I'm looking to translate words by letters/vowels.
I'll try to explain.
I have a an Arabic text with ~300,000 words, my goal is to enable users to search the text using one of 10 languages I'll define. So if some search for Stack overflow in English I'll need to break down to words as S-TA-CK O-VE-R-F-LOW (I need to break it that way to get the Arabic equivalent letters).
Is there something like that already exsiting, or I just need to start from scratch and do a linguistic research???
Thank you for your time.
You need to analyze your words by finding the relative syllables. Take a look at Sphinx-4 Java library, I guess there are some example codes for extracting a word to its syllables based on defined grammar rules.
Related
Is there a way to create a regex will insure that five out of eight characters are present in order in a given character range (like 20 chars for example)?
I am dealing with horrible OCR/scanning, and I can stand the false positives.
Is there a way to do this?
Update: I want to match for example "mshpeln" as misspelling. I do not want to do OCR. The OCR job has been done, but is has been done poorly (i.e. it originally said misspelling, but the OCR'd copy reads "mshpeln"). I do not know what the text that I will have to match against will be (i.e. I do not know that it is "mshpeln" it could be "mispel" or any number of other combinations).
I am not trying to use this as a spell checker, but merely find the end of a capture group. As an aside, I am currently having trouble getting the all.css file, so commenting is impossible temporarily.
I think you need not regex, but database with all valid words and creative usage of functions like soundex() and/or levenshtein().
You can do this: create table with all valid words (dictionary), populate it with columns like word and snd (computed as soundex(word)), create indexes for both word and snd columns.
For example, for word mispeling you would fill snd as M214. If you use SQLite, it has soundex() implemented by default.
Now, when you get new bad word, compute soundex() for it and look it up in your indexed table. For example, for word mshpeln it would be soundex('mshpeln') = M214. There you go, this way you can get back correct word.
But this would not look anything like regex - sorry.
To be honest, I think that a project like this would be better for an actual human to do, not a computer. If the project is to large for 1 or 2 people to do easily, you might want to look into something like Amazon's Mechanical Turk where you can outsource to work for pennies per solution.
This can't be done with a regex, but it can be done with a custom algorithm.
For example, to find words that are like 'misspelling' in your body of text:
1) Preprocess. Create a Set (in the mathematical sense, collection of guaranteed to be unique elements) with all of the unique letters that are in misspelling - {e, i, g, l, m, n, p, s}
2) Split the body of text into words.
3) For each word, create a Set with all of its unique letters. Then, perform the operation of set intersection on this set and the set of the word you are matching against - this will get you letters that are contained by both sets. If this set has 5 or more characters left in it, you have a possible match here.
If the OCR can add in erroneous spaces, then consider two words at a time instead of single words. And etc based on what your requirements are.
I have no solution for this problem, in fact, here's exactly the opposite.
Correcting OCR errors is not programmaticaly possible for two reasons:
You cannot quantify the error that was made by the OCR algorithm as it can goes between 0 and 100%
To apply a correction, you need to know what the maximum error could be in order to set an acceptable level.
Let nello world be the first guess of "hello world", which is quite similar. Then, with another font that is written in "painful" yellow or something, a second guess is noiio verio for the same expression. How should a computer know that this word would have been similar if it was better recognized?
Otherwise, given a predetermined error, mvp's solution seems to be the best in my opinion.
UPDATE:
After digging a little, I found a reference that may be relevant: String similarity measures
I'm calling Microsoft translator for converting arabic names to english
But it translates the names , i just want to convert the names
for example:
أحمد ماهر
need to be
Ahmed Maher
the service is working but it translates the meaning of the names not just the names
I don't know about other languages but for your example
أحمد ماهر = Ahmed Maher
What I know is romanizations doesn't work with it
Why? because of the inflection and the letters "movements" that give the one letter more than one pronunciation. for example
مَح = Mah
مُح = Muh
Same letters but with a different pronunciation.
The solution that I've made is rules to convert the names
You can find the code and the npm package that I've made.
If you got any questions, issues or you tried to understand the rules for the Arabic I will be happy to help.
Please check below to know more about the needed rules to convert an Arabic name:
First Letter Rule
I am checking if the letter was the first letter since if the letter “و” was the first one people tend to write it “W” but if it was inside the word it will be written “O”.
Inner Letter Rule
By this I mean all the letters that are not the first nor the last and it will be changed based letter like the first image “م” will equal “H”
Next Letter Rule
In this rule am checking the letter and the upcoming one. like
if the letter was “م” and the next one was “ع” it should be written ”Mua”
if the letter was “م” and the next one was “ي” it should be written “May”

Special Letter Rule
It looks like the next letter rule but it will include another action like “slice”
ex: in the “First Letter Rule” I said that “و” inside the word will be converted to “O”
.”ا" I will need to delete the “O” and put a “w” then “A” for the ”ا“ But if the next letter was And I believe this rule could be enhanced but it’s important to have.
Last Letter Rule
Without this rule, the name “Sarah” will be converted from Arabic to “Sara”
This rule checks if the last letter is “ه“ ”ة” if it is it will add the “H” to complete the name.
Sounds like you want a character substitution, not a true translator. You could build your own transliterator with String.Replace.
Someone asked a similar question about transliterating Cyrillic to Latin:
How to transliterate Cyrillic to Latin text
If it is on a web site, you can enable the Collaborative Translations Framework functionality in the Widget, and use that to 'override' the translation into a name.
The API also supports this if you are building an app.
Widget documentation is here: http://blogs.msdn.com/b/translation/p/ctf1.aspx
and here: http://blogs.msdn.com/b/translation/p/ctf2.aspx
I am using the follow regex:
(<(table|h[1-6])[^>]*>(?<op>.+?)<\/(table|h[1-6])>)
to extract tables (and headings) from a html document.
I've found it to work quite well in the documents we are using (documents converted with word save as filtered html), however I have a problem that if the table contains a table inside it the regex will match the initial table start tag and the second table end tag rather than the initial table end tag.
Is there a way in regex to specify that if it finds another table tag within the match to keep to ignore the next match of and go for the next one and so on?
Don't do this.
HTML is not a regular grammar and so a regular expression is not a good tool with which to parse it. What you are asking in your last sentence is for a contextual parser, not a regular expression. Bare regular expression parsing it is too likely fail to parse HTML correctly to be responsible coding.
HtmlAgilityPack is a MsPL-licensed solution I've used in the past that has widely acceptable license terms and provides a well-formed DOM which can be probed with XPath or manipulated in other useful ways ("Extract all text, dropping out tags" being a popular one for importing HTML mail for search, for example, that is nigh trivial after letting a DOM parser rip through the HTML and only coding the part that adds value for your specific business case).
Is there a way in regex to specify
that if it finds another table tag
within the match to keep to ignore the
next match of and go for the next one
and so on?
Since nobody's actually answered this part, I will—No.
This is part of what makes regular languages "regular". A regular language is one that can be recognized by a certain regular grammar, often described in syntax that looks very much like basic regular expressions (10* to match 1 followed by any number of 0s), or a DFA. "Regular Expressions" are based strongly off of these regular languages, as their name implies, but add some functions such as lookaheads and lookbehinds. As a general rule, a regular language knows nothing about what's around it or what it's seen, only what it's looking at currently, and which of its finite states it's in.
TLDNR: Why does this matter to you? Since a regular language cannot "count" elements in that way, it is impossible to keep a tally of the number of <table> and </table> elements you have seen. An HTML Parser does just that - since it is not trying to emulate a regular language, it can count the number of opening and closing tags it sees.
This is the prime example of why it's best not to use regular expressions to parse HTML; even though you know how it may be formed, you cannot parse it since there may be nested elements. If you could guarantee there would be no nested tables, it may be feasible to do this, but even then, using a parser would be much simpler.
Plea to the theoretical computer scientists: I did my best to explain what I know from the CS Theory classes I've taken in a way that most people here should be able to understand. I know that regular languages can "count" finite numbers of things. Feel free to correct me, but please be kind!
Regular expressions are not really suited for this as what you're trying to do contains knowledge about the fact that this is a nested language. Without this knowledge it will be really hard (and also hard to read and maintain) to extract this information.
Maybe do something with an XPath navigator?
I just developed a simple asp.net mvc application project for English only. I want to block user's any input for a language other than English. Is it possible to know whether user inputs other languages when they write something on textbox or editor in order to give a popup message?
You could limit the input box to latin characters, but there's no automatic way to see if the user entered something in say English, Finnish or Norwegian. They all mostly use a-z. Any character outside of a-z could give you an indication, but certain accents needs to be allowed in English as well, so it's not 100%.
Google Translate exposes a javascript API to detect the language of text.
Use the following code:
<p>Note that this community uses the English language exclusively, so please be
considerate and write your posts in English. Thank you!</p>
there are two tests you can do. one is to find out what the cultureinfo is set on the users machine:
http://msdn.microsoft.com/en-us/library/system.threading.thread.currentuiculture.aspx
this will give you their current culture setting, which is a start. of course, you can have your setting as 'english' but still typing in russian, and most of the letters will be the same..
so the next step is to discover the language using this: http://www.google.com/uds/samples/language/detect.html
it's not the greatest, according to online discussions, but its a place to start. I'm sure there are better natural language identifiers out there, though.
Checking for Latin 26
If you wanted to ensure that any non-English letters were submitted, you could simply validate that they fall outside the A-Z, a-z, 0-9 and normal punctuation ranges. It sounds like you want the regular non-Latin characters to be detected and rejected.
Detecting the user's OS settings, keyboard settings isn't the best way, as the user could have multiple keyboards attached, and have use of copy/paste.
UI Validation
At the user interface level, you could create a jQuery method that would check the value of a textbox for a value other than your acceptable range. Perhaps that's A-Z, a-z and numeric. You could do this on event onBlur. Remember that you might want to allow ', .
$('#customerName').blur(function() {
var isAlphaNumeric;
//implementation of checking a-z, A-Z, 0-9, etc.
alert(isAlphaNumeric);
});
Controller Validation
If you wanted to ALSO implement this at the controller level, you could run a regex on the incoming values.
public ActionMethod CreateCustomer(string custName)
{
if (IsAcceptableRange(custName))
{
//continue
}
}
public bool IsAcceptableRange(string input)
{
//whitelist all the valid inputs here. be sure to include
//space, period, apostrophe, hypen, etc
Regex alphaNumericPattern=new Regex("[^a-zA-Z0-9]");
return !alphaNumericPattern.IsMatch(input);
}
Google Translate was quoted in two answers, but I want to add that Microsoft Word API may also be used to detect language, just like Word does for check spelling.
It is for sure not the best solution, since language detection by Microsoft Office doesn't work very well (IMHO), but may be an alternative if doing web requests to Google or other remote service on every posted message is not a solution.
Also, check spelling through Microsoft Word API can be useful too. If a message has a huge number of misspelled words when checking in English, it's probably because the message is written in another language (or the author of the message writes too badly, too).
Finally, I completely agree with Matti Virkkunen. The best, and maybe the only way to ensure that messages will be written in English is to ask the users to write in English. Otherwise, it's just as bad as implementing obscenity filters.
I am working on an augmentative and alternative communication (AAC) program. My current goal is to store a history of input/spoken text and search for common phrase fragments or word n-grams. I am currently using an implementation based on the lzw compression algorithm as discussed at CodeProject - N-gram and Fast Pattern Extraction Algorithm. This approach although producing n-grams does not behave as needed.
Let's say for example that I enter "over the mountain and through the woods" several times. My desired output would be the entire phrase "over the mountain and through the woods". Using my current implementation the phrase is broken into trigrams and on each repeated entry one word is added. So on the first entry I get "over the mountain". On the second entry "over the mountain and", etc.
Let's assume we have the following text:
this is a test
this is another test
this is also a test
the test of the emergency broadcasting system interrupted my favorite song
My goal would be that if "this is a test of the emergency broadcasting system" were entered next that I could use that within a regex to return "this is a test" and "test of the emergency broadcasting system". Is this something that is possible through regex or am I'm walking the wrong path? I appreciate any help.
I have been unable to find a way to do what I need with regular expressions alone although the technique shown at Matching parts of a string when the string contains part of a regex pattern comes close.
I ended up using a combination of my initial system along with some regex as shown below.
flow chart http://www.alsmatters.org/files/phraseextractor.png
This parses the transcript of the first presidential debate (about 16,500 words) in about 30 seconds which for my purposes is quite fast.
From your use case it appears you do not want fixed-length n-gram matches, but rather a longest sequence of n-gram match. Just saw your answer to your own post, which confirms ;)
In python you can use the fuzzywuzzy library to match a set of phrases to a canonical/normalized set of phrases through an associated list of "synonym" phrases or words. The trick is segmenting your phrases appropriately (e.g. when do commas separate phrases and when do they join lists of related words within a phrase?)
Here's the structure of the python dict in RAM. Your data structure in C or a database would be similar:
phrase_dict = {
'alternative phrase': 'canonical phrase',
'alternative two': 'canonical phrase',
'less common phrasing': 'different canonical phrase',
}
from fuzzywuzzy.process import extractOne
phrase_dict[extractOne('unknown phrase', phrase_dict)[0]]
and that returns
'canonical phrase'
FuzzyWuzzy seems to use something like a simplified Levenshtein edit-distance... it's fast but doesn't deal well with capitalization (normalize your case first), word sounds (there are other libraries, like soundex, that can hash phrases by what they sound like), or word meanings (that's what your phrase dictionary is for).