Efficient and fast way to parse a string with different languages - c#

I have a string something like (generated via Google Transliterate REST call, and transliterated into 2 languages):
" This world is beautiful and थिस वर्ल्ड इस बेऔतिफुल एंड
থিস বর্ল্ড ইস বিয়াউতিফুল আন্দ amazingly mysterious
अमज़िन्ग्ली म्य्स्तेरिऔस আমাজিন্গ্লি ম্য্স্তেরীয়ুস "
Now Google Transliterate REST call allows FIVE words at a time, so I had to loop, add it to the list and then concatenate the string. That's why we see that each CHUNK (of each language) is of 5 words. The total number of words is 7 words, so first 5 (This world is beautiful and) lies before rest 2 (amazingly mysterious) later.
How do I most efficiently parse the sentence such that I get something like:
This world is beautiful and amazingly mysterious थिस वर्ल्ड इस बेऔतिफुल एंड अमज़िन्ग्ली म्य्स्तेरिऔस থিস বর্ল্ড ইস বিয়াউতিফুল আন্দ আমাজিন্গ্লি ম্য্স্তেরীয়ুস
Since the length of sentence, and the number of languages it can be converted into can be dynamic, may be using lists of each language can work, and then concatenated later?
I used an approach where I transliterated each word, one at a time, it works well, but too slow as it increases the number of calls to the API.
Can someone help me with an efficient (and dynamic) implementation of such a scenario? Thanks a bunch!

One list per language is the way to go.

if you mean different character ASCII code by different languages, you can use this answer here:
Regular expression Spanish and Arabic words

Pay for google translate's API and then your length restriction goes up to 5,000 characters per request https://developers.google.com/translate/v2/faq
Also, yes, as Daniel has said - grouping the text by language will be necessary

I have tried a work out, correct me if i misinterpret your question
string statement = "This world is beautiful and थिस वर्ल्ड इस बेऔतिफुल एंड থিস বর্ল্ড ইস বিয়াউতিফুল আন্দ amazingly mysterious अमज़िन्ग्ली म्य्स्तेरिऔस আমাজিন্গ্লি ম্য্স্তেরীয়ুস ";
string otherLangStmt = statement;
MatchCollection matchCollection = Regex.Matches(statement, "([a-zA-Z]+)");
string result = "";
foreach (Match match in matchCollection)
{
if (match.Groups.Count > 0)
{
result += match.Groups[0].Value + " ";
otherLangStmt = otherLangStmt.Replace(match.Groups[0].Value, string.Empty);
}
}
otherLangStmt = Regex.Replace(otherLangStmt.Trim(), "[\\s]", " ");
Console.WriteLine(result);
Console.WriteLine(otherLangStmt);

Related

Regex hangs trying to find match

I am trying to match an assignment string in VB code (as in I'm passing in text that is VB code into my program that's written in C#). The assignment string that I'm trying to match is something for example like
CustomClassInitializer(someParameter, anotherParameter, someOtherClassAsParameterWithInitialization()).SomeProperty = 7
and I realize that's rather complex, but it actually isn't far off from some of the real text I'm trying to match.
In order to do so I wrote a Regex. This Regex:
#"[\w,.]+\(([\w,.]*\(*,* *\)*)+ = "
which correctly matches. The problem is it becomes VERY slow (with timeouts), which I've researched and found is probably because of "backtracking". One of the suggested solutions to help with backtracking in general was to add "?>" to the regex, which I think would go in this position:
[\w,.]+\(?>([\w,.]*\(*,* *\)*)+ =
but this no longer matches properly.
I'm fairly new to Regex, so I imagine that there is a much better pattern. What is it please? Or how can I improve my times in general?
Helpful notes:
I'm only interested in position 0 of the string I'm searching for a
match in. My code is "if (isMatch && match.index == 0) { ... }. Can
I tell it to only check position 0 and if it's not a match move on?
The reason I use all the 0 or more things is the match could be as simple as CustomClass() = new CustomClass(), and as complicated as the above or perhaps a bit worse. I'm trying to get as many cases as possible.
This Regex is interested in "[\w,.]+(" and then "whatever may be inside the parentheses" (I tried to think of what all could be inside them based on the fact that it's valid VB code) until you get to the close parenthesis and then " = ". Perhaps I can use a wildcard for literally anything until it get's to ") = " in the string? - Like I said, fairly new to Regex.
Thanks in advance!
This seems to do what you want. Normally, I like to be more specific than .*, but it is working correctly. Note that I am using the Multi-line option.
^.*=\s*.+$
Here is a working example in RegExStorm.net example

parsing words in a continuous string

If a have a string with words and no spaces, how should I parse those words given that I have a dictionary/list that contains those words?
For example, if my string is "thisisastringwithwords" how could I use a dictionary to create an output "this is a string with words"?
I hear that using the data structure Tries could help but maybe if someone could help with the pseudo code? For example, I was thinking that maybe you could index the dictionary into a trie structure, then follow each char down the trie; problem is, I'm unfamiliar with how to do this in (pseudo)code.
I'm assuming that you want an efficient solution, not the obvious one where you repeatedly check if your text starts with a dictionary word.
If the dictionary is small enough, I think you could try and modify the standard KMP algorithm. Basically, build a finite-state machine on your dictionary which consumes the text character by character and yields the constructed words.
EDIT: It appeared that I was reinventing tries.
I already did something similar. You cannot use a simple dictionary. The result will be messy. It depends if you only have to do this once or as whole program.
My solution was to:
Connect to a database with working
words from a dictionary list (for
example online dictionary)
Filter long and short words in dictionary and check if you want to trim stuff (for example don't use words with only one character like 'I')
Start with short words and compare your bigString with the database dictionary.
Now you need to create a "table of possibility". Because a lot of words can fit into 100% but are wrong. As longer the word as more sure you are, that this word is the right one.
It is cpu intensive but it can work precise in the result.
So lets say, you are using a small dictionary of 10,000 words and 3,000 of them are with a length of 8 characters, you need to compare your bigString at start with all 3,000 words and only if result was found, it is allowed to proceed to the next word. If you have 200 characters in your bigString you need about (2000chars / 8 average chars) = 250 full loops minimum with comparation.
For me, I also did a small verification of misspelled words into the comparation.
example of procedure (don't copy paste)
Dim bigString As String = "helloworld.thisisastackoverflowtest!"
Dim dictionary As New List(Of String) 'contains the original words. lets make it case insentitive
dictionary.Add("Hello")
dictionary.Add("World")
dictionary.Add("this")
dictionary.Add("is")
dictionary.Add("a")
dictionary.Add("stack")
dictionary.Add("over")
dictionary.Add("flow")
dictionary.Add("stackoverflow")
dictionary.Add("test")
dictionary.Add("!")
For Each word As String In dictionary
If word.Length < 1 Then dictionary.Remove(word) 'remove short words (will not work with for each in real)
word = word.ToLower 'make it case insentitive
Next
Dim ResultComparer As New Dictionary(Of String, Double) 'String is the dictionary word. Double is a value as percent for a own function to weight result
Dim i As Integer = 0 'start at the beginning
Dim Found As Boolean = False
Do
For Each word In dictionary
If bigString.IndexOf(word, i) > 0 Then
ResultComparer.Add(word, MyWeightOfWord) 'add the word if found, long words are better and will increase the weight value
Found = True
End If
Next
If Found = True Then
i += ResultComparer(BestWordWithBestWeight).Length
Else
i += 1
End If
Loop
I told you that it seems like an impossible task. But you can have a look at this related SO question - it may help you.
If you are sure you have all the words of the phrase in the dictionary, you can use that algo:
String phrase = "thisisastringwithwords";
String fullPhrase = "";
Set<String> myDictionary;
do {
foreach(item in myDictionary){
if(phrase.startsWith(item){
fullPhrase += item + " ";
phrase.remove(item);
break;
}
}
} while(phrase.length != 0);
There are so many complications, like, some items starting equally, so the code will be changed to use some tree search, BST or so.
This is the exact problem one has when trying to programmatically parse languages like Chinese where there are no spaces between words. One method that works with those languages is to start by splitting text on punctuation. This gives you phrases. Next you iterate over the phrases and try to break them into words starting with the length of the longest word in your dictionary. Let's say that length is 13 characters. Take the first 13 characters from the phrase and see if it is in your dictionary. If so, take it as a correct word for now, move forward in the phrase and repeat. Otherwise, shorten your substring to 12 characters, then 11 characters, etc.
This works extremely well, but not perfectly because we've accidentally put in a bias towards words that come first. One way to remove this bias and double check your result is to repeat the process starting at the end of the phrase. If you get the same word breaks you can probably call it good. If not, you have an overlapping word segment. For example, when you parse your sample phrase starting at the end you might get (backwards for emphasis)
words with string a Isis th
At first, the word Isis (Egyptian Goddess) appears to be the correct word. When you find that "th" is not in your dictionary, however, you know there is a word segmentation problem nearby. Resolve this by going with the forward segmentation result "this is" for the non-aligned sequence "thisis" since both words are in the dictionary.
A less common variant of this problem is when adjacent words share a sequence which could go either way. If you had a sequence like "archand" (to make something up), should it be "arc hand" or "arch and"? The way to determine is to apply a grammar checker to the results. This should be done to the whole text anyway.
Ok, I will make a hand wavy attempt at this. The perfect(ish) data structure for your problem is (as you've said a trie) made up of the words in the dictionary. A trie is best visualised as a DFA, a nice state machine where you go from one state to the next on every new character. This is really easy to do in code, a Java(ish) style class for this would be :
Class State
{
String matchedWord;
Map<char,State> mapChildren;
}
From hereon, building the trie is easy. Its like having a rooted tree structure with each node having multiple children. Each child is visited on one character transition. The use of a HashMap kind of structure trims down time to look up character to next State mappings. Alternately if all you have are 26 characters for the alphabet, a fixed size array of 26 would do the trick as well.
Now, assuming all of that made sense, you have a trie, your problem still isn't fully solved. This is where you start doing things like regular expressions engines do, walk down the trie, keep track of states which match to a whole word in the dictionary (thats what I had the matchedWord for in the State structure), use some backtracking logic to jump to a previous match state if the current trail hits a dead end. I know its general but given the trie structure, the rest is fairly straightforward.
If you have dictionary of words and need a quick implmentation this can be solved efficiently with dynamic programming in O(n^2) time, assuming the dictionary lookups are O(1). Below is some C# code, the substring extraction could and dictionary lookup could be improved.
public static String[] StringToWords(String str, HashSet<string> words)
{
//Index of char - length of last valid word
int[] bps = new int[str.Length + 1];
for (int i = 0; i < bps.Length; i++)
bps[i] = -1;
for (int i = 0; i < str.Length; i++)
{
for (int j = i + 1; j <= str.Length ; j++)
{
if (bps[j] == -1)
{
//Destination cell doesn't have valid backpointer yet
//Try with the current substring
String s = str.Substring(i, j - i);
if (words.Contains(s))
bps[j] = i;
}
}
}
//Backtrack to recovery sequence and then reverse
List<String> seg = new List<string>();
for (int bp = str.Length; bps[bp] != -1 ;bp = bps[bp])
seg.Add(str.Substring(bps[bp], bp - bps[bp]));
seg.Reverse();
return seg.ToArray();
}
Building a hastset with the word list from /usr/share/dict/words and testing with
foreach (var s in StringSplitter.StringToWords("thisisastringwithwords", dict))
Console.WriteLine(s);
I get the output "t hi sis a string with words". Because as others have pointed out this algorithm will return a valid segmentation (if one exists), however this may not be the segmentation you expect. The presence of short words is reducing the segmentation quality, you might be able to add heuristic to favour longer words if two valid sub-segmentation enter an element.
There are more sophisticated methods that finite state machines and language models that can generate multiple segmentations and apply probabilistic ranking.

Regex MatchCollection obj hangs\"Function evuluation timed out" after Regex.Matches

I'm kind of new too C#, and regular expression for that matter, but I've searched a couple of hours to find a solution too this problem so, hopefully this is easy for you guys:)
My application uses a regex to match email addresses in a given string,
then loops throu the matches.:
String EmailPattern = "\\w+([-+.]\\w+)*#\\w+([-.]\\w+)*\\.\\w+([-.]\\w+)*";
MatchCollection mcemail = Regex.Matches(rawHTML, EmailPattern);
foreach (Match memail in mcemail)
Works fine, but, when I downloaded the string from a certain page, http://www.sp.se/sv/index/services/quality/sidor/default.aspx, the MatchCollection(mcemail) object "hangs" the loop. When using a break point and accessing the object, I get "Function evuluation timed out" on everything(.Count etc).
Update
I've tried my pattern and other email patterns on the same string, everyone(regex desingers, python based web pages etc.) fails/timesout when trying too match this particular string.
How can I detect that the matchcollection obj is not "ready" to use?
If you can post the email that's causing the problem (perhaps anonymized in some way), that will give us more information, but I'm thinking the problem is this little guy right here:
([-.]\\w+)*\\.\\w+([-.]\\w+)*
To understand the problem, let's break that into groups:
([-.]\\w+)*
\\.\\w+
([-.]\\w+)*
The strings that will match \\.\\w+ are a subset of those that will match [-.]\\w+. So if part of your input looks like foo.bar.baz.blah.yadda.com, your regex engine has no way of knowing which group is supposed to match it. Does that make sense? So the first ([-.]\\w+)* could match .bar.baz.blah, then the \\.\\w+ could match .yadda, then the last ([-.]\\w+)* could match .com...
...OR the first clause could match .bar.baz, the second could match .blah, and the last could match .yadda.com. Since it doesn't know which one is right, it will keep trying different combinations. It should stop eventually, but that could still take a long time. This is called "catastrophic backtracking".
This issue is compounded by the fact that you're using capturing groups rather than non-capturing groups; i.e. ([-+.]\\w+) instead of (?:[-+.]\\w+). That causes the engine to try and separate and save whatever matches inside the parentheses for later reference. But as I explained above, it's ambiguous which group each substring belongs in.
You might consider replacing everything after the # with something like this:
\\w[-\\w]*\\.[-.\\w]+
That could use some refinement to make it more specific, but you get the general idea. Hope I explained all this well enough; grouping and backreferences are kind of tough to describe.
EDIT:
Looking back at your pattern, there's a deeper issue here, still related to the backtracking/ambiguity problem I mentioned. The clause \\w+([-.]\\w+)* is ambiguous all by itself. Splitting it into parts, we have:
\\w+
([-.]\\w+)*
Suppose you have a string like foobar. Where does the \\w+ end and the ([-.]\\w+)* begin? How many repetitions of ([-.]\\w+) are there? Any of the following could work as matches:
f(oobar)
foo(bar)
f(o)(oba)(r)
f(o)(o)(b)(a)(r)
foobar
etc...
The regex engine doesn't know which is important, so it will try them all. This is the same problem I pointed out above, but it means you have it in multiple places in your pattern.
Even worse, ([-.]\\w+)* is also ambiguous, because of the + after the \\w. How many groups are there in blah? I count 16 possible combinations: (blah), (b)(lah), (bl)(ah)...
The amount of different possible combinations is going to be huge, even for a relatively small input, so your engine is going to be in overdrive. I would definitely simplify it if I were you.
I just did a local test and it appears either the sheer document size or something in the ViewState causes the Regex match evaluation to time out. (Edit: I'm pretty sure it's the size, actually. Removing the ViewState just reduces the size significantly.)
An admittedly crude way to solve this would be something like this:
string[] rawHtmlLines = File.ReadAllLines(#"C:\default.aspx");
string filteredHtml = String.Join(Environment.NewLine,
rawHtmlLines.Where(line => !line.Contains("_VIEWSTATE")).ToArray());
string emailPattern = #"\w+([-+.]\w+)*#\w+([-.]\w+)*\.\w+([-.]\w+)*";
var emailMatches = Regex.Matches(filteredHtml, emailPattern);
foreach (Match match in emailMatches)
{
//...
}
Overall I suspect the email pattern is just not well optimised (or intended) to filter out emails in a large string but just used as validation for user input. Generally it might be a good idea to limit the string you search in to just the parts you are actually interested in and keep it as small as possible - for example by leaving out the ViewState which is guaranteed to not contain any readable email addresses.
If performance is important, it's probably also a better idea to create the filtered HTML using a StringBuilder and IndexOf (etc.) instead of splitting lines and LINQing up the result :)
Edit:
To further minimize the length of the string the Regex needs to check you could only include lines that contain the # character to begin with, like so:
string filteredHtml = String.Join(Environment.NewLine,
rawHtmlLines.Where(line => line.IndexOf('#') >= 0 && !line.Contains("_VIEWSTATE")).ToArray());
From "Function evaluation timed out", I'm assuming you're doing this in the debugger. The debugger has some fairly quick timeouts with regard to how long a method takes. Not eveything happens quickly. I would suggest going the operation in code, storing the result, then viewing that result in the debugger (i.e. let the call to Matches run and put a breakpoint after it).
Now, with regard to detecting whether the string will make Matches take a long time; that's a bit of a black art. You basically have to perform some sort of input validation. Just because you got some value from the internet, doesn't mean that value will work well with Matches. The ultimate validation logic is up to you; but, starting with the length of rawHtmlLines might be useful. (i.e. if the lenght is 1000000 bytes, Matches might take a while) But, you have to decide what to do if the length is too long; e.g give an error to the user.

Parsing a String for Special characters in C#

I am getting a string in the following format in the query string:
Arnstung%20Chew(20)
I want to convert it to just Arnstung Chew.
How do I do it?
Also how do I make sure that the user is not passing a script or anything harmful in the query string?
string str = "Arnstung Chew (20)";
string replacedString = str.Substring(0, str.IndexOf("(") -1 ).Trim();
string safeString = System.Web.HttpUtility.HtmlEncode(replacedString);
It's impossible to provide a comprehensive answer without knowing what variations might appear on your input text. For example, will there always be two words separated by a space followed by a number in parentheses? Or might there be other variations as well?
I have a lot of parsing code on my Black Belt Coder site, including a sscanf() replacement for .NET that may potentially be useful in your case.

how to create a parser for search queries

for example i'd need to create something like google search query parser to parse such expressions as:
flying hiking or swiming
-"**walking in boots **" **author:**hamish **author:**reid
or
house in new york priced over
$500000 with a swimming pool
how would i even go about start building something like it? any good resources?
c# relevant, please (if possible)
edit: this is something that i should somehow be able to translate to a sql query
How many keywords do you have (like 'or', 'in', 'priced over', 'with a')? If you only have a couple of them I'd suggest going with simple string processing (regexes) too.
But if you have more than that you might want to look into implementing a real parser for those search expressions. Irony.net might help you with that (I found it extremely easy to use as you can express your grammar in a near bnf-form directly in code).
The Lucene/NLucene project have functionality for boolean queries and some other query formats as well. I don't know about the possibilities to add own extensions like author in your case, but it might be worthwile to check it out.
There are few ways doing it, two of them:
Parsing using grammar (useful for complex language)
Parsing using regular expression and basic string manipulations (for simpler language)
According to your example, the language is very basic so splitting the string according to keyword can be the best solution.
string sentence = "house in new york priced over $500000 with a swimming pool";
string[] values = sentence.Split(new []{" in ", " priced over ", " with a "},
StringSplitOptions.None);
string type = values[0];
string area = values[1];
string price = values[2];
string accessories = values[3];
However, some issues that may arise are: how to verify if the sentence stands in the expected form? What happens if some of the keywords can appear as part of the values?
If this is the case you encounter there are some libraries you can use to parse input using a defined grammar. Two of these libraries that works with .Net are ANTLR and Gold Parser, both are free. The main challenge is defining the grammar.
A grammar would work very well for the second example you gave but the first (any order keyword/command strings) would be best handled using Split() and a class to handle the various keywords and commands. You will have to do initial processing to handle quoted regions before the split (for example replacing spaces within quoted regions with a rare/unused character).
The ":" commands are easy to find and pull out of the search string for processing after the split is completed. Simply traverse the array looking.
The +/- keywords are also easy to find and add to the sql query as AND/AND NOT clauses.
The only place you might run into issues is with the "or" since you'll have to define how it is handled. What if there are multiple "or"s? But the order of keywords in the array is the same as in the query so that won't be an issue.
i think you should just do some string processing. There is no smart way of doing this.
So replace "OR" with your own or operator (e.g. ||). As far as i know there is no library for this.
I suggest you go with regexes.

Categories