Parsing mathematical expressions in C# - c#

As a project, I want to write a parser for mathematical expressions in C#. I know there are libraries for this, but want to create my own to learn about this topic.
As an example, I have the expression
min(3,4) + 2 - abs(-4.6)
I then create token from this string by specifying regular expressions and going through the expression from the user trying to match one of the regex. This is done from the front to the back:
private static List<string> Tokenize(string expression)
{
List<string> result = new List<string>();
List<string> tokens = new List<string>();
tokens.Add("^\\(");// matches opening bracket
tokens.Add("^([\\d.\\d]+)"); // matches floating point numbers
tokens.Add("^[&|<=>!]+"); // matches operators and other special characters
tokens.Add("^[\\w]+"); // matches words and integers
tokens.Add("^[,]"); // matches ,
tokens.Add("^[\\)]"); // matches closing bracket
while (0 != expression.Length)
{
bool foundMatch = false;
foreach (string token in tokens)
{
Match match = Regex.Match(expression, token);
if (false == match.Success)
{
continue;
}
result.Add(match.Value);
expression = Regex.Replace(expression, token, "");
foundMatch = true;
break;
}
if (false == foundMatch)
{
break;
}
}
return result;
}
This works quite well. Now I want the user to be able to enter strings into the expression. I found a question to this at Regex tokenize issue however the answer provide regex which match the text anywhere in the expression. However I need this to match only the first occurrence at the front of the expression so I can keep the order of token.
As an example see this:
5 + " is smaller than " + 10
should give me the tokens
5 + " is greater than " + 10
If possible I would also like to be able to enter escape characters so the user is able to use the character " in strings, like "This is an apostrophe \" " gives me the token "This is an apostrophe " "
The answer from Wiktor Stribiżew at that question looked really good, but I couldn't modify it so it only matches at the beginning and only one word. Help is appreciated!

Funny you referencing that question. I actually adopted (yet again) my answer in there to work for you here ;)
Here's a fiddle showing the solution.
The regex is
(?!\+)(?:"((?:\\"|[^"])*)"?)
I changed the code to use capture groups to be able to in a simple manner not add the surrounding quotes. Also the loop removes the + sign separating the tokens.
Regards

Related

Regex for matching Functions and Capturing their Arguments

I'm working on a calculator and it takes string expressions and evaluates them. I have a function that searches the expression for math functions using Regex, retrieves the arguments, looks up the function name, and evaluates it. What I'm having problem with is that I can only do this if I know how many arguments there are going to be, I can't get the Regex right. And if I just split the contents of the ( and ) characters by the , character then I can't have other function calls in that argument.
Here is the function matching pattern: \b([a-z][a-z0-9_]*)\((..*)\)\b
It only works with one argument, have can I create a group for every argument excluding the ones inside of nested functions? For example, it would match: func1(2 * 7, func2(3, 5)) and create capture groups for: 2 * 7 and func2(3, 5)
Here the function I'm using to evaluate the expression:
/// <summary>
/// Attempts to evaluate and store the result of the given mathematical expression.
/// </summary>
public static bool Evaluate(string expr, ref double result)
{
expr = expr.ToLower();
try
{
// Matches for result identifiers, constants/variables objects, and functions.
MatchCollection results = Calculator.PatternResult.Matches(expr);
MatchCollection objs = Calculator.PatternObjId.Matches(expr);
MatchCollection funcs = Calculator.PatternFunc.Matches(expr);
// Parse the expression for functions.
foreach (Match match in funcs)
{
System.Windows.Forms.MessageBox.Show("Function found. - " + match.Groups[1].Value + "(" + match.Groups[2].Value + ")");
int argCount = 0;
List<string> args = new List<string>();
List<double> argVals = new List<double>();
string funcName = match.Groups[1].Value;
// Ensure the function exists.
if (_Functions.ContainsKey(funcName)) {
argCount = _Functions[funcName].ArgCount;
} else {
Error("The function '"+funcName+"' does not exist.");
return false;
}
// Create the pattern for matching arguments.
string argPattTmp = funcName + "\\(\\s*";
for (int i = 0; i < argCount; ++i)
argPattTmp += "(..*)" + ((i == argCount - 1) ? ",":"") + "\\s*";
argPattTmp += "\\)";
// Get all of the argument strings.
Regex argPatt = new Regex(argPattTmp);
// Evaluate and store all argument values.
foreach (Group argMatch in argPatt.Matches(match.Value.Trim())[0].Groups)
{
string arg = argMatch.Value.Trim();
System.Windows.Forms.MessageBox.Show(arg);
if (arg.Length > 0)
{
double argVal = 0;
// Check if the argument is a double or expression.
try {
argVal = Convert.ToDouble(arg);
} catch {
// Attempt to evaluate the arguments expression.
System.Windows.Forms.MessageBox.Show("Argument is an expression: " + arg);
if (!Evaluate(arg, ref argVal)) {
Error("Invalid arguments were passed to the function '" + funcName + "'.");
return false;
}
}
// Store the value of the argument.
System.Windows.Forms.MessageBox.Show("ArgVal = " + argVal.ToString());
argVals.Add(argVal);
}
else
{
Error("Invalid arguments were passed to the function '" + funcName + "'.");
return false;
}
}
// Parse the function and replace with the result.
double funcResult = RunFunction(funcName, argVals.ToArray());
expr = new Regex("\\b"+match.Value+"\\b").Replace(expr, funcResult.ToString());
}
// Final evaluation.
result = Program.Scripting.Eval(expr);
}
catch (Exception ex)
{
Error(ex.Message);
return false;
}
return true;
}
////////////////////////////////// ---- PATTERNS ---- \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
/// <summary>
/// The pattern used for function calls.
/// </summary>
public static Regex PatternFunc = new Regex(#"([a-z][a-z0-9_]*)\((..*)\)");
As you can see, there is a pretty bad attempt at building a Regex to match the arguments. It doesn't work.
All I am trying to do is extract 2 * 7 and func2(3, 5) from the expression func1(2 * 7, func2(3, 5)) but it must work for functions with different argument counts as well. If there is a way to do this without using Regex that is also good.
There is both a simple solution and a more advanced solution (added after edit) to handle more complex functions.
To achieve the example you posted, I suggest doing this in two steps, the first step is to extract the parameters (regexes are explained at the end):
\b[^()]+\((.*)\)$
Now, to parse the parameters.
Simple solution
Extract the parameters using:
([^,]+\(.+?\))|([^,]+)
Here are some C# code examples (all asserts pass):
string extractFuncRegex = #"\b[^()]+\((.*)\)$";
string extractArgsRegex = #"([^,]+\(.+?\))|([^,]+)";
//Your test string
string test = #"func1(2 * 7, func2(3, 5))";
var match = Regex.Match( test, extractFuncRegex );
string innerArgs = match.Groups[1].Value;
Assert.AreEqual( innerArgs, #"2 * 7, func2(3, 5)" );
var matches = Regex.Matches( innerArgs, extractArgsRegex );
Assert.AreEqual( matches[0].Value, "2 * 7" );
Assert.AreEqual( matches[1].Value.Trim(), "func2(3, 5)" );
Explanation of regexes. The arguments extraction as a single string:
\b[^()]+\((.*)\)$
where:
[^()]+ chars that are not an opening, closing bracket.
\((.*)\) everything inside the brackets
The args extraction:
([^,]+\(.+?\))|([^,]+)
where:
([^,]+\(.+?\)) character that are not commas followed by characters in brackets. This picks up the func arguments. Note the +? so that the match is lazy and stops at the first ) it meets.
|([^,]+) If the previous does not match then match consecutive chars that are not commas. These matches go into groups.
More advanced solution
Now, there are some obvious limitations with that approach, for example it matches the first closing bracket, so it doesn't handle nested functions very well. For a more comprehensive solution (if you require it), we need to use balancing group definitions(as I mentioned before this edit). For our purposes, balancing group definitions allow us to keep track of the instances of the open brackets and subtract the closing bracket instances. In essence opening and closing brackets will cancel each other out in the balancing part of the search until the final closing bracket is found. That is, the match will continue until the brackets balance and the final closing bracket is found.
So, the regex to extract the parms is now (func extraction can stay the same):
(?:[^,()]+((?:\((?>[^()]+|\((?<open>)|\)(?<-open>))*\)))*)+
Here are some test cases to show it in action:
string extractFuncRegex = #"\b[^()]+\((.*)\)$";
string extractArgsRegex = #"(?:[^,()]+((?:\((?>[^()]+|\((?<open>)|\)(?<-open>))*\)))*)+";
//Your test string
string test = #"func1(2 * 7, func2(3, 5))";
var match = Regex.Match( test, extractFuncRegex );
string innerArgs = match.Groups[1].Value;
Assert.AreEqual( innerArgs, #"2 * 7, func2(3, 5)" );
var matches = Regex.Matches( innerArgs, extractArgsRegex );
Assert.AreEqual( matches[0].Value, "2 * 7" );
Assert.AreEqual( matches[1].Value.Trim(), "func2(3, 5)" );
//A more advanced test string
test = #"someFunc(a,b,func1(a,b+c),func2(a*b,func3(a+b,c)),func4(e)+func5(f),func6(func7(g,h)+func8(i,(a)=>a+2)),g+2)";
match = Regex.Match( test, extractFuncRegex );
innerArgs = match.Groups[1].Value;
Assert.AreEqual( innerArgs, #"a,b,func1(a,b+c),func2(a*b,func3(a+b,c)),func4(e)+func5(f),func6(func7(g,h)+func8(i,(a)=>a+2)),g+2" );
matches = Regex.Matches( innerArgs, extractArgsRegex );
Assert.AreEqual( matches[0].Value, "a" );
Assert.AreEqual( matches[1].Value.Trim(), "b" );
Assert.AreEqual( matches[2].Value.Trim(), "func1(a,b+c)" );
Assert.AreEqual( matches[3].Value.Trim(), "func2(a*b,func3(a+b,c))" );
Assert.AreEqual( matches[4].Value.Trim(), "func4(e)+func5(f)" );
Assert.AreEqual( matches[5].Value.Trim(), "func6(func7(g,h)+func8(i,(a)=>a+2))" );
Assert.AreEqual( matches[6].Value.Trim(), "g+2" );
Note especially that the method is now quite advanced:
someFunc(a,b,func1(a,b+c),func2(a*b,func3(a+b,c)),func4(e)+func5(f),func6(func7(g,h)+func8(i,(a)=>a+2)),g+2)
So, looking at the regex again:
(?:[^,()]+((?:\((?>[^()]+|\((?<open>)|\)(?<-open>))*\)))*)+
In summary, it starts out with characters that are not commas or brackets. Then if there are brackets in the argument, it matches and subtracts the brackets until they balance. It then tries to repeat that match in case there are other functions in the argument. It then goes onto the next argument (after the comma). In detail:
[^,()]+ matches anything that is not ',()'
?: means non-capturing group, i.e. do not store matches within brackets in a group.
\( means start at an open bracket.
?> means atomic grouping - essentially, this means it does not remember backtracking positions. This also helps to improve performance because there are less stepbacks to try different combinations.
[^()]+| means anything but an opening or closing bracket. This is followed by | (or)
\((?<open>)| This is the good stuff and says match '(' or
(?<-open>) This is the better stuff that says match a ')' and balance out the '('. This means that this part of the match (everything after the first bracket) will continue until all the internal brackets match. Without the balancing expressions, the match would finish on the first closing bracket. The crux is that the engine does not match this ')' against the final ')', instead it is subtracted from the matching '('. When there are no further outstanding '(', the -open fails so the final ')' can be matched.
The rest of the regex contains the closing parenthesis for the group and the repetitions (, and +) which are respectively: repeat the inner bracket match 0 or more times, repeat the full bracket search 0 or more times (0 allows arguments without brackets) and repeat the full match 1 or more times (allows foo(1)+foo(2))
One final embellishment:
If you add (?(open)(?!)) to the regex:
(?:[^,()]+((?:\((?>[^()]+|\((?<open>)|\)(?<-open>))*(?(open)(?!))\)))*)+
The (?!) will always fail if open has captured something (that hasn't been subtracted), i.e. it will always fail if there is an opening bracket without a closing bracket. This is a useful way to test whether the balancing has failed.
Some notes:
\b will not match when the last character is a ')' because it is not a word character and \b tests for word character boundaries so your regex would not match.
While regex is powerful, unless you are a guru among gurus it is best to keep the expressions simple because otherwise they are hard to maintain and hard for other people to understand. That is why it is sometimes best to break up the problem into subproblems and simpler expressions and let the language do some of the non search/match operations that it is good at. So, you may want to mix simple regexes with more complex code or visa versa, depending on where you are comfortable.
This will match some very complex functions, but it is not a lexical analyzer for functions.
If you can have strings in the arguments and the strings themselves can contains brackets, e.g. "go(..." then you will need to modify the regex to take strings out of the comparison. Same with comments.
Some links for balancing group definitions: here, here, here and here.
Hope that helps.
This regex does what you want:
^(?<FunctionName>\w+)\((?>(?(param),)(?<param>(?>(?>[^\(\),"]|(?<p>\()|(?<-p>\))|(?(p)[^\(\)]|(?!))|(?(g)(?:""|[^"]|(?<-g>"))|(?!))|(?<g>")))*))+\)$
Don't forget to escape backslashes and double quotes when pasting it in your code.
It will match correctly arguments in double quotes, inner functions and numbers like this one:
f1(123,"df""j"" , dhf",abc12,func2(),func(123,a>2))
The param stack will contains
123
"df""j"" , dhf"
abc12
func2()
func(123,a>2)
I'm sorry to burst the RegEx bubble, but this is one of those things that you just can't do effectively with regular expressions alone.
What you're implementing is basically an Operator-Precedence Parser with support for sub-expressions and argument lists. The statement is processed as a stream of tokens - possibly using regular expressions - with sub-expressions processed as high-priority operations.
With the right code you can do this as an iteration over the full token stream, but recursive parsers are common too. Either way you have to be able to effectively push state and restart parsing at each of the sub-expression entry points - a (, , or <function_name>( token - and pushing the result up the parser chain at the sub-expression exit points - ) or , token.
Regular expressions aren't going to get you completely out of trouble with this...
Since you have nested parentheses, you need to modify your code to count ( against ). When you encounter an (, you need to take note of the position then look ahead, incrementing a counter for each extra ( you find, and decrementing it for each ) you find. When your counter is 0 and you find a ), that is the end of your function parameter block, and you can then parse the text between the parentheses. You can also split the text on , when the counter is 0 to get function parameters.
If you encounter the end of the string while the counter is 0, you have a "(" without ")" error.
You then take the text block(s) between the opening and closing parentheses and any commas, and repeat the above for each parameter.
There are some new (relatively very new) language-specific enhancements to regex that make it possible to match context free languages with "regex", but you will find more resources and more help when using the tools more commonly used for this kind of task:
It'd be better to use a parser generator like ANTLR, LEX+YACC, FLEX+BISON, or any other commonly used parser generator. Most of them come with complete examples on how to build simple calculators that support grouping and function calls.

How to find repeatable characters

I can't understand how to solve the following problem:
I have input string "aaaabaa" and I'm trying to search for string "aa" (I'm looking for positions of characters)
Expected result is
0 1 2 5
aa aabaa
a aa abaa
aa aa baa
aaaab aa
This problem is already solved by me using another approach (non-RegEx).
But I need a RegEx I'm new to RegEx so google-search can't help me really.
Any help appreciated! Thanks!
P.S.
I've tried to use (aa)* and "\b(\w+(aa))*\w+" but those expressions are wrong
You can solve this by using a lookahead
a(?=a)
will find every "a" that is followed by another "a".
If you want to do this more generally
(\p{L})(?=\1)
This will find every character that is followed by the same character. Every found letter is stored in a capturing group (because of the brackets around), this capturing group is then reused by the positive lookahead assertion (the (?=...)) by using \1 (in \1 there is the matches character stored)
\p{L} is a unicode code point with the category "letter"
Code
String text = "aaaabaa";
Regex reg = new Regex(#"(\p{L})(?=\1)");
MatchCollection result = reg.Matches(text);
foreach (Match item in result) {
Console.WriteLine(item.Index);
}
Output
0
1
2
5
The following code should work with any regular expression without having to change the actual expression:
Regex rx = new Regex("(a)\1"); // or any other word you're looking for.
int position = 0;
string text = "aaaaabbbbccccaaa";
int textLength = text.Length;
Match m = rx.Match(text, position);
while (m != null && m.Success)
{
Console.WriteLine(m.Index);
if (m.Index <= textLength)
{
m = rx.Match(text, m.Index + 1);
}
else
{
m = null;
}
}
Console.ReadKey();
It uses the option to change the start index of a regex search for each consecutive search. The actual problem comes from the fact that the Regex engine, by default, will always continue searching after the previous match. So it will never find a possible match within another match, unless you instruct it to by using a Look ahead construction or by manually setting the start index.
Another, relatively easy, solution is to just stick the whole expression in a forward look ahead:
string expression = "(a)\1"
Regex rx2 = new Regex("(?=" + expression + ")");
MatchCollection ms = rx2.Matches(text);
var indexes = ms.Cast<Match>().Select(match => match.Index);
That way the engine will automatically advance the index by one for every match it finds.
From the docs:
When a match attempt is repeated by calling the NextMatch method, the regular expression engine gives empty matches special treatment. Usually, NextMatch begins the search for the next match exactly where the previous match left off. However, after an empty match, the NextMatch method advances by one character before trying the next match. This behavior guarantees that the regular expression engine will progress through the string. Otherwise, because an empty match does not result in any forward movement, the next match would start in exactly the same place as the previous match, and it would match the same empty string repeatedly.
Try this:
How can I find repeated characters with a regex in Java?
It is in java, but the regex and non-regex way is there. C# Regex is very similar to the Java way.

Why isn't C# following my regex?

I have a C# application that reads a word file and looks for words wrapped in < brackets >
It's currently using the following code and the regex shown.
private readonly Regex _regex = new Regex("([<])([^>]*)([>])", RegexOptions.Compiled);
I've used several online testing tools / friends to validate that the regex works, and my application proves this (For those playing at home, http://wordfiller.codeplex.com)!
My problem is however the regex will also pickup extra rubbish.
E.G
I'm walking on <sunshine>.
will return
sunshine>.
it should just return
<sunshine>
Anyone know why my application refuses to play by the rules?
I don't think the problem is your regex at all. It could be improved somewhat -- you don't need the ([]) around each bracket -- but that shouldn't affect the results. My strong suspicion is that the problem is in your C# implementation, not your regex.
Your regex should split <sunshine> into three separate groups: <, sunshine, and >. Having tested it with the code below, that's exactly what it does. My suspicion is that, somewhere in the C# code, you're appending Group 3 to Group 2 without realizing it. Some quick C# experimentation supports this:
private readonly Regex _regex = new Regex("([<])([^>]*)([>])", RegexOptions.Compiled);
private string sunshine()
{
string input = "I'm walking on <sunshine>.";
var match = _regex.Match(input);
var regex2 = new Regex("<[^>]*>", RegexOptions.Compiled); //A slightly simpler version
string result = "";
for (int i = 0; i < match.Groups.Count; i++)
{
result += string.Format("Group {0}: {1}\n", i, match.Groups[i].Value);
}
result += "\nWhat you're getting: " + match.Groups[2].Value + match.Groups[3].Value;
result += "\nWhat you want: " + match.Groups[0].Value + " or " + match.Value;
result += "\nBut you don't need all those brackets and groups: " + regex2.Match(input).Value;
return result;
}
Result:
Group 0: <sunshine>
Group 1: <
Group 2: sunshine
Group 3: >
What you're getting: sunshine>
What you want: <sunshine> or <sunshine>
But you don't need all those brackets and groups: <sunshine>
We will need to see more code to solve the problem. There is an off by one error somewhere in your code. It is impossible for that regular expression to return sunshine>.. Therefore the regular expression in question is not the problem. I would assume, without more details, that something is getting the index into the string containing your match and it is one character too far into the string.
If all you want is the text between < and > then you'd be better off using:
[<]([^>]*)[>] or simpler: <([^>]+)>
If you want to include < and > then you could use:
([<][^>]*[>]) or simpler: (<[^>]+>)
You're expression currently has 3 Group Matches - indicated by the brackets ().
In the case of < sunshine> this will currently return the following:
Group 1 : "<"
Group 2 : "sunshine"
Group 3 : ">"
So if you only looked at the 2nd group it should work!
The only explanation I can give for your observed behaviour is that where you pull the matches out, you are adding together Groups 2 + 3 and not Group 1.
What you posted works perfectly fine.
Regex _regex = new Regex("([<])([^>]*)([>])", RegexOptions.Compiled);
string test = "I'm walking on <sunshine>.";
var match = _regex.Match(test);
Match is <sunshine> i guess you need to provide more code.
Regex is eager by default. Teach it to be lazy!
What I mean is, the * operator considers as many repetitions as possible (it's said to be eager). Use the *? operator instead, this tells Regex to consider as few repetitions as possible (i.e. to be lazy):
<.*?>
Because you are using parenthesis, you are creating matching groups. This is causing the match collection to match the groups created by the regular expression to also be matched. You can reduce your regular expression to [<][^>]*[>] and it will match only on the <text> that you wish.

How to capitalize first letter of each sentence?

I know how to capitalize first letter in each word. But I want to know how to capitalize first letter of each sentence in C#.
This is not necessarily a trivial problem. Sentences can end with a number of different punctuation marks, and those same punctuation marks don't always denote the end of a sentence (abbreviations like Dr. may pose a particular problem because there are potentially many of them).
That being said, you might be able to get a "good enough" solution by using regular expressions to look for words after a sentence-ending punctuation, but you would have to add quite a few special cases. It might be easier to process the string character by character or word by word. You would still have to handle all the same special cases, but it might be easier than trying to build that into a regex.
There are lots of weird rules for grammar and punctuation. Any solution you come up with probably won't be able to take them all into account. Some things to consider:
Sentences can end with different punctuation marks (. ! ?)
Some punctuation marks that end sentences might also be used in the middle of a sentence (e.g. abbreviations such as Dr. Mr. e.g.)
Sentences could contain nested sentences. Quotations could pose a particular problem (e.g. He said, "This is a hard problem! I wonder," he mused, "if it can be solved.")
As a first approximation, you could probably treat any sequence like [a-z]\.[ \n\t] as the end of a sentence.
Consider a sentence as a word containing spaces an ending with a period.
There's some VB code on this page which shouldn't be too hard to convert to C#.
However, subsequent posts point out the errors in the algorithm.
This blog has some C# code which claims to work:
It auto capitalises the first letter after every full stop (period), question mark and exclamation mark.
UPDATE 16 Feb 2010: I’ve reworked it so that it doesn’t affect strings such as URL’s and the like
Don't forget sentences with parentheses. Also, * if used as an idicator for bold text.
http://www.grammarbook.com/punctuation/parens.asp
I needed to do something similar, and this served my purposes. I pass in my "sentences" as a IEnumerable of strings.
// Read sentences from text file (each sentence on a separate line)
IEnumerable<string> lines = File.ReadLines(inputPath);
// Call method below
lines = CapitalizeFirstLetterOfEachWord(lines);
private static IEnumerable<string> CapitalizeFirstLetterOfString(IEnumerable<string> inputLines)
{
// Will output: Lorem lipsum et
List<string> outputLines = new List<string>();
TextInfo textInfo = new CultureInfo("en-US", false).TextInfo;
foreach (string line in inputLines)
{
string lineLowerCase = textInfo.ToLower(line);
string[] lineSplit = lineLowerCase.Split(' ');
bool first = true;
for (int i = 0; i < lineSplit.Length; i++ )
{
if (first)
{
lineSplit[0] = textInfo.ToTitleCase(lineSplit[0]);
first = false;
}
}
outputLines.Add(string.Join(" ", lineSplit));
}
return outputLines;
}
I know I'm little late, but just like You, I needed to capitalize every first character on each of my sentences.
I just fell here (and a lot of other pages while I was researching) and found nothing to help me out. So, I burned some neurons, and made a algorithm by myself.
Here is my extension method to capitalize sentences:
public static string CapitalizeSentences(this string Input)
{
if (String.IsNullOrEmpty(Input))
return Input;
if (Input.Length == 1)
return Input.ToUpper();
Input = Regex.Replace(Input, #"\s+", " ");
Input = Input.Trim().ToLower();
Input = Char.ToUpper(Input[0]) + Input.Substring(1);
var objDelimiters = new string[] { ". ", "! ", "? " };
foreach (var objDelimiter in objDelimiters)
{
var varDelimiterLength = objDelimiter.Length;
var varIndexStart = Input.IndexOf(objDelimiter, 0);
while (varIndexStart > -1)
{
Input = Input.Substring(0, varIndexStart + varDelimiterLength) + (Input[varIndexStart + varDelimiterLength]).ToString().ToUpper() + Input.Substring((varIndexStart + varDelimiterLength) + 1);
varIndexStart = Input.IndexOf(objDelimiter, varIndexStart + 1);
}
}
return Input;
}
Details about the algorithm:
This simple algorithm starts removing all double spaces. Then, it capitalize the first character of the string. then search for every delimiter. When find one, capitalize the very next character.
I made it easy to Add/Remove or Edit the delimiters, so You can change a lot how code works with a little change on it.
It doesn't check if the substrings go out of the string length, because the delimiters end with spaces, and the algorithm starts with a "Trim()", so every delimiter if found in the string will be followed by another character.
Important:
You didn't specify what were exactly your needs. I mean, it's a grammar corrector, it's just to prettify a text, etc... So, it's important to consider that my algorithm is just perfect for my needs, that can be different of yours.
*This algorithm was created to format a "Product Description" that isn't normalized (almost always it's entirely uppercased) in a nice format to the user (To be more specific, I need to show a pretty and "smaller" text for user. So, all characters in Upper Case is just opposite of what I want). So, it was not created to be grammatically perfect.
*Also, there maybe some exceptions where the character will not be uppercased because bad formatting.
*I choose to include spaces in the delimiter, so "http://www.stackoverflow.com" will not become "http://www.Stackoverflow.Com". In the other hand, sentences like "the box is blue.it's on the floor" will become "The box is blue.it's on the floor", and not "The box is blue.It's on the floor"
*In abbreviations cases, it will capitalize, but once again, it's not a problem because my needs is just show a product description (where grammar is not extremely critic). And in abbreviations like Mr. or Dr. the very first character is a name, so, it's perfect to be capitalized.
If You, or somebody else needs a more accurate algorithm, I'll be glad to improve it.
Hope I could help somebody!
However you can make a class or method to convert each text in TitleCase. Here is the example you just need to call the method.
public static string ToTitleCase(string strX)
{
string[] aryWords = strX.Trim().Split(' ');
List<string> lstLetters = new List<string>();
List<string> lstWords = new List<string>();
foreach (string strWord in aryWords)
{
int iLCount = 0;
foreach (char chrLetter in strWord.Trim())
{
if (iLCount == 0)
{
lstLetters.Add(chrLetter.ToString().ToUpper());
}
else
{
lstLetters.Add(chrLetter.ToString().ToLower());
}
iLCount++;
}
lstWords.Add(string.Join("", lstLetters));
lstLetters.Clear();
}
string strNewString = string.Join(" ", lstWords);
return strNewString;
}

Google-like search query tokenization & string splitting

I'm looking to tokenize a search query similar to how Google does it. For instance, if I have the following search query:
the quick "brown fox" jumps over the "lazy dog"
I would like to have a string array with the following tokens:
the
quick
brown fox
jumps
over
the
lazy dog
As you can see, the tokens preserve the spaces with in double quotes.
I'm looking for some examples of how I could do this in C#, preferably not using regular expressions, however if that makes the most sense and would be the most performant, then so be it.
Also I would like to know how I could extend this to handle other special characters, for example, putting a - in front of a term to force exclusion from a search query and so on.
So far, this looks like a good candidate for RegEx's. If it gets significantly more complicated, then a more complex tokenizing scheme may be necessary, but your should avoid that route unless necessary as it is significantly more work. (on the other hand, for complex schemas, regex quickly turns into a dog and should likewise be avoided).
This regex should solve your problem:
("[^"]+"|\w+)\s*
Here is a C# example of its usage:
string data = "the quick \"brown fox\" jumps over the \"lazy dog\"";
string pattern = #"(""[^""]+""|\w+)\s*";
MatchCollection mc = Regex.Matches(data, pattern);
foreach(Match m in mc)
{
string group = m.Groups[0].Value;
}
The real benefit of this method is it can be easily extened to include your "-" requirement like so:
string data = "the quick \"brown fox\" jumps over " +
"the \"lazy dog\" -\"lazy cat\" -energetic";
string pattern = #"(-""[^""]+""|""[^""]+""|-\w+|\w+)\s*";
MatchCollection mc = Regex.Matches(data, pattern);
foreach(Match m in mc)
{
string group = m.Groups[0].Value;
}
Now I hate reading Regex's as much as the next guy, but if you split it up, this one is quite easy to read:
(
-"[^"]+"
|
"[^"]+"
|
-\w+
|
\w+
)\s*
Explanation
If possible match a minus sign, followed by a " followed by everything until the next "
Otherwise match a " followed by everything until the next "
Otherwise match a - followed by any word characters
Otherwise match as many word characters as you can
Put the result in a group
Swallow up any following space characters
I was just trying to figure out how to do this a few days ago. I ended up using Microsoft.VisualBasic.FileIO.TextFieldParser which did exactly what I wanted (just set HasFieldsEnclosedInQuotes to true). Sure it looks somewhat odd to have "Microsoft.VisualBasic" in a C# program, but it works, and as far as I can tell it is part of the .NET framework.
To get my string into a stream for the TextFieldParser, I used "new MemoryStream(new ASCIIEncoding().GetBytes(stringvar))". Not sure if this is the best way to do it.
Edit: I don't think this would handle your "-" requirement, so maybe the RegEx solution is better
Go char by char to the string like this: (sort of pseudo code)
array words = {} // empty array
string word = "" // empty word
bool in_quotes = false
for char c in search string:
if in_quotes:
if c is '"':
append word to words
word = "" // empty word
in_quotes = false
else:
append c to word
else if c is '"':
in_quotes = true
else if c is ' ': // space
if not empty word:
append word to words
word = "" // empty word
else:
append c to word
// Rest
if not empty word:
append word to words
I was looking for a Java solution to this problem and came up with a solution using #Michael La Voie's. Thought I would share it here despite the question being asked for in C#. Hope that's okay.
public static final List<String> convertQueryToWords(String q) {
List<String> words = new ArrayList<>();
Pattern pattern = Pattern.compile("(\"[^\"]+\"|\\w+)\\s*");
Matcher matcher = pattern.matcher(q);
while (matcher.find()) {
MatchResult result = matcher.toMatchResult();
if (result != null && result.group() != null) {
if (result.group().contains("\"")) {
words.add(result.group().trim().replaceAll("\"", "").trim());
} else {
words.add(result.group().trim());
}
}
}
return words;
}

Categories