Google-like search query tokenization & string splitting

Google-like search query tokenization & string splitting - c#

I'm looking to tokenize a search query similar to how Google does it. For instance, if I have the following search query:
the quick "brown fox" jumps over the "lazy dog"
I would like to have a string array with the following tokens:
the
quick
brown fox
jumps
over
the
lazy dog
As you can see, the tokens preserve the spaces with in double quotes.
I'm looking for some examples of how I could do this in C#, preferably not using regular expressions, however if that makes the most sense and would be the most performant, then so be it.
Also I would like to know how I could extend this to handle other special characters, for example, putting a - in front of a term to force exclusion from a search query and so on.

So far, this looks like a good candidate for RegEx's. If it gets significantly more complicated, then a more complex tokenizing scheme may be necessary, but your should avoid that route unless necessary as it is significantly more work. (on the other hand, for complex schemas, regex quickly turns into a dog and should likewise be avoided).
This regex should solve your problem:
("[^"]+"|\w+)\s*
Here is a C# example of its usage:
string data = "the quick \"brown fox\" jumps over the \"lazy dog\"";
string pattern = #"(""[^""]+""|\w+)\s*";
MatchCollection mc = Regex.Matches(data, pattern);
foreach(Match m in mc)
{
string group = m.Groups[0].Value;
}
The real benefit of this method is it can be easily extened to include your "-" requirement like so:
string data = "the quick \"brown fox\" jumps over " +
"the \"lazy dog\" -\"lazy cat\" -energetic";
string pattern = #"(-""[^""]+""|""[^""]+""|-\w+|\w+)\s*";
MatchCollection mc = Regex.Matches(data, pattern);
foreach(Match m in mc)
{
string group = m.Groups[0].Value;
}
Now I hate reading Regex's as much as the next guy, but if you split it up, this one is quite easy to read:
(
-"[^"]+"
|
"[^"]+"
|
-\w+
|
\w+
)\s*
Explanation
If possible match a minus sign, followed by a " followed by everything until the next "
Otherwise match a " followed by everything until the next "
Otherwise match a - followed by any word characters
Otherwise match as many word characters as you can
Put the result in a group
Swallow up any following space characters

I was just trying to figure out how to do this a few days ago. I ended up using Microsoft.VisualBasic.FileIO.TextFieldParser which did exactly what I wanted (just set HasFieldsEnclosedInQuotes to true). Sure it looks somewhat odd to have "Microsoft.VisualBasic" in a C# program, but it works, and as far as I can tell it is part of the .NET framework.
To get my string into a stream for the TextFieldParser, I used "new MemoryStream(new ASCIIEncoding().GetBytes(stringvar))". Not sure if this is the best way to do it.
Edit: I don't think this would handle your "-" requirement, so maybe the RegEx solution is better

Go char by char to the string like this: (sort of pseudo code)
array words = {} // empty array
string word = "" // empty word
bool in_quotes = false
for char c in search string:
if in_quotes:
if c is '"':
append word to words
word = "" // empty word
in_quotes = false
else:
append c to word
else if c is '"':
in_quotes = true
else if c is ' ': // space
if not empty word:
append word to words
word = "" // empty word
else:
append c to word
// Rest
if not empty word:
append word to words

I was looking for a Java solution to this problem and came up with a solution using #Michael La Voie's. Thought I would share it here despite the question being asked for in C#. Hope that's okay.
public static final List<String> convertQueryToWords(String q) {
List<String> words = new ArrayList<>();
Pattern pattern = Pattern.compile("(\"[^\"]+\"|\\w+)\\s*");
Matcher matcher = pattern.matcher(q);
while (matcher.find()) {
MatchResult result = matcher.toMatchResult();
if (result != null && result.group() != null) {
if (result.group().contains("\"")) {
words.add(result.group().trim().replaceAll("\"", "").trim());
} else {
words.add(result.group().trim());
}
}
}
return words;
}

Related

C#: Remove Excess Text From String

Okay, so after looking around here on SO, I have found a solution that meets about 95% of my requirement, although I believe it may need to be redone at this point.
ISSUE
Say I have a value range supplied as "1000 - 1009 ABC1 ABC SOMETHING ELSE" where I just need the 1000 - 1009 part. I need to be able to remove excess characters from the string supplied, even if they truly are accepted characters, but only if they are part of secondary strings with text. (Sorry if that description seems odd, my mind isn't full power today.)
CURRENT SOLUTION
I currently have a simple method utilizing Linq to return only accepted characters, however this will return "1000 - 10091" which is not the range I am needing. I've thought about looping through the strings individual characters and comparing to previous characters as I go using IsDigit and IsLetter to my advantage, but then comes the issue of replacing the unacceptable characters or removing them. I think if I gave it a day or two I could figure it out with a clear mind, but it needs to be done by the end of the day, and I am banging my head against the keyboard.
void RemoveExcessText(ref string val) {
string allowedChars = "0123456789-+>";
val = new string(val.Where(c => allowedChars.Contains(c)).ToArray());
}
// Alternatively?
char previousChar = ' ';
for (int i = 0; i < val.Length; i++) {
if (char.IsLetter(val[i])) {
previousChar = val[i];
val.Remove(i, 1);
} else if (char.IsDigit(val[i])) {
if (char.IsLetter(previousChar)) {
val.Remove(i, 1);
}
}
}
But how do I calculate white space and leave in the +, -, and > charactrers? I am losing my mind on this one today.

Why not use a regular expression?
Regex.Match("1000 - 1009 ABC1 ABC SOMETHING ELSE", #"^(\d+)([\s\-]+)(\d+)");
Should give you what you want
I made a fiddle

You use a regular expression with a capturing group:
Regex r = new Regex("^(?<v>[-0-9 ]+?)");
This means "from the start of the input string (^) match [0 to 9 or space or hyphen] and keep going for as many occurrences of these characters as are available (+?) and store it into variable v (?)"
We get it out like this:
r.Matches(input)[0].Groups["v"].Value
Note though that if the input string doesn't match, the match collection will be 0 long and a call to [0] will crash. To this end you might want to robust it up with some extra error checking:
MatchCollection mc = r.Matches(input);
if(mc.Length > 0)
MessageBox.Show(mc[0].Groups["v"].Value;

You could match this with a regular expression. \d{1,4} means match a decimal digit at least once up to 4 times. Followed by space, hyphen, space, and 1 to 4 digits again, then anything else. Only the part inside parenthesis is output in your results.
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
var pattern = #"(^\d{1,4} - \d{1,4}).*";
string input = ("1000 - 1009 ABC1 ABC SOMETHING ELSE");
string replacement = "$1";
string result = Regex.Replace(input, pattern, replacement);
Console.WriteLine(result);
}
}
https://dotnetfiddle.net/cZGlX4

How to remove multiple, repeating & unnecessary punctuation from string in C#?

Considering strings like this:
"This is a string....!"
"This is another...!!"
"What is this..!?!?"
...
// There are LOTS of examples of weird/angry sentence-endings like the ones above.
I want to replace the unnecessary punctuation at the end to make it look like this:
"This is a string!"
"This is another!"
"What is this?"
What I basically do is:
- split by space
- check if last char in string contains a punctuation
- start replacing with the patterns below
I have tried a very big ".Replace(string, string)" function, but it does not work - there has to be a simpler regex I guess.
Documentation:
Returns a new string in which all occurrences of a specified string in the current instance are replaced with another specified string.
As well as:
Because this method returns the modified string, you can chain together successive calls to the Replace method to perform multiple replacements on the original string.
Anything is wrong here.
EDIT: ALL the proposed solutions work fine! Thank you very much!
This one was the best suited solution for my project:
Regex re = new Regex("[.?!]*(?=[.?!]$)");
string output = re.Replace(input, "");

Your solution works almost fine (demo), the only issue is when the same sequence could be matched starting at different spots. For example, ..!?!? from your last line is not part of the substitution list, so ..!? and !? get replaced by two separate matches, producing ?? in the output.
It looks like your strategy is pretty straightforward: in a chain of multiple punctuation characters the last character wins. You can use regular expressions to do the replacement:
[!?.]*([!?.])
and replace it with $1, i.e. the capturing group that has the last character:
string s;
while ((s = Console.ReadLine()) != null) {
s = Regex.Replace(s, "[!?.]*([!?.])", "$1");
Console.WriteLine(s);
}
Demo

Simply
[.?!]*(?=[.?!]$)
should do it for you. Like
Regex re = new Regex("[.?!]*(?=[.?!]$)");
Console.WriteLine(re.Replace("This is a string....!", ""));
This replaces all punctuations but the last with nothing.
[.?!]* matches any number of consecutive punctuation characters, and the (?=[.?!]$) is a positive lookahead making sure it leaves one at the end of the string.
See it here at ideone.

Or you can do it without regExps:
string TrimPuncMarks(string str)
{
HashSet<char> punctMarks = new HashSet<char>() {'.', '!', '?'};
int i = str.Length - 1;
for (; i >= 0; i--)
{
if (!punctMarks.Contains(str[i]))
break;
}
// the very last punct mark or null if there were no any punct marks in the end
char? suffix = i < str.Length - 1 ? str[str.Length - 1] : (char?)null;
return str.Substring(0, i+1) + suffix;
}
Debug.Assert("What is this?" == TrimPuncMarks("What is this..!?!?"));
Debug.Assert("What is this" == TrimPuncMarks("What is this"));
Debug.Assert("What is this." == TrimPuncMarks("What is this."));

How to find a string with missing fragments?

I'm building a chatbot in C# using AIML files, at the moment I've this code to process:
<aiml>
<category>
<pattern>a * is a *</pattern>
<template>when a <star index="1"/> is not a <star index="2"/>?</template>
</category>
</aiml>
I would like to do something like:
if (user_string == pattern_string) return template_string;
but I don't know how to tell the computer that the star character can be anything, and expecially that can be more than one word!
I was thinking to do it with regular expressions, but I don't have enough experience with it. Can somebody help me? :)

Using Regex
static bool TryParse(string pattern, string text, out string[] wildcardValues)
{
// ^ and $ means that whole string must be matched
// Regex.Escape (http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.escape(v=vs.110).aspx)
// (.+) means capture at least one character and place it in match.Groups
var regexPattern = string.Format("^{0}$", Regex.Escape(pattern).Replace(#"\*", "(.+)"));
var match = Regex.Match(text, regexPattern, RegexOptions.Singleline);
if (!match.Success)
{
wildcardValues = null;
return false;
}
//skip the first one since it is the whole text
wildcardValues = match.Groups.Cast<Group>().Skip(1).Select(i => i.Value).ToArray();
return true;
}
Sample usage
string[] wildcardValues;
if(TryParse("Hello *. * * to *", "Hello World. Happy holidays to all", out wildcardValues))
{
//it's a match
//wildcardValues contains the values of the wildcard which is
//['World','Happy','holidays','all'] in this sample
}
By the way, you don't really need Regex for this, it's overkill. Just implement your own algorithm by splitting the pattern into tokens using string.Split then finding each token using string.IndexOf. Although using Regex does result in shorter code

Do you think this should work for you?
Match match = Regex.Match(pattern_string, #"<pattern>a [^<]+ is a [^<]+</pattern>");
if (match.Success)
{
// do something...
}
Here [^<]+ represents for one or more characters which is/are not <
If you think you may have < character in your *, then you can simply use .+ instead of [^<]+
But this will be risky as .+ means any characters having one or multiple times.

How to find repeatable characters

I can't understand how to solve the following problem:
I have input string "aaaabaa" and I'm trying to search for string "aa" (I'm looking for positions of characters)
Expected result is
0 1 2 5
aa aabaa
a aa abaa
aa aa baa
aaaab aa
This problem is already solved by me using another approach (non-RegEx).
But I need a RegEx I'm new to RegEx so google-search can't help me really.
Any help appreciated! Thanks!
P.S.
I've tried to use (aa)* and "\b(\w+(aa))*\w+" but those expressions are wrong

You can solve this by using a lookahead
a(?=a)
will find every "a" that is followed by another "a".
If you want to do this more generally
(\p{L})(?=\1)
This will find every character that is followed by the same character. Every found letter is stored in a capturing group (because of the brackets around), this capturing group is then reused by the positive lookahead assertion (the (?=...)) by using \1 (in \1 there is the matches character stored)
\p{L} is a unicode code point with the category "letter"
Code
String text = "aaaabaa";
Regex reg = new Regex(#"(\p{L})(?=\1)");
MatchCollection result = reg.Matches(text);
foreach (Match item in result) {
Console.WriteLine(item.Index);
}
Output
0
1
2
5

The following code should work with any regular expression without having to change the actual expression:
Regex rx = new Regex("(a)\1"); // or any other word you're looking for.
int position = 0;
string text = "aaaaabbbbccccaaa";
int textLength = text.Length;
Match m = rx.Match(text, position);
while (m != null && m.Success)
{
Console.WriteLine(m.Index);
if (m.Index <= textLength)
{
m = rx.Match(text, m.Index + 1);
}
else
{
m = null;
}
}
Console.ReadKey();
It uses the option to change the start index of a regex search for each consecutive search. The actual problem comes from the fact that the Regex engine, by default, will always continue searching after the previous match. So it will never find a possible match within another match, unless you instruct it to by using a Look ahead construction or by manually setting the start index.
Another, relatively easy, solution is to just stick the whole expression in a forward look ahead:
string expression = "(a)\1"
Regex rx2 = new Regex("(?=" + expression + ")");
MatchCollection ms = rx2.Matches(text);
var indexes = ms.Cast<Match>().Select(match => match.Index);
That way the engine will automatically advance the index by one for every match it finds.
From the docs:
When a match attempt is repeated by calling the NextMatch method, the regular expression engine gives empty matches special treatment. Usually, NextMatch begins the search for the next match exactly where the previous match left off. However, after an empty match, the NextMatch method advances by one character before trying the next match. This behavior guarantees that the regular expression engine will progress through the string. Otherwise, because an empty match does not result in any forward movement, the next match would start in exactly the same place as the previous match, and it would match the same empty string repeatedly.

Try this:
How can I find repeated characters with a regex in Java?
It is in java, but the regex and non-regex way is there. C# Regex is very similar to the Java way.

c# Regex for list Parsing

I have a text field that accepts user input in the form of delimeted lists of strings. I have two main delimeters, a space and a comma.
If an item in the list contains more than one word, a user can deliniate it by enclosing it in quotes.
Sample Input:
Apple, Banana Cat, "Dog starts with a D" Elephant Fox "G is tough", "House"
Desired Output:
Apple
Banana
Cat
Dog starts with a D
Elephant
Fox
G is a tough one
House
I've been working on getting a regex for this, and I can't figure out how to allow the commas. Here is what I have so far:
Regex.Matches(input, #"(?<match>\w+)|\""(?<match>[\w\s]*)""")
.Cast<Match>()
.Select(m => m.Groups["match"].Value.Replace("\"", ""))
.Where(x => x != "")
.Distinct()
.ToList()

That regex is pretty smart if it can turn "G is tough" into G is a tough one :-)
On a more serious note, code up a parser and don't try to rely on a singular regex to do this for you.
You'll find you learn more, the code will be more readable, and you won't have to concern yourself with edge cases that you haven't even figured out yet, like:
Apple, Banana Cat, "Dog, not elephant, starts with a D" Elephant Fox
A simple parser for that situation would be:
state = whitespace
word = ""
for each character in (string + " "):
if state is whitespace:
if character is not whitespace:
word = character
state = inword
else:
if character is whitespace:
process word
word = ""
state = whitespace
else:
word = word + character
and it's relatively easy to add support for quoting:
state = whitespace
quote = no
word = ""
for each character in (string + " "):
if state is whitespace:
if character is not whitespace:
word = character
state = inword
else:
if character is whitespace and quote is no:
process word
word = ""
state = whitespace
else:
if character is quote:
quote = not quote
else:
word = word + character
Note that I haven't tested these thoroughly but I've done these quite a bit in the past so I'm quietly confident. It's only a short step from there to one that can also allow escaping (for example, if you want quotes within quotes like "The \" character is inside").
To get a single regex capable of handling multiple separators isn't that hard, getting it to monitor state, such as when you're within quotes, so you can treat separators differently, is another level.

You should choose between using space or commas as delimeters. Using both is a bit confusing. If that choice is not yours to make, I would grab things between quotes first. When they are gone, you can just replace all commas with spaces and split the line on spaces.

You could perform two regexes. The first one to match the quoted sections, then remove them. With the second regex you could match the remaining words.
string pat = "\"(.*?)\"", pat2 = "(\\w+)";
string x = "Apple, Banana Cat, \"Dog starts with a D\" Elephant Fox \"G is tough\", \"House\"";
IEnumerable<Match> combined = Regex.Matches(Regex.Replace(x, pat, ""), pat2).OfType<Match>().Union(Regex.Matches(x, pat).OfType<Match>()).Where(m => m.Success);
foreach (Match m in combined)
Console.WriteLine(m.Groups[1].ToString());
Let me know if this isnt what you were looking for.

I like paxdiablo's parser, but if you want to use a single regex, then consider my modified version of a CSV regex parser.
Step 1: the original
string regex = "((?<field>[^\",\\r\\n]+)|\"(?<field>([^\"]|\"\")+)\")(,|(?<rowbreak>\\r\\n|\\n|$))";
Step 2: using multiple delimiters
char quoter = '"'; // quotation mark
string delimiter = " ,"; // either space or comma
string regex = string.Format("((?<field>[^\\r\\n{1}{0}]*)|[{1}](?<field>([^{1}]|[{1}][{1}])*)[{1}])([{0}]|(?<rowbreak>\\r\\n|\\n|$))", delimiter, quoter);
Using a simple loop to test:
Regex re = new Regex(regex);
foreach (Match m in re.Matches(input))
{
string field = m.Result("${field}").Replace("\"\"", "\"").Trim();
// string rowbreak = m.Result("${rowbreak}");
if (field != string.Empty)
{
// Print(field);
}
}
We get the output:
Apple
Banana
Cat
Dog starts with a D
Elephant
Fox
G is tough
House
That's it!
Look at the original CSV regex parser for ideas on handling the matched regex data. You might have to modify it slightly, but you'll get the idea.
Just for interest sake, if you are crazy enough to want to use multiple characters as a single delimiter, then consider this answer.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Google-like search query tokenization & string splitting - c#

Related

C#: Remove Excess Text From String

How to remove multiple, repeating & unnecessary punctuation from string in C#?

How to find a string with missing fragments?

How to find repeatable characters

c# Regex for list Parsing

Categories

Resources