how to create a parser for search queries - c#

for example i'd need to create something like google search query parser to parse such expressions as:
flying hiking or swiming
-"**walking in boots **" **author:**hamish **author:**reid
or
house in new york priced over
$500000 with a swimming pool
how would i even go about start building something like it? any good resources?
c# relevant, please (if possible)
edit: this is something that i should somehow be able to translate to a sql query

How many keywords do you have (like 'or', 'in', 'priced over', 'with a')? If you only have a couple of them I'd suggest going with simple string processing (regexes) too.
But if you have more than that you might want to look into implementing a real parser for those search expressions. Irony.net might help you with that (I found it extremely easy to use as you can express your grammar in a near bnf-form directly in code).

The Lucene/NLucene project have functionality for boolean queries and some other query formats as well. I don't know about the possibilities to add own extensions like author in your case, but it might be worthwile to check it out.

There are few ways doing it, two of them:
Parsing using grammar (useful for complex language)
Parsing using regular expression and basic string manipulations (for simpler language)
According to your example, the language is very basic so splitting the string according to keyword can be the best solution.
string sentence = "house in new york priced over $500000 with a swimming pool";
string[] values = sentence.Split(new []{" in ", " priced over ", " with a "},
StringSplitOptions.None);
string type = values[0];
string area = values[1];
string price = values[2];
string accessories = values[3];
However, some issues that may arise are: how to verify if the sentence stands in the expected form? What happens if some of the keywords can appear as part of the values?
If this is the case you encounter there are some libraries you can use to parse input using a defined grammar. Two of these libraries that works with .Net are ANTLR and Gold Parser, both are free. The main challenge is defining the grammar.

A grammar would work very well for the second example you gave but the first (any order keyword/command strings) would be best handled using Split() and a class to handle the various keywords and commands. You will have to do initial processing to handle quoted regions before the split (for example replacing spaces within quoted regions with a rare/unused character).
The ":" commands are easy to find and pull out of the search string for processing after the split is completed. Simply traverse the array looking.
The +/- keywords are also easy to find and add to the sql query as AND/AND NOT clauses.
The only place you might run into issues is with the "or" since you'll have to define how it is handled. What if there are multiple "or"s? But the order of keywords in the array is the same as in the query so that won't be an issue.

i think you should just do some string processing. There is no smart way of doing this.
So replace "OR" with your own or operator (e.g. ||). As far as i know there is no library for this.
I suggest you go with regexes.

Related

Match expressions in Strings

I have a database here with certain rules I need to apply to a a bunch of Strings, they're expressions that can occur within the Strings. They are expressed like
(word1 AND word2) OR (word3)
I can't hardcode those (because they may be changed in the database), so I thought about programmatically turning those expressions into Regex patterns.
Has anybody done such a task yet or has an idea on how to do this the best way?
I'm not wuite sure about how to deal with more complex expressions, how to take them apart and so on.
Edit: I'm using C# in VisualStudio / .NET.
The data is basically directory paths, a customer wants to get their documents organized, so the String I'm having are paths, the expressions in the DB could look like:
(office OR headquarter) AND (official OR confidential)
So if the file's directory path contains office and confidential, it should match.
Hope this makes it clearer.
EDIT2:
Heres some dummy examples:
The paths could look like:
c:\documents\official\johnmeyer\court\out\letter.doc
c:\documents\internal\appointments\court\in\september.doc
c:\documents\official\stevemiller\meeting\in\letter.doc
And the expressions like:
(meyer or miller) AND (court OR jail)
So this expression would match the 1st path/ file, but not the 2nd and 3rd one.
No answer, but a good hint:
The expressions you have are actual trees constructed by the parentheses. You need a stack machine to parse the text into a (binary) tree structure, where each node is an AND or OR element and the leaves are the words.
Afterwards, you can simply construct your regex in whatever language you need by walking the tree using depth first search and adding prefix and suffix data as needed before/after reading the subtree.
Consider an abstract class TreeNode having a method GenerateExpression(StringBuilder result).
Each actual TreeNode item will be either an CombinationTreeNode (with a CombinationMode And/Or) or an SearchTextTreeNode (with an SearchText property).
GenerateExpression(StringBuilder result) for CombinationTreeNode will look similar like that:
result.Append("(");
rightSubTree.GenerateExpression(result);
result.Append(") " + this.CombinationMode.ToString() + " (");
rightSubTree.GenerateExpression(result);
result.Append(")");
GenerateExpression(StringBuilder result) for SearchTextTreeNode is much easier:
result.Append(this.SearchText);
Of course, your code will produce a regular expression instead of the input text, as mine does.

Simplifying Regex's - escaping

I want to enable my users to specify the allowed characters in a given string.
So... Regex's are great but too tough for my users.
my plan is to enable users to specify a list of allowed characters - for example
a-z|A-Z|0-9|,
i can transform this into a regex which does the matching as such:
[a-zA-Z0-9,]*
However i'm a little lost to deal with all the escaping - imagine if a user specified
a-z|A-Z|0-9| |,|||\|*|[|]|{|}|(|)
Clearly one option is to deal with every case individually but before i write such a nasty solution - is there some nifty way to do this?
Thanks
David
Forget regex, here is a much simpler solution:
bool isInputValid = inputString.All(c => allowedChars.Contains(c));
You might be right about your customers, but you could provide some introductory regex material and see how they get on - you might be surprised.
If you really need to simplify, you'll probably need to jetison the use of pipe characters too, and provide an alternative such as putting each item on a new line (in a multi line text box for instance).
To make it as simple as possible for your users, why don't you ditch the "|" and the concept of character ranges, e.g., "a-z", and get them just to type the complete list of characters they want to allow:
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890 *{}()
You get the idea. I think this will be much simpler.

Complex String Processing - well complex to me

I am calling a web service and all I get back is a giant blob of text. I am left to process it myself. Problem is not all lines are necessarily the same. They each have 2 or 3 sections to them and they are similar. Here are the most common examples
text1 [text2] /text3/
text1/test3
text1[text2]/text3
text1 [text2] /text /3 here/
I am not exactly sure how to approach this problem. I am not too good at doing anything advanced as far as manipulating strings.
I was thinking using a regular expression might work, but not too sure on that either. If I can get each of these 3 sections broken up it is easier from there to do the rest. its just there doesn't seem to be any uniformity to the main 3 sections that I know how to work with.
EDIT: Thanks for mentioning i didn't actually say what I wanted to do.
Basically, I want to split these 3 sections of text into their own strings seperate stings so basically take it from one single string to an array of 3 strings.
string[0] = text1
string[1] = text2
string[2] = text3
Here is some of the text I get back from a call as an example
スルホ基 [スルホき] /(n) sulfo group/
鋭いナイフ [するどいナイフ] /(n) sharp knife/
鋭い批判 [するどいひはん] /(n) sharp criticism/
スルナーイ /(n) (See ズルナ) (obsc) surnay (Anatolian woodwind instrument) (per:)/zurna/
スルピリン /(n) sulpyrine/
スルファミン /(n) sulfamine/
剃る [そる(P);する] /(v5r,vt) to shave/(P)/
As the first line for an example I want to pull it out into an array
string[0] = スルホ基
string[0] = [スルホき]
string[0] = /(n) sulfo group/
Those example seem a bit random, there has to be some kind of order, isn't there a spec for the service? If not i suggest more example so that we can understand the rules.
Read up on some of the info here on finite state machines, and see if you can use some of the concepts on your input parsing problem.
If there is some order to the groups on each line, then maybe you can use a regex to separate the groups out.
Edit: after seeing your samples, you may get by with a regex, breaking on some of those specific delimiters. It will take maybe half an hour to test theory: pick yourself up a free regex tester, make yourself a regex that will isolate out just one of those groups, and pump a few sample lines through. If it performs reliably on the real data that you have, then expand it and see if you can also isolate out the other groups.
I should mention though that your regexes will break or just become a nightmare if there is any sort of vagaries in your data (and frequently there is). So test long and hard before settling on them. If you find you start to have exceptions in your data, then you will need to choose some sort of parsing algorithm (the FSM i mentioned above is a pattern you can follow if you implement a parsing mechanism).
The most stupid answer is "Use regex". But more information needed for better one.

How to do partial word searches in Lucene.NET?

I have a relatively small index containing around 4,000 locations. Among other things, I'm using it to populate an autocomplete field on a search form.
My index contains documents with a Location field containing values like
Ohio
Dayton, Ohio
Dublin, Ohio
Columbus, Ohio
I want to be able to type in "ohi" and have all of these results appear and right now nothing shows up until I type the full word "ohio".
I'm using Lucene.NET v2.3.2.1 and the relevant portion of my code is as follows for setting up my query....
BooleanQuery keywords = new BooleanQuery();
QueryParser parser = new QueryParser("location", new StandardAnalyzer());
parser.SetAllowLeadingWildcard(true);
keywords.Add(parser.Parse("\"*" + location + "*\""), BooleanClause.Occur.SHOULD);
luceneQuery.Add(keywords, BooleanClause.Occur.MUST);
In short, I'd like to get this working like a LIKE clause similar to
SELECT * from Location where Name LIKE '%ohi%'
Can I do this with Lucene?
Try this query:
parser.Parse(query.Keywords.ToLower() + "*")
Yes, this can be done. But, leading wildcard can result in slow queries. Check the documentation. Also, if you are indexing the entire string (eg. "Dayton, Ohio") as single token, most of the queries will degenerate to leading prefix queries. Using a tokenizer like StandardAnalyzer (which I suppose, you are already doing) will lessen the requirement for leading wildcard.
If you don't want leading prefixes for performance reasons, you can try out indexing ngrams. That way, there will not be any leading wildcard queries. The ngram (assuming only of length 4) tokenizer will create tokens for "Dayton Ohio" as "dayt", "ayto", "yton" and so on.
it's more a matter of populating your index with partial words in the first place. your analyzer needs to put in the partial keywords into the index as it analyzes (and hopefully weight them lower then full keywords as it does).
lucene index lookup trees work from left to right. if you want to search in the middle of a keyword, you have break it up as you analyze. the problem is that partial keywords will explode your index sizes usually.
people usually use really creative analyzers that break up words in root words (that take off prefixes and suffixes).
get down in to deep into understand lucene. it's good stuff. :-)

Regex index in matching string where the match failed

I am wondering if it is possible to extract the index position in a given string where a Regex failed when trying to match it?
For example, if my regex was "abc" and I tried to match that with "abd" the match would fail at index 2.
Edit for clarification. The reason I need this is to allow me to simplify the parsing component of my application. The application is an Assmebly language teaching tool which allows students to write, compile, and execute assembly like programs.
Currently I have a tokenizer class which converts input strings into Tokens using regex's. This works very well. For example:
The tokenizer would produce the following tokens given the following input = "INP :x:":
Token.OPCODE, Token.WHITESPACE, Token.LABEL, Token.EOL
These tokens are then analysed to ensure they conform to a syntax for a given statement. Currently this is done using IF statements and is proving cumbersome. The upside of this approach is that I can provide detailed error messages. I.E
if(token[2] != Token.LABEL) { throw new SyntaxError("Expected label");}
I want to use a regular expression to define a syntax instead of the annoying IF statements. But in doing so I lose the ability to return detailed error reports. I therefore would at least like to inform the user of WHERE the error occurred.
I agree with Colin Younger, I don't think it is possible with the existing Regex class. However, I think it is doable if you are willing to sweat a little:
Get the Regex class source code
(e.g.
http://www.codeplex.com/NetMassDownloader
to download the .Net source).
Change the code to have a readonly
property with the failure index.
Make sure your code uses that Regex
rather than Microsoft's.
I guess such an index would only have meaning in some simple case, like in your example.
If you'll take a regex like "ab*c*z" (where by * I mean any character) and a string "abbbcbbcdd", what should be the index, you are talking about?
It will depend on the algorithm used for mathcing...
Could fail on "abbbc..." or on "abbbcbbc..."
I don't believe it's possible, but I am intrigued why you would want it.
In order to do that you would need either callbacks embedded in the regex (which AFAIK C# doesn't support) or preferably hooks into the regex engine. Even then, it's not clear what result you would want if backtracking was involved.
It is not possible to be able to tell where a regex fails. as a result you need to take a different approach. You need to compare strings. Use a regex to remove all the things that could vary and compare it with the string that you know it does not change.
I run into the same problem came up to your answer and had to work out my own solution. Here it is:
https://stackoverflow.com/a/11730035/637142
hope it helps

Categories