Parsing Lucene Query syntax and escaping for CloudSearch

Parsing Lucene Query syntax and escaping for CloudSearch - c#

Basically, I have an application that needs to support both Lucene.NET and Amazon CloudSearch.
So, I can't re-write the queries, I need to use the standard queries from lucene, and use the .ToString() on the query to get the syntax.
The issue is that in Lucene.NET (I don't know if this is the same in the java version), the .ToString() method return the raw string without the escape characters.
Therefore, things like:
(title:blah:blah summary:"lala:la")
should be
(title:blah\:blah summary:"lala\:la")
What I need is a regex that will add the escapes.
Is this possible? and if so, what would it look like.
Some additional possible variances:
(title:"this is a search:term")
(field5:"this is a title:term")

Based on comments and edits, it seems that you want any query string to be able to be correctly escaped by the regex, and any given lucene query to be accurately represented by the resulting string.
That ain't gonna happen.
Lucene query syntax is not capable of expressing all lucene queries. In fact, the string you get from Query.toString() often can't even be parsed by the QueryParser, nevermind being an accurate reconstruction of the query.
The long and short of it: You are going about this the wrong way. Query.ToString() is not designed to serialize the query, and it's goal is not to create a parsable string query. It's mainly for debugging and such. If you keep attempting to use it this way, this tomfoolery of trying to use a regex to escape ambiguous query syntax will likely just be the start of your troubles.
This question provides another example of this.

You can use this regex to escape the colon : at strategic points of the string
(?<!title|summary):
Then escape the captured colon :
Explanation
Look behind ?<! for any colon that is not followed by title or summary, then match the colon :
See Demo
input
(title:blah:blah summary:"lala:la")
Output
(title:blah\:blah summary:"lala\:la")

Related

C# Regex filter problems

At this moment in time, i posted something earlier asking about the same type of question regarding Regex. It has given me headaches, i have looked up loads of documentation of how to use regex but i still could not put my finger on it. I wouldn't want to waste another 6 hours looking to filter simple (i think) expressions.
So basically what i want to do is filter all filetypes with the endings of HTML extensions (the '*' stars are from a Winforms Tabcontrol signifying that the file has been modified. I also need them in IgnoreCase:
.html, .htm, .shtml, .shtm, .xhtml
.html*, .htm*, .shtml*, .shtm*, .xhtml*
Also filtering some CSS files:
.css
.css*
And some SQL Files:
.sql, .ddl, .dml
.sql*, .ddl*, .dml*
My previous question got an answer to filtering Python files:
.py, .py, .pyi, .pyx, .pyw
Expression would be: \.py[3ixw]?\*?$
But when i tried to learn from the expression above i would always end up with opening a .xhtml only, the rest are not valid.
For the HTML expression, i currently have this: \.html|.html|.shtml|.shtm|.xhtml\*?$ with RegexOptions.IgnoreCase. But the output will only allow .xhtml case sensitive or insensitive. .html files, .htm and the rest did not match. I would really appreciate an explanation to each of the expressions you provide (so i don't have to ask the same question ever again).
Thank you.

For such cases you may start with a simple regex that can be simplified step by step down to a good regex expression:
In C# this would basically, with IgnoreCase, be
Regex myRegex = new Regex("PATTERN", RegexOptions.IgnoreCase);
Now the pattern: The most easy one is simply concatenating all valid results with OR + escaping (if possible):
\.html|\.htm|\.shtml|\.shtm|\.xhtml|\.html*|\.htm*|\.shtml*|\.shtm*|\.xhtml*
With .html* you mean .html + anything, which is written as .*(Any character, 0-infinite times) in regex.
\.html|\.htm|\.shtml|\.shtm|\.xhtml|\.html.*|\.htm.*|\.shtml.*|\.shtm.*|\.xhtml.*
Then, you may take all repeating patterns and group them together. All file endings start with a dot and may have an optional end and ending.* always contains ending:
\.(html|htm|shtml|shtm|xhtml).*
Then, I see htm pretty often, so I try to extract that. Taking all possible characters before and after htm together (? means 0 or 1 appearance):
\.(s|x)?(htm)l?.*
And, I always check if it's still working in regexstorm for .Net
That way, you may also get regular expressions for the other 2 ones and concat them all together in the end.

Regex.Unescape exception

The following folder path stored on a database table as \\SnowAngel\IcedData. However when reading from the database it is coming as:
string myFolderName = "\\\\SnowAngel\\IcedData"; Where SnowAngel is the server name.
Regex.Unescape(myFolderName);
The above line throws the following exception:
{"parsing \"\\SnowAngel\IcedData\" - Unrecognized escape sequence \I."}
What I'm missing here ?

One has to deal with two parsers, the first is the C# language and the second is the regex parser. You have added multiple slashes to speak to the C# parser and that is confusing to the regex parser.
I recommend that you use the C# literal # when dealing with regex patterns. That way one doesn't have to worry about the C# parser. Simply change it to
string myFolderName = #"\\SnowAngel\IcedData";
and work with it in regex, though that doesn't look like a pattern.

Is it possible to use Regex to extract text from attributes repeated in a text file - c# .NET

I am working something at the moment and need to extract an attribute from a big list tags, they are formatted like this:
<appid="928" appname="extractapp" supportemail="me#mydomain.com" /><appid="928" appname="extractapp" supportemail="me#mydomain.com" />
The tags are repeated one after another and all have different appid, appname, supportemail.
I need to just extract all of the support emails, just the email address, without the supportemail=
Will I need to use two regex statements, one to seperate each individual tag, then loop through the result and pull out the emails?
I would then go through and Add the emails to a list, then loop through the list and write each one to a txt file, with a comma after it.
I've never really used Regex too much, so don't know if it's suitable for the above?
I would spend more time trying it myself but it's quite urgent. So hopefully somebody can help.

Have you considered Linq to XML?
http://www.hookedonlinq.com/LINQtoXML5MinuteOverview.ashx

Using XML is better, perhaps, but here's the regular expression you'd use (in case there's a particular reason you need/want to use regular expressions to read XML):
(appid="(?<AppID>[^"]+)" appname="(?<AppName>[^"]+)" supportemail="(?<SupportEmail>[^"]+)")
You can just take the last bit there for the support email but this will extract all of the attributes you mentioned and they will be "grouped" within each tag.

What about modify the string to have proper xml format and load xml to extract all the values of supportemail attribute?

Use
string pattern = "supportemail=\"([^\"]+)";
MatchCollection matches = Regex.Matches(inputString, pattern);
foreach(Match m in matches)
Console.WriteLine(m.Groups[1].Value);
See it here.

Problems you'll encounter by using regular expressions instead of an XML DOM:
All of the example regexes posted thus far will fail in the extremely common case that the attribute values are delimited by single quotes.
Any regex that depends on the attributes appearing in a specific order (e.g. appId before appName) will fail in the event that attributes - whose ordering is insignificant to XML - appear in an order different from what the regex expects.
A DOM will resolve entity references for you and a regex will not; if you use regex, you must check the returned values for (at least) the XML character entitites &, &apos;, >, <, and ".
There's a well-known edge case where using regular expressions to parse XML and XHTML unleashes the Great Old Ones. This will complicate your task considerably, as you will be reduced to gibbering madness and then the Earth will be eaten.

RegEx replace with calculations?

Is it possible somehow to do a RegEx-replace with a calculation in the result? (in VS2010)
Such as:
Grid\.Row\=\"{[0-9]+}\"
to
Grid.Row="eval(int(\1) + 1)"

You can use a MatchEvaluator do achieve this, like
String s = Regex.Replace("1239", #"\d", m => (Int32.Parse(m.ToString()) + 1).ToString());
Output: 23410
Edit:
I just noticed... if you mean "using the VS2010 find-replace feature" and not "using C#", then the answer is "no", i am afraid.

You could always use capturing to retrieve any values you need for your calculation and then perform a RegEx Replace with a new RegEx that's constructed from you're equation and any values you captured.
If the equation doesn't use anything from the input text, one RegEx would be sufficient. You'd simply construct it by concatenating the static portions together with the computed value(s).

Unfortunately, C# and .NET do not provide an eval method or equivalent. However, it is possible to either use a library for expression parsing (a quick google gave me this .NET Math Expression Parser) or write your own (which is actually pretty easy, check out the Shunting-yard Algorithm and Postfix Notation). Simply capture the group then output the group value to the library/method you have written.
Edit: I see now you want this for the VS2010 program. This is unachievable unless you write your own VS extension. You could always write a program to search and replace your code and feed the code into it, then replace it the original code with its output.

Regex index in matching string where the match failed

I am wondering if it is possible to extract the index position in a given string where a Regex failed when trying to match it?
For example, if my regex was "abc" and I tried to match that with "abd" the match would fail at index 2.
Edit for clarification. The reason I need this is to allow me to simplify the parsing component of my application. The application is an Assmebly language teaching tool which allows students to write, compile, and execute assembly like programs.
Currently I have a tokenizer class which converts input strings into Tokens using regex's. This works very well. For example:
The tokenizer would produce the following tokens given the following input = "INP :x:":
Token.OPCODE, Token.WHITESPACE, Token.LABEL, Token.EOL
These tokens are then analysed to ensure they conform to a syntax for a given statement. Currently this is done using IF statements and is proving cumbersome. The upside of this approach is that I can provide detailed error messages. I.E
if(token[2] != Token.LABEL) { throw new SyntaxError("Expected label");}
I want to use a regular expression to define a syntax instead of the annoying IF statements. But in doing so I lose the ability to return detailed error reports. I therefore would at least like to inform the user of WHERE the error occurred.

I agree with Colin Younger, I don't think it is possible with the existing Regex class. However, I think it is doable if you are willing to sweat a little:
Get the Regex class source code
(e.g.
http://www.codeplex.com/NetMassDownloader
to download the .Net source).
Change the code to have a readonly
property with the failure index.
Make sure your code uses that Regex
rather than Microsoft's.

I guess such an index would only have meaning in some simple case, like in your example.
If you'll take a regex like "ab*c*z" (where by * I mean any character) and a string "abbbcbbcdd", what should be the index, you are talking about?
It will depend on the algorithm used for mathcing...
Could fail on "abbbc..." or on "abbbcbbc..."

I don't believe it's possible, but I am intrigued why you would want it.

In order to do that you would need either callbacks embedded in the regex (which AFAIK C# doesn't support) or preferably hooks into the regex engine. Even then, it's not clear what result you would want if backtracking was involved.

It is not possible to be able to tell where a regex fails. as a result you need to take a different approach. You need to compare strings. Use a regex to remove all the things that could vary and compare it with the string that you know it does not change.
I run into the same problem came up to your answer and had to work out my own solution. Here it is:
https://stackoverflow.com/a/11730035/637142
hope it helps

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Parsing Lucene Query syntax and escaping for CloudSearch - c#

Related

C# Regex filter problems

Regex.Unescape exception

Is it possible to use Regex to extract text from attributes repeated in a text file - c# .NET

RegEx replace with calculations?

Regex index in matching string where the match failed

Categories

Resources