Regex.Unescape exception - c#

The following folder path stored on a database table as \\SnowAngel\IcedData. However when reading from the database it is coming as:
string myFolderName = "\\\\SnowAngel\\IcedData"; Where SnowAngel is the server name.
Regex.Unescape(myFolderName);
The above line throws the following exception:
{"parsing \"\\SnowAngel\IcedData\" - Unrecognized escape sequence \I."}
What I'm missing here ?

One has to deal with two parsers, the first is the C# language and the second is the regex parser. You have added multiple slashes to speak to the C# parser and that is confusing to the regex parser.
I recommend that you use the C# literal # when dealing with regex patterns. That way one doesn't have to worry about the C# parser. Simply change it to
string myFolderName = #"\\SnowAngel\IcedData";
and work with it in regex, though that doesn't look like a pattern.

Related

Parsing Lucene Query syntax and escaping for CloudSearch

Basically, I have an application that needs to support both Lucene.NET and Amazon CloudSearch.
So, I can't re-write the queries, I need to use the standard queries from lucene, and use the .ToString() on the query to get the syntax.
The issue is that in Lucene.NET (I don't know if this is the same in the java version), the .ToString() method return the raw string without the escape characters.
Therefore, things like:
(title:blah:blah summary:"lala:la")
should be
(title:blah\:blah summary:"lala\:la")
What I need is a regex that will add the escapes.
Is this possible? and if so, what would it look like.
Some additional possible variances:
(title:"this is a search:term")
(field5:"this is a title:term")
Based on comments and edits, it seems that you want any query string to be able to be correctly escaped by the regex, and any given lucene query to be accurately represented by the resulting string.
That ain't gonna happen.
Lucene query syntax is not capable of expressing all lucene queries. In fact, the string you get from Query.toString() often can't even be parsed by the QueryParser, nevermind being an accurate reconstruction of the query.
The long and short of it: You are going about this the wrong way. Query.ToString() is not designed to serialize the query, and it's goal is not to create a parsable string query. It's mainly for debugging and such. If you keep attempting to use it this way, this tomfoolery of trying to use a regex to escape ambiguous query syntax will likely just be the start of your troubles.
This question provides another example of this.
You can use this regex to escape the colon : at strategic points of the string
(?<!title|summary):
Then escape the captured colon :
Explanation
Look behind ?<! for any colon that is not followed by title or summary, then match the colon :
See Demo
input
(title:blah:blah summary:"lala:la")
Output
(title:blah\:blah summary:"lala\:la")

Parsing XML in VB.Net is failing due to a special character

I have some VB.Net code which is parsing an XML string.
The XML String comes from a TCP 3rd Party stream and as such we have to take the data we get and deal with it.
The issue we have is that one of the elements data can sometimes contain special characters e.g. &, $ , < and thus when the “XMLDoc.LoadXml(XML)” is executed it fails - note XMLDoc is configured as "Dim XMLDoc As XmlDocument = New XmlDocument()".
Have tried to Google answers for this but I am really struggling to find a solution. Have looked at a RegEX but realised this has some limitations; or I just dont understand it enough lol.
If it helps here is an example of XLM we would have streamed to us (just for info the message tag comes from an SMS message):-
(if it helps the only bit that will ever have an error is (and all I have to check) the <Message>O&N</Message> section so in this case the message has come in with an &)
<IncomingMessage><DeviceSendTime>19/02/2013 14:00:50</DeviceSendTime>
<Sender>0000111111</Sender>
<Status>New</Status>
<Transport>Sms</Transport>
<Id>-1</Id>
<Message>O&N</Message>
<Timestamp>19/02/2013 14:00:50</Timestamp>
<ReadTimestamp>19/02/2013 14:00:50</ReadTimestamp>
</IncomingMessage>
If we're looking specifically within Message elements, and assuming there are no nested elements within the Message element:
Dim url = "put url here"
Dim s As String
Dim characterMappings = New Dictionary(Of String, String) From {
{"&", "&"},
{"<", "<"},
{">", ">"},
{"""", """}
}
Using client As New WebClient
s = client.DownloadString(url)
End Using
s = Regex.Replace(s,
"(?:<Message>).*?(" & String.Join("|", characterMappings.Keys) & ").*?(?:</Message>)",
Function(match) characterMappings(match.Groups(1).Value)
)
Dim x = XDocument.Parse(s)
$ should not be an issue with XML, but if it is you can add it to the dictionary.
Use of WebClient comes from here.
Updated
Since $ has special meaning in regular expressions, it cannot be simply added to the dictionary; it needs to be escaped with \ in the regular expression pattern. The simplest way to do this, would be to write the pattern manually, instead of joining the keys to the dictionary:
s = Regex.Replace(s,
"(?:<Message>).*?(&|<|>|\$).*?(?:</Message>)",
Function(match) characterMappings(match.Groups(1).Value)
)
Also, I highly recommend Expresso for working with regular expressions.
Your XML is invalid and hence it is not XML. Either fix code that generates XML (correct approach) or pretend this is text file and enjoy all problems with parsing non-structured text.
As you've stated in the question <Message>O&N</Message> is not valid XML. Most likely reason of such "XML" is using string concatenation to construct it instead of using proper XML manipulation methods. Unless you use some arcane language all practically used languages have built in or library support for XML creation so it should not be to hard to create XML right.

Is it possible to use Regex to extract text from attributes repeated in a text file - c# .NET

I am working something at the moment and need to extract an attribute from a big list tags, they are formatted like this:
<appid="928" appname="extractapp" supportemail="me#mydomain.com" /><appid="928" appname="extractapp" supportemail="me#mydomain.com" />
The tags are repeated one after another and all have different appid, appname, supportemail.
I need to just extract all of the support emails, just the email address, without the supportemail=
Will I need to use two regex statements, one to seperate each individual tag, then loop through the result and pull out the emails?
I would then go through and Add the emails to a list, then loop through the list and write each one to a txt file, with a comma after it.
I've never really used Regex too much, so don't know if it's suitable for the above?
I would spend more time trying it myself but it's quite urgent. So hopefully somebody can help.
Have you considered Linq to XML?
http://www.hookedonlinq.com/LINQtoXML5MinuteOverview.ashx
Using XML is better, perhaps, but here's the regular expression you'd use (in case there's a particular reason you need/want to use regular expressions to read XML):
(appid="(?<AppID>[^"]+)" appname="(?<AppName>[^"]+)" supportemail="(?<SupportEmail>[^"]+)")
You can just take the last bit there for the support email but this will extract all of the attributes you mentioned and they will be "grouped" within each tag.
What about modify the string to have proper xml format and load xml to extract all the values of supportemail attribute?
Use
string pattern = "supportemail=\"([^\"]+)";
MatchCollection matches = Regex.Matches(inputString, pattern);
foreach(Match m in matches)
Console.WriteLine(m.Groups[1].Value);
See it here.
Problems you'll encounter by using regular expressions instead of an XML DOM:
All of the example regexes posted thus far will fail in the extremely common case that the attribute values are delimited by single quotes.
Any regex that depends on the attributes appearing in a specific order (e.g. appId before appName) will fail in the event that attributes - whose ordering is insignificant to XML - appear in an order different from what the regex expects.
A DOM will resolve entity references for you and a regex will not; if you use regex, you must check the returned values for (at least) the XML character entitites &, &apos;, >, <, and ".
There's a well-known edge case where using regular expressions to parse XML and XHTML unleashes the Great Old Ones. This will complicate your task considerably, as you will be reduced to gibbering madness and then the Earth will be eaten.

Heredoc strings in C#

Is there a heredoc notation for strings in C#, preferably one where I don't have to escape anything (including double quotes, which are a quirk in verbatim strings)?
As others have said, there isn't.
Personally I would avoid creating them in the first place though - I would use an embedded resource instead. They're pretty easy to work with, and if you have a utility method to load a named embedded resource from the calling assembly as a string (probably assuming UTF-8 encoding) it means that:
If your embedded document is something like SQL, XSLT, HTML etc you'll get syntax highlighting because it really will be a SQL (etc) file
You don't need to worry about any escaping
You don't need to worry about either indenting your document or making your C# code look ugly
You can use the file in a "normal" way if that's relevant (e.g. view it as an HTML page)
Your data is separated from your code
Well even though it doesn't support HEREDOC's, you can still do stuff like the following using Verbatim strings:
string miniTemplate = #"
Hello ""{0}"",
Your friend {1} sent you this message:
{2}
That's all!";
string populatedTemplate = String.Format(miniTemplate, "Fred", "Jack", "HelloWorld!");
System.Console.WriteLine(populatedTemplate);
Snagged from:
http://blog.luckyus.net/2009/02/03/heredoc-in-c-sharp/
No, there is no "HEREDOC" style string literal in C#.
C# has only two types of string literals:
Regular literal, with many escape sequences necessary
Verbatim literal, #-quoted: doublequotes need to be escaped by doubling
References
csharpindepth.com - General Articles - Strings
MSDN - C# Programmer's Reference - Strings
String literals are of type string and can be written in two forms, quoted and #-quoted.
November 2022 update:
Starting with C# 11 this is now possible using Raw string literals:
var longMessage = """
This is a long message.
Some "quoted text" here.
""";

Parser using RegEx and XML, in C#

I am making an application where I need to verify the syntax of each line which contains a command involving a keyword as the first word.
Also, if the syntax is correct I need to check the type of the variables used in the keywords.
Like if there's a print command:
print "string" < variable < "some another string" //some comments
print\s".*"((\s?<\s?".*")*\s?<\s?(?'string1'\w+))?(\s*//.*)?
So i made the following Regex:
\s*[<>]\s*((?'variant'\w+)(\[\d+\])*)
This is to access all words in variant group to extract the variables used and verify their type.
Like this my tool has many keywords and currently I am crudely writing regex for each keyword. And if there's a change tomorrow I would be replacing the respective change everytime everywhere in every keyword.
I am storing a Regex for each keyword in an XML file. However I was interested in making it extensible, where say the specification changes tomorrow so I need to change it only once and it would reflect in all the places something like I transform the print regex to:
print %string% (%<% %string%|%variable%)* %comments%
Now like this, I write a specification for each keyword and write the definition of string, variable, comments in another file which stores their regex. Then I write a parser which parses this string and create a regex string for me.
Is this possible?
Is there any better way of doing this or is there any way I can do this in XML?
Last time I asked a question like this, someone pointed me to http://www.antlr.org/. Enjoy. :-)
I got an idea and made my own replacer. I used %myname% kind of tags to define my regular expression, and i wrote the definition of %myname% tags seperately using regex. Then i scanned the string recursively and converted the occurance of %myname% tags to the specification they had. It did my work.Thanks any ways

Categories