C#: Help polishing string parsing - c#

One of the things I really value from this community is learning ways to do things in two lines that would normally take me twenty. In that spirit, I've done my best to take some string parsing down from about a dozen lines to three. But I feel like there's someone out there who wants to show me how this is actually a mess. Just for my own edification, is there a cleaner way to do the following? Could it all be done in one line?
string getThis = "<add key=\"messageFilter\" value=\"";
string subStr = strFile.Substring(strFile.IndexOf(getThis) + getThis.Length);
string[] igPhrases = subStr.Substring(0, subStr.IndexOf(";\"")).Split(';');
UPDATE
Thanks for the quick responses! Really helpful examples AND good advice with a minimum of snark. :) Fewer lines is not the same thing as clean and elegant, and reducing lines may actually make the code worse.
Let me rephrase the question.
I've got an XML doc that has the following line: <add key="messageFilter" value="Out of Office AutoReply;Automatic reply;"/>. This doc tells our automated ticketing system not to create tickets from emails that have those phrases in the subject line. Otherwise, we get an endless loop.
I'm working on a small program that will list phrases already included, and then allow users to add new phrases. If we notice that a new autoreply message is starting to loop through the system, we need to be able to add the language of that message to the filter.
I don't work a lot with XML. I like Sperske's solution, but I don't know how to make it dynamic. In other words, I can't put the value in my code. I need to find the key "messageFilter" and then get all the values associated with that key.
What I've done works, but it seems a little cumbersome. Is there a more straightforward way to get the key values? And to add a new one?

A slightly different one liner (split for readability):
System.Xml.Linq.XDocument
.Parse("<add key='messageFilter' value='AttrValue'/>")
.Root
.Attribute("value")
.Value
Outputs:
AttrValue
To address the updated question you could turn all of your <add> nodes into a dictionary (borrowing from Pako's excellent answer, and using a slightly longer string):
var keys = System.Xml.Linq.XDocument
.Parse("<keys><add key='messageFilter' value='AttrValue'/><add key='userFilter' value='AttrValueUser'/></keys>")
.Descendants("add")
.ToDictionary(r => r.Attribute("key").Value, r => r.Attribute("value").Value);
This lets you access your keys like so:
keys["messageFilter"] == "AttrValue"
keys["userFilter"] == "AttrValueUser"

It has been answered already, but for future readers - if you want to parse bigger XML, with root and many add nodes, you may need to use something slightly different.
string xmlPart = "<add key=\"messageFilter\" value=\"\" />";
string xml = "<root>" + xmlPart + "</root>";
var x = XDocument.Parse(xmlPart, LoadOptions.None);
var attributes1 = x.Descendants("add").Select(n => n.Attributes());
var attributes2 = x.Descendants("add").SelectMany(n => n.Attributes());
This will get you IEnumerable<IEnumerable<XAttribute>> (see attributes1) or IEnumerable<XAttribue> (see attributes2). Second option will simply flatten results - all attributes will be held in one collection, no matter from which node they came from.
Of course nothing stops you to filter XAttributes by name or some other criteria - it all up to you!

one ugly line:
string[] igPhrases = strFile.Substring(strFile.IndexOf(getThis) + ("<add key=\"messageFilter\" value=\"").Length).Substring(0, strFile.Substring(strFile.IndexOf("<add key=\"messageFilter\" value=\"") + ("<add key=\"messageFilter\" value=\"").Length).IndexOf(";\"")).Split(';');

Related

c# remove (null) from XML tags

I need to figure out a good way using C# to parse an XML file for (NULL) and remove it from the tags and replace it with the word BAD.
For example:
<GC5_(NULL) DIRTY="False"></GC5_(NULL)>
should be replaced with
<GC5_BAD DIRTY="False"></GC5_BAD>
Part of the problem is I have no control over the original XML, I just need to fix it once I receive it. The second problem is that the (NULL) can appear in zero, one, or many tags. It appears to be an issue with users filling in additional fields or not. So I might get
<GC5_(NULL) DIRTY="False"></GC5_(NULL)>
or
<MH_OTHSECTION_TXT_(NULL) DIRTY="False"></MH_OTHSECTION_TXT_(NULL)>
or
<LCDATA_(NULL) DIRTY="False"></LCDATA_(NULL)>
I am a newbie to C# and programming.
EDIT:
So I have come up with the following function that while not pretty, so far work.
public static string CleanInvalidXmlChars(string fileText)
{
List<char> charsToSubstitute = new List<char>();
charsToSubstitute.Add((char)0x19);
charsToSubstitute.Add((char)0x1C);
charsToSubstitute.Add((char)0x1D);
foreach (char c in charsToSubstitute)
fileText = fileText.Replace(Convert.ToString(c), string.Empty);
StringBuilder b = new StringBuilder(fileText);
b.Replace("", string.Empty);
b.Replace("", string.Empty);
b.Replace("<(null)", "<BAD");
b.Replace("(null)>", "BAD>");
Regex nullMatch = new Regex("<(.+?)_\\(NULL\\)(.+?)>");
String result = nullMatch.Replace(b.ToString(), "<$1_BAD$2>");
result = result.Replace("(NULL)", "BAD");
return result;
}
I have only been able to find 6 or 7 bad XML files to test this code on, but it has worked on each of them and not removed good data. I appreciate the feedback and your time.
In general, regular expressions are not the right way of handling XML files. There's a range of solutions to handle XML files correctly - you can read up on System.Xml.Linq for a good start. If you're a newbie, it's certainly something you should learn at some point. As Ed Plunkett pointed out in the comments, though, your XML is not actually XML: ( and ) characters are not allowed in XML element names.
Since you will have to do it as an operation on a string, Corak's comment to use
contentOfXml.Replace("(NULL)", "BAD");
may be a good idea, but will break if any elements can contain the string (NULL) as anything other than their name.
If you want a regex approach, this might work decently, but I'm not sure if it's not missing any edge cases:
var regex = new Regex(#"(<\/?[^_]*_)\(NULL\)([^>]*>)");
var result = regex.Replace(contentOfXml, "$1BAD$2");
Will it be suitable for you to read this XML as a string and perform a regex replacement? Like:
Regex nullMatch = new Regex("<(.+?)_\\(NULL\\)(.+?)>");
String processedXmlString = nullMatch.Replace(originalXmlString, "<$1_BAD$2>");

Build regular expression for replacing duplicated string into single word

I'm working of filtering comments. I'd like to replace string like this:
llllolllllllllllooooooooooooouuuuuuuuuuuddddddddddddddllllollllllllllllloooooooooooooooooouuuuuuuuuuuuuuuuuuddddddddddddddllllollllllllllllloooooooooooooooooouuuuuuuuuuuuuuuuuuddddddddddddddllllollllllllllllloooooooooooouuuuuuuuuuuuuuuuudddddddddddddd
with two words: lol loud
string like this:
cuytwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww
with: cuytw
And string like this:
hyyuyuyuyuyuyuyuyuyuyuyuyuyu
with: hyu
but not modify strings like look, geek.
Is there any way to achieve this with single regular expression in C#?
I think I can answer this categorically.
This definitely cant be done with RegEx or even standard code due to your input and output requirements without at minimum some sort of dictionary and algorithm to try and reduce doubles in a permutation check for legitimate words.
The result (at best) would give you a list of possible non mutually-exclusive combinations of nonsense words and legitimate words with doubles.
In fact, I'd go as far to say with your current requirements and no extra specificity on rules, your input and output are generically impossible and could only be taken at face value for the cases you have given.
I'm not sure how to use RegEx for this problem, but here is an alternative which is arguably easier to read.*
Assuming you just want to return a string comprising the distinct letters of the input in order, you can use GroupBy:
private static string filterString(string input)
{
var groups = input.GroupBy(c => c);
var output = new string(groups.Select(g => g.Key).ToArray());
return output;
}
Passes:
Returns loud for llllolllllllllllooooooooooooouuuuuuuuuuuddddddddddddddllllollllllllllllloooooooooooooooooouuuuuuuuuuuuuuuuuuddddddddddddddllllollllllllllllloooooooooooooooooouuuuuuuuuuuuuuuuuuddddddddddddddllllollllllllllllloooooooooooouuuuuuuuuuuuuuuuudddddddddddddd
Returns cuytw for cuytwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww
Returns hyu for hyyuyuyuyuyuyuyuyuyuyuyuyuyu
Failures:
Returns lok for look
Returns gek for geek
* On second read you want to leave words like look and geek alone; this is a partial answer.

Best Way to Parse A Fedex Transaction Reply String Using C#

I'm sure this isn't as complicated as I'm making it.
I have a string that follows the following pattern:
#,"value"#,"next value"#,"next value" . . .
I need to parse out the number/value pairs in order to use the data in my application.
Here is a sample of a return:
0,"120"1,""10,"298630427"29,"577015971830"30,"SG MSNA "33,"A1"34,"4625"35,"239"36,"2105"37,"2759"60,"15"112,"0"
To complicate matters the string can contain newline characters (\r,\n, or \r\n).
I can handle that by simply removing the newlines with a few string.replace calls.
I would ultimately like to parse the data into key/value pairs. I just can't seem to get my mind unto the right path.
I apologize if this is trivial but I've been pulling 18+ hours days for two months trying to meet a deadline and my brain is shot. Any assistance or guidance in the right direction will be most appreciated.
var numVal=Regex.Matches(input,#"\"([^\"]+)\"(\d+)")
.Cast<Match>()
.Select(x=>new
{
num=x.Groups[2].Value,
value=x.Groups[1].Value
});
Now you can iterate over numVal
foreach(var nv in numVal)
{
nv.num;
nv.value;
}
If you're going straight to key value pairs, you might be able to use LINQ to make your life easier. You'll have to be aware of and handle cases where you don't match the key/value format. However, you might be able to achieve it using something like this.
var string_delimiter = new [] { ',' };
var kvp_delimiter = new[] { "\"" };
var dictionary = string_value.Split(string_delimiter)
.Select(kvp_string => kvp_string.Split(kvp_delimiter, StringSplitOptions.RemoveEmptyEntries))
.ToDictionary(kvp_vals => kvp_vals.First(), kvp_vals => kvp_vals.Last());

Complex String Processing - well complex to me

I am calling a web service and all I get back is a giant blob of text. I am left to process it myself. Problem is not all lines are necessarily the same. They each have 2 or 3 sections to them and they are similar. Here are the most common examples
text1 [text2] /text3/
text1/test3
text1[text2]/text3
text1 [text2] /text /3 here/
I am not exactly sure how to approach this problem. I am not too good at doing anything advanced as far as manipulating strings.
I was thinking using a regular expression might work, but not too sure on that either. If I can get each of these 3 sections broken up it is easier from there to do the rest. its just there doesn't seem to be any uniformity to the main 3 sections that I know how to work with.
EDIT: Thanks for mentioning i didn't actually say what I wanted to do.
Basically, I want to split these 3 sections of text into their own strings seperate stings so basically take it from one single string to an array of 3 strings.
string[0] = text1
string[1] = text2
string[2] = text3
Here is some of the text I get back from a call as an example
スルホ基 [スルホき] /(n) sulfo group/
鋭いナイフ [するどいナイフ] /(n) sharp knife/
鋭い批判 [するどいひはん] /(n) sharp criticism/
スルナーイ /(n) (See ズルナ) (obsc) surnay (Anatolian woodwind instrument) (per:)/zurna/
スルピリン /(n) sulpyrine/
スルファミン /(n) sulfamine/
剃る [そる(P);する] /(v5r,vt) to shave/(P)/
As the first line for an example I want to pull it out into an array
string[0] = スルホ基
string[0] = [スルホき]
string[0] = /(n) sulfo group/
Those example seem a bit random, there has to be some kind of order, isn't there a spec for the service? If not i suggest more example so that we can understand the rules.
Read up on some of the info here on finite state machines, and see if you can use some of the concepts on your input parsing problem.
If there is some order to the groups on each line, then maybe you can use a regex to separate the groups out.
Edit: after seeing your samples, you may get by with a regex, breaking on some of those specific delimiters. It will take maybe half an hour to test theory: pick yourself up a free regex tester, make yourself a regex that will isolate out just one of those groups, and pump a few sample lines through. If it performs reliably on the real data that you have, then expand it and see if you can also isolate out the other groups.
I should mention though that your regexes will break or just become a nightmare if there is any sort of vagaries in your data (and frequently there is). So test long and hard before settling on them. If you find you start to have exceptions in your data, then you will need to choose some sort of parsing algorithm (the FSM i mentioned above is a pattern you can follow if you implement a parsing mechanism).
The most stupid answer is "Use regex". But more information needed for better one.

how to create a parser for search queries

for example i'd need to create something like google search query parser to parse such expressions as:
flying hiking or swiming
-"**walking in boots **" **author:**hamish **author:**reid
or
house in new york priced over
$500000 with a swimming pool
how would i even go about start building something like it? any good resources?
c# relevant, please (if possible)
edit: this is something that i should somehow be able to translate to a sql query
How many keywords do you have (like 'or', 'in', 'priced over', 'with a')? If you only have a couple of them I'd suggest going with simple string processing (regexes) too.
But if you have more than that you might want to look into implementing a real parser for those search expressions. Irony.net might help you with that (I found it extremely easy to use as you can express your grammar in a near bnf-form directly in code).
The Lucene/NLucene project have functionality for boolean queries and some other query formats as well. I don't know about the possibilities to add own extensions like author in your case, but it might be worthwile to check it out.
There are few ways doing it, two of them:
Parsing using grammar (useful for complex language)
Parsing using regular expression and basic string manipulations (for simpler language)
According to your example, the language is very basic so splitting the string according to keyword can be the best solution.
string sentence = "house in new york priced over $500000 with a swimming pool";
string[] values = sentence.Split(new []{" in ", " priced over ", " with a "},
StringSplitOptions.None);
string type = values[0];
string area = values[1];
string price = values[2];
string accessories = values[3];
However, some issues that may arise are: how to verify if the sentence stands in the expected form? What happens if some of the keywords can appear as part of the values?
If this is the case you encounter there are some libraries you can use to parse input using a defined grammar. Two of these libraries that works with .Net are ANTLR and Gold Parser, both are free. The main challenge is defining the grammar.
A grammar would work very well for the second example you gave but the first (any order keyword/command strings) would be best handled using Split() and a class to handle the various keywords and commands. You will have to do initial processing to handle quoted regions before the split (for example replacing spaces within quoted regions with a rare/unused character).
The ":" commands are easy to find and pull out of the search string for processing after the split is completed. Simply traverse the array looking.
The +/- keywords are also easy to find and add to the sql query as AND/AND NOT clauses.
The only place you might run into issues is with the "or" since you'll have to define how it is handled. What if there are multiple "or"s? But the order of keywords in the array is the same as in the query so that won't be an issue.
i think you should just do some string processing. There is no smart way of doing this.
So replace "OR" with your own or operator (e.g. ||). As far as i know there is no library for this.
I suggest you go with regexes.

Categories