Extracting and Manipulating Strings in C#.Net - c#

We have a requirement to extract and manipulate strings in C#. Net. The requirement is - we have a string
($name$:('George') AND $phonenumer$:('456456') AND
$emailaddress$:("test#test.com"))
We need to extract the strings between the character - $
Therefore, in the end, we need to get a list of strings containing - name, phonenumber, emailaddress.
What would be the ideal way to do it? are there any out of the box features available for this?
Regards,
John

The simplest way is to use a regular expression to match all non-whitespace characters between $ :
var regex=new Regex(#"\$\w+\$");
var input = "($name$:('George') AND $phonenumer$:('456456') AND $emailaddress$:(\"test#test.com\"))";
var matches=regex.Matches(input);
This will return a collection of matches. The .Value property of each match contains the matching string. \$ is used because $ has special meaning in regular expressions - it matches the end of a string. \w means a non-whitespace character. + means one or more.
Since this is a collection, you can use LINQ on it to get eg an array with the values:
var values=matches.OfType<Match>().Select(m=>m.Value).ToArray();
That array will contain the values $name$,$phonenumer$,$emailaddress$.
Capture by name
You can specify groups in the pattern and attach names to them. For example, you can group the field name values:
var regex=new Regex(#"\$(?<name>\w+)\$");
var names=regex.Matches(input)
.OfType<Match>()
.Select(m=>m.Groups["name"].Value);
This will return name,phonenumer,emailaddress. Parentheses are used for grouping. (?<somename>pattern) is used to attach a name to the group
Extract both names and values
You can also capture the field values and extract them as a separate field. Once you have the field name and value, you can return them, eg as an object or anonymous type.
The pattern in this case is more comples:
#"\$(?<name>\w+)\$:\(['""](?<value>.+?)['""]\)"
Parentheses are escaped because we want them to match the values. Both ' and " characters are used in values, so ['"] is used to specify a choice of characters. The pattern is a literal string (ie starts with #) so the double quotes have to be escaped: ['""] . Any character has to be matched .+ but only up to the next character in the pattern .+?. Without the ? the pattern .+ would match everything to the end of the string.
Putting this together:
var regex = new Regex(#"\$(?<name>\w+)\$:\(['""](?<value>.+?)['""]\)");
var myValues = regex.Matches(input)
.OfType<Match>()
.Select(m=>new { Name=m.Groups["name"].Value,
Value=m.Groups["value"].Value
})
.ToArray()
Turn them into a dictionary
Instead of ToArray() you could convert the objects to a dictionary with ToDictionary(), eg with .ToDictionary(it=>it.Name,it=>it.Value). You could omit the select step and generate the dictionary from the matches themselves :
var myDict = regex.Matches(input)
.OfType<Match>()
.ToDictionary(m=>m.Groups["name"].Value,
m=>m.Groups["value"].Value);
Regular expressions are generally fast because they don't split the string. The pattern is converted to efficient code that parses the input and skips non-matching input immediatelly. Each match and group contain only the index to their starting and ending character in the input string. A string is only generated when .Value is called.
Regular expressions are thread-safe, which means a single Regex object can be stored in a static field and reused from multiple threads. That helps in web applications, as there's no need to create a new Regex object for each request
Because of these two advantages, regular expressions are used extensively to parse log files and extract specific fields. Compared to splitting, performance can be 10 times better or more, while memory usage remains low. Splitting can easily result in memory usage that's multiple times bigger than the original input file.
Can it go faster?
Yes. Regular expressions produce parsing code that may not be as efficient as possible. A hand-written parser could be faster. In this particular case, we want to start capturing text if $ is detected up until the first $. This can be done with the following method :
IEnumerable<string> GetNames(string input)
{
var builder=new StringBuilder(20);
bool started=false;
foreach(var c in input)
{
if (started)
{
if (c!='$')
{
builder.Append(c);
}
else
{
started=false;
var value=builder.ToString();
yield return value;
builder.Clear();
}
}
else if (c=='$')
{
started=true;
}
}
}
A string is an IEnumerable<char> so we can inspect one character at a time without having to copy them. By using a single StringBuilder with a predetermined capacity we avoid reallocations, at least until we find a key that's larger than 20 characters.
Modifying this code to extract values though isn't so easy.

Here's one way to do it, but certainly not very elegant. Basically splitting the string on the '$' and taking every other item will give you the result (after some additional trimming of unwanted characters).
In this example, I'm also grabbing the value of each item and then putting both in a dictionary:
var input = "($name$:('George') AND $phonenumer$:('456456') AND $emailaddress$:(\"test#test.com\"))";
var inputParts = input.Replace(" AND ", "")
.Trim(')', '(')
.Split(new[] {'$'}, StringSplitOptions.RemoveEmptyEntries);
var keyValuePairs = new Dictionary<string, string>();
for (int i = 0; i < inputParts.Length - 1; i += 2)
{
var key = inputParts[i];
var value = inputParts[i + 1].Trim('(', ':', ')', '"', '\'', ' ');
keyValuePairs[key] = value;
}
foreach (var kvp in keyValuePairs)
{
Console.WriteLine($"{kvp.Key} = {kvp.Value}");
}
// Wait for input before closing
Console.WriteLine("\nDone!\nPress any key to exit...");
Console.ReadKey();
Output

Related

Trim().Split causes a problem in Contains()

I am getting a string and trimming it first, then splitting it and assigning it to a string[]. Then, I am using every element in the array for a string.Contains() or string.StartsWith() method. Interesting thing is that even if the string contains element, Contains() doesn't work properly. And situation is same for StartsWith(), too. Does anyone have any idea about the problem?
P.S.: I trimmed strings after splitting and problem was solved.
string inputTxt = "tasklist";
string commands = "net, netsh, tasklist";
string[] maliciousConsoleCommands = commands.Trim(' ').Split(',');
for (int i = 0; i < maliciousConsoleCommands.Length; i++) {
if (inputTxt.StartsWith(maliciousConsoleCommands[i])) {
return false;
}
}
//this code works but no idea why previous code didn't work.
string[] maliciousConsoleCommands = commands.Split(',');
for (int i = 0; i < maliciousConsoleCommands.Length; i++) {
if (inputTxt.StartsWith(maliciousConsoleCommands[i].Trim(' '))) {
return false;
}
}
I expected to work properly but it is solved by trimming after splitting.
Your delimiter is not a comma char, it's a comma followed by a white-space - so instead of splitting by ',', simply split by ", ":
string[] maliciousConsoleCommands = commands.Split(new string[] {", "});
This will return the items without the leading space so the trim will be redundant.
It seems, you should Trim each item :
// ["net", "netsh, "tasklist"]
string[] maliciousConsoleCommands = commands
.Split(',') // "net" " netsh", " tasklist" - note leading spaces
.Select(item => item.Trim()) // removing leading spaces from each item
.ToArray();
Finally, if you want to test if inputTxt is malicious:
if (commands
.Split(',')
.Select(item => item.Trim()) // You can combine Select and Any
.Any(item => inputTxt.StartsWith(item))
return false;
First code you presented won't work because you want to trim initial string, so "net, netsh, tasklist" will stay unchanged after trimming (no leading and trailing spaces), then splitting it by comma will produce entries, that have leading space. Thus, you will get unexpected results. You should be trimming after splitting the string.
Second code also won't work, because you use Trim after StartsWith, which return bool value. You can't apply Trim to bool, this code should not even compile.
Yet another way to split if the commands themselves have no spaces is to use ' ' itself as a delimiter, and discard empty entries :
var maliciousConsoleCommands = commands.Split(new[]{',',' '},StringSplitOptions.RemoveEmptyEntries)
.ToArray();
This avoids the temporary strings generated by every string manipulation command.
For your code to work though, you'd have use Contains for each command, instead of using StartWith :
var isSuspicious = maliciousCommands.Any(cmd=>input.Contains(cmd));
Or even :
var isSuspicious = maliciousCommands.Any(input.Contains);
This can get rather slow if you have multiple commands, or if the input text is large
Regular expression alternative
A far faster technique would be to use a Regular expression. This performs a lot faster than searching individual keywords :
var regex=new Regex("net|netsh|tasklist");
var isSuspicious=regex.IsMatch(inputTxt);
Regular expressions are thread-safe which means they can be created once and reused by different threads/requests.
By using Match/Matches instead of IsMatch the regex could return the actual keywords that were detected :
var detection=regex.Match(inputTxt);
if (detection.Success)
{
var detectedKeyword=detection.Value;
....
}
Converting the original comma-separated list to a regular expression can be performed with a single String.Replace(", ") or another regular expression that can handle any whitespace character :
string commands = "net , netsh, \ttasklist";
var pattern=Regex.Replace(commands,#"\s*,\s*","|").Dump();
var regex=new Regex(pattern);
Detecting whole words only
Both Contains and the original regular expression would match tasklist1 as well as tasklist. It's possible to match whole words only, if the pattern is surrounded by the word delimiter, \b :
#"\b(" + pattern + #")\b"
This will match tasklist and net but reject tasklist1

Get sub-strings from a string that are enclosed using some specified character

Suppose I have a string
Likes (20)
I want to fetch the sub-string enclosed in round brackets (in above case its 20) from this string. This sub-string can change dynamically at runtime. It might be any other number from 0 to infinity. To achieve this my idea is to use a for loop that traverses the whole string and then when a ( is present, it starts adding the characters to another character array and when ) is encountered, it stops adding the characters and returns the array. But I think this might have poor performance. I know very little about regular expressions, so is there a regular expression solution available or any function that can do that in an efficient way?
If you don't fancy using regex you could use Split:
string foo = "Likes (20)";
string[] arr = foo.Split(new char[]{ '(', ')' }, StringSplitOptions.None);
string count = arr[1];
Count = 20
This will work fine regardless of the number in the brackets ()
e.g:
Likes (242535345)
Will give:
242535345
Works also with pure string methods:
string result = "Likes (20)";
int index = result.IndexOf('(');
if (index >= 0)
{
result = result.Substring(index + 1); // take part behind (
index = result.IndexOf(')');
if (index >= 0)
result = result.Remove(index); // remove part from )
}
Demo
For a strict matching, you can do:
Regex reg = new Regex(#"^Likes\((\d+)\)$");
Match m = reg.Match(yourstring);
this way you'll have all you need in m.Groups[1].Value.
As suggested from I4V, assuming you have only that sequence of digits in the whole string, as in your example, you can use the simpler version:
var res = Regex.Match(str,#"\d+")
and in this canse, you can get the value you are looking for with res.Value
EDIT
In case the value enclosed in brackets is not just numbers, you can just change the \d with something like [\w\d\s] if you want to allow in there alphabetic characters, digits and spaces.
Even with Linq:
var s = "Likes (20)";
var s1 = new string(s.SkipWhile(x => x != '(').Skip(1).TakeWhile(x => x != ')').ToArray());
const string likes = "Likes (20)";
int likesCount = int.Parse(likes.Substring(likes.IndexOf('(') + 1, (likes.Length - likes.IndexOf(')') + 1 )));
Matching when the part in paranthesis is supposed to be a number;
string inputstring="Likes (20)"
Regex reg=new Regex(#"\((\d+)\)")
string num= reg.Match(inputstring).Groups[1].Value
Explanation:
By definition regexp matches a substring, so unless you indicate otherwise the string you are looking for can occur at any place in your string.
\d stand for digits. It will match any single digit.
We want it to potentially be repeated several times, and we want at least one. The + sign is regexp for previous symbol or group repeated 1 or more times.
So \d+ will match one or more digits. It will match 20.
To insure that we get the number that is in paranteses we say that it should be between ( and ). These are special characters in regexp so we need to escape them.
(\d+) would match (20), and we are almost there.
Since we want the part inside the parantheses, and not including the parantheses we tell regexp that the digits part is a single group.
We do that by using parantheses in our regexp. ((\d+)) will still match (20), but now it will note that 20 is a subgroup of this match and we can fetch it by Match.Groups[].
For any string in parantheses things gets a little bit harder.
Regex reg=new Regex(#"\((.+)\)")
Would work for many strings. (the dot matches any character) But if the input is something like "This is an example(parantesis1)(parantesis2)", you would match (parantesis1)(parantesis2) with parantesis1)(parantesis2 as the captured subgroup. This is unlikely to be what you are after.
The solution can be to do the matching for "any character exept a closing paranthesis"
Regex reg=new Regex(#"\(([^\(]+)\)")
This will find (parantesis1) as the first match, with parantesis1 as .Groups[1].
It will still fail for nested paranthesis, but since regular expressions are not the correct tool for nested paranthesis I feel that this case is a bit out of scope.
If you know that the string always starts with "Likes " before the group then Saves solution is better.

what is a good pattern to processes each individual regex match through a method

I'm trying to figure out a pattern where I run a regex match on a long string, and each time it finds a match, it runs a replace on it. The thing is, the replace will vary depending on the matched value. This new value will be determined by a method. For example:
var matches = Regex.Match(myString, myPattern);
while(matches.Success){
Regex.Replace(myString, matches.Value, GetNewValue(matches.Groups[1]));
matches = matches.NextMatch();
}
The problem (i think) is that if I run the Regex.Replace, all of the match indexes get messed up so the result ends up coming out wrong. Any suggestions?
If you replace each pattern with a fixed string, Regex.replace does that for you. You don't need to iterate the matches:
Regex.Replace(myString, myPattern, "replacement");
Otherwise, if the replacement depends upon the matched value, use the MatchEvaluator delegate, as the 3rd argument to Regex.Replace. It receives an instance of Match and returns string. The return value is the replacement string. If you don't want to replace some matches, simply return match.Value:
string myString = "aa bb aa bb";
string myPattern = #"\w+";
string result = Regex.Replace(myString, myPattern,
match => match.Value == "aa" ? "0" : "1" );
Console.WriteLine(result);
// 0 1 0 1
If you really need to iterate the matches and replace them manually, you need to start replacement from the last match towards the first, so that the index of the string is not ruined for the upcoming matches. Here's an example:
var matches = Regex.Matches(myString, myPattern);
var matchesFromEndToStart = matches.Cast<Match>().OrderByDescending(m => m.Index);
var sb = new StringBuilder(myString);
foreach (var match in matchesFromEndToStart)
{
if (IsGood(match))
{
sb.Remove(match.Index, match.Length)
.Insert(match.Index, GetReplacementFor(match));
}
}
Console.WriteLine(sb.ToString());
Just be careful, that your matches do not contain nested instances. If so, you either need to remove matches which are inside another match, or rerun the regex pattern to generate new matches after each replacement. I still recommend the second approach, which uses the delegates.
If I understand your question correctly, you want to perform a replace based on a constant Regular Expression, but the replacement text you use will change based on the actual text that the regex matches on.
The Captures property of the Match Class (not the Match method) returns a collection of all the matches with your regex within the input string. It contains information like the position within the string, the matched value and the length of the match. If you iterate over this collection with a foreach loop you should be able to treat each match individually and perform some string manipulations where you can dynamically modify the replacement value.
I would use something like
Regex regEx = new Regex("some.*?pattern");
string input = "someBLAHpattern!";
foreach (Match match in regEx.Matches(input))
{
DoStuffWith(match.Value);
}

Regex questions

I'm trying to get some text from a large text file, the text I'm looking for is:
Type:Production
Color:Red
I pass the whole text in the following method to get (Type:Production , Color:Red)
private static void FindKeys(IEnumerable<string> keywords, string source)
{
var found = new Dictionary<string, string>(10);
var keys = string.Join("|", keywords.ToArray());
var matches = Regex.Matches(source, #"(?<key>" + #"\B\s" + keys + #"\B\s" + "):",
RegexOptions.Singleline);
foreach (Match m in matches)
{
var key = m.Groups["key"].ToString();
var start = m.Index + m.Length;
var nx = m.NextMatch();
var end = (nx.Success ? nx.Index : source.Length);
found.Add(key, source.Substring(start, end - start));
}
foreach (var n in found)
{
Console.WriteLine("Key={0}, Value={1}", n.Key, n.Value);
}
}
}
My problems are the following:
The search returns _Type: as well, where I only need Type:
The search return Color:Red/n/n/n/n/n (with the rest of the text, where I only need Color:Red
So, basically:
- How can I force Regex to get the exact match for Type and ignore _Type
- How to get only the text after : and ignore /n/n/ and any other text
I hope this is clear
Thanks,
Your regex currently looks like this:
(?<key>\B\sWord1|Word2|Word3\B\s):
I see the following issues here:
First, Word1|Word2|Word3 should be put in parenthesis. Otherwise, it will search for \B\sWord1 or Word2 or Word3\B\s, which is not what you want (I guess).
Why \B\s? A non-boundary followed by a whitespace? That doesn't make sense. I guess you want just \b (= word boundary). There's no need to use it in the end, because the colon already constitutes a word boundary.
So, I would suggest to use the following. It will fix the _Type problem, because there is no word boundary between _ and Type (since _ is considered to be a word character).
\b(?<key>Word1|Word2|Word3):
If the text following the key is always just a single word, I'd match it in the regex as well: (\s* allows for whitespace after the colon, I don't know if you need this. \w+ ensures that only word characters -- i.e. no line breaks etc. -- are matched as the value.)
\b(?<key>Word1|Word2|Word3):\s*(?<value>\w+)
Then you just need to iterate through all the matches and extract the key and value groups. No need for any string operations or index arithmetic.
So if I understand correctly, you have:
Pairs of key:values
Each pair is separated by a space
Within each pair, the key and value is separated by “:”
Then I would not use regex at all. I would:
use String.Split(' ') to get an array of pairs
loop over all the pairs
use String.Split(':') to get the key and value from each pair

Regex: replace inner string

I'm working with X12 EDI Files (Specifically 835s for those of you in Health Care), and I have a particular vendor who's using a non-HIPAA compliant version (3090, I think). The problem is that in a particular segment (PLB- again, for those who care) they're sending a code which is no longer supported by the HIPAA Standard. I need to locate the specific code, and update it with a corrected code.
I think a Regex would be best for this, but I'm still very new to Regex, and I'm not sure where to begin. My current methodology is to turn the file into an array of strings, find the array that starts with "PLB", break that into an array of strings, find the code, and change it. As you can guess, that's very verbose code for something which should be (I'd think) fairly simple.
Here's a sample of what I'm looking for:
~PLB|1902841224|20100228|49>KC15X078001104|.08~
And here's what I want to change it to:
~PLB|1902841224|20100228|CS>KC15X078001104|.08~
Any suggestions?
UPDATE: After review, I found I hadn't quite defined my question well enough. The record above is an example, but it is not necessarilly a specific formatting match- there are three things which could change between this record and some other (in another file) I'd have to fix. They are:
The Pipe (|) could potentially be any non-alpha numeric character. The file itself will define which character (normally a Pipe or Asterisk).
The > could also be any other non-alpha numeric character (most often : or >)
The set of numbers immediately following the PLB is an identifier, and could change in format and length. I've only ever seen numeric Ids there, but technically it could be alpha numeric, and it won't necessarilly be 10 characters.
My Plan is to use String.Format() with my Regex match string so that | and > can be replaced with the correct characters.
And for the record. Yes, I hate ANSI X12.
Assuming that the "offending" code is always 49, you can use the following:
resultString = Regex.Replace(subjectString, #"(?<=~PLB|\d{10}|\d{8}|)49(?=>\w+|)", "CS");
This looks for 49 if it's the first element after a | delimiter, preceded by a group of 8 digits, another |, a group of 10 digits, yet another |, and ~PLB. It also looks if it is followed by >, then any number of alphanumeric characters, and one more |.
With the new requirements (and the lucky coincidence that .NET is one of the few regex flavors that allow variable repetition inside lookbehind), you can change that to:
resultString = Regex.Replace(subjectString, #"(?<=~PLB\1\w+\1\d{8}(\W))49(?=\W\w+\1)", "CS");
Now any non-alphanumeric character is allowed as separator instead of | or > (but in the case of | it has to be always the same one), and the restrictions on the number of characters for the first field have been loosened.
Another, similar approach that works on any valid X12 file to replace a single data value with another on a matching segment:
public void ReplaceData(string filePath, string segmentName,
int elementPosition, int componentPosition,
string oldData, string newData)
{
string text = File.ReadAllText(filePath);
Match match = Regex.Match(text,
#"^ISA(?<e>.).{100}(?<c>.)(?<s>.)(\w+.*?\k<s>)*IEA\k<e>\d*\k<e>\d*\k<s>$");
if (!match.Success)
throw new InvalidOperationException("Not an X12 file");
char elementSeparator = match.Groups["e"].Value[0];
char componentSeparator = match.Groups["c"].Value[0];
char segmentTerminator = match.Groups["s"].Value[0];
var segments = text
.Split(segmentTerminator)
.Select(s => s.Split(elementSeparator)
.Select(e => e.Split(componentSeparator)).ToArray())
.ToArray();
foreach (var segment in segments.Where(s => s[0][0] == segmentName &&
s.Count() > elementPosition &&
s[elementPosition].Count() > componentPosition &&
s[elementPosition][componentPosition] == oldData))
{
segment[elementPosition][componentPosition] = newData;
}
File.WriteAllText(filePath,
string.Join(segmentTerminator.ToString(), segments
.Select(e => string.Join(elementSeparator.ToString(),
e.Select(c => string.Join(componentSeparator.ToString(), c))
.ToArray()))
.ToArray()));
}
The regular expression used validates a proper X12 interchange envelope and assures that all segments within the file contain at least a one character name element. It also parses out the element and component separators as well as the segment terminator.
Assuming that your code is always a two digit number that comes after a pipe character | and before the greater than sign > you can do it like this:
var result = Regex.Replace(yourString, #"(\|)(\d{2})(>)", #"$1CS$3");
You can break it down with regex yes.
If i understand your example correctly the 2 characters between the | and the > need to be letters and not digits.
~PLB\|\d{10}\|\d{8}\|(\d{2})>\w{14}\|\.\d{2}~
This pattern will match the old one and capture the characters between the | and the >. Which you can then use to modify (lookup in a db or something) and do a replace with the following pattern:
(?<=|)\d{2}(?=>)
This will look for the ~PLB|#|#| at the start and replace the 2 numbers before the > with CS.
Regex.Replace(testString, #"(?<=~PLB|[0-9]{10}|[0-9]{8})(\|)([0-9]{2})(>)", #"$1CS$3")
The X12 protocol standard allows the specification of element and component separators in the header, so anything that hard-codes the "|" and ">" characters could eventually break. Since the standard mandates that the characters used as separators (and segment terminators, e.g., "~") cannot appear within the data (there is no escape sequence to allow them to be embedded), parsing the syntax is very simple. Maybe you're already doing something similar to this, but for readability...
// The original segment string (without segment terminator):
string segment = "PLB|1902841224|20100228|49>KC15X078001104|.08";
// Parse the segment into elements, then the fourth element
// into components (bounds checking is omitted for brevity):
var elements = segment.Split('|');
var components = elements[3].Split('>');
// If the first component is the bad value, replace it with
// the correct value (again, not checking bounds):
if (components[0] == "49")
components[0] = "CS";
// Reassemble the segment by joining the components into
// the fourth element, then the elements back into the
// segment string:
elements[3] = string.Join(">", components);
segment = string.Join("|", elements);
Obviously more verbose than a single regular expression but parsing X12 files is as easy as splitting strings on a single character. Except for the fixed length header (which defines the delimiters), an entire transaction set can be parsed with Split:
// Starting with a string that contains the entire 835 transaction set:
var segments = transactionSet.Split('~');
var segmentElements = segments.Select(s => s.Split('|')).ToArray();
// segmentElements contains an array of element arrays,
// each composite element can be split further into components as shown earlier
What I found is working is the following:
parts = original.Split(record);
for(int i = parts.Length -1; i >= 0; i--)
{
string s = parts[i];
string nString =String.Empty;
if (s.StartsWith("PLB"))
{
string[] elems = s.Split(elem);
if (elems[3].Contains("49" + subelem.ToString()))
{
string regex = string.Format(#"(\{0})49({1})", elem, subelem);
nString = Regex.Replace(s, regex, #"$1CS$2");
}
I'm still having to split my original file into a set of strings and then evaluate each string, but the that seams to be working now.
If anyone knows how to get around that string.Split up at the top, I'd love to see a sample.

Categories