I have a string that might contain escape characters. Let's assume this is '\'. I follow the MSDN Escape Sequences definition
I want to reverse this string, but keep the escape sequences.
Example:
string input = #"Hello\_World";
string reversed = #"dlroW\_elloH";
Note that in my input string the backslashes are separate characters. The reversed string is meant to be used in a SQL LIKE statement where the underscore is not meant as a wild card, but literally as an underscore. The backslash in the SQL LIKE functions as an escape character
The problem is, that if a character in my original string is preceded by a backslash, then in my reversed string this backslash should still precede the character: #"_" (two separate characters) should in reverse still be #"_".
Bonus points: Reverse escape sequences with numbers '\x0128'
I've tried it as extension functions:
public static string EscapedReverse(this string txt, char escapeChar)
{
IList<char> charList = txt.ToList();
return new string(EscapedReverse(charList, escapeChar).ToArray());
}
public static IEnumerable<char> EscapedReverse(this IList<char> text, char escapeChar)
{
int i = text.Count-1;
// Text[i] is the last character of the sequence;
// text[i] is the next character to return, except if text[i-1] is escapeChar
while (i > 0)
{
if(text[i-1] == escapeChar)
{
yield return text[i-1];
yield return text[i];
i -= 2;
}
else
{
yield return text[i];
i -= 1;
}
}
// return the last character
if (i == 0)
yield return text[i];
}
This works. However, my string is converted to array / list twice. I wondered if there would be a smarter method where the elements don't have to be accessed so often?
Addition: what is my problem anyway?
Comments suggested to add more information about my problem.
There is a requirement to show a list of matching elements while an operator is typing in a text box. Most elements he can see start with a similar prefix. The difference the operator searches for is in the end of the name.
Therefore we want to show a list of names ending with the typed character. So if the operator types "World" he will see a list with all names ending with "World".
The already existing database (change is out of the question) has a table with a NAME and a REVERSEDNAME. Software takes care that if a name is inserted or updated the correct reversed name is inserted / updated. REVERSEDNAME is indexed, so using a WHERE with reversed name is fast.
So if I need to return all names ending with "World", I need to return the names of all records where the REVERSEDNAME starts with the reverse of "WORLD":
SELECT TOP 30 [MYTABLE].[NAME] as Name
FROM [MYTABLE]
WHERE [MYTABLE].REVERSEDNAME LIKE 'dlroW%'
This works fine as long as no wild cards (like underscore) are used. This was solved by the software by escaping the underscore character (I know, bad design, the fact that SQL LIKE uses underscore as wild card should not seep through, but I have to live with this existing software)
So the operator types #"My_World"
My software received #"My_World", the backslash is a separate character
I have to reverse to #"dlrow_yM", note that the backslash is still before the underscore
My Dapper code:
IEnumerable<string> FetchNamesEndingWith(string nameEnd)
// here is my reversal procedure:
string reversedNameEnd = nameEnd.EscapedReverse() = '%';
using (var dbConnection = this.CreateOpenDbConnection())
{
return dbConnection.Query<string>(#"
SELECT TOP 30 [MYTABLE].[NAME] as Name
FROM [MYTABLE]
WHERE [MYTABLE].REVERSEDNAME LIKE #param ESCAPE '\'",
new {param = reversedNameEnd});
}
MSDN about using escape characters in SQL LIKE
Changing the escape character to a different character doesn't help. The problem is not that the escape character is a backslash, but that reversing my string should keep the escape character in front of the escaped character.
My code works, I only wondered if there would be a better algorithm that doesn't copy the string twice. Not only for this specific problem, but also if in future problems I need to reverse strings and keep certain characters in place.
You can use regular expressions:
var pattern = #"\\x[1-9a-fA-F]{4}|\\x[1-9a-fA-F]{2}|\\[0-7]{3}|\\.|.";
var rgx = new Regex(pattern);
return new string(
rgx.Matches(txt)
.Cast<Match>()
.OrderByDescending(x => x.Index)
.SelectMany(x => x.Value)
.ToArray());
pattern covers single characters and escape sequences in formats:
\x????
\x??
\???
\?
Related
We have a requirement to extract and manipulate strings in C#. Net. The requirement is - we have a string
($name$:('George') AND $phonenumer$:('456456') AND
$emailaddress$:("test#test.com"))
We need to extract the strings between the character - $
Therefore, in the end, we need to get a list of strings containing - name, phonenumber, emailaddress.
What would be the ideal way to do it? are there any out of the box features available for this?
Regards,
John
The simplest way is to use a regular expression to match all non-whitespace characters between $ :
var regex=new Regex(#"\$\w+\$");
var input = "($name$:('George') AND $phonenumer$:('456456') AND $emailaddress$:(\"test#test.com\"))";
var matches=regex.Matches(input);
This will return a collection of matches. The .Value property of each match contains the matching string. \$ is used because $ has special meaning in regular expressions - it matches the end of a string. \w means a non-whitespace character. + means one or more.
Since this is a collection, you can use LINQ on it to get eg an array with the values:
var values=matches.OfType<Match>().Select(m=>m.Value).ToArray();
That array will contain the values $name$,$phonenumer$,$emailaddress$.
Capture by name
You can specify groups in the pattern and attach names to them. For example, you can group the field name values:
var regex=new Regex(#"\$(?<name>\w+)\$");
var names=regex.Matches(input)
.OfType<Match>()
.Select(m=>m.Groups["name"].Value);
This will return name,phonenumer,emailaddress. Parentheses are used for grouping. (?<somename>pattern) is used to attach a name to the group
Extract both names and values
You can also capture the field values and extract them as a separate field. Once you have the field name and value, you can return them, eg as an object or anonymous type.
The pattern in this case is more comples:
#"\$(?<name>\w+)\$:\(['""](?<value>.+?)['""]\)"
Parentheses are escaped because we want them to match the values. Both ' and " characters are used in values, so ['"] is used to specify a choice of characters. The pattern is a literal string (ie starts with #) so the double quotes have to be escaped: ['""] . Any character has to be matched .+ but only up to the next character in the pattern .+?. Without the ? the pattern .+ would match everything to the end of the string.
Putting this together:
var regex = new Regex(#"\$(?<name>\w+)\$:\(['""](?<value>.+?)['""]\)");
var myValues = regex.Matches(input)
.OfType<Match>()
.Select(m=>new { Name=m.Groups["name"].Value,
Value=m.Groups["value"].Value
})
.ToArray()
Turn them into a dictionary
Instead of ToArray() you could convert the objects to a dictionary with ToDictionary(), eg with .ToDictionary(it=>it.Name,it=>it.Value). You could omit the select step and generate the dictionary from the matches themselves :
var myDict = regex.Matches(input)
.OfType<Match>()
.ToDictionary(m=>m.Groups["name"].Value,
m=>m.Groups["value"].Value);
Regular expressions are generally fast because they don't split the string. The pattern is converted to efficient code that parses the input and skips non-matching input immediatelly. Each match and group contain only the index to their starting and ending character in the input string. A string is only generated when .Value is called.
Regular expressions are thread-safe, which means a single Regex object can be stored in a static field and reused from multiple threads. That helps in web applications, as there's no need to create a new Regex object for each request
Because of these two advantages, regular expressions are used extensively to parse log files and extract specific fields. Compared to splitting, performance can be 10 times better or more, while memory usage remains low. Splitting can easily result in memory usage that's multiple times bigger than the original input file.
Can it go faster?
Yes. Regular expressions produce parsing code that may not be as efficient as possible. A hand-written parser could be faster. In this particular case, we want to start capturing text if $ is detected up until the first $. This can be done with the following method :
IEnumerable<string> GetNames(string input)
{
var builder=new StringBuilder(20);
bool started=false;
foreach(var c in input)
{
if (started)
{
if (c!='$')
{
builder.Append(c);
}
else
{
started=false;
var value=builder.ToString();
yield return value;
builder.Clear();
}
}
else if (c=='$')
{
started=true;
}
}
}
A string is an IEnumerable<char> so we can inspect one character at a time without having to copy them. By using a single StringBuilder with a predetermined capacity we avoid reallocations, at least until we find a key that's larger than 20 characters.
Modifying this code to extract values though isn't so easy.
Here's one way to do it, but certainly not very elegant. Basically splitting the string on the '$' and taking every other item will give you the result (after some additional trimming of unwanted characters).
In this example, I'm also grabbing the value of each item and then putting both in a dictionary:
var input = "($name$:('George') AND $phonenumer$:('456456') AND $emailaddress$:(\"test#test.com\"))";
var inputParts = input.Replace(" AND ", "")
.Trim(')', '(')
.Split(new[] {'$'}, StringSplitOptions.RemoveEmptyEntries);
var keyValuePairs = new Dictionary<string, string>();
for (int i = 0; i < inputParts.Length - 1; i += 2)
{
var key = inputParts[i];
var value = inputParts[i + 1].Trim('(', ':', ')', '"', '\'', ' ');
keyValuePairs[key] = value;
}
foreach (var kvp in keyValuePairs)
{
Console.WriteLine($"{kvp.Key} = {kvp.Value}");
}
// Wait for input before closing
Console.WriteLine("\nDone!\nPress any key to exit...");
Console.ReadKey();
Output
Suppose I have a string
Likes (20)
I want to fetch the sub-string enclosed in round brackets (in above case its 20) from this string. This sub-string can change dynamically at runtime. It might be any other number from 0 to infinity. To achieve this my idea is to use a for loop that traverses the whole string and then when a ( is present, it starts adding the characters to another character array and when ) is encountered, it stops adding the characters and returns the array. But I think this might have poor performance. I know very little about regular expressions, so is there a regular expression solution available or any function that can do that in an efficient way?
If you don't fancy using regex you could use Split:
string foo = "Likes (20)";
string[] arr = foo.Split(new char[]{ '(', ')' }, StringSplitOptions.None);
string count = arr[1];
Count = 20
This will work fine regardless of the number in the brackets ()
e.g:
Likes (242535345)
Will give:
242535345
Works also with pure string methods:
string result = "Likes (20)";
int index = result.IndexOf('(');
if (index >= 0)
{
result = result.Substring(index + 1); // take part behind (
index = result.IndexOf(')');
if (index >= 0)
result = result.Remove(index); // remove part from )
}
Demo
For a strict matching, you can do:
Regex reg = new Regex(#"^Likes\((\d+)\)$");
Match m = reg.Match(yourstring);
this way you'll have all you need in m.Groups[1].Value.
As suggested from I4V, assuming you have only that sequence of digits in the whole string, as in your example, you can use the simpler version:
var res = Regex.Match(str,#"\d+")
and in this canse, you can get the value you are looking for with res.Value
EDIT
In case the value enclosed in brackets is not just numbers, you can just change the \d with something like [\w\d\s] if you want to allow in there alphabetic characters, digits and spaces.
Even with Linq:
var s = "Likes (20)";
var s1 = new string(s.SkipWhile(x => x != '(').Skip(1).TakeWhile(x => x != ')').ToArray());
const string likes = "Likes (20)";
int likesCount = int.Parse(likes.Substring(likes.IndexOf('(') + 1, (likes.Length - likes.IndexOf(')') + 1 )));
Matching when the part in paranthesis is supposed to be a number;
string inputstring="Likes (20)"
Regex reg=new Regex(#"\((\d+)\)")
string num= reg.Match(inputstring).Groups[1].Value
Explanation:
By definition regexp matches a substring, so unless you indicate otherwise the string you are looking for can occur at any place in your string.
\d stand for digits. It will match any single digit.
We want it to potentially be repeated several times, and we want at least one. The + sign is regexp for previous symbol or group repeated 1 or more times.
So \d+ will match one or more digits. It will match 20.
To insure that we get the number that is in paranteses we say that it should be between ( and ). These are special characters in regexp so we need to escape them.
(\d+) would match (20), and we are almost there.
Since we want the part inside the parantheses, and not including the parantheses we tell regexp that the digits part is a single group.
We do that by using parantheses in our regexp. ((\d+)) will still match (20), but now it will note that 20 is a subgroup of this match and we can fetch it by Match.Groups[].
For any string in parantheses things gets a little bit harder.
Regex reg=new Regex(#"\((.+)\)")
Would work for many strings. (the dot matches any character) But if the input is something like "This is an example(parantesis1)(parantesis2)", you would match (parantesis1)(parantesis2) with parantesis1)(parantesis2 as the captured subgroup. This is unlikely to be what you are after.
The solution can be to do the matching for "any character exept a closing paranthesis"
Regex reg=new Regex(#"\(([^\(]+)\)")
This will find (parantesis1) as the first match, with parantesis1 as .Groups[1].
It will still fail for nested paranthesis, but since regular expressions are not the correct tool for nested paranthesis I feel that this case is a bit out of scope.
If you know that the string always starts with "Likes " before the group then Saves solution is better.
I have to validate a text box from a list of special characters that are not allowed.
This all are not allowed characters.
"&";"\";"/";"!";"%";"#";"^";"(";")";"?";"|";"~";"+";" ";
"{";"}";"*";",";"[";"]";"$";";";":";"=";"
Where semi-column is used to just separate between characters .I tried to write a regex for some characters to validate if it had worked i would extend it.it is not working .
What I am doing wrong in this.
Regex.IsMatch(textBox1.Text, #"^[\%\/\\\&\?\,\'\;\:\!\-]+$")
^[\%\/\\\&\?\,\'\;\:\!\-]+$
matches the strings that consist entirely of special characters. You need to invert the character class to match the strings that do not contain a special character:
^[^\%\/\\\&\?\,\'\;\:\!\-]+$
^--- added
Alternatively, you can use this regex to match any string containing only alphanumeric characters, hyphens, underscores and apostrophes.
^[a-zA-Z0-9\-'_]$
The regex you mention in the comments
[^a-zA-Z0-9-'_]
matches a string that contains any character except those that are allowed (you might need to escape the hyphen, though). This works as well, assuming you reverse the condition correctly (accept the strings that do not match).
If you are just looking for any of a list of characters then a regular expression is the more complicated option. String.IndexOfAny will return the first index of any of an array of characters or -1. So the check:
if (input.IndexOfAny(theCharacetrers) != -1) {
// Found one of them.
}
where theCharacetrers has previously been set up at class scope:
private readonly char[] theCharacetrers = new [] {'&','\','/','!','%','#','^',... };
You needed to remove ^ from the beginning and $ from the end of the pattern, otherwise in order to match the string should start and end with the special characters.
So, instead of
#"^[\%\/\\\&\?\,\'\;\:\!\-]+$"
it should be
#"[\%\/\\\&\?\,\'\;\:\!\-]+"
You can read more about start of string and end of string anchors here
Your RegExp is "string consiting only of special characters (since you have begin/end markers ^ and $).
You probably want just check if string does not contain any of the characters #"[\%\/\\\&\?\,\'\;\:\!\-]") would be enough.
Also String.IndexOfAny may be better fit if you just need to see if any of the characters is present in the source string.
PLease use this in textchange event
//Regex regex = new Regex("([a-zA-Z0-9 ._#]+)");
Regex regex = new Regex("^[a-zA-Z0-9_#(+).,-]+$");
string alltxt = txtOthers.Text;//txtOthers is textboxes name;
int k = alltxt.Length;
for (int i = 0; i <= k - 1; i++)
{
string lastch = alltxt.Substring(i, 1);
MatchCollection matches = regex.Matches(lastch);
if (matches.Count > 0)
{
}
else
{
txtOthers.Text = alltxt.Remove(i, 1);
i = i - 1;
alltxt = txtOthers.Text;
k = alltxt.Length;
}
txtOthers.Select(txtOthers.TextLength, 0);
}
BY Sharafu Hameed
I am trying to get a string to have an underscore '_' everywhere on a case changes in a string to make it more clear to the user. For example if we have a String 'personal IDnumber', I want to make it to 'personal_ID_number'. This is in C#.
Thank you, I appreciate any help, suggestions.
-Sid
Replace on a case change
You are looking for this (I removed the space before ID, assuming it a typo)
(?:(?<=[a-z])(?=[A-Z]))|(?:(?<=[A-Z])(?=[a-z]))
It will transform personalIDnumber into personal_ID_number
See it here on Regexr
the Lookbehind ((?<=[a-z])) and lookahead ((?=[A-Z])) construction is matching the empty string between a lower case and a uppercase letter (or the other way round in the second part after the pipe |) and replace this with a underscore.
Replace on a case change with optional whitespace
If you want to include whitespace in the replace, just include it between the lookarounds
(?:(?<=[a-z])\s*(?=[A-Z]))|(?:(?<=[A-Z])\s*(?=[a-z]))
It will transform personal IDnumber into personal_ID_number
See this here on Regexr
Replace on a case change, but each word starts with an uppercase
If you say every word starts with an uppercase letter, then you can do this
(?:(?<=[a-z])(?=[A-Z]))|(?:(?<=[A-Z])(?=[A-Z][a-z]))
This will make from
PersonalIDNumber
JustAnotherTest
OtherWordWithID
SomeMORETest
this
Personal_ID_Number
Just_Another_Test
Other_Word_With_ID
Some_MORE_Test
See this here on Regexr
Non-regex solution (just in case)
string str = "personalIDnumber";
var result =
Enumerable.Range(0, str.Length - 1)
.Where(i => char.IsUpper(str[i]) != char.IsUpper(str[i + 1]))
.ToList();
string resultString = result.Aggregate(str, (current, i) => current.Insert(i + 1 + result.IndexOf(i), "_"));
the result stores a list of locations where an _ has to be inserted and the Aggregate() function inserts it.
I'm working with X12 EDI Files (Specifically 835s for those of you in Health Care), and I have a particular vendor who's using a non-HIPAA compliant version (3090, I think). The problem is that in a particular segment (PLB- again, for those who care) they're sending a code which is no longer supported by the HIPAA Standard. I need to locate the specific code, and update it with a corrected code.
I think a Regex would be best for this, but I'm still very new to Regex, and I'm not sure where to begin. My current methodology is to turn the file into an array of strings, find the array that starts with "PLB", break that into an array of strings, find the code, and change it. As you can guess, that's very verbose code for something which should be (I'd think) fairly simple.
Here's a sample of what I'm looking for:
~PLB|1902841224|20100228|49>KC15X078001104|.08~
And here's what I want to change it to:
~PLB|1902841224|20100228|CS>KC15X078001104|.08~
Any suggestions?
UPDATE: After review, I found I hadn't quite defined my question well enough. The record above is an example, but it is not necessarilly a specific formatting match- there are three things which could change between this record and some other (in another file) I'd have to fix. They are:
The Pipe (|) could potentially be any non-alpha numeric character. The file itself will define which character (normally a Pipe or Asterisk).
The > could also be any other non-alpha numeric character (most often : or >)
The set of numbers immediately following the PLB is an identifier, and could change in format and length. I've only ever seen numeric Ids there, but technically it could be alpha numeric, and it won't necessarilly be 10 characters.
My Plan is to use String.Format() with my Regex match string so that | and > can be replaced with the correct characters.
And for the record. Yes, I hate ANSI X12.
Assuming that the "offending" code is always 49, you can use the following:
resultString = Regex.Replace(subjectString, #"(?<=~PLB|\d{10}|\d{8}|)49(?=>\w+|)", "CS");
This looks for 49 if it's the first element after a | delimiter, preceded by a group of 8 digits, another |, a group of 10 digits, yet another |, and ~PLB. It also looks if it is followed by >, then any number of alphanumeric characters, and one more |.
With the new requirements (and the lucky coincidence that .NET is one of the few regex flavors that allow variable repetition inside lookbehind), you can change that to:
resultString = Regex.Replace(subjectString, #"(?<=~PLB\1\w+\1\d{8}(\W))49(?=\W\w+\1)", "CS");
Now any non-alphanumeric character is allowed as separator instead of | or > (but in the case of | it has to be always the same one), and the restrictions on the number of characters for the first field have been loosened.
Another, similar approach that works on any valid X12 file to replace a single data value with another on a matching segment:
public void ReplaceData(string filePath, string segmentName,
int elementPosition, int componentPosition,
string oldData, string newData)
{
string text = File.ReadAllText(filePath);
Match match = Regex.Match(text,
#"^ISA(?<e>.).{100}(?<c>.)(?<s>.)(\w+.*?\k<s>)*IEA\k<e>\d*\k<e>\d*\k<s>$");
if (!match.Success)
throw new InvalidOperationException("Not an X12 file");
char elementSeparator = match.Groups["e"].Value[0];
char componentSeparator = match.Groups["c"].Value[0];
char segmentTerminator = match.Groups["s"].Value[0];
var segments = text
.Split(segmentTerminator)
.Select(s => s.Split(elementSeparator)
.Select(e => e.Split(componentSeparator)).ToArray())
.ToArray();
foreach (var segment in segments.Where(s => s[0][0] == segmentName &&
s.Count() > elementPosition &&
s[elementPosition].Count() > componentPosition &&
s[elementPosition][componentPosition] == oldData))
{
segment[elementPosition][componentPosition] = newData;
}
File.WriteAllText(filePath,
string.Join(segmentTerminator.ToString(), segments
.Select(e => string.Join(elementSeparator.ToString(),
e.Select(c => string.Join(componentSeparator.ToString(), c))
.ToArray()))
.ToArray()));
}
The regular expression used validates a proper X12 interchange envelope and assures that all segments within the file contain at least a one character name element. It also parses out the element and component separators as well as the segment terminator.
Assuming that your code is always a two digit number that comes after a pipe character | and before the greater than sign > you can do it like this:
var result = Regex.Replace(yourString, #"(\|)(\d{2})(>)", #"$1CS$3");
You can break it down with regex yes.
If i understand your example correctly the 2 characters between the | and the > need to be letters and not digits.
~PLB\|\d{10}\|\d{8}\|(\d{2})>\w{14}\|\.\d{2}~
This pattern will match the old one and capture the characters between the | and the >. Which you can then use to modify (lookup in a db or something) and do a replace with the following pattern:
(?<=|)\d{2}(?=>)
This will look for the ~PLB|#|#| at the start and replace the 2 numbers before the > with CS.
Regex.Replace(testString, #"(?<=~PLB|[0-9]{10}|[0-9]{8})(\|)([0-9]{2})(>)", #"$1CS$3")
The X12 protocol standard allows the specification of element and component separators in the header, so anything that hard-codes the "|" and ">" characters could eventually break. Since the standard mandates that the characters used as separators (and segment terminators, e.g., "~") cannot appear within the data (there is no escape sequence to allow them to be embedded), parsing the syntax is very simple. Maybe you're already doing something similar to this, but for readability...
// The original segment string (without segment terminator):
string segment = "PLB|1902841224|20100228|49>KC15X078001104|.08";
// Parse the segment into elements, then the fourth element
// into components (bounds checking is omitted for brevity):
var elements = segment.Split('|');
var components = elements[3].Split('>');
// If the first component is the bad value, replace it with
// the correct value (again, not checking bounds):
if (components[0] == "49")
components[0] = "CS";
// Reassemble the segment by joining the components into
// the fourth element, then the elements back into the
// segment string:
elements[3] = string.Join(">", components);
segment = string.Join("|", elements);
Obviously more verbose than a single regular expression but parsing X12 files is as easy as splitting strings on a single character. Except for the fixed length header (which defines the delimiters), an entire transaction set can be parsed with Split:
// Starting with a string that contains the entire 835 transaction set:
var segments = transactionSet.Split('~');
var segmentElements = segments.Select(s => s.Split('|')).ToArray();
// segmentElements contains an array of element arrays,
// each composite element can be split further into components as shown earlier
What I found is working is the following:
parts = original.Split(record);
for(int i = parts.Length -1; i >= 0; i--)
{
string s = parts[i];
string nString =String.Empty;
if (s.StartsWith("PLB"))
{
string[] elems = s.Split(elem);
if (elems[3].Contains("49" + subelem.ToString()))
{
string regex = string.Format(#"(\{0})49({1})", elem, subelem);
nString = Regex.Replace(s, regex, #"$1CS$2");
}
I'm still having to split my original file into a set of strings and then evaluate each string, but the that seams to be working now.
If anyone knows how to get around that string.Split up at the top, I'd love to see a sample.