C# - Removing a Line that matches a Regex - c#

I have some data.. it looks similar to this:
0423 222222 ADH, TEXTEXT
0424 1234 ADH,MORE TEXT
0425 98765 ADH, TEXT 3609
2000 98765-4 LBL,IUC,PCA,S/N
0010 99999-27 LBL,IUI,1.0x.25
9000 12345678 HERE IS MORE, TEXT
9010 123-123 SOMEMORE,TEXT1231
9100 SD178 YAYFOR, TEXT01
9999 90123 HEY:HOW-TO DOTHIS
And I would like to remove each entire line that begins with a 9xxx. Right now I have tried Replacing the value using Regex. Here is what I have for that:
output = Regex.Replace(output, #"^9[\d]{3}\s+[\d*\-*\w*]+\s+[\d*\w*\-*\,*\:*\;*\.*\d*\w*]+", "");
However, this is really hard to read and it actually does not delete the entire line.
CODE:
Here is the section of the code I am using:
try
{
// Resets the formattedTextRichTextBox so multiple files aren't loaded on top of eachother.
formattedTextRichTextBox.ResetText();
foreach (string line in File.ReadAllLines(openFile.FileName))
{
// Uses regular expressions to find a line that has, digit(s), space(s), digit(s) + letter(s),
// space(s), digit(s), space(s), any character (up to 25 times).
Match theMatch = Regex.Match(line, #"^[\.*\d]+\s+[\d\w]+\s+[\d\-\w*]+\s+.{25}");
if (theMatch.Success)
{
// Stores the matched value in string output.
string output = theMatch.Value;
// Replaces the text with the required layout.
output = Regex.Replace(output, #"^[\.*\d]+\s+", "");
//output = Regex.Replace(output, #"^9[\d]{3}\s+[\d*\-*\w*]+\s+[\d*\w*\-*\,*\:*\;*\.*\d*\w*]+", "");
output = Regex.Replace(output, #"\s+", " ");
// Sets the formattedTextRichTextBox to the string output.
formattedTextRichTextBox.AppendText(output);
formattedTextRichTextBox.AppendText("\n");
}
}
}
OUTCOME:
So what I would like the new data to look like is in this format (removed 9xxx):
0423 222222 ADH, TEXTEXT
0424 1234 ADH,MORE TEXT
0425 98765 ADH, TEXT 3609
2000 98765-4 LBL,IUC,PCA,S/N
0010 99999-27 LBL,IUI,1.0x.25
QUESTIONS:
Is there an easier way to go about this?
If so, can I use regex to go about this or must I use a different way?

Just reformulate the regex that tests your format to match everything that doesn't begin with 9 - that way lines starting with 9 are not added to the rich text box.

Try this(Uses Linq):
//Create a regex to identify lines that start with 9XXX
Regex rgx = new Regex(#"^9\d{3}");
//Below is the linq expression to filter the lines that start with 9XXX
var validLines =
(
//This following line specifies what enumeration to pick the data from
from ln in File.ReadAllLines(openFile.FileName)
//This following specifies what is the filter that needs to be applied to select the data.
where !rgx.IsMatch(ln)
//This following specifies what to select from the filtered data.
select ln;
).ToArray(); //This line makes the IQueryable enumeration to an array of Strings (since variable ln in the above expression is a String)
//Finally join the filtered entries with a \n using String.Join and then append it to the textbox
formattedTextRichTextBox.AppendText = String.Join(validLines, "\n");

Yes, there is a simpler way. Just use Regex.Replace method, and provide Multiline option.

Why don't you just match the first 9xxx part the use a wildcard to match the rest of the line, it would be a lot more readable.
output = Regex.Replace(output, #"^9[\d{3}].*", "")

Related

Regex Replace between groups

So I have the following regex.replace in C#:
Regex.Replace(inputString, #"^([^,]*,){5}(.*)", #"$1somestring,$2");
where 5 is a variable number in code, but that's not really relevant since at the time of execution it will always have a set value (like 5, for example). Same with somestring,.
Essentially I want to input somestring, between the two groups. The output works for somestring,$2, but $1 is just printed as $1. So say whatever (.*) grabs = "2, a, f2" the resulting string I'd get out is $1somestring,2,a,f2 no matter what $1 is. Is this because of the repeating group feature {5}? If so, how do I grab the collection of repeats and put it in place of where I have $1 right now?
Edit: I know the first group captures correctly, as well. I grab the content of somestring, using this regex:
Regex.Match(line, #"^([^,]*,){5}([0-9]+\.[0-9]+),.*");
The first part is identical the the first group in the replacement regex, and it works fine, so there shouldn't be an issue (and they're both used on the same string).
Edit2:
Ok I'll try to explain more of the process since someone said it was hard to understand. I have three variables, line a string I work with, and latIndex and lonIndex which are just ints (tells me between what ,'s two doubles I look for are located). I have the two following matches:
var latitudeMatch = Regex.Match(line, #"^([^,]*,){" + latIndex + #"}([0-9]+\.[0-9]+),.*");
var longitudeMatch = Regex.Match(line, #"^([^,]*,){" + lonIndex + #"}([0-9]+\.[0-9]+),.*");
I then grab the doubles:
var latitude = latitudeMatch.Groups[2].Value;
var longitude = longitudeMatch.Groups[2].Value;
I use these doubles to get a string from a web API, which i store in a variable called veiRef. Then I want to insert these after the doubles, using the following code (insert after lat or lon, depending on which one appears last):
if (latIndex > lonIndex)
{
line = Regex.Replace(line, #"^([^,]*,){" + (latIndex+1) + #"}(.*)",$#"$1{veiRef},$2");
}
else
{
line = Regex.Replace(line, #"^([^,]*,){" + (lonIndex + 1) + #"}(.*)", $#"$1{veiRef},$2");
}
However, this results in a string line which doesn't have the content of $1 inserted before it ($2 works fine).
You have a repeated capturing group at the start of the pattern that you need to turn into a non-capturing one and wrap with a capturing group. Then, you may access the whole part of the match with the $1 backreference.
var line = "a, s, f, double, double, 12, sd, 1";
var latIndex = 5;
var pat = $#"^((?:[^,]*,){{{latIndex+1}}})(.*)";
// Console.WriteLine(pat); // => ^((?:[^,]*,){6})(.*)
var veiRef = "str";
line = Regex.Replace(line, pat, $"${{1}}{veiRef.Replace("$","$$")}$2");
Console.WriteLine(line); // => a, s, f, double, double, 12,str sd, 1
See the C# demo
The pattern - ^((?:[^,]*,){6})(.*) - now contains ((?:[^,]*,){6}) after ^, and this is now what $1 holds after a match is found.
Since your replacement string is dynamic, you need to make sure any $ inside gets doubled (hence, .Replace("$","$$")) and that the first backreference is unambiguous, thus it should look like ${1} (it will work regardless whether the veiRef starts with a digit or not).
Replacement string in details:
It is an interpolated string literal...
$" - declaration of the interpolated string literal (start)
${{1}} - a literal ${1} string (the { and } must be doubled to denote literal symbols)
{veiRef.Replace("$","$$")} - a piece of C# code inside the interpolated string literal (we delimit this part where code is permitted with single {...})
$2 - a literal $2 string
" - end of the interpolated string literal.
Adding an extra group around the repeating capturing group seems to provide the desired output for the example you gave.
Regex.Replace("a, s, f, double, double, 12, sd, 1", #"^(([^,]*,){5})(.*)", #"$1somestring,$3");
I'm not an expert on RegEx and someone can probably explain it better than I, but:-
Group 1 is the set of 5 repeating capture groups
Group 2 is the last of the repeating capture groups
Group 3 is the text after the 5 repeating capture groups.

Extract Menu from String

I want to extract a menu from a string whenever there is one.
recipe ABC: Quelle bonne idC)e!
L: 33348, C: 2130
1 Like
2 Comment
3 Next
4 See Comments
# Home
Since I am new to regex, I tried this for a start:
If Regex.IsMatch(text, "(\d\w*\n)*") Then
End If
And it returned true.
Am I doing this right?
I want to be able to extract the menu whenever there is one. Menus don't have a pre-defined format. So I used whatever starts with number \d followed by alphanumeric character \w and new line \n.
After regex returning true, how can I extract the text that did match the regex?
Any help would be appreciated.
You can use a regex (?sm).*(?=^\d+\s+\p{L}+[\r\n]) that is taking everything from the beginning and up to a line (due to ^) that starts with a number (\d+), then some spaces (\s+), then some letters (\p{L}), then a newline ([\r\n]):
var txt ="Lorem ipsum:amet, consectetur adipiscing elit!!\r\nL: 33348, C: 2130\r\n\r\n1 Next\r\n\r\n2 Forward\r\n\r\n3 Last\r\n\r\n4 See more";
var rx = new Regex(#"(?sm).*?(?=^\d+\s+\p{L}+[\r\n])");
var res = rx.Match(txt).Value;
However, I believe your menu always starts with 1 at the line start, and all menu items are generally capitalized. That is why I suggest using another regex to reflect the following conditions: take all until a line that starts with 1 followed by some space(s), and then by an uppercase letter:
var rx = new Regex(#"(?sm).*(?=^1\s+\p{Lu})");
Or, you can try to split the string into lines, and check if a line starts with 1.
var out2 = string.Join("\r\n",txt.Split(new string[] { "\r\n" }, StringSplitOptions.None).TakeWhile(p => !p.StartsWith("1 ")).ToList());
Results:
You are using isMatch, which will return only the information "Did the pattern match anything"?
You should use something like this :
Regex regex = new Regex(#"(\d\w*\n)*");
Match match = regex.Match(yourText);
if (match.Success)
{
Console.WriteLine(match.Value);
}
Disclaimer : As your question was not about your expression itself, I haven't checked what it does. You haven't asked help on that part so I didn't give any.

C# Replacing Multiple Spaces with 1 space leaving special characters intact

Having a bit of a problem as I have to translate a string into a table. I'd like to remove multiple spaces, but not all of them. So the data in text comes back with lots of spaces in between like so:
SESSIONNAME USERNAME ID STATE TYPE DEVICE\r\n
services 0 Disc \r\n
console 1 Conn \r\n
alinav 2 Disc \r\n
rdp-tcp 65536 Listen \r\n
I would like to still keep the \r\n\ values that will define my rows, and I want to keep the empty value which would be legit under the columns, and I want to keep the spaces to define the columns. But I want to remove the extra spaces that I don't want to be fed into the values.
I've tried:
output = Regex.Replace(output, #"\s{2,}", " ", RegexOptions.Multiline);
output = output.Replace(" ", " ");
But the first one just removes everything (things I need and don't need). And the second one still leaves too many spaces.
Thanks.
You can do two things:
Use space explicitly in the regular expression, \s includes weird characters like (\n, \r, \t,...) as well, thus:
output = Regex.Replace(output, #" +", " ", RegexOptions.Multiline);
Or apply the second method until convergence:
string s2 = output;
do {
output = s2;
s2 = s2.Replace(" "," ");
} while(output != s2);
In most cases the first method will outperform the second one. This because the first method groups all substrings with two or more spaces. Regexes are in general a bit slower than simple string replacement, but if the string contains sequences with many spaces, the first method will be faster.
In your example the data is delimited by position, not by characters; is that correct? If so, you should extract by position; something like:
foreach (string s in output.Split())
{
var sessionName = s.Substring(0, 18).Trim();
var userName = s.Substring(18, 19).Trim();
var id = Int32.Parse(s.Substring(37, 8).Trim());
var whateverType = s.Substring(45, 12).Trim();
var device = s.Substring(57, 6).Trim();
}
Of course you need to do proper error checking, and should probably put the field widths in an array and calculate positions instead of hard-coding them as I have shown.

Regex replacing inside of

Well, I have this code:
StreamReader sr = new StreamReader(#"main.cl", true);
String str = sr.ReadToEnd();
Regex r = new Regex(#"&");
string[] line = r.Split(str);
foreach (string val in line)
{
string Change = val.Replace("puts","System.Console.WriteLine()");
Console.Write(Change);
}
As you can see, I'm trying to replace puts (content) by Console.WriteLine(content) but it would be need Regular Expressions and I didn't found a good article about how to do THIS.
Basically, taking * as the value that is coming, I'd like to do this:
string Change = val.Replace("puts *","System.Console.WriteLine(*)");
Then, if I receive:
puts "Hello World";
I want to get:
System.Console.WriteLine("Hello World");
You need to use Regex.Replace to capture part of the input by using a capturing group and include the captured match into the output. Example:
Regex.Replace(
"puts 'foo'", // input
"puts (.*)", // .* means "any number of characters"
"System.Console.WriteLine($1)") // $1 stands for whatever (.*) matched
If the input always ends in a semicolon you would want to move that semicolon outside the WriteLine parens. One way to do that is:
Regex.Replace(
"puts 'foo';", // input
"puts (.*);", // ; outside parens -- now it's not captured
"System.Console.WriteLine($1);") // manually adding the fixed ; at the end
If you intend to adapt these examples it's a good idea to consult a technical reference first; you can find a very good one here.
What you want to do is look at Grouping Expressions. Give the following a try
Regex.Replace(val, "puts (.*);", "System.Console.WriteLine(${1});");
Note that you can also name your groups, as opposed to using their indexes for replacement. You can do this like so:
Regex.Replace(val, "puts (?<str>.*);", "System.Console.WriteLine(${str});");

Regex: replace inner string

I'm working with X12 EDI Files (Specifically 835s for those of you in Health Care), and I have a particular vendor who's using a non-HIPAA compliant version (3090, I think). The problem is that in a particular segment (PLB- again, for those who care) they're sending a code which is no longer supported by the HIPAA Standard. I need to locate the specific code, and update it with a corrected code.
I think a Regex would be best for this, but I'm still very new to Regex, and I'm not sure where to begin. My current methodology is to turn the file into an array of strings, find the array that starts with "PLB", break that into an array of strings, find the code, and change it. As you can guess, that's very verbose code for something which should be (I'd think) fairly simple.
Here's a sample of what I'm looking for:
~PLB|1902841224|20100228|49>KC15X078001104|.08~
And here's what I want to change it to:
~PLB|1902841224|20100228|CS>KC15X078001104|.08~
Any suggestions?
UPDATE: After review, I found I hadn't quite defined my question well enough. The record above is an example, but it is not necessarilly a specific formatting match- there are three things which could change between this record and some other (in another file) I'd have to fix. They are:
The Pipe (|) could potentially be any non-alpha numeric character. The file itself will define which character (normally a Pipe or Asterisk).
The > could also be any other non-alpha numeric character (most often : or >)
The set of numbers immediately following the PLB is an identifier, and could change in format and length. I've only ever seen numeric Ids there, but technically it could be alpha numeric, and it won't necessarilly be 10 characters.
My Plan is to use String.Format() with my Regex match string so that | and > can be replaced with the correct characters.
And for the record. Yes, I hate ANSI X12.
Assuming that the "offending" code is always 49, you can use the following:
resultString = Regex.Replace(subjectString, #"(?<=~PLB|\d{10}|\d{8}|)49(?=>\w+|)", "CS");
This looks for 49 if it's the first element after a | delimiter, preceded by a group of 8 digits, another |, a group of 10 digits, yet another |, and ~PLB. It also looks if it is followed by >, then any number of alphanumeric characters, and one more |.
With the new requirements (and the lucky coincidence that .NET is one of the few regex flavors that allow variable repetition inside lookbehind), you can change that to:
resultString = Regex.Replace(subjectString, #"(?<=~PLB\1\w+\1\d{8}(\W))49(?=\W\w+\1)", "CS");
Now any non-alphanumeric character is allowed as separator instead of | or > (but in the case of | it has to be always the same one), and the restrictions on the number of characters for the first field have been loosened.
Another, similar approach that works on any valid X12 file to replace a single data value with another on a matching segment:
public void ReplaceData(string filePath, string segmentName,
int elementPosition, int componentPosition,
string oldData, string newData)
{
string text = File.ReadAllText(filePath);
Match match = Regex.Match(text,
#"^ISA(?<e>.).{100}(?<c>.)(?<s>.)(\w+.*?\k<s>)*IEA\k<e>\d*\k<e>\d*\k<s>$");
if (!match.Success)
throw new InvalidOperationException("Not an X12 file");
char elementSeparator = match.Groups["e"].Value[0];
char componentSeparator = match.Groups["c"].Value[0];
char segmentTerminator = match.Groups["s"].Value[0];
var segments = text
.Split(segmentTerminator)
.Select(s => s.Split(elementSeparator)
.Select(e => e.Split(componentSeparator)).ToArray())
.ToArray();
foreach (var segment in segments.Where(s => s[0][0] == segmentName &&
s.Count() > elementPosition &&
s[elementPosition].Count() > componentPosition &&
s[elementPosition][componentPosition] == oldData))
{
segment[elementPosition][componentPosition] = newData;
}
File.WriteAllText(filePath,
string.Join(segmentTerminator.ToString(), segments
.Select(e => string.Join(elementSeparator.ToString(),
e.Select(c => string.Join(componentSeparator.ToString(), c))
.ToArray()))
.ToArray()));
}
The regular expression used validates a proper X12 interchange envelope and assures that all segments within the file contain at least a one character name element. It also parses out the element and component separators as well as the segment terminator.
Assuming that your code is always a two digit number that comes after a pipe character | and before the greater than sign > you can do it like this:
var result = Regex.Replace(yourString, #"(\|)(\d{2})(>)", #"$1CS$3");
You can break it down with regex yes.
If i understand your example correctly the 2 characters between the | and the > need to be letters and not digits.
~PLB\|\d{10}\|\d{8}\|(\d{2})>\w{14}\|\.\d{2}~
This pattern will match the old one and capture the characters between the | and the >. Which you can then use to modify (lookup in a db or something) and do a replace with the following pattern:
(?<=|)\d{2}(?=>)
This will look for the ~PLB|#|#| at the start and replace the 2 numbers before the > with CS.
Regex.Replace(testString, #"(?<=~PLB|[0-9]{10}|[0-9]{8})(\|)([0-9]{2})(>)", #"$1CS$3")
The X12 protocol standard allows the specification of element and component separators in the header, so anything that hard-codes the "|" and ">" characters could eventually break. Since the standard mandates that the characters used as separators (and segment terminators, e.g., "~") cannot appear within the data (there is no escape sequence to allow them to be embedded), parsing the syntax is very simple. Maybe you're already doing something similar to this, but for readability...
// The original segment string (without segment terminator):
string segment = "PLB|1902841224|20100228|49>KC15X078001104|.08";
// Parse the segment into elements, then the fourth element
// into components (bounds checking is omitted for brevity):
var elements = segment.Split('|');
var components = elements[3].Split('>');
// If the first component is the bad value, replace it with
// the correct value (again, not checking bounds):
if (components[0] == "49")
components[0] = "CS";
// Reassemble the segment by joining the components into
// the fourth element, then the elements back into the
// segment string:
elements[3] = string.Join(">", components);
segment = string.Join("|", elements);
Obviously more verbose than a single regular expression but parsing X12 files is as easy as splitting strings on a single character. Except for the fixed length header (which defines the delimiters), an entire transaction set can be parsed with Split:
// Starting with a string that contains the entire 835 transaction set:
var segments = transactionSet.Split('~');
var segmentElements = segments.Select(s => s.Split('|')).ToArray();
// segmentElements contains an array of element arrays,
// each composite element can be split further into components as shown earlier
What I found is working is the following:
parts = original.Split(record);
for(int i = parts.Length -1; i >= 0; i--)
{
string s = parts[i];
string nString =String.Empty;
if (s.StartsWith("PLB"))
{
string[] elems = s.Split(elem);
if (elems[3].Contains("49" + subelem.ToString()))
{
string regex = string.Format(#"(\{0})49({1})", elem, subelem);
nString = Regex.Replace(s, regex, #"$1CS$2");
}
I'm still having to split my original file into a set of strings and then evaluate each string, but the that seams to be working now.
If anyone knows how to get around that string.Split up at the top, I'd love to see a sample.

Categories