Regex between and including characters across multiple lines - c#

I have the below text:
BEGIN:
>>DocTypeName: Zoning Letter
>>DocDate: 4/16/2014
Loan Number: 355211
Ad Hoc: ZONING VERIFICATION LETTER
Document Handle: 712826
>>DiskgroupNum: 102
>>VolumeNum: 367
>>NumOfPages: 0
>>FileSize: 261711
>>DocRevNum: 0
>>Rendition: 1
>>PhysicalPageNum: 0
>>ItemPageNum: 0
>>FileTypeNum: 16
>>ImageType: 0
>>Compress: 2
>>Xdpi: 0
>>Ydpi: 0
>>FileName: \V367\2855\1558564.PDF
BEGIN:
>>DocTypeName: Zoning Letter
>>DocDate: 4/16/2014
Loan Number: 355211
Ad Hoc: ZONING CODES COMPLIANCE LETTER
Document Handle: 712825
>>DiskgroupNum: 102
>>VolumeNum: 367
>>NumOfPages: 0
>>FileSize: 19441
>>DocRevNum: 0
>>Rendition: 1
>>PhysicalPageNum: 0
>>ItemPageNum: 0
>>FileTypeNum: 16
>>ImageType: 0
>>Compress: 2
>>Xdpi: 0
>>Ydpi: 0
>>FileName: \V367\2855\1558563.pdf
I need to use regex (which will go in a C# program) to convert this into something effective for a CSV. The data that is most vital is the document handle and filename (path) from each section (being a section under "BEGIN:") I'm working on this for someone else, so I'd like to retain as much as possible in the event they decide they need some of the other data. This was my initial attempt:
\r\n(?!BEGIN).*\:
However, not every section has an "Ad Hoc:" component, which throws off the cell alignment when pulled into Excel. Ad Hoc I know for sure is not part of the data that is needed for the end result.
The best case scenario would be to just select and remove everything between every "Ad Hoc" and "Handle:" to be replaced with the delimiter (;). I would then pipe this along with my above regex.
My only other requirement is that this has to all be in one regex statement - otherwise in the program I've written I'll have to set up some sort of loop or while business which I'm not prepared to do yet.

Based on what i understood from the comments underneath the question, the example data given in the question should be transformed into two text lines like this:
Zoning Letter;4/16/2014;355211;712826;102;367;0;261711;0;1;0;0;16;0;2;0;0;\V367\2855\1558564.PDF
Zoning Letter;4/16/2014;355211;712825;102;367;0;19441;0;1;0;0;16;0;2;0;0;\V367\2855\1558563.pdf
To achieve this result while avoiding a loop (although i wonder why you would want to avoid loops - they are basic and omni-present constructs), i would suggest applying two (or three, see section 3. below) regex substitutions.
1. Removal of "Label:" and replacement of line breaks with ";"
The first regular expression will remove a label in front of ":" including ":" and any preceding line break with a semicolon. However, it will not remove or replace a line break in front of "BEGIN:", and neither will it touch the "BEGIN:" itself.
#"(([\r\n]+\s*Ad\sHoc:.*?[\r\n]+)|([\r\n]+(?!\s*BEGIN))).*?:\s*"
This regex is an OR-combination of two regex (which is easy to see in the visualization above):
[\r\n]+\s*Ad\sHoc:.*?[\r\n]+.*?:\s*
which will match Ad Hoc:" lines including any "Label:" string in the following line, and
([\r\n]+(?!\s*BEGIN)).*?:\s*
which will match any "Label:" including the line break in front of it, except for the "BEGIN:" label.
Applying this regex to your example and replacing all matches with ";" will result in the following:
BEGIN:;Zoning Letter;4/16/2014;355211;712826;102;367;0;261711;0;1;0;0;16;0;2;0;0;\V367\2855\1558564.PDF
BEGIN:;Zoning Letter;4/16/2014;355211;712825;102;367;0;19441;0;1;0;0;16;0;2;0;0;\V367\2855\1558563.pdf
Note the "BEGIN:;" which we will take care of now.
2. Elimination of the "BEGIN:" labels
This is rather simple pattern when looking at the result of the first regex substitution.
"(?m)^BEGIN:;"
You might think that you can do this through a string replacement - and so did i when writing the first version of my answer. However, a mere string replacement would become a problem when "BEGIN:;" could be part of the content of any other text field. Better to be correct and safe by specifying a regex which matches only at the beginning of a line.
3. Code example, including elimination of empty lines in the source text
If you have empty lines containing white-spaces in the source text, the regular expression displayed above might not work properly. The solution is to do another regex substitution beforehand, which reduces empty lines (including white-spaces) to a single line break (if you are certain that your source data does not contain empty lines, you can omit this step).
A complete code example, which would produce the result as mentioned at the beginning of my answer, could look like this:
string sourceData = ... your text with the source data ...
Regex reEmptyLines = new Regex(#"[\s\r\n]+[\r\n]", RegexOptions.Compiled);
Regex reSemicolons = new Regex(#"(([\r\n]+\s*Ad\sHoc:.*?[\r\n]+)|([\r\n]+(?!\s*BEGIN))).*?:\s*", RegexOptions.Compiled);
Regex reBegin = new Regex("(?m)^BEGIN:;", RegexOptions.Compiled);
string processed =
reBegin.Replace(
reSemicolons.Replace(
reEmptyLines.Replace(sourceData, "\r\n"),
";"
),
string.Empty
);

You can use the regex, but I wouldn't say it is easier than doing it in cycle manually.
(?<=BEGIN:\r\n)(?:.*:\s*(?:(?<value>(?<!Ad Hoc:\s*).*)|.*)(?:\r\n)?)*?(?=BEGIN:|$)
Sample code:
foreach (Match m in Regex.Matches(text, #"(?<=BEGIN:\r\n)(?:.*:\s*(?:(?<value>(?<!Ad Hoc:\s*).*)|.*)(?:\r\n)?)*?(?=BEGIN:|$)"))
{
Console.WriteLine(string.Join(",", m.Groups["value"].Captures.Cast<Capture>().Select(c => c.Value)));
}
Output:
Zoning Letter,4/16/2014,355211,712826,102,367,0,261711,0,1,0,0,16,0,2,0,0,\V367\2855\1558564.PDF
Zoning Letter,4/16/2014,355211,712825,102,367,0,19441,0,1,0,0,16,0,2,0,0,\V367\2855\1558563.pdf

How's this:
BEGIN:((?:(?!BEGIN:).)*)
This would match everything between the first BEGIN and the next.

Related

Regex.Split returning whitespaces

I want to export a View as a HTML-Document to the User on my ASP.NET page. I want to give the option to only get a part of the view.
Because of that I want to split the output with Regex.Split(). I wrote a Regex that matches the part I want to cut out. After splitting I put the 2 output parts together again.
The problem is that I get a list of 3 parts, of which the second contains " ". How can I change the code that the output contains only 2 strings?
My Code:
textParts = Regex.Split(text, #"<!--Graphic2-->(.|\n)*<!--EndDiscarded-->");
text = textParts[0] + textParts[1];
text contains HTML, CSS and jQuery Code. I wrote comments like <!--Graphic2--> around the blocks I want to cut out.
EDIT
I got it working now by using the Regex.Replace() Method. But I still don't know why Split isn't working how I expected.
You should consider parsing HTML with the proper tools, like HtmlAgilityPack.
The current question is about why Regex.Split returned 3 values. That is due to the presence of a capturing group in your pattern. Regex.Split returns the chunks between start/end of string and the matched chunks, and all captured substrings:
If capturing parentheses are used in a Regex.Split expression, any captured text is included in the resulting string array. For example, if you split the string "plum-pear" on a hyphen placed within capturing parentheses, the returned array includes a string element that contains the hyphen.
So, Regex.Split(text, #"<!--Graphic2-->(.|\n)*<!--EndDiscarded-->") matches <!--Graphic2--> substring, then matches and captures into Group 1 any 0+ occurrences of any char, as many as possible, and then matches <!--EndDiscarded-->") - these matches are removed and substrings that are not matched are returned, but the last char captured into the repeated capturing group is also returned.
So, if you plan to use regex for this task, you should consider re-writing it to #"(?s)<!--Graphic2-->.*?<!--EndDiscarded-->" or #"<!--Graphic2-->[^<]*(?:<(?!!--EndDiscarded)[^<]*)*<!--EndDiscarded-->" that will be much more efficient, or even #"<!--Graphic2-->[^<]*(?:<(?!!--(?:EndDiscarded|Graphic2))[^<]*)*<!--EndDiscarded-->" that will ensure no nested Graphic2 comments are matched.
See, the complexity of the regexps rises when you want to make sure your patterns work more efficiently and safer. However, even these longer versions do not guarantee 100% safety.

getting the correct regex to print out in c#

Below is a regex statement I have been working on for quite sometime:
Match parsedRequestData = Regex.Match(requestData, #"^.*\[(.*)\]$");
What this is supposed to be doing is taking the email out of the email below:
2.3|[0246303#up.com]
For clarification, this email comes from a table in SQL Server. There are many emails that are formatted like this in there and the regex is supposed to be getting all of that from inside the brackets. However, it is matching the entirety of this line instead of whats inside of it. So my question is, is there something wrong with my regex statement or do I have something in my code I need to add?
Your regex is storing the email address in capture group 1. Try referencing group 1 like this:
parsedRequestData.Groups[1];
Code Sample:
string requestData = "2.3|[0246303#up.com]";
Match parsedRequestData = Regex.Match(requestData, #"^.*\[(.*)\]$");
if (parsedRequestData.Success)
{
Console.WriteLine(parsedRequestData.Groups[1]);
}
Results:
0246303#up.com
Your regex is OK. All you need is to use the Group[1]
var email = Regex.Match("2.3|[0246303#up.com]", #"^.*\[(.*)\]$").Groups[1].Value;
However, it is matching the entirety of this line instead of whats inside of it.
Unless one uses named match captures, the match capture groups are indexed.
Match.Groups[0].Value is the whole match; it shows all the match captures and all the grouped matched text.
Match.Groups[{1-N}].Value is the match captures in the order of specification in the pattern for anything in a ( ) parenthesis set(s). If there is only one ( ) there will be two indexed groups; 0 as mentioned above, and 1 of the items specified to be captured to N.
You only have one ( ) set so the data you want is found in match capture group 1. Group 0 has the non match capture items along with the match capture data.
If one names the match capture such as (?<MyNameHere> ) one can also access the match via Match.Groups["MyNameHere"].Value.
Suggestion on your pattern away from the answer
Usage of * (zero or more) in patterns can be problematic in that it can significantly increase the time of the parser takes due to backtracking false scenarios.
If one knows there is text to be found, don't tell the parser zero items may happen when that is impossible, change it to + one or more. That slight change can greatly affect the parsing operations, both in time and operations.
Change ^.*\[(.*)\]$ to ^.+\[(.+)\]$.
But to even increase the efficiency of the pattern, focus on the knowns of the characters [ and ] as anchors.
Pattern Restructure To Use Anchors
^[^[]+\[([^\]]+)[\s\]]+$
Why is this pattern better? Because we will look for "[" and "]" as anchors.
Let us break it down
^ - Beginning of the pattern (a hard anchor)
[^ ]+ This is a set notation where the ^ says NOT.
[^\[]+ So we want to capture all text + (one or more) that is NOT a [. This tells the pattern to match up to our anchor [ in the text. Note that we don't have to escape it for regex parser treats all characters in a set [ ] as a literal so [^[] is valid. (To be clear this is a match but don't capture text anchor so we will not find this text in an index above the 0 index; only in 0).
\[ Our literal anchor the "[" character.
([^\]]+) This is our match capture which says match this set where any character is valid but not an "]". Here we have to escape the ] because otherwise it would signify the end of our set.
[\s\]]+ we know the end of our text there will be spaces and the "]" character, so let us match (but not to capture) any combination of spaces and a ] before the end.
$ our final anchor, the end of the file/buffer indicator (or line if the right parser rule is set).

Regular Expression to Replace Unwanted Letters

I wrote a small program in C# to Capture ingame Text.
My issue is that the Text allso containts Collor Codes which i try to not to have. I read about the function Regex.Replace
Which i think is going to suite for that.
I have Following String (Line) i want to clear i used the small little tool espresso to play a little bit with regular expression but i never figured it really out.
This is the String i am going to work with:
|c001177ffSave Code =|r |cff00AA00A|cff00AA00G|cff00AA00Q|cffff69b4g|r |cff00AA00R|cff40e0d09|cffffff00$|cffffff00#|r |cff40e0d04|cffff69b4f|cff00AA00R
I try to use ^|( [a-zA-Z0-9]{9})
which gave me theese matches
c001177ff
cff00AA00
cff00AA00
cff00AA00
cffff69b4
cff00AA00
cff40e0d0
cffffff00
cffffff00
cff40e0d0
cffff69b4
cff00AA00
Well i am not good at regex more likly i just started it. I don't want any body to present me completed solution (you are more than welcome to do that) at least a little help how i can solve that issue. I want to filter the Text.
Inpute Code
|c001177ffSave Code =|r |cff00AA00A|cff00AA00G|cff00AA00Q|cffff69b4g|r |cff00AA00R|cff40e0d09|cffffff00$|cffffff00#|r |cff40e0d04|cffff69b4f|cff00AA00R
Should be Filtered to this
Save Code = AGQg R9$# 4fR
I think theese are Hexadecimal Color Codes the |c marks the beginning and the |r the End of the string.I think the |r | is just used to indicate that the first color string ends than we get an SPACE and the | indicates the next start.
How about a simple Linq?
var output = String.Join("", input.Split('|')
.Select(s => s.Length != 10 ? ' ' : s.Last()))
.Trim();
So I think the problem you were having was not escaping your |... the following regex works for me:
var replaced = Regex.Replace(intput, #"\|c[0-9a-zA-Z]{8}|\|r", "");
\|c[0-9a-zA-Z]{8} - match starting with "|c" and then any 8 letters or numbers
| - or
\|r - match "|r"
You're on the right track. Your regex
^|( [a-zA-Z0-9]{9})
Both forces the match to be only at the start of your input string, due to the ^ start-of-line anchor, and the | needs to be escaped, because unescaped, it's a special "or" operator, which completely changes the meaning of your regex.
In addition, the space after the | is undesired, and the capture group is unnecessary, as you only want to eliminate this portion.
If you replace all instances of this
\|[a-zA-z0-9]{9}
with nothing (the empty string)
You will achieve most of your goal. Try it here: http://regex101.com/r/rF6yB6/1
But it seems you really want to eliminate not just nine characters after the pipe, but up through nine characters. So use the {1,9} range quantifier instead:
\|[a-zA-z0-9]{1,9}
Try it: http://regex101.com/r/rF6yB6/2
This seems to achieve your goal exactly.
Please consider bookmarking the Stack Overflow Regular Expressions FAQ for future reference.
string input = "[The example input from your question]";
string output = input.Replace("|r", "");
while (output.Contains("|c"))
output = output.Remove(output.IndexOf("|c"), 10);
// output = "Save Code = AGQg R9$# 4fR"
I like this much more than using Regexes just because it's so much more clear to me.
var str1 = "|c001177ffSave Code =|r |cff00AA00A|cff00AA00G|cff00AA00Q|cffff69b4g|r |cff00AA00R|cff40e0d09|cffffff00$|cffffff00#|r |cff40e0d04|cffff69b4f|cff00AA00R"
var str2 = Regex.Replace(str,#"\|(r|[a-zA-Z0-9]{9})","") //"Save Code = AGQg R9$# 4fR"
In addition to this answer re: escaping the "pipe" character, you're starting your regex with the caret (^) character. This matches the beginning of a line.
A correct regex would be:
\|c[0-9a-zA-Z]{8}
This regex should match all of the characters you want to remove:
([|]c([0-9]|[a-f]|[A-F]){8})|[|]r
Here's the breakdown...
The vertical pipe is an OR marker, so to search for it, place it in square brackets [ and ].
The parenthesis makes a set. So you're searching for ([|]c([0-9]|[a-f]|[A-F]){8}) OR [|]r which is all of your color codes OR |r.
Breakdown of the color codes is the set that begins with |c and is followed by the set of exactly 8 characters that can be 0 though 9 or a through f or A through F.
I tested it at RegexPal.com.

Regex - Lookahead for pattern through newlines

Still learning Regex, and am having trouble getting my head wrapped around the lookahead concept. Similar data to my question here - Matching multiple lines up until a sepertor line? , say I have the following lines handed to me by the user:
0000AA.The horizontal coordinates are valid at the epoch date displayed above.
0000AA.The epoch date for horizontal control is a decimal equivalence
0000AA.of Year/Month/Day.
0000AA
[..]
So a really simple Regex is #^[0-9]{4}[A-Z]{2}\.(?<noteline>.*), where would give me every line. Fantastic. :) However, I'd like a lookahead (or a condition?) that would look at the next line and tell me if the line has the code WITHOUT a '.'. (i.e. If the NEXT line would match #^[0-9]{4}[A-Z]{2}[^\.]
Trying the lookahead, I get hits on the first two lines (because the following line has '.' after the code) but not on the last.
Edit: Using the regex above, or the one offered below gives me all lines, but I'd like to know IF a blank line (line with AA0000 code, but no '.' afterwards) follows. For example, when I get to the match on the line of Year/Month/Day, I'd like to know IF that line is followed by a blank line (or not). (Like with a grouping name that's not spaces or empty, for high-level example.)
Edit 2: I may be mis-using the 'lookahead' term. Going back over .NET's regex, I see something referred to as a Alternation Construct, but not sure if that could be used here.
Thanks! Mike.
Apply the option RegexOptions.Multiline. It changes the meaning of ^ and $ making them match the beginning and the end of ervery line instead the beginning and end of the entire string.
var matches = Regex.Matches(input,
#"^[0-9]{4}[A-Z]{2}\..*$?(?!^[0-9]{4}[A-Z]{2}[^.])",
RegexOptions.Multiline);
The negative look ahead is
find(?!suffix)
It matches a position not preceeding a suffix. Don't escape the dot within the brackets [ ]. The bracket disables the special meaning of most characters anyway.
I also added .*$? making the pattern match until the end of the current line. The ? is required in order make * lazy. Otherwise it is greedy, meaning that will try to get as many characters as possible and possibly match several lines at a time.
If you need only the number part, you can capture it in a group by enclosing it within parentheses.
(^[0-9]{4}[A-Z]{2})\..*$?(?!^[0-9]{4}[A-Z]{2}[^.])
You can then get the group like this
string number = match.Groups[1].Value;
Note: Group #0 represents the entire match.
After doing a lot of research, and hit and misses, I'm certain now that it can't be done - or, rather - it CAN be but would be prohibitively difficult - easier to do it in code.
To refrain, I was looking at a multiline string (document), where every line was preceded by a 6-digit code. Some lines - the lines I'm interested in - have a '.' after the 6-digit code, and then open text. I was hoping there would be a way to get me each line in a group, along with a flag letting me know if the next line has no free-text entry. (No '.' after the 6-digit code.) I.e. Two line data entry would give me two matches on the document. First match would have the line's text in the group called 'notetext', and the group 'lastline' would be empty. The second line would have the second part of the entered note in 'notetext', and the group 'lastline' would have something (anything, content wouldn't matter.)
From what I understand, lookaheads are zero-width assertions, so that if it matches, the returnable value is still empty. Without using lookahead, the match for 'lastline' would consume the next line's code, making the 'notetext' skip that line (giving me every other line of text.) So, I would need to have some back-reference to revert back to.
By this time, it'd be easier (code-wise) to simply get all the lines, and add up text until I get to the end of their notes. (Looping over then entire document, which can't be more than 200 lines as opposed to looping through the regex-matched lines, and the ease of reading the code for future modifications would out-weigh any slight speed advantage the regex could get me.
Thanks guys -
-Mike.

C# Regex Replace weird behavior with multiple captures and matching at the end of string?

I'm trying to write something that format Brazilian phone numbers, but I want it to do it matching from the end of the string, and not the beginning, so it would turn input strings according to the following pattern:
"5135554444" -> "(51) 3555-4444"
"35554444" -> "3555-4444"
"5554444" -> "555-4444"
Since the begining portion is what usually changes, I thought of building the match using the $ sign so it would start at the end, and then capture backwards (so I thought), replacing then by the desired end format, and after, just getting rid of the parentesis "()" in front if they were empty.
This is the C# code:
s = "5135554444";
string str = Regex.Replace(s, #"\D", ""); //Get rid of non digits, if any
str = Regex.Replace(str, #"(\d{0,2})(\d{0,4})(\d{1,4})$", "($1) $2-$3");
return Regex.Replace(str, #"^\(\) ", ""); //Get rid of empty () at the beginning
The return value was as expected for a 10 digit number. But for anything less than that, it ended up showing some strange behavior. These were my results:
"5135554444" -> "(51) 3555-4444"
"35554444" -> "(35) 5544-44"
"5554444" -> "(55) 5444-4"
It seems that it ignores the $ at the end to do the match, except that if I test with something less than 7 digits it goes like this:
"554444" -> "(55) 444-4"
"54444" -> "(54) 44-4"
"4444" -> "(44) 4-4"
Notice that it keeps the "minimum" {n} number of times of the third capture group always capturing it from the end, but then, the first two groups are capturing from the beginning as if the last group was non greedy from the end, just getting the minimum... weird or it's me?
Now, if I change the pattern, so instead of {1,4} on the third capture I use {4} these are the results:
str = Regex.Replace(str, #"(\d{0,2})(\d{0,4})(\d{4})$", "($1) $2-$3");
"5135554444" -> "(51) 3555-4444" //As expected
"35554444" -> "(35) 55-4444" //The last four are as expected, but "35" as $1?
"54444" -> "(5) -4444" //Again "4444" in $3, why nothing in $2 and "5" in $1?
I know this is probably some stupidity of mine, but wouldn't it be more reasonable if I want to capture at the end of the string, that all previous capture groups would be captured in reverse order?
I would think that "54444" would turn into "5-4444" in this last example... then it does not...
How would one accomplish this?
(I know maybe there's a better way to accomplish the very same thing using different approaches... but what I'm really curious is to find out why this particular behavior of the Regex seems odd. So, the answer tho this question should focus on explaining why the last capture is anchored at the end of the string, and why the others are not, as demonstrated in this example. So I'm not particularly interested in the actual phone # formatting problem, but to understand the Regex sintax)...
Thanks...
So you want the third part to always have four digits, the second part zero to four digits, and the first part zero to two digits, but only if the second part contains four digits?
Use
^(\d{0,2}?)(\d{0,4})(\d{4})$
As a C# snippet, commented:
resultString = Regex.Replace(subjectString,
#"^ # anchor the search at the start of the string
(\d{0,2}?) # match as few digits as possible, maximum 2
(\d{0,4}) # match up to four digits, as many as possible
(\d{4}) # match exactly four digits
$ # anchor the search at the end of the string",
"($1) $2-$3", RegexOptions.IgnorePatternWhitespace);
By adding a ? to a quantifier (??, *?, +?, {a,b}?) you make it lazy, i. e. tell it to match as few characters as possible while still allowing an overall match to be found.
Without the ? in the first group, what would happen when trying to match 123456?
First, the \d{0,2} matches 12.
Then, the \d{0,4} matches 3456.
Then, the \d{4} doesn't have anything left to match, so the regex engine backtracks until that's possible again. After four steps, the \d{4} can match 3456. The \d{0,4} gives up everything it had matched greedily for this.
Now, an overall match has been found - no need to try any more combinations. Therefore, the first and third groups will contain parts of the match.
You have to tell it that it's OK if the first matching groups aren't there, but not the last one:
(\d{0,2}?)(\d{0,4}?)(\d{1,4})$
Matches your examples properly in my testing.

Categories