count a multiple set of string that all appeared in one line - c#

here is the C# line of code that tries to find the matching string in a line using regex in a textfile
MatchCollection collection = Regex.Matches(readedLine, #"/funcdesc=cls/ && /jobcat=VSO/");
countedChars = collection.Count;
This is the sample textfile content
2016-01-01 d;D;;D;funcdesc=cls&workcode=file&jobcat=VSO&jobcat=DSO;
2016-01-02 d;D;;D;funcdesc=cls&workcode=file&jobcat=DSO&jobcat=DSO;
expected total count output should be 1
(because line 1 meet the requirement where both "funcdesc=cls" and "jobcat=VSO" are there, however line 2 did not because there is no "jobcat=VSO" found only the first string.

Since the order of the "funcdesc=cls" and "jobcat=VSO" aren't fixed (i.e., "funcdesc" can come after "jobcat"), you can use the following Regex to capture a match for either case. There may be more efficient ways to do it, this is just off the top of my head:
/funcdesc=cls.+jobcat=VSO|jobcat=VSO.+funcdesc=cls/
The | is a way to say "OR" in Regex, i.e., either "funcdesc=cls" followed by one or more (.+) characters followed by "jobcat=VSO", or "jobcat=VSO" followed by one or more characters followed by "funcdesc=cls".
This will match the following:
2016-01-01 d;D;;D;funcdesc=cls&workcode=file&jobcat=VSO&jobcat=DSO;
or
2016-01-01 d;D;;D;jobcat=VSO&workcode=file&funcdesc=cls&jobcat=DSO;
but will not match
2016-01-02 d;D;;D;funcdesc=cls&workcode=file&jobcat=DSO&jobcat=DSO;

Right pattern would be funcdesc=cls.*jobcat=VSO
Also $ in regex means end of line.

Related

Matching a sequence of characters splitted by spaces after a prefix

I have the following strings:
-prefix <#141222969505480701> where the second part e.g. <#141222969505480701> can be repeated unlimited times (only the numbers change).
-prefix 141222969505480701 which should behave the same as above.
-prefix 141222969505480701 <#141222969505480702> which would still be able to repeat itself forever.
The last one should have groups containing 141222969505480701 and 141222969505480702.
So a few bits of information:
The digit chains are always 18 in total so I use \d{18} in my regex
I would like to have the numbers in groups for me to use them afterwards.
What I tried
First of I tried to match the first of my example strings.
-prefix(\s<#\d{18}>)\1* which would match the entire string, but I would like to have the digits itself in its own group. Also this method only matches the same parts e.g. <#141222969505480701> <#141222969505480701> <#141222969505480701> would match, but any other number in between wouldn't match.
What would sound logical in my head
-prefix (\d{18})+ but it would only match the first one of the 'digit parts'.
While I was testing it on regex101 it told me the following:
A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data.
I tried to adjust the regex to the following -prefix ((\d{18})+), but with the same result.
With the help to of #madreflection in the comments I was able to come up with this solution:
-prefix([\s]*(<#|)(?<digits>[0-9]{18})>?)+
Which is exactly what I needed, which even ignores spaces in between. Also with the use of match.Groups["digits"].Captures it made the whole story a lot easier.
You could use an alternation to list the 3 different allowed formats. In .NET it is supported to reuse the group name.
-prefix\s*(?:(?<digits>[0-9]{18})\s*<#(?<digits>[0-9]{18})>|(?<digits>[0-9]{18})|<#(?<digits>[0-9]{18}))
Pattern parts
-prefix\s* Match literally followed by 0+ whitespace characters
(?: Non capturing group
(?<digits>[0-9]{18})\s*<#(?<digits>[0-9]{18})> 2 named capturing groups which will match the digits
| Or
(?<digits>[0-9]{18}) Named capturing group, match digits only
| Or
<#(?<digits>[0-9]{18}) Named capturing group, match digits between brackets only
)
Regex demo
You could also use 2 named capturing groups, 1 for each format. For example:
-prefix\s*(?:(?<digits>[0-9]{18})\s*<#(?<digitsBrackets>[0-9]{18})>|(?<digits>[0-9]{18})|<#(?<digitsBrackets>[0-9]{18}))
Regex demo

Find a pattern using Regex

I have a bunch of string lines with different formats. I want to find a pattern using regex in order to match specific lines. I have tried to figure it out myself up to some degree, using this: \b([A-Z0-9]{2,})\b. However I wasn't able to find the right pattern that will match only the lines 3, 6 and 8.Thank you.
// DONE:
return Test;
TESTER
MessageBoxButtons.OK,
.GetConnectionString();
TOURNAMENT TRACKER
// Create
TEST 4 ME
My guess is that your solution also matched the first and fourth line. If you want to exclude the lines with characters other than those specified, you could look at the whole line instead of checking single words:
^[0-9A-Z]+(\s[0-9A-Z]+)*$
It will match lines consisting of white space separated words which contain numbers or uppercase letters.
If you check the whole line you can use this
^[A-Z0-9 ]+$
Assuming case-sensitivity is set then this will match only uppercase characters, numbers and spaces from the start to the end of the line.
See demo here

C# Regular Expression: Search the first 3 letters of each name

Does anyone know how to say I can get a regex (C#) search of the first 3 letters of a full name?
Without the use of (.*)
I used (.**)but it scrolls the text far beyond the requested name, or
if it finds the first condition and after 100 words find the second condition he return a text that is not the look, so I have to limit in number of words.
Example: \s*(?:\s+\S+){0,2}\s*
I would like to ignore names with less than 3 characters if they exist in name.
Search any name that contains the first 3 characters that start with:
'Mar Jac Rey' (regex that performs search)
Should match:
Marck Jacobs L. S. Reynolds
Marcus Jacobine Reys
Maroon Jacqueline by Reyils
Can anyone help me?
The zero or more quantifier (*) is 'greedy' by default—that is, it will consume as many characters as possible in order to finding the remainder of the pattern. This is why Mar.*Jac will match the first Mar in the input and the last Jac and everything in between.
One potential solution is just to make your pattern 'non-greedy' (*?). This will make it consume as few characters as possible in order to match the remainder of the pattern.
Mar.*?Jac.*?Rey
However, this is not a great solution because it would still match the various name parts regardless of what other text appears in between—e.g. Marcus Jacobine Should Not Match Reys would be a valid match.
To allow only whitespace or at most 2 consecutive non-whitespace characters to appear between each name part, you'd have to get more fancy:
\bMar\w*(\s+\S{0,2})*\s+Jac\w*(\s+\S{0,2})*\s+Rey\w*
The pattern (\s+\S{0,2})*\s+ will match any number of non-whitespace characters containing at most two characters, each surrounded by whitespace. The \w* after each name part ensures that the entire name is included in that part of the match (you might want to use \S* instead here, but that's not entirely clear from your question). And I threw in a word boundary (\b) at the beginning to ensure that the match does not start in the middle of a 'word' (e.g. OMar would not match).
I think what you want is this regular expression to check if it is true and is case insensitive
#"^[Mar|Jac|Rey]{3}"
Less specific:
#"^[\w]{3}"
If you want to capture the first three letters of every words of at least three characters words you could use something like :
((?<name>[\w]{3})\w+)+
And enable ExplicitCapture when initializing your Regex.
It will return you a serie of Match named "name", each one of them is a result.
Code sample :
Regex regex = new Regex(#"((?<name>[\w]{3})\w+)+", RegexOptions.ExplicitCapture | RegexOptions.IgnoreCase);
var match = regex.Matches("Marck Jacobs L. S. Reynolds");
If you want capture also 3 characters words, you can replace the last "\w" by a space. In this case think to handle the last word of the phrase.

Regex between and including characters across multiple lines

I have the below text:
BEGIN:
>>DocTypeName: Zoning Letter
>>DocDate: 4/16/2014
Loan Number: 355211
Ad Hoc: ZONING VERIFICATION LETTER
Document Handle: 712826
>>DiskgroupNum: 102
>>VolumeNum: 367
>>NumOfPages: 0
>>FileSize: 261711
>>DocRevNum: 0
>>Rendition: 1
>>PhysicalPageNum: 0
>>ItemPageNum: 0
>>FileTypeNum: 16
>>ImageType: 0
>>Compress: 2
>>Xdpi: 0
>>Ydpi: 0
>>FileName: \V367\2855\1558564.PDF
BEGIN:
>>DocTypeName: Zoning Letter
>>DocDate: 4/16/2014
Loan Number: 355211
Ad Hoc: ZONING CODES COMPLIANCE LETTER
Document Handle: 712825
>>DiskgroupNum: 102
>>VolumeNum: 367
>>NumOfPages: 0
>>FileSize: 19441
>>DocRevNum: 0
>>Rendition: 1
>>PhysicalPageNum: 0
>>ItemPageNum: 0
>>FileTypeNum: 16
>>ImageType: 0
>>Compress: 2
>>Xdpi: 0
>>Ydpi: 0
>>FileName: \V367\2855\1558563.pdf
I need to use regex (which will go in a C# program) to convert this into something effective for a CSV. The data that is most vital is the document handle and filename (path) from each section (being a section under "BEGIN:") I'm working on this for someone else, so I'd like to retain as much as possible in the event they decide they need some of the other data. This was my initial attempt:
\r\n(?!BEGIN).*\:
However, not every section has an "Ad Hoc:" component, which throws off the cell alignment when pulled into Excel. Ad Hoc I know for sure is not part of the data that is needed for the end result.
The best case scenario would be to just select and remove everything between every "Ad Hoc" and "Handle:" to be replaced with the delimiter (;). I would then pipe this along with my above regex.
My only other requirement is that this has to all be in one regex statement - otherwise in the program I've written I'll have to set up some sort of loop or while business which I'm not prepared to do yet.
Based on what i understood from the comments underneath the question, the example data given in the question should be transformed into two text lines like this:
Zoning Letter;4/16/2014;355211;712826;102;367;0;261711;0;1;0;0;16;0;2;0;0;\V367\2855\1558564.PDF
Zoning Letter;4/16/2014;355211;712825;102;367;0;19441;0;1;0;0;16;0;2;0;0;\V367\2855\1558563.pdf
To achieve this result while avoiding a loop (although i wonder why you would want to avoid loops - they are basic and omni-present constructs), i would suggest applying two (or three, see section 3. below) regex substitutions.
1. Removal of "Label:" and replacement of line breaks with ";"
The first regular expression will remove a label in front of ":" including ":" and any preceding line break with a semicolon. However, it will not remove or replace a line break in front of "BEGIN:", and neither will it touch the "BEGIN:" itself.
#"(([\r\n]+\s*Ad\sHoc:.*?[\r\n]+)|([\r\n]+(?!\s*BEGIN))).*?:\s*"
This regex is an OR-combination of two regex (which is easy to see in the visualization above):
[\r\n]+\s*Ad\sHoc:.*?[\r\n]+.*?:\s*
which will match Ad Hoc:" lines including any "Label:" string in the following line, and
([\r\n]+(?!\s*BEGIN)).*?:\s*
which will match any "Label:" including the line break in front of it, except for the "BEGIN:" label.
Applying this regex to your example and replacing all matches with ";" will result in the following:
BEGIN:;Zoning Letter;4/16/2014;355211;712826;102;367;0;261711;0;1;0;0;16;0;2;0;0;\V367\2855\1558564.PDF
BEGIN:;Zoning Letter;4/16/2014;355211;712825;102;367;0;19441;0;1;0;0;16;0;2;0;0;\V367\2855\1558563.pdf
Note the "BEGIN:;" which we will take care of now.
2. Elimination of the "BEGIN:" labels
This is rather simple pattern when looking at the result of the first regex substitution.
"(?m)^BEGIN:;"
You might think that you can do this through a string replacement - and so did i when writing the first version of my answer. However, a mere string replacement would become a problem when "BEGIN:;" could be part of the content of any other text field. Better to be correct and safe by specifying a regex which matches only at the beginning of a line.
3. Code example, including elimination of empty lines in the source text
If you have empty lines containing white-spaces in the source text, the regular expression displayed above might not work properly. The solution is to do another regex substitution beforehand, which reduces empty lines (including white-spaces) to a single line break (if you are certain that your source data does not contain empty lines, you can omit this step).
A complete code example, which would produce the result as mentioned at the beginning of my answer, could look like this:
string sourceData = ... your text with the source data ...
Regex reEmptyLines = new Regex(#"[\s\r\n]+[\r\n]", RegexOptions.Compiled);
Regex reSemicolons = new Regex(#"(([\r\n]+\s*Ad\sHoc:.*?[\r\n]+)|([\r\n]+(?!\s*BEGIN))).*?:\s*", RegexOptions.Compiled);
Regex reBegin = new Regex("(?m)^BEGIN:;", RegexOptions.Compiled);
string processed =
reBegin.Replace(
reSemicolons.Replace(
reEmptyLines.Replace(sourceData, "\r\n"),
";"
),
string.Empty
);
You can use the regex, but I wouldn't say it is easier than doing it in cycle manually.
(?<=BEGIN:\r\n)(?:.*:\s*(?:(?<value>(?<!Ad Hoc:\s*).*)|.*)(?:\r\n)?)*?(?=BEGIN:|$)
Sample code:
foreach (Match m in Regex.Matches(text, #"(?<=BEGIN:\r\n)(?:.*:\s*(?:(?<value>(?<!Ad Hoc:\s*).*)|.*)(?:\r\n)?)*?(?=BEGIN:|$)"))
{
Console.WriteLine(string.Join(",", m.Groups["value"].Captures.Cast<Capture>().Select(c => c.Value)));
}
Output:
Zoning Letter,4/16/2014,355211,712826,102,367,0,261711,0,1,0,0,16,0,2,0,0,\V367\2855\1558564.PDF
Zoning Letter,4/16/2014,355211,712825,102,367,0,19441,0,1,0,0,16,0,2,0,0,\V367\2855\1558563.pdf
How's this:
BEGIN:((?:(?!BEGIN:).)*)
This would match everything between the first BEGIN and the next.

C# Regex Replace weird behavior with multiple captures and matching at the end of string?

I'm trying to write something that format Brazilian phone numbers, but I want it to do it matching from the end of the string, and not the beginning, so it would turn input strings according to the following pattern:
"5135554444" -> "(51) 3555-4444"
"35554444" -> "3555-4444"
"5554444" -> "555-4444"
Since the begining portion is what usually changes, I thought of building the match using the $ sign so it would start at the end, and then capture backwards (so I thought), replacing then by the desired end format, and after, just getting rid of the parentesis "()" in front if they were empty.
This is the C# code:
s = "5135554444";
string str = Regex.Replace(s, #"\D", ""); //Get rid of non digits, if any
str = Regex.Replace(str, #"(\d{0,2})(\d{0,4})(\d{1,4})$", "($1) $2-$3");
return Regex.Replace(str, #"^\(\) ", ""); //Get rid of empty () at the beginning
The return value was as expected for a 10 digit number. But for anything less than that, it ended up showing some strange behavior. These were my results:
"5135554444" -> "(51) 3555-4444"
"35554444" -> "(35) 5544-44"
"5554444" -> "(55) 5444-4"
It seems that it ignores the $ at the end to do the match, except that if I test with something less than 7 digits it goes like this:
"554444" -> "(55) 444-4"
"54444" -> "(54) 44-4"
"4444" -> "(44) 4-4"
Notice that it keeps the "minimum" {n} number of times of the third capture group always capturing it from the end, but then, the first two groups are capturing from the beginning as if the last group was non greedy from the end, just getting the minimum... weird or it's me?
Now, if I change the pattern, so instead of {1,4} on the third capture I use {4} these are the results:
str = Regex.Replace(str, #"(\d{0,2})(\d{0,4})(\d{4})$", "($1) $2-$3");
"5135554444" -> "(51) 3555-4444" //As expected
"35554444" -> "(35) 55-4444" //The last four are as expected, but "35" as $1?
"54444" -> "(5) -4444" //Again "4444" in $3, why nothing in $2 and "5" in $1?
I know this is probably some stupidity of mine, but wouldn't it be more reasonable if I want to capture at the end of the string, that all previous capture groups would be captured in reverse order?
I would think that "54444" would turn into "5-4444" in this last example... then it does not...
How would one accomplish this?
(I know maybe there's a better way to accomplish the very same thing using different approaches... but what I'm really curious is to find out why this particular behavior of the Regex seems odd. So, the answer tho this question should focus on explaining why the last capture is anchored at the end of the string, and why the others are not, as demonstrated in this example. So I'm not particularly interested in the actual phone # formatting problem, but to understand the Regex sintax)...
Thanks...
So you want the third part to always have four digits, the second part zero to four digits, and the first part zero to two digits, but only if the second part contains four digits?
Use
^(\d{0,2}?)(\d{0,4})(\d{4})$
As a C# snippet, commented:
resultString = Regex.Replace(subjectString,
#"^ # anchor the search at the start of the string
(\d{0,2}?) # match as few digits as possible, maximum 2
(\d{0,4}) # match up to four digits, as many as possible
(\d{4}) # match exactly four digits
$ # anchor the search at the end of the string",
"($1) $2-$3", RegexOptions.IgnorePatternWhitespace);
By adding a ? to a quantifier (??, *?, +?, {a,b}?) you make it lazy, i. e. tell it to match as few characters as possible while still allowing an overall match to be found.
Without the ? in the first group, what would happen when trying to match 123456?
First, the \d{0,2} matches 12.
Then, the \d{0,4} matches 3456.
Then, the \d{4} doesn't have anything left to match, so the regex engine backtracks until that's possible again. After four steps, the \d{4} can match 3456. The \d{0,4} gives up everything it had matched greedily for this.
Now, an overall match has been found - no need to try any more combinations. Therefore, the first and third groups will contain parts of the match.
You have to tell it that it's OK if the first matching groups aren't there, but not the last one:
(\d{0,2}?)(\d{0,4}?)(\d{1,4})$
Matches your examples properly in my testing.

Categories