How to supress whitespace regex and evenly space during Split

How to supress whitespace regex and evenly space during Split - c#

I'm trying to create a regex to tokenize a string. An example of the string is
ExeScript.ps1 $1, $0,%1, %3
Or it may be carelessly typed in as . ExeScript.ps1 $1, $0,%1, %3`
I use a simple regex string.
Regex RE = new Regex(#"[\s,\,]");
return (RE.Split(ActionItem));
I get a whole bunch of zeros between the script1.ps1 and the $1
When it is evenly spaced, as in first example I get no space between the script1.ps1 and the $1.
What am I doing wrong. How do I supress whitespace and ensure each array cell has a value in it which is now a whitespace.
Bob.

Try this regex:
Regex RE = new Regex(#"[\s,\,]+");
The + makes the delimiter "one or more" of the previous items. This may help, but it won't detect the situation of two commas (it would interpret it as one delimiter, which may not be what you want). Another possibility would be:
Regex RE = new Regex(#"\s*,\s*");
which is zero or more spaces, followed by a comma, followed by zero or more spaces.
You may also have to decide how you want to handle inputs such as:
foo, ,bar
which you might view as a list of three items, the second of which is a single space.

Related

Matching multiple different start and end of line in one multiline regex [duplicate]

I need to remove lines that match a particular pattern from some text. One way to do this is to use a regular expression with the begin/end anchors, like so:
var re = new Regex("^pattern$", RegexOptions.Multiline);
string final = re.Replace(initial, "");
This works fine except that it leaves an empty line instead of removing the entire line (including the line break).
To solve this, I added an optional capturing group for the line break, but I want to be sure it includes all of the different flavors of line breaks, so I did it like so:
var re = new Regex(#"^pattern$(\r\n|\r|\n)?", RegexOptions.Multiline);
string final = re.Replace(initial, "");
This works, but it seems like there should be a more straightforward way to do this. Is there a simpler way to reliably remove the entire line including the ending line break (if any)?

To match any single line break sequence you may use (?:\r\n|[\r\n\u000B\u000C\u0085\u2028\u2029]) pattern. So, instead of (\r\n|\r|\n)?, you can use (?:\r\n|[\r\n\u000B\u000C\u0085\u2028\u2029])?.
Details:
‎000A - a newline, \n
‎000B - a line tabulation char
‎000C - a form feed char
‎000D - a carriage return, \r
‎0085 - a next line char, NEL
‎2028 - a line separator char
‎- 2029 - a paragraph separator char.
If you want to remove any 0+ non-horizontal (or vertical) whitespace chars after a matched line, you may use [\s-[\p{Zs}\t]]*: any whitespace (\s) but (-[...]) a horizontal whitespace (matched with [\p{Zs}\t]). Note that for some reason, \p{Zs} Unicode category class does not match tab chars.
One more aspect must be dealt with here since you are using the RegexOptions.Multiline option: it makes $ match before a newline (\n) or end of string. That is why if your line endings are CRLF the pattern may fail to match. Hence, add an optional \r? before $ in your pattern.
So, either use
#"^pattern\r?$(?:\r\n|[\r\n\u000B\u000C\u0085\u2028\u2029])?"
or
#"^pattern\r?$[\s-[\p{Zs}\t]]*"

Removing commas from numbers with .NET regex

So I'm processing a report that (brilliantly, really) spits out number values with commas in them, in a .csv output. Super useful.
So, I'm using (C#)regex lookahead positive and lookbehind positive expressions to remove commas that have digits on both sides.
If I use only the lookahead, it seems to work. However when I add the lookbehind as well, the expression breaks down and removes nothing. Both ends of the comma can have arbitrary numbers of digits around them, so I just want to remove the comma if the pattern has one or more digits around it.
Here's the expression that works with the lookahead only:
str = Regex.Replace(str, #"[,](?=(\d+)),"");
Here's the expression that doesn't work as I intend it:
str = Regex.Replace(str, #"[,](?=(\d+)?<=(\d+))", "");
What's wrong with my regex! If I had to guess, there's something I'm misunderstanding about how lookbehind works. Any ideas?

You may use any of the solutions below:
var s = "abc,def,2,100,xyz!,:))))";
Console.WriteLine(Regex.Replace(s, #"(\d),(\d)", "$1$2")); // Does not handle 1,2,3,4 cases
Console.WriteLine(Regex.Replace(s, #"(\d),(?=\d)", "$1")); // Handles consecutive matches with capturing group+backreference/lookahead
Console.WriteLine(Regex.Replace(s, #"(?<=\d),(?=\d)", "")); // Handles consecutive matches with lookbehind/lookahead, the most efficient way
Console.WriteLine(Regex.Replace(s, #",(?<=\d,)(?=\d)", "")); // Also handles all cases
See the C# demo.
Explanations:
(\d),(\d) - matches and captures single digits on both sides of , and $1$2 are replacement backreferences that insert captured texts back into the result
(\d),(?=\d) - matches and captures a digit before ,, then a comma is matched and then a positive lookahead (?=\d) requires a digit after ,, but since it is not consumed, onyl $1 is required in the replacement pattern
(?<=\d),(?=\d) - only such a comma is matched that is enclosed with digits without consuming the digits ((?<=\d) is a positive lookbehind that requires its pattern match immediately to the left of the current location)
,(?<=\d,)(?=\d) - matches a comma and only after matching it, the regex engine checks if there is a digit and a comma immediately before the location (that is after the comma), and if the check if true, the next char is checked for a digit. If it is a digit, a match is returned.
RegexHero.net test:
Bonus:
You may just match a pattern like yours with \d,\d and pass the match to the MatchEvaluator method where you may manipulate the match further:
Console.WriteLine(Regex.Replace(s, #"\d,\d", m => m.Value.Replace(",",string.Empty))); // Callback method
Here, m is the match object and m.Value holds the whole match value. With .Replace(",",string.Empty), you remove all commas from the match value.

You can always check a website that evaluates regex expressions.
I think this code might be able to help you:
str = Regex.Replace(str, #"[,](?=(\d+))(?<=(\d))", "");

.NET Regular Expression white space special characters

This pattern is not working sometimes (it works only for the 3rd instance). The pattern is ^\s*flood\s{55}\s+\w+
I am new to regular expression and I am trying to write a regular expression that captures all the following conditions:
Example 1: flood a)
Example 2: flood As respects
Example 3: flood USD100,000
(it's in a tabular format and there's a lot of space between flood and the next word)

Your expression is saying:
^\s* The start of the string may have zero or more whitespace characters
flood followed by the string flood
\s{55} followed by exactly 55 whitespace characters
\s+\w+ followed by one or more whitespace characters and then one or more word characters.
If you want a minimum number of whitespace characters, say at least 30, followed by one or more word chraracters, then you could do this:
^\s*flood\s{30,}\w+

Try this:
string input =
#" flood a)
flood As respects
flood USD100,000";
string pattern = #"^\s*flood\s+.+$";
MatchCollection matches = Regex.Matches(input, pattern, RegexOptions.Multiline);

If there are a lot of spaces between flood and the next word you could omit \s{55} which is a quantifier that matches a whitespace character 55 times.
That would leave you with ^\s*flood\s+\w+ which does not yet match all the values at the end because \w matches a word character but not a whitespace or any of ),.
To match your values you might use a character class and add the characters that you allow to match:
^\s*flood\s+[\w,) ]+
Or if you want to match any character you could use a dot instead of a character class.
According to your comment, you might use a positive lookbehind:
(?<=\(13\. Deductible\))\s*(\s*flood\s+[\w,) ]+)+
Demo

Matching a substring of any length and characters using RegEx

I would like to be able to match and then extract all substrings in the following string using regex in c#:
"2012-05-15 00:49:02 192.168.100.10 POST /Microsoft-Server-ActiveSync/default.eas User=nikced&DeviceId=ApplDNWGRKZQDTC0&DeviceType=iPhone&Cmd=Ping&Log=V121_Sst8_LdapC0_LdapL0_RpcC31_RpcL50_Hb3540_Erq1_Pk1728465481_S2_ 443 redcloud\nikced 94.234.170.42 Apple-iPhone4C1/902.179 200 0 64 3140491"
Since it's a logfile it the regex should be able to handle any line that is of a similar type.
In this case, the preferred output to a collection should be:
2012-05-15
00:49:02
192.168.100.10
/Microsoft-Server-ActiveSync/default.eas
User=nikced&DeviceId=ApplDNWGRKZQDTC0&DeviceType=iPhone&Cmd=Ping&Log=V121_Sst8_LdapC0_LdapL0_RpcC31_RpcL50_Hb3540_Erq1_Pk1728465481_S2_
443
redcloud\nikced
94.234.170.42
Apple-iPhone4C1/902.179
200
0
64
3140491
Appreciate any answer using C#, .net and Regex to extract the above substrings into a collection (MatchCollection preferred). All log lines follows the same format and pattern.

Incredibly complex regex incoming:
logFile.Split(' ');

This will give you an array that you can iterate through to retrieve all of the "lines" which are separated by a space
string[] lines = log.Split(' ');

You don't need to use a Regex. You can simply use String.Split Method, and specify space as separator:
string [] substrings = line.Split(new Char [] {' '});
If you need to identify the kind of each part, then you should specify what you need to find, and a regex can be created for it.
Anyway, if you really want to use a Regex, do this:
Regex re = new Regex (#"(?:(?<s>[^ ]+)(?: |$))*");
This will give you all the captures in the "s" group, when you call the Match method.
As the OP pointed out in a comment that the separator can be anything appart from a single space, then the possible separators should be included in the (?: |$) and the [^ ] parts of the expression. I.e. if space as well as tab are possible separators, replace that part with (?: |\t|$) and [^ \t]. If you need to accept more than one of those characters as separators, add a + after the () group:
(?:(?<s>[^ \t]+)(?: |\t|$)+)*

The fastest and most obvious way is to use String.Split:
string[] substrings = result = line->Split( nullptr, StringSplitOptions::RemoveEmptyEntries );
But if you insist on a MatchCollection then this will do what you want
MatchCollection ^ substrings = Regex.Matches(line, "\\S+")

Really, you just need to break this down into the parts.
First, the date. Will it always be in YYYY-MM-DD format? Could it be possible that it will be different based on region/culture settings?
(?<LogDate>dddd-dd-dd)
Next, you have the time. Same thing:
(?<LogTime>dd:dd:dd)
Next, I'm assuming this is the web method that was actually called? Not entirely sure, since you haven't really explained how the data is laid out. However, I'm assuming it's either going to be either POST or GET, so that's what we're going to do next...
(?<LogMethod>POST|GET)
Just do this for every part of the log line you're interested in, and you'll be set. IE:
(?<LogDate>dddd-dd-dd) (?<LogTime>dd:dd:dd) (?<LogMethod>POST|GET)...
If you want to anchor to the start/end of the line, be sure to use ^ and $ respectively. When you get the Matches, you can get the values from each group by indexing the Groups property with the named group (such as match.Groups["LogMethod"].Value). Good luck!

Format String : Parsing

I have a parsing question. I have a paragraph which has instances of :  word  . So basically it has a colon, two spaces, a word (could be anything), then two more spaces.
So when I have those instances I want to convert the string so I have
A new line character after : and the word.
Removed the double space after the word.
Replace all double spaces with new line characters.
Don't know exactly how about to do this. I'm using C# to do this. Bullet point 2 above is what I'm having a hard time doing this.
Thanks

Assuming your original string is exactly in the form you described, this will do:
var newString = myString.Trim().Replace(" ", "\n");
The Trim() removes leading and trailing whitespaces, taking care of your spaces at the end of the string.
Then, the Replace replaces the remaining " " two space characters, with a "\n" new line character.
The result is assigned to the newString variable. This is needed, as myString will not change - as strings in .NET are immutable.
I suggest you read up on the String class and all its methods and properties.

You can try
var str = ": first : second ";
var result = Regex.Replace(str, ":\\s{2}(?<word>[a-zA-Z0-9]+)\\s{2}",
":\n${word}\n");

Using RegularExpressions will give you exact matches on what you are looking for.
The regex match for a colon, two spaces, a word, then two more spaces is:
Dim reg as New Regex(": [a-zA-Z]* ")
[a-zA-Z] will look for any character within the alphabetical range. Can append 0-9 on as well if you accept numbers within the word. The * afterwards indicated that there can be 0 or more instances of the preceding value.
[a-zA-Z]* will attempt to do a full match of any set of contiguous alpha characters.
Upon further reading, you may use [\w] in place of [a-zA-Z0-9] if that's what you are looking for. This will match any 'word' character.
source: http://msdn.microsoft.com/en-us/library/ms972966.aspx
You can retrieve all the matches using reg.Matches(inputString).
Review http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.replace.aspx for more information on regular expression replacements and your options from there out
edit: Before I was using \s to search for spaces. This will match any whitespace character including tabs, new lines and other. That is not what we want, so I reverted it back to search for exact space characters.

You can use string.TrimEnd - http://msdn.microsoft.com/en-us/library/system.string.trimend.aspx - to trim spaces at the end of the string.

The following is an example using Regular Expressions. See also this question for more info.
Basically the pattern string tells the regex to look for a colon followed by two spaces. Then we save in a capture group named "word" whatever the word is surrounded by two spaces on either side. Finally two more spaces are specified to finish the pattern.
The replace uses a lambda which says for every match, replace it with a colon, a new line, the "lone" word, and another newline.
string Paragraph = "Jackdaws love my big sphinx of quartz: fizz The quick onyx goblin jumps over the lazy dwarf. Where: buzz The crazy dogs.";
string Pattern = #": (?<word>\S*) ";
string Result = Regex.Replace(Paragraph, Pattern, m =>
{
var LoneWord = m.Groups[1].Value;
return #":" + Environment.NewLine + LoneWord + Environment.NewLine;
},
RegexOptions.IgnoreCase);
Input
Jackdaws love my big sphinx of quartz: fizz The quick onyx goblin jumps over the lazy dwarf. Where: buzz The crazy dogs.
Output
Jackdaws love my big sphinx of quartz:
fizz
The quick onyx goblin jumps over the lazy dwarf. Where:
buzz
The quick brown fox.
Note, for item 3 on your list, if you also want to replace individual occurrences of two spaces with newlines, you could do this:
Result = Result.Replace(" ", Environment.NewLine);

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to supress whitespace regex and evenly space during Split - c#

Related

Matching multiple different start and end of line in one multiline regex [duplicate]

Removing commas from numbers with .NET regex

.NET Regular Expression white space special characters

Matching a substring of any length and characters using RegEx

Format String : Parsing

Categories

Resources