How to escape a delimiter by doubling the delimiter in a regex

How to escape a delimiter by doubling the delimiter in a regex - c#

I need to split a string on a delimiter, but not where the delimiter is doubled.
For instance "\m55.\m207|DEFAULT||DEFAULT|55||207" once split should result in
\m55.\m207
DEFAULT||DEFAULT
55||207
I'm trying to do this with a regex. If it makes a difference, I'm using C# System.Text.RegularExpression.Regex.
So far I have "[^|]\|[^|]" but that doesn't handle where an escaped delimiter is next to the delimter. IE |||
I'm sure there is a solution on the net, but I've tried searching with multiple different terms and couldn't find the right combination of terms to find it.
How do I escape the delimiter by doubling it in a regex? Or if there is a simpler solution what is it?
EDIT
Here is a more complicated example:
Input: "\m55.\m207|DEFAULT||DEFAULT|||55||207"
Expected output:
"\m55.\m207"
"DEFAULT||DEFAULT||"
"55||207"

Because your demo is so simple,and you just want to split with single |,so I can use \b here
string txt = #"\m55.\m207|DEFAULT||DEFAULT|55||207";
string patten = #"\b\|\b";
foreach (var str in Regex.Split(txt, patten))
{
Console.WriteLine(str);
}

(?<=[^|](?:\|{2})+)\|(?!\|)|(?<!\|)\|(?!\|)
You need to use lookarounds to make sure split happens on only one |.
See Demo

Related

How to use regex to match anything from A to B, where B is not preceeded by C

I'm having a hard time with this one. First off, here is the difficult part of the string I'm matching against:
"a \"b\" c"
What I want to extract from this is the following:
a \"b\" c
Of course, this is just a substring from a larger string, but everything else works as expected. The problem is making the regex ignore the quotes that are escaped with a backslash.
I've looked into various ways of doing it, but nothing has gotten me the correct results. My most recent attempt looks like this:
"((\"|[^"])+?)"
In various test online, this works the way it should - but when I build my ASP.NET page, it cuts off at the first ", leaving me with just the a-letter, white space and a backslash.
The logic behind the pattern above is to capture all instances of \" or something that is not ". I was hoping this would search for \", making sure to find those first - but I got the feeling that this is overridden by the second part of the expression, which is only 1 single character. A single backslash does not match 2 characters (\"), but it will match as a non-". And from there, the next character will be a single ", and the matching is completed. (This is just my hypothesis on why my pattern is failing.)
Any pointers on this one? I have tried various combinations with "look"-methods in regex, but I didn't really get anywhere. I also get the feeling that is what I need.

ORIGINAL ANSWER
To match a string like a \"b\" c, you need to use following regex declaration:
(?:\\"|[^"])+
var rx = Regex(#"(?:\\""|[^""])+");
See RegexStorm demo
Here is an IDEONE demo:
var str = "a \\\"b\\\" c";
Console.WriteLine(str);
var rx = new Regex(#"(?:\\""|[^""])+");
Console.WriteLine(rx.Match(str).Value);
Please note the # in front of the string literal that lets us use verbatim string literals where we have to double quotes to match literal quotes and use single escape slashes instead of double. This makes regexps easier to read and maintain.
If you want to match any escaped entities in your input string, you can use:
var rx = new Regex(#"[^""\\]*(?:\\.[^""\\]*)*");
See demo on RegexStorm
UPDATE
To match the quoted strings, just add quotes around the pattern:
var rx = new Regex(#"""(?<res>[^""\\]*(?:\\.[^""\\]*)*)""");
This pattern yields much better performance than Tim Long's suggested regex, see RegexHero test resuls:

The following expression worked for me:
"(?<Result>(\\"|.)*)"
The expression matches as follows:
An opening quote (literal ")
A named capture (?<name>pattern) consisting of:
Zero or more occurences * of literal \" or (|) any single character (.)
A final closing quote (literal ")
Note that the * (zero or more) quantifier is non-greedy so the final quote is matched by the literal " and not the "any single character" . part.
I used ReSharper 9's built-in Regular Expression validator to develop the expression and verify the results:
I have used the "Explicit Capture" option to reduce cruft in the output (RegexOptions.ExplicitCapture).
One thing to note is that I am matching the whole string, but I am only capturing the substring, using a named capture. Using named captures is a really useful way to get at the results you want. In code, it might look something like this:
static string MatchQuotedString(string input)
{
const string pattern = #"""(?<Result>(\\""|.)*)""";
const RegexOptions options = RegexOptions.ExplicitCapture;
Regex regex = new Regex(pattern, options);
var matches = regex.Match(input);
var substring = matches.Groups["Result"].Value;
return substring;
}
Optimization: If you are planning on using the regex a lot, you could factor it out into a field and use the RegexOptions.Compiled option, this pre-compiles the expression and gives you faster throughput at the expense of longer initialization.

c#: how to quickly insert a character into string in front of any occurrences of special combinations of characters?

I have a string where I need to escape any occurrences of special combinations of characters. In other words, I need to stick a "\" in front any occurrence of any such combination. Most combinations are actually single characters (e.g a double quote or a backslash) but some are multi-character (e.g. "&&"). One approach is to create an array of strings with these combinations, loop over them and run a String.Replace(), with the backslash being checked the last to avoid recursive escaping. But is there a better (more elegant/quick/etc) way of doing it? Thx

Use your idea of Replace but using an StringBuilder instead (much better perfomance).

You can use Regex.Replace for this.
var input = #"abc'def&&aa\cc""ff";
var output = Regex.Replace(input, #"'|&&|""|\\", m => #"\" + m); // => "abc\'def\&&aa\\cc\"ff"

you can just take your entire string and run String.Replace() for each replacement type you want to do, As far as I know that is the quickest/most elegant way to do it. Thats why it is a built in method.

Matching a substring of any length and characters using RegEx

I would like to be able to match and then extract all substrings in the following string using regex in c#:
"2012-05-15 00:49:02 192.168.100.10 POST /Microsoft-Server-ActiveSync/default.eas User=nikced&DeviceId=ApplDNWGRKZQDTC0&DeviceType=iPhone&Cmd=Ping&Log=V121_Sst8_LdapC0_LdapL0_RpcC31_RpcL50_Hb3540_Erq1_Pk1728465481_S2_ 443 redcloud\nikced 94.234.170.42 Apple-iPhone4C1/902.179 200 0 64 3140491"
Since it's a logfile it the regex should be able to handle any line that is of a similar type.
In this case, the preferred output to a collection should be:
2012-05-15
00:49:02
192.168.100.10
/Microsoft-Server-ActiveSync/default.eas
User=nikced&DeviceId=ApplDNWGRKZQDTC0&DeviceType=iPhone&Cmd=Ping&Log=V121_Sst8_LdapC0_LdapL0_RpcC31_RpcL50_Hb3540_Erq1_Pk1728465481_S2_
443
redcloud\nikced
94.234.170.42
Apple-iPhone4C1/902.179
200
0
64
3140491
Appreciate any answer using C#, .net and Regex to extract the above substrings into a collection (MatchCollection preferred). All log lines follows the same format and pattern.

Incredibly complex regex incoming:
logFile.Split(' ');

This will give you an array that you can iterate through to retrieve all of the "lines" which are separated by a space
string[] lines = log.Split(' ');

You don't need to use a Regex. You can simply use String.Split Method, and specify space as separator:
string [] substrings = line.Split(new Char [] {' '});
If you need to identify the kind of each part, then you should specify what you need to find, and a regex can be created for it.
Anyway, if you really want to use a Regex, do this:
Regex re = new Regex (#"(?:(?<s>[^ ]+)(?: |$))*");
This will give you all the captures in the "s" group, when you call the Match method.
As the OP pointed out in a comment that the separator can be anything appart from a single space, then the possible separators should be included in the (?: |$) and the [^ ] parts of the expression. I.e. if space as well as tab are possible separators, replace that part with (?: |\t|$) and [^ \t]. If you need to accept more than one of those characters as separators, add a + after the () group:
(?:(?<s>[^ \t]+)(?: |\t|$)+)*

The fastest and most obvious way is to use String.Split:
string[] substrings = result = line->Split( nullptr, StringSplitOptions::RemoveEmptyEntries );
But if you insist on a MatchCollection then this will do what you want
MatchCollection ^ substrings = Regex.Matches(line, "\\S+")

Really, you just need to break this down into the parts.
First, the date. Will it always be in YYYY-MM-DD format? Could it be possible that it will be different based on region/culture settings?
(?<LogDate>dddd-dd-dd)
Next, you have the time. Same thing:
(?<LogTime>dd:dd:dd)
Next, I'm assuming this is the web method that was actually called? Not entirely sure, since you haven't really explained how the data is laid out. However, I'm assuming it's either going to be either POST or GET, so that's what we're going to do next...
(?<LogMethod>POST|GET)
Just do this for every part of the log line you're interested in, and you'll be set. IE:
(?<LogDate>dddd-dd-dd) (?<LogTime>dd:dd:dd) (?<LogMethod>POST|GET)...
If you want to anchor to the start/end of the line, be sure to use ^ and $ respectively. When you get the Matches, you can get the values from each group by indexing the Groups property with the named group (such as match.Groups["LogMethod"].Value). Good luck!

C# regex not matching string

I have a string which is formatted like this: $20,$40,$AA,$FF. Basically, hex numbers and they can be of many bytes. I want to check if a string is in the above format, so I tried something like this:
string a = "$20,$30,$40";
Regex reg = new Regex(#"$[0-9a-fA-F],");
if (a.StartsWith(string.Format("{0}{1}", reg, reg)))
MessageBox.Show("A");
It doesn't seem to work though, is there anything I'm missing?

$ is a special character in regular expressions and means end of string. That regex won't match anything at all since you're specifying stuff after the string end. Escape the $ character like
"\$[0-9a-fA-F]{2},"
Anyway AFAIK this will not work with your string since it doesn't end with an ",". You might try:
"^(\$[0-9a-fA-F]{2},?)+$"
You can even simplify the regex by using case-insensitive regex matching:
Regex reg = new Regex(#"^(\$[0-9A-F]{2},?)+$", RegexOptions.IgnoreCase);
EDIT: corrected to match exactly 2 hexadecimal digits.
EDIT: maybe you should write your regex checking like:
if (Regex.IsMatch(a,#"^(\$[0-9A-F]{2},?)+$",RegexOptions.IgnoreCase))
{
// Do whatever
}

I think you are missing a quantifier:
"\$[0-9a-fA-F]+,"
For the problem with the comma at the end, I would simply append one at the end to keep the regex as simple as possible. But this is just the way I would do it.

There are 3 things that need to be changed:
Need to escape your $ symbol as it represents end of line.
\$
Need to tweak your regex pattern to match the entire string instead of parts.
^(\$[0-9a-fA-F]{2},+)+\$[0-9a-fA-F]{2}$
Need to change your code to use Regex.IsMatch.
string a = "$20,$30,$40";
if (Regex.IsMatch(a,#"^(\$[0-9a-fA-F]{2},+)+\$[0-9a-fA-F]{2}$",RegexOptions.IgnoreCase))
MessageBox.Show("A");
PS:
If the input string has white space like a tab or a space in between, then this regex will need to be modified. In such cases, you have to use "\s" at the right positions. For example, if you have white space around the commas like
string a = "$20 ,$30, $40";
then you need to tweak your RegEx this way:
^(\$[0-9a-fA-F]{2}\s*,+\s*)+\$[0-9a-fA-F]{2}\s*$
References:
C# Regex Testers
A Better .NET Regular Expression Tester
RegexHero tester
about Regex.IsMatch (instead of using Match)
MSDN Regex.isMatch
Usage example
C# Regular Expression Cheat Sheet
Old answer below (Ignore):
Try this:
"\$[0-9a-fA-F]{2}?[,]{0,1}"

You might also want to add a repeat modifier to your set such that it becomes;
"\$[0-9a-fA-F]+,"

.NET Regular Expression: Get paragraphs

I'm trying to get paragraphs from a string in C# with Regular Expressions.
By paragraphs; I mean string blocks ending with double or more \r\n. (NOT HTML paragraphs <p>)...
Here is a sample text:
For example this is a paragraph with a carriage return here
and a new line here. At this point, second paragraph starts. A paragraph ends if double or more \r\n is matched or if reached at the end of the string ($).
I tried the pattern:
Regex regex = new Regex(#"(.*)(?:(\r\n){2,}|\r{2,}|\n{2,}|$)", RegexOptions.Multiline);
but this does not work. It matches every line ending with a single \r\n. What I need is to get all characters including single carriage returns and newline chars till reached a double \r\n.

.* is being greedy and consuming as much as it can. Your second set of () has a $ so the expression that is being used is (.*)(?). In order to make the .* not be greedy, follow it with a ?.
When you specify RegexOptions.Multiline, .NET will split the input on line breaks. Use RegexOptions.Singleline to make it treat the entire input as one.
Regex regex = new Regex(#"(.*?)(?:(\r\n){2,}|\r{2,}|\n{2,}|$)", RegexOptions.Singleline);

An opposite approach will be to match the separators instead of the paragraphs, making the problem almost trivial. Consider:
string[] paragraphs = Regex.Split(text, #"^\s*$", RegexOptions.Multiline);
By splitting the input string by empty lines you can easily get all paragraphs. If you only want blank lines with no spaces you can simplify that even further, and use the parretn ^$. In that case you can also use the non-regex String.Split, with an array of separators:
string[] separators = {"\n\n", "\r\r", "\r\n\r\n"};
string[] paragraphs = text.Split(separators,
StringSplitOptions.RemoveEmptyEntries);

Do you have to use a regular expression? Tools like COCO/R could make this job pretty easy as well. In addition it might just prove to be faster than generating code at runtime using a regex.
COMPILER YourParaProcessor
// your code goes here
TOKENS
newLine= '\r'|'\n'.
paraLetter = ANY - '\n' - '\r' .
YourParaProcessor
=
{Paragraph}
.
Paragraph =
{paraLetter} '\r\n' .

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to escape a delimiter by doubling the delimiter in a regex - c#

Because your demo is so simple,and you just want to split with single |,so I can use \b here string txt = #"\m55.\m207|DEFAULT||DEFAULT|55||207"; string patten = #"\b\|\b"; foreach (var str in Regex.Split(txt, patten)) { Console.WriteLine(str); }

(?<=[^|](?:\|{2})+)\|(?!\|)|(?<!\|)\|(?!\|) You need to use lookarounds to make sure split happens on only one |. See Demo

Related

How to use regex to match anything from A to B, where B is not preceeded by C

c#: how to quickly insert a character into string in front of any occurrences of special combinations of characters?

Matching a substring of any length and characters using RegEx

C# regex not matching string

.NET Regular Expression: Get paragraphs

Categories

Resources