Regex match pattern - c#

I am trying to count the amount of times I use \n\n\n (3 linebreaks) after some text. It counts almost as I like, the problem is when you spam newlines that will be counted which I don't want to.
Edit: Seems like the regex from regexr does not support .net so I have to come up with a pattern that works with .net.
Example for the text that the regex will check on:
Description for something
text \n \n \n // this will make DescriptionAmount++
Description for something
text\n \n \n // this will make DescriptionAmount++
\n \n \n // this shouldn't add on DescriptionAmount
Here's the code I've done so far.
int DescriptionAmount = Regex.Matches(somestring, "[^\w$](\r\n){2}[^\w$]").Count;

Try using a quantifier {x,y} to select how many tokens you want to match.
The '*' will match the preceding character 0 or many times, meaning it will match any \n after the 3rd token.
\n{3} says \n must be matched 3 times no more no less.
I find this tool http://regexr.com/ very useful for building and debugging regex statements.

To ensure you capture the 3 linebreaks after some text would take something like:
\w+\s*\n{3}
Since this is .net, you either need to put an # in front:
#"\w+\s*\n{3}
or escape the slashes like:
"\\w+\\s\n{3}
You mentioned that you are searching for three \n, but your search has \r\n. If you are looking for \r\n instead, just add \r in front of the \n and surround with () for (\\r\\n) or (\r\n) in the expressions above.
One other thing, depending on the text in some string, you may want to apply the multiline option like:
Regex.Matches(somestring, "\\w\\s\n{3}", RegexOptions.Multiline)

Related

Text files - how to programmatically mimic opening in Wordpad and overwriting as plain text [duplicate]

How can I replace lone instances of \n with \r\n (LF alone with CRLF) using a regular expression in C#?
I know to do it using plan String.Replace, like:
myStr.Replace("\n", "\r\n");
myStr.Replace("\r\r\n", "\r\n");
However, this is inelegant, and would destroy any "\r+\r\n" already in the text (although they are not likely to exist).
It might be faster if you use this.
(?<!\r)\n
It basically looks for any \n that is not preceded by a \r. This would most likely be faster, because in the other case, almost every letter matches [^\r], so it would capture that, and then look for the \n after that. In the example I gave, it would only stop when it found a \n, and them look before that to see if it found \r
Will this do?
[^\r]\n
Basically it matches a '\n' that is preceded with a character that is not '\r'.
If you want it to detect lines that start with just a single '\n' as well, then try
([^\r]|$)\n
Which says that it should match a '\n' but only those that is the first character of a line or those that are not preceded with '\r'
There might be special cases to check since you're messing with the definition of lines itself the '$' might not work too well. But I think you should get the idea.
EDIT: credit #Kibbee Using look-ahead s is clearly better since it won't capture the matched preceding character and should help with any edge cases as well. So here's a better regex + the code becomes:
myStr = Regex.Replace(myStr, "(?<!\r)\n", "\r\n");
I was trying to do the code below to a string and it was not working.
myStr.Replace("(?<!\r)\n", "\r\n")
I used Regex.Replace and it worked
Regex.Replace( oldValue, "(?<!\r)\n", "\r\n")
I guess that "myStr" is an object of type String, in that case, this is not regex.
\r and \n are the equivalents for CR and LF.
My best guess is that if you know that you have an \n for EACH line, no matter what, then you first should strip out every \r. Then replace all \n with \r\n.
The answer chakrit gives would also go, but then you need to use regex, but since you don't say what "myStr" is...
Edit:looking at the other examples tells me one thing.. why do the difficult things, when you can do it easy?, Because there is regex, is not the same as "must use" :D
Edit2: A tool is very valuable when fiddling with regex, xpath, and whatnot that gives you strange results, may I point you to: http://www.regexbuddy.com/
myStr.Replace("([^\r])\n", "$1\r\n");
$ may need to be a \
Try this: Replace(Char.ConvertFromUtf32(13), Char.ConvertFromUtf32(10) + Char.ConvertFromUtf32(13))
If I know the line endings must be one of CRLF or LF, something that works for me is
myStr.Replace("\r?\n", "\r\n");
This essentially does the same neslekkiM's answer except it performs only one replace operation on the string rather than two. This is also compatible with Regex engines that don't support negative lookbehinds or backreferences.

Regex filtering Regex, extra additional final \r

I want to filter a regex with a ... regex ...
My target is in a file which content is
...
information 1...
Entity1=^\|1[\s\t]+[\S]+[\s\t]+(.*)$
information 2...
...
The file is transferred to mystring with the method ReadAllText(path); where path is the path to the text file.
I use the code
//Retrieve regex like ^\|1[\s\t]+[\S]+[\s\t]+(.*)$ in Entity1=^\|1[\s\t]+[\S]+[\s\t]+(.*)$
//\d for any digit followed by =
// . for any character found 1 or + times, ended with space character
m = Regex.Match(mystring, #"Entity\d=(.+)\s");
string regex = m.Groups[1].Value;
which works almost fine
What I get is ( seen from inside the degugger )
^\|1[\s\t]+[\S]+[\s\t]+(.*)$\r
There is an additional \r at the end of the result. It causes an unwanted extra newline in other parts of the code.
Trying #"Entity\d=(.+)" (i.e removing the final \s) does not help.
Any idea of how to avoid the additionnal \r gracefully ( I do not want,if possible, to track the finale \r and remove it )
Online regex tester like regex101 did not permit to foresee this problem before going to C# code
Use a negated character class to make sure \r is not matched:
m = Regex.Match(mystring, #"Entity\d=([^\r\n]+)");
The [^\r\n] class means match any character other than a carriage return and a line feed.
It is true that regex101 does not keep carriage returns. You can see the \r matching at regexhero.net:
Check if this works:
#"Entity\d=(.+)(?=(\r|\n))";
(?=(\r|\n)) is a positive lookahead and means that the \r or \n won't be included in the result.
Edit:
#"Entity\d=(.+?)(?=\r|\n)";

Regex to remove trailing whitespace and multiple blank lines

I'd like regex expressions to use in a Visual Studio 2013 extension written in C#.
I'm trying to remove trailing whitespaces from a line while preserving empty lines. I'd also like to remove multiple empty lines. The existing line endings should be preserved (generally carriage return line feed).
So the following text (spaces shown as underscores):
hello_world__
___hello_world_
__
__
hello_world
Would become:
hello_world
___hello_world
hello_world
I've tried a number of different patterns to remove the trailing spaces but I either end up not matching the trailing spaces or losing the carriage returns. I haven't yet tried to remove the multiple empty lines.
Here's a couple of the patterns I've tried so far:
\s+$
(?<=\S)\s+$
Thanks for the answers so far. None of them are quite right for what I need, but they've helped me come up with what I needed. I think the issue is that are some oddities with regex in VS2013 (see Using Regular Expressions in Visual Studio). These two operations work for me:
Replace \ +(?=(\n|\r?$)) with nothing.
Replace ^\r?$(\n|\r\n){2,} with \r\n.
To remove multiple blank lines and trailing whitespace with
(?:\r\n[\s-[\rn]]*){3,}
and replace with \r\n\r\n.
See demo
And to remove the remaining whitespace, you can use
(?m)[\s-[\r]]+\r?$
See demo 2
\ +(?=(\n|$))
Any number of space, and checking that after a newline coming OR end of line (last characters in your string/text). (of course multi line needs to be enabled and global mode)
Just as a punt, without the use of Regex, you could always split the document by its end of line marker and then feedback using TrimEnd (as highlighted by Anton Semenov)...
(Assuming a text document read into a string...)
// Ascertain the linefeed...
string str = "This is a test \r\nto see if I can force \ra string to be broken \non multiple lines \r\n into an array.";
string[] t = str.Split(new string[] { "\r\n", "\r", "\n" } ,StringSplitOptions.RemoveEmptyEntries);
thediv.InnerHtml = str + "<br /><br />";
foreach(string s in t)
{
thediv.InnerHtml += s.TrimEnd() + "<br />";
}
I haven't timed this at all, but if you prefer to avoid the complications of Regex (which I do if I can - see below*), you should find this fast enough to do what you want.
* I avoid Regex if I can. That doesn't mean that I don't use it. Regex has its place, but I believe it to be a last resort tool for involved jobs, for instance complex flexible strings that adhere to a format - something where the alternative will generate large amounts of code. Keeping Regex to an absolute minimum aids the readability of your code.
As separate operations -
Remove trailing whitespace any (?m)[^\S\r\n]+$
Remove trailing whitespace lines with text (?m)(?<=\S)[^\S\r\n]+$
Remove duplicate blank lines (along with whitespace trim)
# Find: (?>\A(?:[^\S\r\n]*\r\n)+)|(?>\r\n(?:[^\S\r\n]*(\r\n)){2,})
# Replace: $1\r\n
(?>
\A
(?: [^\S\r\n]* \r \n )+
)
|
(?>
\r \n
(?:
[^\S\r\n]*
( \r \n ) # (1)
){2,}
)
The \s includes the linefeed, I would search for just multiple blanks instead. I do not know the specifics of VS, but this should hopefully do it:
[" "]*?$

Regex to match comma separated string with no comma at the end of the line

I am trying to write a regex that will allow input of all characters on the keyboard(even space) but will restrict the input of comma at the end of the line. I have tried do this,that includes all the possible characters,but it still does not give me the correct output:
[RegularExpression("^([a-zA-Z0-9\t\n ./<>?;:\"'!##$%^&*()[]{}_+=|\\-]+,)*[a-zA-Z0-9\t\n ./<>?;:\"'!##$%^&*()[]{}_+=|\\-]+$", ErrorMessage = "Comma is not allowed at the end of {0} ")]
^.*[^,]$
.* means all char,don't need so long
^([a-zA-Z0-9\t\n ./<>?;:\"'!##$%^&*()[]{}_+=|\\-]+,)*[a-zA-Z0-9\t\n ./<>?;:\"'!##$%^&*()[]{}_+=|\\-]+(?<!,)$
^^
Just add lookbehind at the end.
a regex that will allow input of all characters on the keyboard(even space) but will restrict the input of comma at the end of the line.
Mind that you can type much more than what you typed using a keyboard. Basically, you want to allow any character but a comma at the end of the line.
So,
(?!,).(?=\r\n|\z)
This regex is checking each line (because of the (?=\r\n|$) look-ahead), and the (?!,) look-ahead makes sure the last character (that we match using .) is not a comma. \z is an unambiguous string end anchor.
See regex demo
This will work even on a client side.
To also get the full line match, you can just add .* at the beginning of the pattern (as we are not using singleline flag, . does not match newline symbols):
.*(?!,).(?=\r\n|\z)
Or (making it faster with an atomic group or an inline multiline option with ^ start of line anchor, but will not work on the client side)
(?>.*)(?!,).(?=\r\n|\z)
(?m)^.*?(?!,).(?=\r\n|\z) // The fastest of the last three
See demo

Regex to exclude part of string on split

I asked a similar question a few weeks ago on how to split a string based on a specific substring. However, I now want to do something a little different. I have a line that looks like this (sorry about the formatting):
What I want to do is split this line at all the newline \r\n sequences. However, I do not want to do this if there is a PA42 after one of the PA41 lines. I want the PA41 and the PA42 line that follows it to be on the same line. I have tried using several regex expressions to no avail. The output that I am looking for will ideally look like this:
This is the regex that I am currently using, but it does not quite accomplish what I am looking for.
string[] p = Regex.Split(parameterList[selectedIndex], #"[\r\n]+(?=PA41)");
If you need any clarifications, please feel free to ask.
You're trying a positive look-ahead, you want a negative one. (Positive insures that the pattern does follow, whereas negative insures it does not.)
(\\r\\n)(?!PA42)
Works for me.
string[] splitArray = Regex.Split(subjectString, #"\\r\\n(?!PA42)");
This should work. It uses a negative lookahead assertion to ensure that a \r\n sequence is not followed by PA42.
Explanation :
#"
\\ # Match the character “\” literally
r # Match the character “r” literally
\\ # Match the character “\” literally
n # Match the character “n” literally
(?! # Assert that it is impossible to match the regex below starting at this position (negative lookahead)
PA42 # Match the characters “PA42” literally
)
"

Categories