How can I create a regular expression that includes a line feed? - c#

I have this code:
Regex regex = new Regex(#"\[see=[^\]]*\]");
var list = phraseSources.ToList();
list.ForEach(item => item.JmdictMeaning = regex.Replace(item.JmdictMeaning, ""));
It removes all between "[see" and "]"
However I found out that before the "[see" is a line feed and I would like to remove this in addition to all between "[see" and "]".
I used a hex editor and I can see:
0D 0A
Which I think is causing the line feed but I am not sure how to make a regex that includes this.
Can someone tell me how I can remove this plus all that comes between "[see" and "]"

You can change your regex to also include carriage returns (0D, aka \r) and line feeds (0A, aka \n)
Regex regex = new Regex(#"\[see=[^\]]*\]|\r\n");
So now it matches anything that's [see=...] or \r\n.
If you expect either of them to appear by itself in the string you could also match something like this
Regex regex = new Regex(#"\[see=[^\]]*\]|[\r\n]");
It would clean up any \r and \n characters even if they appear alone.
If you meant that you want to only remove a \r\n that might appear before the [see=... part, then this would do
Regex regex = new Regex(#"(\r\n)?\[see=[^\]]*\]");

Related

Regex filtering Regex, extra additional final \r

I want to filter a regex with a ... regex ...
My target is in a file which content is
...
information 1...
Entity1=^\|1[\s\t]+[\S]+[\s\t]+(.*)$
information 2...
...
The file is transferred to mystring with the method ReadAllText(path); where path is the path to the text file.
I use the code
//Retrieve regex like ^\|1[\s\t]+[\S]+[\s\t]+(.*)$ in Entity1=^\|1[\s\t]+[\S]+[\s\t]+(.*)$
//\d for any digit followed by =
// . for any character found 1 or + times, ended with space character
m = Regex.Match(mystring, #"Entity\d=(.+)\s");
string regex = m.Groups[1].Value;
which works almost fine
What I get is ( seen from inside the degugger )
^\|1[\s\t]+[\S]+[\s\t]+(.*)$\r
There is an additional \r at the end of the result. It causes an unwanted extra newline in other parts of the code.
Trying #"Entity\d=(.+)" (i.e removing the final \s) does not help.
Any idea of how to avoid the additionnal \r gracefully ( I do not want,if possible, to track the finale \r and remove it )
Online regex tester like regex101 did not permit to foresee this problem before going to C# code
Use a negated character class to make sure \r is not matched:
m = Regex.Match(mystring, #"Entity\d=([^\r\n]+)");
The [^\r\n] class means match any character other than a carriage return and a line feed.
It is true that regex101 does not keep carriage returns. You can see the \r matching at regexhero.net:
Check if this works:
#"Entity\d=(.+)(?=(\r|\n))";
(?=(\r|\n)) is a positive lookahead and means that the \r or \n won't be included in the result.
Edit:
#"Entity\d=(.+?)(?=\r|\n)";

Regex.Replace removes '\r' character in "\r\n"

Here is a simple example
string text = "parameter=120\r\n";
int newValue = 250;
text = Regex.Replace(text, #"(?<=parameter\s*=).*", newValue.ToString());
text will be "parameter=250\n" after replacement. Replace() method removes '\r'. Does it uses unix-style for line feed by default? Adding \b to my regex (?<=parameter\s*=).*\b solves the problem, but I suppose there should be a better way to parse lines with windows-style line feeds.
Take a look at this answer. In short, the period (.) matches every character except \n in pretty much all regex implementations. Nothing to do with Replace in particular - you told it to remove any number of ., and that will slurp up \r as well.
Can't test now, but you might be able to rewrite it as (?<=parameter\s*=)[^\r\n]* to explicitly state which characters you want disallowed.
. by default doesn't match \n..If you want it to match you have to use single line mode..
(?s)(?<=parameter\s*=).*
^
(?s) would toggle the single line mode
Try this:
string text = "parameter=120\r\n";
int newValue = 250;
text = Regex.Replace(text, #"(parameter\s*=).*\r\n", "${1}" + newValue.ToString() + "\n");
Final value of text:
parameter=250\n
Match carriage return and newline explicitly. Will only match lines ending in \r\n.

Replace a string in multiline regex with end of line token

I got the following regex
var fixedString = Regex.Replace(subject, #"(:[\w]+ [\d]+)$", "",
RegexOptions.Multiline);
which doesn't work. It works if I use \r\n, but I would like to support all types of line breaks. As another answer states I have to use RegexOptions.Multiline to be able to use $ as end of line token (instead of end of string). But it doesn't seem to help.
What am I doing wrong?
I am not sure what you want to achieve, I think I understood, you want to replace also the newline character at the end of the row.
The problem is the $ is a zero width assertion. It does not match the newline character, it matches the position before \n.
You could do different other things:
If it is OK to match all following newlines, means also all following empty rows, you could do this:
var fixedString = Regex.Replace(subject, #"(:[\w]+ [\d]+)[\r\n]+", "");
If you only want to match the newline after the row and keep following empty rows, you have to make a pattern for all possible combinations, e.g.:
var fixedString = Regex.Replace(subject, #"(:[\w]+ [\d]+)\r?\n", "");
This would match the combination \n and \r\n

.NET Regular Expression: Get paragraphs

I'm trying to get paragraphs from a string in C# with Regular Expressions.
By paragraphs; I mean string blocks ending with double or more \r\n. (NOT HTML paragraphs <p>)...
Here is a sample text:
For example this is a paragraph with a carriage return here
and a new line here. At this point, second paragraph starts. A paragraph ends if double or more \r\n is matched or if reached at the end of the string ($).
I tried the pattern:
Regex regex = new Regex(#"(.*)(?:(\r\n){2,}|\r{2,}|\n{2,}|$)", RegexOptions.Multiline);
but this does not work. It matches every line ending with a single \r\n. What I need is to get all characters including single carriage returns and newline chars till reached a double \r\n.
.* is being greedy and consuming as much as it can. Your second set of () has a $ so the expression that is being used is (.*)(?). In order to make the .* not be greedy, follow it with a ?.
When you specify RegexOptions.Multiline, .NET will split the input on line breaks. Use RegexOptions.Singleline to make it treat the entire input as one.
Regex regex = new Regex(#"(.*?)(?:(\r\n){2,}|\r{2,}|\n{2,}|$)", RegexOptions.Singleline);
An opposite approach will be to match the separators instead of the paragraphs, making the problem almost trivial. Consider:
string[] paragraphs = Regex.Split(text, #"^\s*$", RegexOptions.Multiline);
By splitting the input string by empty lines you can easily get all paragraphs. If you only want blank lines with no spaces you can simplify that even further, and use the parretn ^$. In that case you can also use the non-regex String.Split, with an array of separators:
string[] separators = {"\n\n", "\r\r", "\r\n\r\n"};
string[] paragraphs = text.Split(separators,
StringSplitOptions.RemoveEmptyEntries);
Do you have to use a regular expression? Tools like COCO/R could make this job pretty easy as well. In addition it might just prove to be faster than generating code at runtime using a regex.
COMPILER YourParaProcessor
// your code goes here
TOKENS
newLine= '\r'|'\n'.
paraLetter = ANY - '\n' - '\r' .
YourParaProcessor
=
{Paragraph}
.
Paragraph =
{paraLetter} '\r\n' .

Matching an (easy??) regular expression using C#'s regex

Ok sorry this might seem like a dumb question but I cannot figure this thing out :
I am trying to parse a string and simply want to check whether it only contains the following characters : '0123456789dD+ '
I have tried many things but just can't get to figure out the right regex to use!
Regex oReg = new Regex(#"[\d dD+]+");
oReg.IsMatch("e4");
will return true even though e is not allowed...
I've tried many strings, including Regex("[1234567890 dD+]+")...
It always works on Regex Pal but not in C#...
Please advise and again i apologize this seems like a very silly question
Try this:
#"^[0-9dD+ ]+$"
The ^ and $ at the beginning and end signify the beginning and end of the input string respectively. Thus between the beginning and then end only the stated characters are allowed. In your example, the regex matches if the string contains one of the characters even if it contains other characters as well.
#comments: Thanks, I fixed the missing + and space.
Oops, you forgot the boundaries, try:
Regex oReg = new Regex(#"^[0-9dD +]+$");
oReg.IsMatch("e4");
^ matches the begining of the text stream, $ matches the end.
It is matching the 4; you need ^ and $ to terminate the regex if you want a full match for the entire string - i.e.
Regex re = new Regex(#"^[\d dD+]+$");
Console.WriteLine(re.IsMatch("e4"));
Console.WriteLine(re.IsMatch("4"));
This is because regular expressions can also match parts of the input, in this case it just matches the "4" of "e4". If you want to match a whole line, you have to surround the regex with "^" (matches line start) and "$" (matches line end).
So to make your example work, you have to write is as follows:
Regex oReg = new Regex(#"^[\d dD+]+$");
oReg.IsMatch("e4");
I believe it's returning True because it's finding the 4. Nothing in the regex excludes the letter e from the results.
Another option is to invert everything, so it matches on characters you don't want to allow:
Regex oReg = new Regex(#"[^0-9dD+]");
!oReg.IsMatch("e4");

Categories