so at the end the end(after few days of debuging) i found a problem. It isnt in regex at all :/ . It seams that i was trimming ekstra white spaces with
intput= Regex.Replace(input, "\\s+", " ");
so all new lines are replaced with " ". Stupid! Moderator, please remove this if unnecesary!
I have regexp for tokenizing some text and it looks like this :
"(?<html>Ç)|
(?<number>\\d+(?:[.]\\d+)?(?=[][ \f\n\r\t\v!?.,():;\"'„Ç]|$))|
(?<other>(?:[^][Ç \f\n\r\t\v!?.,():;\"'„A-Za-zčćšđžČĆŠĐŽäöÖü][^ Ç\f\n\r\t\vA-Za-zčćšđžČĆŠĐŽäöÖü]*)?[^][ Ç\f\n\r\t\v!?.,():;\"'„A-Za-zčćšđžČĆŠĐŽäöÖü](?=[][!?.,():;\"'„]*(?:$|[ Ç\f\n\r\t\v])))|
(?<word>(?:[^][ Ç\f\n\r\t\v!?.,():;\"'„][^ Ç\f\n\r\t\v]*)?[^][ Ç\f\n\r\t\v!?.,():;\"'„])|
(?<punctuation>[][ \f\n\r\t\v!?.,():;\"'„])"
Problem is in this part: (?<punctuation>[][ \f\n\r\t\v!?.,():;\"'„]). So when im prsing text with input "\n\n" it is grouping in punctuation matches: " "," " - in other words, space and space... and I don't know why?
I could be wrong, but you need to hand the String as String to the RegEx...means you need to escape the backslashes.
... (?=[][ \\f\\n\\r\\t\\v!?.,():;\\" ...
Or otherwise C# will replace \n with a linebreak within the RegEx-Statement.
Edit: It's also possible to use literal strings, but the need to be marked with beginning # (see Martin's answer).
If you put an # in front of string you can use single backslashes and line-breaks will be recognized.
#"(?<html>Ç)|
Set RegexOptions.IgnorePatternWhiteSpace
Update:
Are you sure [^] is correct? Unless it's somekind of character group (that I have never used), that will be the same as . . Same goes for []. Perhaps I just have not used all of RE before :p
Related
How can I replace lone instances of \n with \r\n (LF alone with CRLF) using a regular expression in C#?
I know to do it using plan String.Replace, like:
myStr.Replace("\n", "\r\n");
myStr.Replace("\r\r\n", "\r\n");
However, this is inelegant, and would destroy any "\r+\r\n" already in the text (although they are not likely to exist).
It might be faster if you use this.
(?<!\r)\n
It basically looks for any \n that is not preceded by a \r. This would most likely be faster, because in the other case, almost every letter matches [^\r], so it would capture that, and then look for the \n after that. In the example I gave, it would only stop when it found a \n, and them look before that to see if it found \r
Will this do?
[^\r]\n
Basically it matches a '\n' that is preceded with a character that is not '\r'.
If you want it to detect lines that start with just a single '\n' as well, then try
([^\r]|$)\n
Which says that it should match a '\n' but only those that is the first character of a line or those that are not preceded with '\r'
There might be special cases to check since you're messing with the definition of lines itself the '$' might not work too well. But I think you should get the idea.
EDIT: credit #Kibbee Using look-ahead s is clearly better since it won't capture the matched preceding character and should help with any edge cases as well. So here's a better regex + the code becomes:
myStr = Regex.Replace(myStr, "(?<!\r)\n", "\r\n");
I was trying to do the code below to a string and it was not working.
myStr.Replace("(?<!\r)\n", "\r\n")
I used Regex.Replace and it worked
Regex.Replace( oldValue, "(?<!\r)\n", "\r\n")
I guess that "myStr" is an object of type String, in that case, this is not regex.
\r and \n are the equivalents for CR and LF.
My best guess is that if you know that you have an \n for EACH line, no matter what, then you first should strip out every \r. Then replace all \n with \r\n.
The answer chakrit gives would also go, but then you need to use regex, but since you don't say what "myStr" is...
Edit:looking at the other examples tells me one thing.. why do the difficult things, when you can do it easy?, Because there is regex, is not the same as "must use" :D
Edit2: A tool is very valuable when fiddling with regex, xpath, and whatnot that gives you strange results, may I point you to: http://www.regexbuddy.com/
myStr.Replace("([^\r])\n", "$1\r\n");
$ may need to be a \
Try this: Replace(Char.ConvertFromUtf32(13), Char.ConvertFromUtf32(10) + Char.ConvertFromUtf32(13))
If I know the line endings must be one of CRLF or LF, something that works for me is
myStr.Replace("\r?\n", "\r\n");
This essentially does the same neslekkiM's answer except it performs only one replace operation on the string rather than two. This is also compatible with Regex engines that don't support negative lookbehinds or backreferences.
I need to match a string:-
that always starts with 'P#' (case-insensitive)
that always contains 'Z#'
and ends with new line (\r or \n or \r\n)
Example strings:
P#M1RE2Z#
P#M2S0Z#M2SX0
P#M3S12Z#
Here is what i figured out so far but need to match 'Z#' in between
(P#.*?(\r|\n|\r\n))
this one should work for you
^P\#.*Z\#.*[\n\r]+
Note: I put \ before # because in regex # is comment,
this regex will much only if the line ends with \n or \r.
This will work
\bP#(?=.*Z#)(?=.*[\r\n]+)\b
Regex Demo
I'd like regex expressions to use in a Visual Studio 2013 extension written in C#.
I'm trying to remove trailing whitespaces from a line while preserving empty lines. I'd also like to remove multiple empty lines. The existing line endings should be preserved (generally carriage return line feed).
So the following text (spaces shown as underscores):
hello_world__
___hello_world_
__
__
hello_world
Would become:
hello_world
___hello_world
hello_world
I've tried a number of different patterns to remove the trailing spaces but I either end up not matching the trailing spaces or losing the carriage returns. I haven't yet tried to remove the multiple empty lines.
Here's a couple of the patterns I've tried so far:
\s+$
(?<=\S)\s+$
Thanks for the answers so far. None of them are quite right for what I need, but they've helped me come up with what I needed. I think the issue is that are some oddities with regex in VS2013 (see Using Regular Expressions in Visual Studio). These two operations work for me:
Replace \ +(?=(\n|\r?$)) with nothing.
Replace ^\r?$(\n|\r\n){2,} with \r\n.
To remove multiple blank lines and trailing whitespace with
(?:\r\n[\s-[\rn]]*){3,}
and replace with \r\n\r\n.
See demo
And to remove the remaining whitespace, you can use
(?m)[\s-[\r]]+\r?$
See demo 2
\ +(?=(\n|$))
Any number of space, and checking that after a newline coming OR end of line (last characters in your string/text). (of course multi line needs to be enabled and global mode)
Just as a punt, without the use of Regex, you could always split the document by its end of line marker and then feedback using TrimEnd (as highlighted by Anton Semenov)...
(Assuming a text document read into a string...)
// Ascertain the linefeed...
string str = "This is a test \r\nto see if I can force \ra string to be broken \non multiple lines \r\n into an array.";
string[] t = str.Split(new string[] { "\r\n", "\r", "\n" } ,StringSplitOptions.RemoveEmptyEntries);
thediv.InnerHtml = str + "<br /><br />";
foreach(string s in t)
{
thediv.InnerHtml += s.TrimEnd() + "<br />";
}
I haven't timed this at all, but if you prefer to avoid the complications of Regex (which I do if I can - see below*), you should find this fast enough to do what you want.
* I avoid Regex if I can. That doesn't mean that I don't use it. Regex has its place, but I believe it to be a last resort tool for involved jobs, for instance complex flexible strings that adhere to a format - something where the alternative will generate large amounts of code. Keeping Regex to an absolute minimum aids the readability of your code.
As separate operations -
Remove trailing whitespace any (?m)[^\S\r\n]+$
Remove trailing whitespace lines with text (?m)(?<=\S)[^\S\r\n]+$
Remove duplicate blank lines (along with whitespace trim)
# Find: (?>\A(?:[^\S\r\n]*\r\n)+)|(?>\r\n(?:[^\S\r\n]*(\r\n)){2,})
# Replace: $1\r\n
(?>
\A
(?: [^\S\r\n]* \r \n )+
)
|
(?>
\r \n
(?:
[^\S\r\n]*
( \r \n ) # (1)
){2,}
)
The \s includes the linefeed, I would search for just multiple blanks instead. I do not know the specifics of VS, but this should hopefully do it:
[" "]*?$
I have this text
'Random Text', 'a\nb\\c\'d\\', 'ok'
I want it to become
'Random Text', 'a\nb\c''d\', 'ok'
The issue is escaping. Instead of escaping with \ I now escape only ' with ''. This is for a 3rd party program so I can't change it thus needing to change one escaping method to another.
The issue is \\'. If i do string replace it will become \'' rather than \'. Also \n is not a newline but the actual text \n which shouldn't be modified. I tried using regex but I couldn't think of a way to say if ' replace with '' else if \\ replace with \. Obviously doing this in two step creates the problem.
How do I replace this string properly?
If I understand your question correctly, the issue lies in replacing \\ with \, which can then cause another replacement if it occurs right before '. One technique would be to replace it to an intermediary string first that you're sure will not occur anywhere else, then replace it back after you're done.
var str = #"'Random Text', 'a\nb\\c\'d\\', 'ok'";
str.Replace(#"\\", "NON_OCCURRING_TEMP")
.Replace(#"\'", "''")
.Replace("NON_OCCURRING_TEMP", #"\");
As pointed out by #AlexeiLevenkov, you can also use Regex.Replace to do both modifications simultaneously.
Regex.Replace(str, #"(\\\\)|(\\')",
match => match.Value == #"\\" ? #"\" : #"''");
Seems voithos' interpretation of the question is the right one. Another approach is to use RegEx to find all tokens at once and replace ReguarExpression.Replace
Starting point:
var matches = new Regex(#"\\\\'|\\'|'");
Console.Write(matches.Replace(#"'a b\nc d\\e\'f\\'",
match =>"["+match + "]"));
I imagine I should use Regex for that but I still scratch my head a lot about it (and the only similar question I found wasn't exactly my case) so I decided to ask for help. This is the input and expected output:
Input: "c-0.68219,-0.0478 -1.01455-0.0441 0.9e-4,0.43212"
Output: "c -0.68219,-0.0478 -1.01455 -0.0441 0.9e-4,0.43212"
Basically I need either commas or spaces as value separators, but I can't break the exponential index (e‑4). Maybe do two successive replacements?
Been some time for me, but you should be able to use something like this:
Regex rx = new Regex(#"([^e\s])-(\d)");
rx.Replace(input, "$1 -$2");
Edit: This will add a space in front even if there's a comma. Any reason not doing this?
You can use below also
ResultString = Regex.Replace(MyString, "([^0-9,e,., ,e-])", "$1 ",
RegexOptions.IgnoreCase);