Regular Expression to ignore all text between 2 points in .NET - c#

Start code is "abc123" now, this is irrelevant text and characters,
possibly new lines and line breaks, quotes etc. Finish.
I would like my regular expression to find a match where there is text that begins with 'Start code is', retrieves the value that follows within the quotes (in this case abc123), but then ignores all that immediately follows that closing quote UP UNTIL the word 'Finish'.
So if I had the following text:
This is some dummy text. Start code is "abc123" now, this is
irrelevant text and characters, possibly new lines and line
breaks, quotes etc. Finish. And this is some more dummy text.
... the match would be successful, and I would be able to retrieve the value abc123. That is the only part that I actually need to use.

var matches = Regex.Matches(str, ".*Start code.[^\"]*\"([^\"]*)\"[^\"]*.*Finish\\..*");
if (matches.Count > 0)
{
result = matches[0].Groups[1];
}
Basically, you just need to check the results of Regex.Matches. The only "trick" used here, is the fact that I'm limiting myself for looking for non-quote characters, both before, between and after the quotes.

You can use this pattern with lookarounds:
#"(?<=Start code is "")[^""]+(?="")"
or a simple capturing group:
#"Start code is ""([^""]+)"""

Related

How to get string between starting text and ending character using regular expression?

I want to get string starting with color=" and ending double quote with parentheses () and with or without parameter, sometimes there may be many quoted words in a line. I want to select only matching start word and ending quote.
This is my input file
color="functions.getcolor('someinput')"
color="getcolor()"
color="!model.type && functions.getcolor(model.type, cofig.value)"
color="model.type == enums.someenum"
color="(something=something)||(Something=somethingelse)"
color="model.type" mode="getmode()"
This my regular expression
color=\".+\(.+\)*\"$
My current output is all line get selected except line 4 in input file
but my requirement is lines similar to first 3 lines of text get selected.
Expected result
color="functions.getcolor('someinput')"
color="getcolor()"
color="!model.type && functions.getcolor(model.type, cofig.value)"
How to write the regular expression for this?
Seems like within quotes, you want to be able to find text that has some kind of function call. If this is the case, this regex will match lines 1-3 but not 4-6. You can keep expanding the characters allowed within the [ ] if you encounter more.
Example with Tests
^color=\"([a-zA-Z.! &])+\(.+\)*\"$

Remove all Trailing <br> Using Regex, Substitute Group not Returning Full Match

Here is the problem. I have a block of pasted html text. I need to remove trailing line breaks and white space from the text. Even ones proceeded by closing tags. The below text is simply an example, and actually closely represents the real text I'm dealing with.
EG:
This:
<span>Here is some<br></span><br>
<span><span>Here is some text</span><br><span><br> </span></span><br><br>
Becomes this:
<span>Here is some<br></span><br>
<span><span>Here is some text<span></span></span>
My first pass. I use this: Regex.Replace(htmlString, #"(?:\<br\s*?\>)*$", "") to get rid of the trailing line breaks. Now all I have left is the line breaks stuck behind closing tags and white space.
I'm attempting to use this:
While(Regex.IsMatch(#"(<br>|\s| )*(<[^>]*>)*$")
{
Regex.Replace(htmlString, #"(<br>|\s| )*(<[^>]*>)*$", $2)
}
The regex pattern is actually working great, the problem is that the substitute by matched group 2 is only giving back a single closing span. So that I end up with the below:
<span>Here is some<br></span><br>
<span><span>Here is some text</span></span>
The regular expression is in #"(<br>|\s| )*(<[^>]*>)*$". The second group is followed by a * meaning the group is repeated and so the $2 only yields one repetition of the group.
Putting the repetition in a group will capture the whole repetition. Change the regular expression to be #"(<br>|\s| )*((<[^>]*>)*)$".
Note that repeating the first group with a * may make the code spin on some input strings as there no guarantee that the Replace will change the text to a different string. As the first group is optional (ie zero or more repeats) the Replace might replace one string with exactly the same string. So I suggest changing the regular expression to be #"(<br>|\s| )+((<[^>]*>)*)$" meaning that one or more occurrences of the first group are required.
I guess you can use:
resultString = Regex.Replace(subjectString, #"<br>| |\n", "");
Regex Demo

Skip text between quotes in Regex

I have the following Regex using IgnoreCaseand Multilinein .NET:
^.*?\WtextToFind\W.*?$
Given a multiline input like:
1 Some Random Text textToFind
2 Some more "textToFind" random text
3 Another textToFinddd random text
The current regular expression matches with the lines 1 and 2. However I need to skip all the lines which textToFindis inside quotes and double quotes.
Any ideas how to achieve this?
Thanks!
EDIT:
Explanation: My purpose is to find some method calls inside VBScript code. I thought this would be irrelevant for my question, but after reading the comments I realised I should explain this.
So basically I want to skip text that is between quotes or single quotes and all the text that is between a quote and the end of line since that would be a comment in VBScript:
If I'm looking for myFunc
Call myFunc("parameter") // should match
Call anotherFunc("myFunc") //should not match
Call someFunc("parameter") 'Comment myFunc //should not match
If(myFunc("parameter") And someFunc("myFunc")) //should match
With all of the possible cases involving mixed sets of quotes, a regex may not be your best option here. What you could do instead (after using your current regex to filter for everything but quotes), is count the number of quotes before and after the occurrence of textToFind. If both counts are odd, then you have quotes around your keyword and should scrap the line. If both are even, you've got matched quotes elsewhere (or no quotes at all), and should keep the line. Then repeat the process for double quotes. You could do all this only walking through the string once.
Edit to address the update that you're searching through code:
There are some additional considerations to take into account.
Escaped quotes (skip over the character after an escape character, and it won't be counted).
Commented quotes, e.g. /* " */ in the middle of a line. When you hit a /*, just jump to the next occurrence of */ and then continue inspecting characters. You may also want to check whether the occurrence of textToFind is in a comment.
End-of-line ' quotes - if it occurs (outside a literal string) before the keyword, it's not a valid method call.
The bottom line is still that regexes aren't the droids you're looking for, here. You're better off walking through lines and parsing them.
It seems like this should work for your actual implementation in all the examples you've given:
/\bmyFunc\(/
Demonstration - view console.
as long as you don't have something like "i'm going to call myFunc()", but if you start trying to deal with quotes, multiple quotes, nested quotes, etc... it will get very messy (like trying to parse dom with regex).
Also, it appears that you are checking within vbscript code. Comments in vbscript code start with an ', right? You could check this as well, as it looks like you are doing this on a line by line basis, this should work for those type of comments:
/^\s*[^'].*\bmyFunc\(/
Demo

Regex - Lookahead for pattern through newlines

Still learning Regex, and am having trouble getting my head wrapped around the lookahead concept. Similar data to my question here - Matching multiple lines up until a sepertor line? , say I have the following lines handed to me by the user:
0000AA.The horizontal coordinates are valid at the epoch date displayed above.
0000AA.The epoch date for horizontal control is a decimal equivalence
0000AA.of Year/Month/Day.
0000AA
[..]
So a really simple Regex is #^[0-9]{4}[A-Z]{2}\.(?<noteline>.*), where would give me every line. Fantastic. :) However, I'd like a lookahead (or a condition?) that would look at the next line and tell me if the line has the code WITHOUT a '.'. (i.e. If the NEXT line would match #^[0-9]{4}[A-Z]{2}[^\.]
Trying the lookahead, I get hits on the first two lines (because the following line has '.' after the code) but not on the last.
Edit: Using the regex above, or the one offered below gives me all lines, but I'd like to know IF a blank line (line with AA0000 code, but no '.' afterwards) follows. For example, when I get to the match on the line of Year/Month/Day, I'd like to know IF that line is followed by a blank line (or not). (Like with a grouping name that's not spaces or empty, for high-level example.)
Edit 2: I may be mis-using the 'lookahead' term. Going back over .NET's regex, I see something referred to as a Alternation Construct, but not sure if that could be used here.
Thanks! Mike.
Apply the option RegexOptions.Multiline. It changes the meaning of ^ and $ making them match the beginning and the end of ervery line instead the beginning and end of the entire string.
var matches = Regex.Matches(input,
#"^[0-9]{4}[A-Z]{2}\..*$?(?!^[0-9]{4}[A-Z]{2}[^.])",
RegexOptions.Multiline);
The negative look ahead is
find(?!suffix)
It matches a position not preceeding a suffix. Don't escape the dot within the brackets [ ]. The bracket disables the special meaning of most characters anyway.
I also added .*$? making the pattern match until the end of the current line. The ? is required in order make * lazy. Otherwise it is greedy, meaning that will try to get as many characters as possible and possibly match several lines at a time.
If you need only the number part, you can capture it in a group by enclosing it within parentheses.
(^[0-9]{4}[A-Z]{2})\..*$?(?!^[0-9]{4}[A-Z]{2}[^.])
You can then get the group like this
string number = match.Groups[1].Value;
Note: Group #0 represents the entire match.
After doing a lot of research, and hit and misses, I'm certain now that it can't be done - or, rather - it CAN be but would be prohibitively difficult - easier to do it in code.
To refrain, I was looking at a multiline string (document), where every line was preceded by a 6-digit code. Some lines - the lines I'm interested in - have a '.' after the 6-digit code, and then open text. I was hoping there would be a way to get me each line in a group, along with a flag letting me know if the next line has no free-text entry. (No '.' after the 6-digit code.) I.e. Two line data entry would give me two matches on the document. First match would have the line's text in the group called 'notetext', and the group 'lastline' would be empty. The second line would have the second part of the entered note in 'notetext', and the group 'lastline' would have something (anything, content wouldn't matter.)
From what I understand, lookaheads are zero-width assertions, so that if it matches, the returnable value is still empty. Without using lookahead, the match for 'lastline' would consume the next line's code, making the 'notetext' skip that line (giving me every other line of text.) So, I would need to have some back-reference to revert back to.
By this time, it'd be easier (code-wise) to simply get all the lines, and add up text until I get to the end of their notes. (Looping over then entire document, which can't be more than 200 lines as opposed to looping through the regex-matched lines, and the ease of reading the code for future modifications would out-weigh any slight speed advantage the regex could get me.
Thanks guys -
-Mike.

Regular expression to replace whole line if it contains a particular word

I have a word document, it contains some confidential information like it has NIC:343434343.
I need a regular expression which will do the following thing.
if it finds NIC on a line it should replace the whole line with specified text.
Since by default the dot does not match NewLine, you can simply use
.*NIC.*
to find lines containing "NIC". You'd use this expression like
string result = Regex.Replace(originalString, ".*NIC.*", "replacement string");
You can see it at work at ideone.com.
Use the start and end-of-line markers:
^.*NIC.*$
^ matches the start of line and $ matches the end of line. This will cause the entire line to be matched, if it contains "NIC" at least once.
Use this regex: (?m-i)^.*?NIC.*$. It enables multiline option and disables ignore case option.

Categories