Remove "[ ]" and all that occurences within in a string - c#

I'm building a small text clean up program and I'm currently testing it on Wiki articles,and I'm trying to efficiently remove the "[2]", "[14]", "[nb 6]" etc.
I have this code which nearly does the job, but its seem very overly long and I feel there must be a way to do it in one line, but I'm new to Regex and can't figure it out. Also I've read mixed opinions on Regex so if theres an alternsate way that'd be great.
Anyway here is my current code:
string refinedText = Regex.Replace(sourceText, #"\[[0-9]\]", "");
refinedText = Regex.Replace(refinedText, #"\[[0-9]", "");
refinedText = Regex.Replace(refinedText, #"\[[a-z]", "");
refinedText = Regex.Replace(refinedText, #"[0-9]\]", "");
The issue is there are 2 numbers within the "[ ]" and I don't know how to tell it to remove both, as doing "0-9" just removes the first number, I can do the replace in 2 parts for them; but for the instances of "[nb 3]" the b always remains as there no way I can specify the lone "b" after the "[ ]" are gone to be used as the reference. Also "[nb 14]" same issue with if there are double digits after the "nb".
I'm sure this is simply to do in 1 line, but I can't find any where explaining regex to this extent.
-Thanks.

If you would like to delete the square brackets along with their content, no matter what that content is, the expression looks like this:
#"\[[^\]]*\]"
This means "match everything until you get to the closing bracket". This is more efficient than the dot with reluctant qualifier .*? because it avoids so-called catastrophic backtracking.

Use the + modifier:
string refinedText = Regex.Replace(sourceText, #"\[[0-9]+\]", "");
As the Regular Expression Language - Quick Reference explains:
Matches the previous element one or more times.
To remove any characters between the brackets:
string refinedText = Regex.Replace("[0as9]", #"\[.+\]", "");
Or if you want to also handle the "[]" case, then change + to *:
Matches the previous element zero or more times.
string refinedText = Regex.Replace("[0as9]", #"\[.*\]", "");

Try like this:
string refinedText = Regex.Replace(sourceText, #"\[[0-9]+\]", "");
Also you may try like this:
var refinedText = Regex.Replace(sourceText, #" ?\[.*?\]", string.Empty);
REGEX DEMO
This will remove everything inside the text box including characters and numbers

Related

Match regex pattern in a line of text without targeting the text within quotations

Stackoverflow has been very generous with answers to my regex questions so far, but with this one I'm blanking on what to do and just can't seem to find it answered here.
So I'm parsing a string, let's say for example's sake, a line of VB-esque code like either of the following:
Call Function ( "Str ing 1 ", "String 2" , " String 3 ", 1000 ) As Integer
Dim x = "This string should not be affected "
I'm trying to parse the text in order to eliminate all leading spaces, trailing spaces, and extra internal spaces (when two "words/chunks" are separated with two or more space or when there is one or more spaces between a character and a parentheses) using regex in C#. The result after parsing the above should look like:
Call Function("Str ing 1 ", "String 2", " String 3 ", 1000) As Integer
Dim x = "This string should not be affected "
The issue I'm running into is that, I want to parse all of the line except any text contained within quotation marks (i.e. a string). Basically if there are extra spaces or whatever inside a string, I want to assume that it was intended and move on without changing the string at all, but if there are extra spaces in the line text outside of the quotation marks, I want to parse and adjust that accordingly.
So far I have the following regex which does all of the parsing I mentioned above, the only issue is it will affect the contents of strings just like any other part of the line:
var rx = new Regex(#"\A\s+|(?<=\s)\s+|(?<=.)\s+(?=\()|(?<=\()\s+(?=.)|(?<=.)\s+(?=\))|\s+\z")
.
.
.
lineOfText = rx.Replace(lineOfText, String.Empty);
Anyone have any idea how I can approach this, or know of a past question answering this that I couldn't find? Thank you!
Since you are reading the file line by line, you can use the following fix:
("[^"]*(?:""[^"]*)*")|^\s+|(?<=\s)\s+|(?<=\w)\s+(?=\()|(?<=\()\s+(?=\w)|(?<=\w)\s+(?=\))|\s+$
Replace the matched text with $1 to restore the captured string literals that were captured with ("[^"]*(?:""[^"]*)*").
See demo

Regex replace all matching words that do not contain a certain string

How can I use regex to replace matching strings that do not include a specific string?
input string
Keepword mywordsecond mythirdword myfourthwordKeep
string to replace
word
exclude string
Keep
Desired out put
Keepword mysecond mythird myfourthKeep
Will there ever be more than one word in a word? If there are more than one, do you want to replace all of them? If not, this should sort you out:
Regex r = new Regex(#"\b((?:(?!Keep|word)\w)*)word((?:(?!Keep)\w)*)\b");
s1 = r.Replace(s0, "$1$2");
to explain:
First, \b((?:(?!Keep|word)\w)*) captures whatever text precedes the first occurrence of word or Keep.
The next thing it sees must be word, If it sees Keep or the end of the string instead, the match attempt immediately fails.
Then ((?:(?!Keep)\w)*)\b captures the remainder of the text in order to ensure it doesn't contain Keep.
When faced with a problem like this, most users' first impulse is to match (in the sense of consuming) only the part of the string they're interested in, using lookarounds to establish the context. It's usually much easier to write the regex so that it always moves forward through the string as it matches. You capture the parts you want to retain so you can plug them back into the result string by means of group references ($1, $2, etc.).
Given that you're using C#, you could use the lookaround approach:
Regex r = new Regex(#"(?<!Keep\w*)word(?!\w*Keep)");
s1 = r.Replace(s0, "");
But please don't. There are very few regex flavors that support unrestricted lookbehinds like .NET does, and most problems don't work so neatly as this one anyway.
string str = "Keepword mywordsecond mythirdword myfourthwordKeep";
str = Regex.Replace(str, "(?<!Keep)word", "");
And I'm going to link you to a one of good Regular Expressions Cheat sheet here
This works in notepad++:
(?<!Keep)word(?!Keep)
It uses "look ahead".
You can use negative look-behind assertion if you want to remove all "word" that are not proceeded by "Keep":
String input = "Keepword mywordsecond mythirdword myfourthwordKeep";
String pattern = "(?<!Keep)word";
String output = Regex.Replace(input, pattern, "");

C# Trouble with Regex.Replace

Been scratching my head all day about this one!
Ok, so I have a string which contains the following:
?\"width=\"1\"height=\"1\"border=\"0\"style=\"display:none;\">');
I want to convert that string to the following:
?\"width=1height=1border=0style=\"display:none;\">');
I could theoretically just do a String.Replace on "\"1\"" etc. But this isn't really a viable option as the string could theoretically have any number within the expression.
I also thought about removing the string "\"", however there are other occurrences of this which I don't want to be replaced.
I have been attempting to use the Regex.Replace method as I believe this exists to solve problems along my lines. Here's what I've got:
chunkContents = Regex.Replace(chunkContents, "\".\"", ".");
Now that really messes things up (It replaces the correct elements, but with a full stop), but I think you can see what I am attempting to do with it. I am also worrying that this will only work for single numbers (\"1\" rather than \"11\").. So that led me into thinking about using the "*" or "+" expression rather than ".", however I foresaw the problem of this picking up all of the text inbetween the desired characters (which are dotted all over the place) whereas I obviously only want to replace the ones with numeric characters in between them.
Hope I've explained that clearly enough, will be happy to provide any extra info if needed :)
Try this
var str = "?\"width=\"1\"height=\"1234\"border=\"0\"style=\"display:none;\">');";
str = Regex.Replace(str , "\"(\\d+)\"", "$1");
(\\d+) is a capturing group that looks for one or more digits and $1 references what the group captured.
This works
String input = #"?\""width=\""1\""height=\""1\""border=\""0\""style=\""display:none;\"">');";
//replace the entire match of the regex with only what's captured (the number)
String result = Regex.Replace(input, #"\\""(\d+)\\""", match => match.Result("$1"));
//control string for excpected result
String shouldBe = #"?\""width=1height=1border=0style=\""display:none;\"">');";
//prints true
Console.WriteLine(result.Equals(shouldBe).ToString());

What C# regex expression can be used to strip out dots (.) in a string?

I need a string with non alpha-numeric characters etc stripped out of it; I used the following:
wordsstr = Regex.Replace(wordsstr, "[^A-Za-z0-9,-_]", "");
The problem being dots (.)s are left in the string yet they are not specified to be kept. How could I make sure dots are gotten rid of too?
Many thanks.
You are specifying that they need to be kept - you're using ,-_ which is everything from U+002C to U+005F, including U+002E (period).
If you meant the ,-_ to just mean comma, dash and underscore you'll need to escape the dash, such as:
wordsstr = Regex.Replace(input, #"[^A-Za-z0-9,\-_]", "");
Alternatively, (as in Oded's comment) put the dash as the first or last character in the set, to prevent it being interpreted as a range specifier:
wordsstr = Regex.Replace(input, "[^A-Za-z0-9,_-]", "");
If that's not the aim, please be more specific: "non alpha-numeric characters etc" isn't really enough information to go on.
Try the code below:
wordsstr = Regex.Replace(wordsstr, "[^-A-Za-z0-9,_]", "");
Your problem would be easier to understand if you write your expectation and actual result.
Try
wordstr = Regex.Replace(wordstr, "[^A-Za-z0-9,\\-_]", "");
or better if you just want to have alpha-numerical characters:
wordstr = Regex.Replace(wordstr, "[^A-z0-9]", "");
The problem in your first regex is that the - char defines a range, so you have to escape it to make it behave the way you want it to.

Why is "$1" ending up in my Regex.Replace() result?

I am trying to write a regular expression to rewrite URLs to point to a proxy server.
bodystring = Regex.Replace(bodystring, "(src='/+)", "$1" + proxyStr);
The idea of this expression is pretty simple, basically find instances of "src='/" or "src='//" and insert a PROXY url at that point. This works in general but occasionally I have found cases where a literal "$1" will end up in the result string.
This makes no sense to me because if there was no match, then why would it replace anything at all?
Unfortunately I can't give a simple example of this at it only happens with very large strings so far, but I'd like to know conceptually what could make this sort of thing happen.
As an aside, I tried rewriting this expression using a positive lookbehind as follows:
bodystring = Regex.Replace(bodystring, "(?<=src='/+)", proxyStr);
But this ends up with proxyStr TWICE in the output if the input string contains "src='//". This also doesn't make much sense to me because I thought that "src=" would have to be present in the input twice in order to get proxyStr to end up twice in the output.
When proxyStr = "10.15.15.15:8008/proxy?url=http://", the replacement string becomes "$110.15.15.15:8008/proxy?url=http://". It contains a reference to group number 110, which certainly does not exist.
You need to make sure that your proxy string does not start in a digit. In your case you can do it by not capturing the last slash, and changing the replacement string to "$1/"+proxyStr, like this:
bodystring = Regex.Replace(bodystring, "(src='/*)/", "$1/" + proxyStr);
Edit:
Rawling pointed out that .NET's regexp library addresses this issue: you can enclose 1 in curly braces to avoid false aliasing, like this:
bodystring = Regex.Replace(bodystring, "(src='/+)", "${1}" + proxyStr);
What you are doing can't be done. .NET has trouble when interpolating variable like this. Your problem is that your Proxy string starts with a number : proxyStr = "10.15.15.15:8008/proxy?url=http://"
When you combine this with your $1, the regex thing it has to look for backreference $110 which doesn't exist.
See what I mean here.
You can remedy this by matching something else, or by matching and constructing the replacement string manually etc. Use what suits you best.
Based on dasblinkenlights answer (already +1) the solution is this:
bodystring = Regex.Replace(bodystring, "(src='/+)", "${1}" + proxyStr);
This ensures that the group 1 is used and not a new group number is build.
In the second version, I guess proxyStr appears twice because you're inserting it once more. Try
string s2 = Regex.Replace(s, "((?<=src='/+))", proxyStr);

Categories