C# Trouble with Regex.Replace

C# Trouble with Regex.Replace - c#

Been scratching my head all day about this one!
Ok, so I have a string which contains the following:
?\"width=\"1\"height=\"1\"border=\"0\"style=\"display:none;\">');
I want to convert that string to the following:
?\"width=1height=1border=0style=\"display:none;\">');
I could theoretically just do a String.Replace on "\"1\"" etc. But this isn't really a viable option as the string could theoretically have any number within the expression.
I also thought about removing the string "\"", however there are other occurrences of this which I don't want to be replaced.
I have been attempting to use the Regex.Replace method as I believe this exists to solve problems along my lines. Here's what I've got:
chunkContents = Regex.Replace(chunkContents, "\".\"", ".");
Now that really messes things up (It replaces the correct elements, but with a full stop), but I think you can see what I am attempting to do with it. I am also worrying that this will only work for single numbers (\"1\" rather than \"11\").. So that led me into thinking about using the "*" or "+" expression rather than ".", however I foresaw the problem of this picking up all of the text inbetween the desired characters (which are dotted all over the place) whereas I obviously only want to replace the ones with numeric characters in between them.
Hope I've explained that clearly enough, will be happy to provide any extra info if needed :)

Try this
var str = "?\"width=\"1\"height=\"1234\"border=\"0\"style=\"display:none;\">');";
str = Regex.Replace(str , "\"(\\d+)\"", "$1");
(\\d+) is a capturing group that looks for one or more digits and $1 references what the group captured.

This works
String input = #"?\""width=\""1\""height=\""1\""border=\""0\""style=\""display:none;\"">');";
//replace the entire match of the regex with only what's captured (the number)
String result = Regex.Replace(input, #"\\""(\d+)\\""", match => match.Result("$1"));
//control string for excpected result
String shouldBe = #"?\""width=1height=1border=0style=\""display:none;\"">');";
//prints true
Console.WriteLine(result.Equals(shouldBe).ToString());

Related

Regex replace all matching words that do not contain a certain string

How can I use regex to replace matching strings that do not include a specific string?
input string
Keepword mywordsecond mythirdword myfourthwordKeep
string to replace
word
exclude string
Keep
Desired out put
Keepword mysecond mythird myfourthKeep

Will there ever be more than one word in a word? If there are more than one, do you want to replace all of them? If not, this should sort you out:
Regex r = new Regex(#"\b((?:(?!Keep|word)\w)*)word((?:(?!Keep)\w)*)\b");
s1 = r.Replace(s0, "$1$2");
to explain:
First, \b((?:(?!Keep|word)\w)*) captures whatever text precedes the first occurrence of word or Keep.
The next thing it sees must be word, If it sees Keep or the end of the string instead, the match attempt immediately fails.
Then ((?:(?!Keep)\w)*)\b captures the remainder of the text in order to ensure it doesn't contain Keep.
When faced with a problem like this, most users' first impulse is to match (in the sense of consuming) only the part of the string they're interested in, using lookarounds to establish the context. It's usually much easier to write the regex so that it always moves forward through the string as it matches. You capture the parts you want to retain so you can plug them back into the result string by means of group references ($1, $2, etc.).
Given that you're using C#, you could use the lookaround approach:
Regex r = new Regex(#"(?<!Keep\w*)word(?!\w*Keep)");
s1 = r.Replace(s0, "");
But please don't. There are very few regex flavors that support unrestricted lookbehinds like .NET does, and most problems don't work so neatly as this one anyway.

string str = "Keepword mywordsecond mythirdword myfourthwordKeep";
str = Regex.Replace(str, "(?<!Keep)word", "");
And I'm going to link you to a one of good Regular Expressions Cheat sheet here

This works in notepad++:
(?<!Keep)word(?!Keep)
It uses "look ahead".

You can use negative look-behind assertion if you want to remove all "word" that are not proceeded by "Keep":
String input = "Keepword mywordsecond mythirdword myfourthwordKeep";
String pattern = "(?<!Keep)word";
String output = Regex.Replace(input, pattern, "");

regex approach for extracting strings surrounded with double quotes

I have a search string that is getting passed
Eg: "a+b",a, b, "C","d+e",a-b,d
I want to filter out all sub strings surrounded by double quotes("").
In above sample Output should contain:
"a+b","C","d+e"
Is there a way to do this without looping?
Also I then need to extract a string without above values to do further processing
Eg: a,b,a-b,d
Any suggestions on how to do this with minimal performance impact?
Thank you in advance for all your comments and suggestions

Since you didn't say anything about how exactly you want your output (do you need to keep the commas and extra whitespace? Is it comma delimited to begin with? Let's assume that it is NOT comma delimited and you are just trying to remove the occurences of the "xyz":
string strRegex = #"""([^""])+""";
string strTargetString = #" ""a+b"",a, b, ""C"",""d+e"",a-b,d";
string strOutput = Regex.Replace(strTargetString, strRegex, x => "");
Will remove all of the items (leaving the extra commas and whitespace).
If you are trying to do something where you need each individual match then you might want to try:
var y = (from Match m in Regex.Matches(strTargetString, strRegex) select m.Value).ToList<string>();
y.ForEach(s => Console.WriteLine(s));
To get the list of items without the surrounding quotes, you could either reverse the regex pattern OR use the replace method in the first code sample and then split on the commas, trimming white space (again, assuming you are splitting on commas which it sounds like you are)

First, add a comma to the end of your output:
"a+b",a, b, "C","d+e",a-b,d,
Then, use this regular expression:
((?<quoted>\".+?\")|(?<unquoted>.+?)),\s*
Now you have 2 problems. Kidding!
You'll have to find a way of extracting the matches without using a loop, but at least they are separated into quoted and unquoted strings by using the group. You could use a lamdba expression to pull the data out and join it, one each for quoted and unquoted, but it's just doing a loop behind the scenes, and may add more overhead than a simple for loop. It sounds like you're trying to eek out performance here, so time and test each method to see what gives the best results.

What C# regex expression can be used to strip out dots (.) in a string?

I need a string with non alpha-numeric characters etc stripped out of it; I used the following:
wordsstr = Regex.Replace(wordsstr, "[^A-Za-z0-9,-_]", "");
The problem being dots (.)s are left in the string yet they are not specified to be kept. How could I make sure dots are gotten rid of too?
Many thanks.

You are specifying that they need to be kept - you're using ,-_ which is everything from U+002C to U+005F, including U+002E (period).
If you meant the ,-_ to just mean comma, dash and underscore you'll need to escape the dash, such as:
wordsstr = Regex.Replace(input, #"[^A-Za-z0-9,\-_]", "");
Alternatively, (as in Oded's comment) put the dash as the first or last character in the set, to prevent it being interpreted as a range specifier:
wordsstr = Regex.Replace(input, "[^A-Za-z0-9,_-]", "");
If that's not the aim, please be more specific: "non alpha-numeric characters etc" isn't really enough information to go on.

Try the code below:
wordsstr = Regex.Replace(wordsstr, "[^-A-Za-z0-9,_]", "");
Your problem would be easier to understand if you write your expectation and actual result.

Try
wordstr = Regex.Replace(wordstr, "[^A-Za-z0-9,\\-_]", "");
or better if you just want to have alpha-numerical characters:
wordstr = Regex.Replace(wordstr, "[^A-z0-9]", "");
The problem in your first regex is that the - char defines a range, so you have to escape it to make it behave the way you want it to.

Remove all "invisible" chars from a string?

I'm writing a little class to read a list of key value pairs from a file and write to a Dictionary<string, string>. This file will have this format:
key1:value1
key2:value2
key3:value3
...
This should be pretty easy to do, but since a user is going to edit this file manually, how should I deal with whitespaces, tabs, extra line jumps and stuff like that? I can probably use Replace to remove whitespaces and tabs, but, is there any other "invisible" characters I'm missing?
Or maybe I can remove all characters that are not alphanumeric, ":" and line jumps (since line jumps are what separate one pair from another), and then remove all extra line jumps. If this, I don't know how to remove "all-except-some" characters.
Of course I can also check for errors like "key1:value1:somethingelse". But stuff like that doesn't really matter much because it's obviously the user's fault and I would just show a "Invalid format" message. I just want to deal with the basic stuff and then put all that in a try/catch block just in case anything else goes wrong.
Note: I do NOT need any whitespaces at all, even inside a key or a value.

I did this one recently when I finally got pissed off at too much undocumented garbage forming bad xml was coming through in a feed. It effectively trims off anything that doesn't fall between a space and the ~ in the ASCII table:
static public string StripControlChars(this string s)
{
return Regex.Replace(s, #"[^\x20-\x7F]", "");
}
Combined with the other RegEx examples already posted it should get you where you want to go.

If you use Regex (Regular Expressions) you can filter out all of that with one function.
string newVariable Regex.Replace(variable, #"\s", "");
That will remove whitespace, invisible chars, \n, and \r.

One of the "white" spaces that regularly bites us is the non-breakable space. Also our system must be compatible with MS-Dynamics which is much more restrictive. First, I created a function that maps the 8th bit characters to their approximate 7th bit counterpart, then I removed anything that was not in the x20 to x7f range further limited by the Dynamics interface.
Regex.Replace(s, #"[^\x20-\x7F]", "")
should do that job.

The requirements are too fuzzy. Consider:
"When is a space a value? key?"
"When is a delimiter a value? key?"
"When is a tab a value? key?"
"Where does a value end when a delimiter is used in the context of a value? key"?
These problems will result in code filled with one off's and a poor user experience. This is why we have language rules/grammar.
Define a simple grammar and take out most of the guesswork.
"{key}":"{value}",
Here you have a key/value pair contained within quotes and separated via a delimiter (,). All extraneous characters can be ignored. You could use use XML, but this may scare off less techy users.
Note, the quotes are arbitrary. Feel free to replace with any set container that will not need much escaping (just beware the complexity).
Personally, I would wrap this up in a simple UI and serialize the data out as XML. There are times not to do this, but you have given me no reason not to.

var split = textLine.Split(":").Select(s => s.Trim()).ToArray();
The Trim() function will remove all the irrelevant whitespace. Note that this retains whitespace inside of a key or value, which you may want to consider separately.

You can use string.Trim() to remove white-space characters:
var results = lines
.Select(line => {
var pair = line.Split(new[] {':'}, 2);
return new {
Key = pair[0].Trim(),
Value = pair[1].Trim(),
};
}).ToList();
However, if you want to remove all white-spaces, you can use regular expressions:
var whiteSpaceRegex = new Regex(#"\s+", RegexOptions.Compiled);
var results = lines
.Select(line => {
var pair = line.Split(new[] {':'}, 2);
return new {
Key = whiteSpaceRegex.Replace(pair[0], string.Empty),
Value = whiteSpaceRegex.Replace(pair[1], string.Empty),
};
}).ToList();

If it doesn't have to be fast, you could use LINQ:
string clean = new String(tainted.Where(c => 0 <= "ABCDabcd1234:\r\n".IndexOf(c)).ToArray());

Regex Word splitting in C#

I know similar questions have been asked before, but I can't find one that is like mine, or enough like mine to help me out :). So essentially I want to split up a string which contains a bunch of words, and I don't want to return any characters that are not words (this is the key problem I am struggling with, ignoring characters). This is how I define the problem:
What constitutes a word is a string of any character a-zA-Z only
(no numbers or anything else)
In between any word, there can be any number of random other characters
I want to get back a string[] containing only the words
eg: text: "apple^&**^orange1247pear"
I want to return: apple, orange, pear in an array.
The closest I have found I suppose is this:
Regex.Split("apple^orange7pear",#"([a-zA-Z]*)")
Which splits out the apple/orange/pear, but also returns a bunch of other junk and blank strings.
Anyone know how to stop the split function from returning certain parts of the string, or is that not possible?
Thanks in advance for any help you give me :)

Split should match the tokens between your words. In your regex you've added a group around the word, so it is included in the result, but that isn't desired in this case. Note that this regex matches anything besides valid words - anything that isn't an ASCII letter:
string[] words = Regex.Split(str, "[^a-zA-Z]+");
Another option is to match the words directly:
MatchCollection matches = Regex.Matches(str, "[a-zA-Z]+");
string[] words2 = matches.Cast<Match>().Select(m => m.Value).ToArray();
The second option is probably clearer, and will not include blank elements on the start or end of the array.

var splits = Regex.Split("aaa $$$bbb ccc", #"[^A-Za-z]+");
But to include non-latin letters, I would use this:
var splits = Regex.Split("aaa $$$bbb ccc", #"\P{L}+");

Try this:
Regex.Matches("kalle kula(/()&//()nisse8978971", #"[A-Za-z]+")
Using Matches() will collect only the words, Split() will divide the string which is not what you want.

The second option Kobi listed is better and easier to control. I use the following regular expression to locate common entities such as words, numbers, email addresses in a string it will.
var regex = new Regex(#"[\p{L}\p{N}\p{M}]+(?:[-.'´_#][\p{L}|\p{N}|\p{M}]+)*", RegexOptions.Compiled);

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# Trouble with Regex.Replace - c#

Try this var str = "?\"width=\"1\"height=\"1234\"border=\"0\"style=\"display:none;\">');"; str = Regex.Replace(str , "\"(\\d+)\"", "$1"); (\\d+) is a capturing group that looks for one or more digits and $1 references what the group captured.

Related

Regex replace all matching words that do not contain a certain string

regex approach for extracting strings surrounded with double quotes

What C# regex expression can be used to strip out dots (.) in a string?

Remove all "invisible" chars from a string?

Regex Word splitting in C#

Categories

Resources