Using RegEx to replace invalid characters - c#

I have a directory with lots of folders, sub-folder and all with files in them. The idea of my project is to recurse through the entire directory, gather up all the names of the files and replace invalid characters (invalid for a SharePoint migration).
However, I'm completely unfamiliar with Regular Expressions. The characters i need to get rid in filenames are: ~, #, %, &, *, { } , \, /, :, <>, ?, -, | and ""
I want to replace these characters with a blank space. I was hoping to use a string.replace() method to look through all these file names and do the replacement.
So far, the only code I've gotten to is the recursion. I was thinking of the recursion scanning the drive, fetching the names of these files and putting them in a List<string>.
Can anybody help me with how to find/replace invalid chars with RegEx with those specific characters?

string pattern = "[\\~#%&*{}/:<>?|\"-]";
string replacement = " ";
Regex regEx = new Regex(pattern);
string sanitized = Regex.Replace(regEx.Replace(input, replacement), #"\s+", " ");
This will replace runs of whitespace with a single space as well.

is there a way to get rid of extra spaces?
Try something like this:
string pattern = " *[\\~#%&*{}/:<>?|\"-]+ *";
string replacement = " ";
Regex regEx = new Regex(pattern);
string sanitized = regEx.Replace(input, replacement);
Consider learning a bit about regular expressions yourself, as it's also very useful in developing (e.g. search/replace in Visual Studio).

Related

Regex to get multiple filters

I am new to Regex and trying to find all the files with .cs,.json etc.
But, I am just getting only 1 file extension i.e. only 1 filter value.
Code :
string ext = "json|cs|xml";
Regex RegEx = new Regex(#"<(Compile|Content|None) Include=\""([^\""]+." + ext + #")\""( /)?>",RegexOptions.IgnoreCase);
Match match = RegEx.Match(line); //Only takes json, does not take cs or xml
So, here it matches only json file.
Can anyone help me with this regex.
Short answer: you need to add bracket around where you include your ext variable, as then the parser knows to match any of those options.
Currently what you have is going to match any character not including a double quote and the string json OR the string cs OR xml. By adding the extra brackets (as below) you tell the parser to match any character not including the double quote and any of the extension you provide.
Replace
<(Compile|Content|None) Include=\""([^\""]+." + ext + #")\""( /)?>
with
<(Compile|Content|None) Include=\""([^\""]+.(" + ext + #"))\""( /)?>
PS. I find Expresso very useful in debugging Regular Expressions. Not affiliated, just been using for quite a number of years.

Regex pattern in c# start with # and end with 9;

Need regex pattern that text start with"#" and end with " ";
I tried the below pattern
string pattern = "^[#].*?[ ]$";
but not working
Since is an hex code of tab character, why not just using StartsWith and EndsWith methods instead?
if(yourString.StartsWith("#") && yourString.EndsWith("\\t"))
{
// Pass
}
This patterns works fine. I have tested it.
string pattern = "#(.*?)9";
See below link to test it online.
https://regex101.com/r/iR6nP6/1
C#
const string str = "dadasd#beetween9ddasdasd";
var match = Regex.Match(str, "#(.*?)9");
Console.WriteLine(match.Groups[1].Value);
In regex syntaxt, the [] denotes a group of characters of which the engine will attempt to match one of. Thus, [&#x9] means, match one of an &, #, x or 9 in no particular order.
If you are after order, which seems you are, you will need to remove the []. Something like so should work: string pattern = "^#.*?&#x9$";
you mean something like:
string pattern = "^#.*?[ ]$"
There are also many fine regex expression helpers on the web. for example https://regex101.com/ It gives a nice explanation of how your text will be handled.
You should use \t to match tab character
You can use special character sequences to put non-printable characters in your regular expression. Use \t to match a tab character (ASCII 0x09)
Try following Regex
^\#.*\t\;$

Return RegExp C# with linebreak

I’m having a problem with Regular Expressions in C#.
What I have is a string representing a page (HTML etc.). The string also contains \r\n, \r and \n in different places, now I’m trying to match something in the string:
Match currentMatch = Regex.Match(contents, "Title: <strong>(.*?)</strong>");
string org = currentMatch.Groups[1].ToString();
This works fine, however, when I want to match something that has any of the characters mentioned earlier (line breaks) in the string, it doesn’t return anything (empty, no match):
Match currentMatch = Regex.Match(contents, "Description: <p>(.*?)</p>");
string org = currentMatch.Groups[1].ToString();
It does however work if I add the following lines above the match:
contents = contents.Replace("\r", " ");
contents = contents.Replace("\n", " ");
I however don’t like that its modify the source, what can I do about this?
The . does not match newline characters by default. You can change this, by using the Regex Option Singleline. This treats the whole input string as one line, i.e. the dot matches also newline characters.
Match currentMatch = Regex.Match(contents, "Title: <strong>(.*?)</strong>", RegexOptions.Singleline);
By the way, I hope you are aware that regex is normally not the way to deal with Html?

regex to find a word or words between spaces

I want to find the words in a sentence between spaces. So the words till the first space before and after the search word
This is anexampleof what I want should return anexampleof if my search word is example
I now have this regex "(?:^|\S*\s*)\S*" + searchword + "\S*(?:$|\s*\S*)" but this gives me an extra word in the beginning and the end.
'This is anexampleof what I want' --> returns 'is anexampleof what'
I tried to change the regex but I'm not good at it at all..
I'm using c#. Thx for the help.
Full C# code:
MatchCollection m1 = Regex.Matches(content, #"(?:^|\S*\s*)\S*" + searchword + #"\S*(?:$|\s*\S*)",
RegexOptions.IgnoreCase | RegexOptions.Multiline);
You can simply leave out the non-capturing groups at the end:
#"\S*" + searchword + #"\S*";
Due to greediness you will get as many non-space characters on each side as possible.
Also, the idea of non-capturing groups is not, that they are not included in the match. All they do is not to produce captures of sub-matches. If you wanted to check that there is something, but don't to include it in the match, you want lookarounds:
#"(?<=^|\S*\s*)\S*" + searchword + #"\S*(?=$|\s*\S*)"
However these lookarounds don't really do anything in this case, because \s*\S* is satisfied with an empty string (because * makes both characters optional). But just for further reference... if you want to make assertions at the boundary of your match, which should not be part of the match... lookarounds are the way to go.

c# Regular expression for words in brackets with separator

I need to parse a text and check if between all squared brackets is a - and before and after the - must be at least one character.
I tried the following code, but it doesn't work. The matchcount is to large.
Regex regex = new Regex(#"[\.*-.*]");
MatchCollection matches = regex.Matches(textBox.Text);
SampleText:
Node
(Entity [1-5])
Figured I might as well provide an answer... To reiterate my points (with modifications):
* matches 0 or more occurences. You want + probably.
square brackets are special characters and will need to be escaped. They are used to define sets of characters.
You will probably want to exclude [ and ] from your "any character" matching
Put this all together and the following should do you better:
Regex regex = new Regex(#"\[[^-[\]]+-[^[\]]+\]");
Although its a little messy the key thing is that [^[\]] means any character except a square bracket. [^-[\]] means that but also disallows -. This is an optimisation and not required but it just reduces the work the regular expression engine has to do when working out the match. Thanks to ridgerunner for pointing out this optimisation.
Square brackets mean something special in Regexes, you'll need to escape them. Additionally, if you want at least one character then you need to use + rather than *.
Regex regex = new Regex(#"\[.+-.+\]");
MatchCollection matches = regex.Matches(textBox.Text);
string txt = "(Entity [1-5])";
Regex reg = new Regex(#"\[.+\-.+\]");
if it is for #:
string txt = "(Entity [1-5])";
Regex reg = new Regex(#"\[\d+\-\d+\]");

Categories