Return RegExp C# with linebreak - c#

I’m having a problem with Regular Expressions in C#.
What I have is a string representing a page (HTML etc.). The string also contains \r\n, \r and \n in different places, now I’m trying to match something in the string:
Match currentMatch = Regex.Match(contents, "Title: <strong>(.*?)</strong>");
string org = currentMatch.Groups[1].ToString();
This works fine, however, when I want to match something that has any of the characters mentioned earlier (line breaks) in the string, it doesn’t return anything (empty, no match):
Match currentMatch = Regex.Match(contents, "Description: <p>(.*?)</p>");
string org = currentMatch.Groups[1].ToString();
It does however work if I add the following lines above the match:
contents = contents.Replace("\r", " ");
contents = contents.Replace("\n", " ");
I however don’t like that its modify the source, what can I do about this?

The . does not match newline characters by default. You can change this, by using the Regex Option Singleline. This treats the whole input string as one line, i.e. the dot matches also newline characters.
Match currentMatch = Regex.Match(contents, "Title: <strong>(.*?)</strong>", RegexOptions.Singleline);
By the way, I hope you are aware that regex is normally not the way to deal with Html?

Related

C# Replacing "/ with " - double quote backlash with double quote and others

I'm trying to parse some HTML that has a bunch of escaped chars inside it, a lot of
\t, \n, \r, and every double quote is escaped by a backslash. Sample HTML:
<div id=\"error-modal\" title=\"Retrieving Document Error\" class=\"text-hide\">\n We're sorry, we were unable to retrieve your requested document or image.</div>
I'm trying to replace these characters by doing this:
var xpar = new XML.Parser(wConn.RawString.Replace("\\n", "").Replace("\\t", "").Replace("\\r","").Replace("\\\"", "\""))
The parser errs out because there's something else in the HTML it doesn't like, but in the exception the string is the same as it was before, the backslashes are all still there. What am I doing wrong?
The problem is that replacement method take \n \r \t as a code and not as text that you want.
You can use a regular expressions to achieve that.
var patternToMatch = "\\\\(n|r|t|\\\")";
var replacement = "";
var escapedString = Regex.Replace(inputString, patternToMatch, replacement);
modify the pattern to match with your requirements but basically this expression can solve your problem.

Parse multiple hostnames from string

I am trying to parse multiple hostnames from a string using a Regex in C#.
Example string: abc.google.com another example here abc.microsoft.com and another example abc.bbc.co.uk
The code I have been trying is below:
string input = "abc.google.com another example here abc.microsoft.com and another example abc.bbc.co.uk";
string FQDN_Pat = #"^([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\-]{0,61}[a-zA-Z0-9])(\.([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\-]{0,61}[a-zA-Z0-9]))*$";
Regex r = new Regex(FQDN_Pat);
Match m = r.Match(input);
while (m.Success)
{
txtBoxOut.Text += "Match: " + m.Value + " ";
m = m.NextMatch();
}
The code works if the string fits the pattern exactly e.g. abc.google.com.
How can I change the Regex to match the patterns that fit within the example string e.g. so the output would be:
Match: abc.google.com
Match: abc.microsoft.com
Match: abc.bbc.co.uk
Apologies in advance if this is something very simple as my knowledge of regular expressions is not great! :) Thanks!
UPDATE:
Updating the Regex to the following (removing the ^ and $):
string FQDN_Pat = #"([a-zA-Z0-9]([a-zA-Z0-9\-]{0,61}[a-zA-Z0-9])?)(\.([a-zA-Z0-9]|[a-zA-Z0-9][a-zA‌​-Z0-9\-]{0,61}[a-zA-Z0-9]))";
Results in the following output:
Match 1: abc.g
Match 2: oogle.c
Match 3: abc.m
Match 4: icrosoft.c
Match 5: abc.b
Match 6: bc.c
Match 7: o.u
As the regexp is quite complicated I tried to simplify it a bit. So what I've done was to
Remove ^ and $ to make the regexp match anywhere
Simplify characters that you match to , so instead of ([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\-]{0,61}[a-zA-Z0-9]) i'm using ([a-zA-Z0-9])+ which means look for any alphanumeric sequence with length higher than one (the + sign means that you match to a char that appears once or more). Let's call it X. If the rules for names in FQDN are more complex please modify this value
Expression for finding FQDN is X(\.X)+. This can be viewed as sequence of chars followed by one or more sequences, all are separated by dots (.).
Substitiuting X you have full expression given as
string FQDN_Pat = #"([a-zA-Z0-9]+)(\.([a-zA-Z0-9])+)+";
which actually matches to your example but I suggest you read C# regexp manuals for further references in case there are some tricks in domain names
You get this behavior because you are only matching the string that contain nothing else but your pattern. You are using ^ (start of the string) and $ (end of the string). If you want to match your pattern anywhere in the input string remove those characters from the pattern.

Regex.Replace removes '\r' character in "\r\n"

Here is a simple example
string text = "parameter=120\r\n";
int newValue = 250;
text = Regex.Replace(text, #"(?<=parameter\s*=).*", newValue.ToString());
text will be "parameter=250\n" after replacement. Replace() method removes '\r'. Does it uses unix-style for line feed by default? Adding \b to my regex (?<=parameter\s*=).*\b solves the problem, but I suppose there should be a better way to parse lines with windows-style line feeds.
Take a look at this answer. In short, the period (.) matches every character except \n in pretty much all regex implementations. Nothing to do with Replace in particular - you told it to remove any number of ., and that will slurp up \r as well.
Can't test now, but you might be able to rewrite it as (?<=parameter\s*=)[^\r\n]* to explicitly state which characters you want disallowed.
. by default doesn't match \n..If you want it to match you have to use single line mode..
(?s)(?<=parameter\s*=).*
^
(?s) would toggle the single line mode
Try this:
string text = "parameter=120\r\n";
int newValue = 250;
text = Regex.Replace(text, #"(parameter\s*=).*\r\n", "${1}" + newValue.ToString() + "\n");
Final value of text:
parameter=250\n
Match carriage return and newline explicitly. Will only match lines ending in \r\n.

parse string with regex

I am trying to search a string for email addresses, but my regex does not work, when the string contains other characters than the email. Meaning, if I try on a small string like "me#email.com", the regex finds a match. If I insert a blank space in the string, like: " me#mail.com", the regex does not find an email match.
Here is my code(the regex pattern is from the web):
string emailpattern = #"^(([^<>()[\]\\.,;:\s#\""]+"
+ #"(\.[^<>()[\]\\.,;:\s#\""]+)*)|(\"".+\""))#"
+ #"((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}"
+ #"\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+"
+ #"[a-zA-Z]{2,}))$";
Regex rEmail = new Regex(emailpattern);
string str = #" me#mail.com";
MatchCollection mcolResults = rEmail.Matches(str);
MessageBox.Show(mcolResults.Count.ToString());
Please let me know what am I doing wrong.
Thank you.
Best regards,
^ and $ mean (respectively) the start and end of the input text (or line in multi-line mode) - generally used to check that the entire text (or line) matches the pattern. So if you don't want that, take them away.
Remove the ^ and the $ from the beginning and the end. They mean "Start of string" and "End of string" respectively.
Do you learn how to use regex or you actually need to parse email addresses?
There is an object that was especially designed to do it MailAddress
Here is the MSDN documentation: http://msdn.microsoft.com/en-us/library/591bk9e8.aspx
When you initialize it with a string that holds a mail address that is not in the correct format, a FormatException will be thrown.
Good luck!
First obvious problem: Your expression only matches email adresses at the start of a string.
You need to drop the ^ at the start.
^ matches the start of a string.
The regex is correct. e-mail addresses don't contain whitespace.
You can use escapes like \w in your regex in order to match whitespace, or you can do str.Trim() to fix your string before trying to match against it.

Using RegEx to replace invalid characters

I have a directory with lots of folders, sub-folder and all with files in them. The idea of my project is to recurse through the entire directory, gather up all the names of the files and replace invalid characters (invalid for a SharePoint migration).
However, I'm completely unfamiliar with Regular Expressions. The characters i need to get rid in filenames are: ~, #, %, &, *, { } , \, /, :, <>, ?, -, | and ""
I want to replace these characters with a blank space. I was hoping to use a string.replace() method to look through all these file names and do the replacement.
So far, the only code I've gotten to is the recursion. I was thinking of the recursion scanning the drive, fetching the names of these files and putting them in a List<string>.
Can anybody help me with how to find/replace invalid chars with RegEx with those specific characters?
string pattern = "[\\~#%&*{}/:<>?|\"-]";
string replacement = " ";
Regex regEx = new Regex(pattern);
string sanitized = Regex.Replace(regEx.Replace(input, replacement), #"\s+", " ");
This will replace runs of whitespace with a single space as well.
is there a way to get rid of extra spaces?
Try something like this:
string pattern = " *[\\~#%&*{}/:<>?|\"-]+ *";
string replacement = " ";
Regex regEx = new Regex(pattern);
string sanitized = regEx.Replace(input, replacement);
Consider learning a bit about regular expressions yourself, as it's also very useful in developing (e.g. search/replace in Visual Studio).

Categories