Parse multiple hostnames from string - c#

I am trying to parse multiple hostnames from a string using a Regex in C#.
Example string: abc.google.com another example here abc.microsoft.com and another example abc.bbc.co.uk
The code I have been trying is below:
string input = "abc.google.com another example here abc.microsoft.com and another example abc.bbc.co.uk";
string FQDN_Pat = #"^([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\-]{0,61}[a-zA-Z0-9])(\.([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\-]{0,61}[a-zA-Z0-9]))*$";
Regex r = new Regex(FQDN_Pat);
Match m = r.Match(input);
while (m.Success)
{
txtBoxOut.Text += "Match: " + m.Value + " ";
m = m.NextMatch();
}
The code works if the string fits the pattern exactly e.g. abc.google.com.
How can I change the Regex to match the patterns that fit within the example string e.g. so the output would be:
Match: abc.google.com
Match: abc.microsoft.com
Match: abc.bbc.co.uk
Apologies in advance if this is something very simple as my knowledge of regular expressions is not great! :) Thanks!
UPDATE:
Updating the Regex to the following (removing the ^ and $):
string FQDN_Pat = #"([a-zA-Z0-9]([a-zA-Z0-9\-]{0,61}[a-zA-Z0-9])?)(\.([a-zA-Z0-9]|[a-zA-Z0-9][a-zA‌​-Z0-9\-]{0,61}[a-zA-Z0-9]))";
Results in the following output:
Match 1: abc.g
Match 2: oogle.c
Match 3: abc.m
Match 4: icrosoft.c
Match 5: abc.b
Match 6: bc.c
Match 7: o.u

As the regexp is quite complicated I tried to simplify it a bit. So what I've done was to
Remove ^ and $ to make the regexp match anywhere
Simplify characters that you match to , so instead of ([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\-]{0,61}[a-zA-Z0-9]) i'm using ([a-zA-Z0-9])+ which means look for any alphanumeric sequence with length higher than one (the + sign means that you match to a char that appears once or more). Let's call it X. If the rules for names in FQDN are more complex please modify this value
Expression for finding FQDN is X(\.X)+. This can be viewed as sequence of chars followed by one or more sequences, all are separated by dots (.).
Substitiuting X you have full expression given as
string FQDN_Pat = #"([a-zA-Z0-9]+)(\.([a-zA-Z0-9])+)+";
which actually matches to your example but I suggest you read C# regexp manuals for further references in case there are some tricks in domain names

You get this behavior because you are only matching the string that contain nothing else but your pattern. You are using ^ (start of the string) and $ (end of the string). If you want to match your pattern anywhere in the input string remove those characters from the pattern.

Related

Regex pattern in c# start with # and end with 9;

Need regex pattern that text start with"#" and end with " ";
I tried the below pattern
string pattern = "^[#].*?[ ]$";
but not working
Since is an hex code of tab character, why not just using StartsWith and EndsWith methods instead?
if(yourString.StartsWith("#") && yourString.EndsWith("\\t"))
{
// Pass
}
This patterns works fine. I have tested it.
string pattern = "#(.*?)9";
See below link to test it online.
https://regex101.com/r/iR6nP6/1
C#
const string str = "dadasd#beetween9ddasdasd";
var match = Regex.Match(str, "#(.*?)9");
Console.WriteLine(match.Groups[1].Value);
In regex syntaxt, the [] denotes a group of characters of which the engine will attempt to match one of. Thus, [&#x9] means, match one of an &, #, x or 9 in no particular order.
If you are after order, which seems you are, you will need to remove the []. Something like so should work: string pattern = "^#.*?&#x9$";
you mean something like:
string pattern = "^#.*?[ ]$"
There are also many fine regex expression helpers on the web. for example https://regex101.com/ It gives a nice explanation of how your text will be handled.
You should use \t to match tab character
You can use special character sequences to put non-printable characters in your regular expression. Use \t to match a tab character (ASCII 0x09)
Try following Regex
^\#.*\t\;$

Find a string pattern using Regular expression

How can I use regular expressions to find if the string matches a pattern like [sometextornumber] is a [sometextornumber].
For instance, if the input is This is a test, the output should be this and test.
I was thinking something like ([a-zA-Z0-9]) is a([a-zA-Z0-9]) but looks like I am way off the correct path.
Your question is geared towards grabbing the first and last word of a sentence. If this is all you're going to be interested in, this pattern will suffice:
"^(\\w+)|(\\w+)$"
Pattern breakdown:
^ indicates the beginning of a line
^(\\w+) capture group for a word at the beginning of the line. This is equivalent to [a-zA-Z0-9]+, where the + says you want a one or more letters and numbers.
| acts as an OR operator in Regex
$ indicates the end of a line
(\\w+)$ capture group for a word at the end of the line. This is equivalent to [a-zA-Z0-9]+, where the + says you want a one or more letters and numbers.
This pattern allows you to ignore what's in between the first and last word, so it doesn't care about "is a", and give you one capture group to pull from.
Usage:
string data = "This is going to be a test";
Match m = Regex.Match(data, "^(\\w+)|(\\w+)$");
while (m.Success)
{
Console.WriteLine(m.Groups[0]);
m = m.NextMatch();
}
Results:
This
test
If you're really only interested in the first and last word of a sentence, you also don't need to bother with Regex. Just split the sentence by a space and grab the first and last element of the array.
string[] dataPieces = data.Split(' ');
Console.WriteLine(dataPieces[0]);
Console.WriteLine(dataPieces[dataPieces.Length - 1]);
And the results are the same.
References:
https://msdn.microsoft.com/en-us/library/hs600312(v=vs.110).aspx
https://msdn.microsoft.com/en-us/library/az24scfc(v=vs.110).aspx
Try this:
([a-zA-Z0-9])+ is a ([a-zA-Z0-9])+
Edit:
You need a space after the a since it is another word. Without the + it will only match from last letter of the first word till the first letter of the last word. The + will match 1 or more of whatever is in the (), so in this case the whole word.
If you're looking to match a specific pattern such as "This" or "Test" you can simply do a case insensitive string compare.
From your question, I'm not sure that you necessarily need a regular expression here.
Here is a quick LINQpad:
var r = new Regex("(.*) is a (.*)");
var match = r.Match("This is a test");
match.Groups.OfType<Group>().Skip(1).Select(g=>g.Value).Dump();
That outputs:
IEnumerable<String> (2 items)
This
test

How to use regex to match anything from A to B, where B is not preceeded by C

I'm having a hard time with this one. First off, here is the difficult part of the string I'm matching against:
"a \"b\" c"
What I want to extract from this is the following:
a \"b\" c
Of course, this is just a substring from a larger string, but everything else works as expected. The problem is making the regex ignore the quotes that are escaped with a backslash.
I've looked into various ways of doing it, but nothing has gotten me the correct results. My most recent attempt looks like this:
"((\"|[^"])+?)"
In various test online, this works the way it should - but when I build my ASP.NET page, it cuts off at the first ", leaving me with just the a-letter, white space and a backslash.
The logic behind the pattern above is to capture all instances of \" or something that is not ". I was hoping this would search for \", making sure to find those first - but I got the feeling that this is overridden by the second part of the expression, which is only 1 single character. A single backslash does not match 2 characters (\"), but it will match as a non-". And from there, the next character will be a single ", and the matching is completed. (This is just my hypothesis on why my pattern is failing.)
Any pointers on this one? I have tried various combinations with "look"-methods in regex, but I didn't really get anywhere. I also get the feeling that is what I need.
ORIGINAL ANSWER
To match a string like a \"b\" c, you need to use following regex declaration:
(?:\\"|[^"])+
var rx = Regex(#"(?:\\""|[^""])+");
See RegexStorm demo
Here is an IDEONE demo:
var str = "a \\\"b\\\" c";
Console.WriteLine(str);
var rx = new Regex(#"(?:\\""|[^""])+");
Console.WriteLine(rx.Match(str).Value);
Please note the # in front of the string literal that lets us use verbatim string literals where we have to double quotes to match literal quotes and use single escape slashes instead of double. This makes regexps easier to read and maintain.
If you want to match any escaped entities in your input string, you can use:
var rx = new Regex(#"[^""\\]*(?:\\.[^""\\]*)*");
See demo on RegexStorm
UPDATE
To match the quoted strings, just add quotes around the pattern:
var rx = new Regex(#"""(?<res>[^""\\]*(?:\\.[^""\\]*)*)""");
This pattern yields much better performance than Tim Long's suggested regex, see RegexHero test resuls:
The following expression worked for me:
"(?<Result>(\\"|.)*)"
The expression matches as follows:
An opening quote (literal ")
A named capture (?<name>pattern) consisting of:
Zero or more occurences * of literal \" or (|) any single character (.)
A final closing quote (literal ")
Note that the * (zero or more) quantifier is non-greedy so the final quote is matched by the literal " and not the "any single character" . part.
I used ReSharper 9's built-in Regular Expression validator to develop the expression and verify the results:
I have used the "Explicit Capture" option to reduce cruft in the output (RegexOptions.ExplicitCapture).
One thing to note is that I am matching the whole string, but I am only capturing the substring, using a named capture. Using named captures is a really useful way to get at the results you want. In code, it might look something like this:
static string MatchQuotedString(string input)
{
const string pattern = #"""(?<Result>(\\""|.)*)""";
const RegexOptions options = RegexOptions.ExplicitCapture;
Regex regex = new Regex(pattern, options);
var matches = regex.Match(input);
var substring = matches.Groups["Result"].Value;
return substring;
}
Optimization: If you are planning on using the regex a lot, you could factor it out into a field and use the RegexOptions.Compiled option, this pre-compiles the expression and gives you faster throughput at the expense of longer initialization.

How do I exclude a regex value in a replace

I have a regex expression which searches for strings using a Prefix and a Suffix. In it's simplest form \$\$\w+\$\$ will find $$My_Name$$ (in this case the Prefix and Suffix are both equal to $$) Another example would be \[\#\w+\#\] to match [#My_Name#].
The Prefix and Suffix will always be a specific string of 0 to n characters which I can always safely escape for a direct character match.
I extract the Matches in my C# code so I can work with them but obviously my match contains $$My_Name$$ but what I want is to simply get the inner string between the Suffix and Prefix: My_Name.
How do I exclude the Prefix and Suffix from the result?
Change your REGEX to \$\$(\w+)\$\$ and use $1 to get the matching (inner) string.
For example
string pattern = #"\$\$(\w+)\$\$";
string input = "$$My_Name$$";
Regex rgx = new Regex(pattern);
Match result = rgx.Match(input);
Console.WriteLine(result.Groups[1]);
Outputs: "My Name"
P.S - There's no need to use explicitly typed local variables, but I just wanted the types to be clear.
You can group your w+ into a group like this (w+) then when you retrieve the matched string you might be able to ask for that subgroup.
I do not know if I am wrong (but you didn't provided any code whatsoever) but I think this is how it is done: .Groups[1].Value on the the result of Regex.Match.
How about the regex below. It works by capturing the first character into a named group then capturing any repeats into a named group called first group which it then uses to match the end of the string. It will work with any number of repeated character so long as they repeated at the end of the word.
'(?<first_group>(?<first_char>.)\k<first_char>+)(?<word>\w+)\k<first_group>+'
You just need to then extract the capture group named word like so:
String sample = "$$My_Name$$";
Regex regex = new Regex("(?<first_group>(?<first_char>.)\k<first_char>+)(?<word>\w+)\k<first_group>+");
Match match = regex.Match(sample);
if (match.Success)
{
Console.WriteLine(match.Groups["word"].Value);
}
You can use named group like this:
(\$\$)(?<group1>.+?)\1 -- pattern 1 (first case)
\[(#)(?<group2>.+?)\1\] -- pattern 2 (second case)
or combined representation would be:
(\$\$)(?<group1>.+?)\1|\[(#)(?<group2>.+?)\3\]
I would suggest you to use .+? it will help you to match any character other than your prefix/suffix.
Live Demo

C# Regular expressions, retrieving two words separated by a comma, parenthesis operator

I've been playing around with retrieving data from a string using regular expression, mostly as an exercise for myself. The pattern that I'm trying to match looks like this:
"(SomeWord,OtherWord)"
After reading some documentation and looking at a cheat sheet I came to the conclusion that the following regex should give me 2 matches:
"\((\w),(\w)\)"
Because according to the documentation the parenthesis should do the following:
(pattern) Matches pattern and remembers the match. The matched
substring can be retrieved from the resulting Matches collection,
using Item [0]...[n]. To match parentheses characters ( ), use "\ (" or
"\ )".
However using the following code (removed error checking for conciseness) matches quite something different:
string line = "(A,B)";
string pattern = #"\((\w),(\w)\)";
MatchCollection matches = Regex.Matches(line, pattern);
string left = matches[0].Value;
string right = matches[1].Value;
Now I would expect left to become "A" and right to become "B". However left becomes "(A,B)" and there is no second match at all. What am I missing here?
(I know this example is trivial to solve without regexes but to learn how to properly use regexes I should be able to make something simple as this work)
You want the Groups member of the first match. In your example case there is only 1 match, which is the whole string. In the Groups collection you will have 3 items. Try this sample code, left should be A, and right should be B. If you look at the group[0] value it will be the whole string.
string line = "(A,B)";
string pattern = #"\((\w),(\w)\)";
MatchCollection matches = Regex.Matches(line, pattern);
GroupCollection groups = matches[0].Groups;
string left = groups[1].Value;
string right = groups[2].Value;
\w matches only one word character. If words have to contain at least one character, the expression should be:
string pattern = #"\((\w+),(\w+)\)";
if words may be empty:
string pattern = #"\((\w*),(\w*)\)";
+: means one or more repetitions.
*: means zero, one or more repetitions.
In any case, you will get one match with three groups, the first containing the whole string including the left and right parentheses, the two others the two words.
I think the problem is that you're confusing the concept of a match and a group.
A MatchCollection contains a list of strings that matched your entire regex, not just the parenthetical groups inside that Regex. For example, if the string you searched looked like this...
(A,B)(C,D)
...then you would have two matches: (A,B) and (C,D).
However, there's good news: you can get the groups from each match very easily, like so:
string line = "(A,B)";
string pattern = #"\((\w),(\w)\)";
MatchCollection matches = Regex.Matches(line, pattern);
string left = matches[0].Groups[1].Value;
string right = matches[0].Groups[2].Value;
That Groups variable is a collection of parenthetical groups from a single match.
Edit:
Olivier Jacot-Descombes made a very good point: we all got so hung up explaining match vs. group that we forgot to notice a second problem: \w will only match a SINGLE character. You need to add a quantifier (such as +) in order to grab more than one character at a time. Olivier's answer should explain that part clearly.
First off, it's one "match", with 2 "groups"...
I would recommend you name the groups anyway...
string pattern = #"\((?<FirstWord>\w+),(?<SecondWord>\w+)\)";
Then you could do...
Match m = Regex.Match(line, pattern);
string firstWord = m.Groups["FirstWord"].Value;
Since all you are looking for are the characters separated by a comma, you can simply use \w as your pattern. The matches will be A and B.
A handy site for testing your Regex is http://gskinner.com/RegExr/

Categories