I am trying to figure out a regex expression for a project, and struggling here.
Here's my sample string:
[link="http://www.cnn.com"]CNN Webpage[/link]
I want to regex.replace the above example to this:
CNN Webpage
I know there is a Regex way to do this. Can anyone help?
I personally prefer using named groups when I can. As you'll see it makes the regex/code a little more maintainable/readable. This also helps with maintenance on the code as the captured groups are no longer being referenced by the index. As you probably know, the index groups will change if you change any preceding capturing groups within the regex.
The named groups will stay consistent through the lifetime of the regex unless you specifically change it.
Regex
\[link=["\u201C](?<href>[^"\u201D]+)["\u201D]\](?<title>[^\[]+)\[/link\]
Regex Demo - Note the regex is different because of the different regex engines, but the regex is equal to the one I present here.
Code
var str = "[link=\"http://www.cnn.com\"]CNN Webpage[/link] OR [link=“http://www.cnn.com”]CNN Webpage[/link]";
var regex = new Regex(#"\[link=[""\u201C](?<href>[^""\u201D]+)[""\u201D]\](?<title>[^\[]+)\[/link\]");
//The ${name} refers to a named capture group in the regex. Makes it a little more readable, and maintainable.
str = regex.Replace(str, "${title}");
Console.WriteLine(str);
Please note that the regex only supports the "smart quotes" if the quotes are used properly, to handle cases where the quotes might be reversed you'd need to do something like this:
\[link=["\u201C\u201D](?<href>[^"\u201D\u201C]+)["\u201D\u201C]\](?<title>[^\[]+)\[/link\]
Just for clarity, the example below shows where this regex would be useful. Notice the last link has the unicode characters messed up. It uses the unicode right quote (\u201D ”) on both sides of the text. This regex will parse the data, but the one at the beginning of the post will not.
var str = "[link=\"http://www.cnn.com\"]CNN Webpage[/link] OR [link=“http://www.cnn.com”]CNN Webpage[/link] OR [link=”http://www.cnn.com”]CNN Webpage[/link]";
var regex = new Regex(#"\[link=[""\u201C\u201D](?<href>[^""\u201D\u201C]+)[""\u201D\u201C]\](?<title>[^\[]+)\[/link\]");
//The ${name} refers to a named capture group in the regex. Makes it a little more readable, and maintainable.
str = regex.Replace(str, "${title}");
Use capturing groups to capture the http link and the content of [link] tag.
Regex:
\[link="([^"]*)"\]([^\[\]]*)\[\/link]
Replacement string:
$2
DEMO
\[link(="[^"]+")\]([^\[]+)\[\/link\]
Try this.Replace by <a href$1 target="_blank">$2</a>.See demo.
http://regex101.com/r/kP8uF5/18
Related
I'm having a hard time with this one. First off, here is the difficult part of the string I'm matching against:
"a \"b\" c"
What I want to extract from this is the following:
a \"b\" c
Of course, this is just a substring from a larger string, but everything else works as expected. The problem is making the regex ignore the quotes that are escaped with a backslash.
I've looked into various ways of doing it, but nothing has gotten me the correct results. My most recent attempt looks like this:
"((\"|[^"])+?)"
In various test online, this works the way it should - but when I build my ASP.NET page, it cuts off at the first ", leaving me with just the a-letter, white space and a backslash.
The logic behind the pattern above is to capture all instances of \" or something that is not ". I was hoping this would search for \", making sure to find those first - but I got the feeling that this is overridden by the second part of the expression, which is only 1 single character. A single backslash does not match 2 characters (\"), but it will match as a non-". And from there, the next character will be a single ", and the matching is completed. (This is just my hypothesis on why my pattern is failing.)
Any pointers on this one? I have tried various combinations with "look"-methods in regex, but I didn't really get anywhere. I also get the feeling that is what I need.
ORIGINAL ANSWER
To match a string like a \"b\" c, you need to use following regex declaration:
(?:\\"|[^"])+
var rx = Regex(#"(?:\\""|[^""])+");
See RegexStorm demo
Here is an IDEONE demo:
var str = "a \\\"b\\\" c";
Console.WriteLine(str);
var rx = new Regex(#"(?:\\""|[^""])+");
Console.WriteLine(rx.Match(str).Value);
Please note the # in front of the string literal that lets us use verbatim string literals where we have to double quotes to match literal quotes and use single escape slashes instead of double. This makes regexps easier to read and maintain.
If you want to match any escaped entities in your input string, you can use:
var rx = new Regex(#"[^""\\]*(?:\\.[^""\\]*)*");
See demo on RegexStorm
UPDATE
To match the quoted strings, just add quotes around the pattern:
var rx = new Regex(#"""(?<res>[^""\\]*(?:\\.[^""\\]*)*)""");
This pattern yields much better performance than Tim Long's suggested regex, see RegexHero test resuls:
The following expression worked for me:
"(?<Result>(\\"|.)*)"
The expression matches as follows:
An opening quote (literal ")
A named capture (?<name>pattern) consisting of:
Zero or more occurences * of literal \" or (|) any single character (.)
A final closing quote (literal ")
Note that the * (zero or more) quantifier is non-greedy so the final quote is matched by the literal " and not the "any single character" . part.
I used ReSharper 9's built-in Regular Expression validator to develop the expression and verify the results:
I have used the "Explicit Capture" option to reduce cruft in the output (RegexOptions.ExplicitCapture).
One thing to note is that I am matching the whole string, but I am only capturing the substring, using a named capture. Using named captures is a really useful way to get at the results you want. In code, it might look something like this:
static string MatchQuotedString(string input)
{
const string pattern = #"""(?<Result>(\\""|.)*)""";
const RegexOptions options = RegexOptions.ExplicitCapture;
Regex regex = new Regex(pattern, options);
var matches = regex.Match(input);
var substring = matches.Groups["Result"].Value;
return substring;
}
Optimization: If you are planning on using the regex a lot, you could factor it out into a field and use the RegexOptions.Compiled option, this pre-compiles the expression and gives you faster throughput at the expense of longer initialization.
I need some help on a problem.
In fact I search to check for an image type by the hexadecimal code.
string JpgHex = "FF-D8-FF-E0-xx-xx-4A-46-49-46-00";
Then I have a condition on
string.StartsWith(pngHex).
The problem is that the "x" characters presents in my "JpgHex" string can be whatever I want.
I think I need a regex to check that but I don't know how!!
Thanks a lot!
I'm not quite clear what exactly you want to do, but the dot '.' character represents any character in Regex.
So the regex "^FF-D8-FF-E0-..-..-4A-46-49-46-00" will probably do the trick. '^' = Start of input.
If you want to allow only hex chars you can use "^FF-D8-FF-E0-[0-9A-F]{2}-[0-9A-F]{2}-4A-46-49-46-00".
Like I said, I'd need a better idea of what pattern you need to match.
Here are some examples:
Regex rgx =
new Regex(#"^FF-D8-FF-E0-[a-zA-Z0-9]{2}-[a-zA-Z0-9]{2}-4A-46-49-46-00$");
rgx.IsMatch(pngHex); // is match will return a bool.
I use [a-zA-Z0-9]{2} to denote two instances of a character, caps or small or a number. So the above regex would match :
FF-D8-FF-E0-aa-zZ-4A-46-49-46-00
FF-D8-FF-E0-11-22-4A-46-49-46-00
.. etc
Based on your need change the regex accordingly so for capitals and numbers only you change to [A-Z0-9]. The {2} denotes two occurrences.
The ^ denotes the string should start with FF and $ means the string should end with 00.
Lets say you wanted to only match two numbers, so you would use \d{2}, the whole thing would look like this:
Regex rgx = new Regex(#"^FF-D8-FF-E0-\d{2}-\d{2}-4A-46-49-46-00$");
rgx.IsMatch(pngHex);
How do I know of these magical characters? Simple, there are docs everywhere. See this MSDN page for some basic regex patterns. This page shows some quantifiers, those are things like match one or more or match only one.
Cheat-sheets also come in handy.
A regex would help you; you can use the following tool to help you test and learn: -
http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx
I recommend you have a play because then you'll learn!
To simply match any character in place of the x, the following should work: -
"^FF-D8-FF-E0-..-..-4A-46-49-46-00$"
In C#, it would be something like this: -
var test = "FF-D8-FF-E0-AB-CD-4A-46-49-46-00";
var foo = new Regex("^FF-D8-FF-E0-..-..-4A-46-49-46-00$");
if (foo.IsMatch(test))
{
// Do magic
}
You will need to read up on regular expressions to understand some of the characters that may not look familiar, i.e. ^ and $. See http://www.regular-expressions.info/
I need a regex pattern which will accommodate for the following.
I get a response from a UDP server, it's a very long string and each word is separated by \, for example:
\g79g97\g879o\wot87gord\player_0\name0\g6868o\g78og89\g79g79\player_1\name1\gyuvui\yivyil\player_2\name2\g7g87\g67og9o\v78v9i7
I need the strings after \player_n\, so in the above example I would need name0, name1 and name3,
I know this is the second regex question of the day but I have the book (Mastering Regular Expressions) on order! Thank you.
UPDATE. elusive's regex pattern will suffice, and I can add the match(0) to a textbox. However, what if I want to add all the matches to the text box ?
textBox1.Text += match.Captures[0].ToString(); //this works fine.
How do I add "all" match.captures to the text box? :s sorry for being so lame, this Regex class is brand new to me .
Try this one:
\\player_\d+\\([^\\]+)
i think that this test sample can help you
string inp = #"\g79g97\g879o\wot87gord\player_0\name0\g6868o\g78og89\g79g79\player_1\name1\gyuvui\yivyil\player_2\name2\g7g87\g67og9o\v78v9i7";
string rex = #"[\w]*[\\]player_[0-9]+[\\](?<name>[A-Za-z0-9]*)\b";
Regex re = new Regex(rex);
Match mat = re.Match(inp);
for (Match m = re.Match(inp); m.Success; m = m.NextMatch())
{
Console.WriteLine(m.Groups["name"]);
}
you can take the name of the player from the m.Groups["name"]
To get only the player name, you could use:
(?<=\\player_\d+\\)[^\\]+
This (?<=\\player_\d+\\) is something called a positive look-behind. It makes sure that the actual match [^\\]+ is preceded by the expression in the parentheses.
In this case, it's even specific to only a few regex engines (.NET being among them, luckily), in that it contains a variable length expression (due to \d+). Most regex engines only support fixed-length look-behind.
In any case, look-behind is not necessarily the best approach to this problem, match groups are simpler easier to read.
I am trying to get all strings enclosed in <*> by using following Regex:
Regex regex = new Regex(#"\<(?<name>\S+)\>", RegexOptions.IgnoreCase);
string name = e.Match.Groups["name"].Value;
But in some cases where I have text like :
<Vendors><Vtitle/> <VSurname/></Vendors>
It's returning two strings instead of four, i.e. above Regex outputs
<Vendors><Vtitle/> //as one string and
<VSurname/></Vendors> //as second string
Where as I am expecting four strings:
<Vendors>
<Vtitle/>
<VSurname/>
</Vendors>
Could you please guide me what change I need to make to my Regex.
I tried adding '\b' to specify word boundry
new Regex(#"\b\<(?<name>\S+)\>\b", RegexOptions.IgnoreCase);
, but that didn't help.
You'll get most of what what you want by using the regex /<([^>]*)>/. (No need to escape the angle brackets' as angle brackets aren't special characters in most regex engines, including the .NET engine.) The regex I provided will also capture trailing whitespace and any attributes on the tag--parsing those things reliably is way, way beyond the scope of a reasonable regex.
However, be aware that if you're trying to parse XML/HTML with a regex, that way lies madness
Regexes are the wrong tool for parsing XML. Try using the System.Xml.Linq (XElement) API.
Your regex is using \S+ as the wildcard. In english, this is "a series of one or more characters, none of which is non-whitespace". In other words, when the regex <(?<name>\S+)> is applied to this string: '`, the regex will match the entire string. angle brackets are non-whitespace.
I think what you want is "a series of one or more characters, none of which is an angle bracket".
The regex for that is <(?<name>[^>]+)> .
Ahhh, regular expressions. The language designed to look like cartoon swearing.
At the moment I am trying to match patterns such as
text text date1 date2
So I have regular expressions that do just that. However, the issue is for example if users input data with say more than 1 whitespace or if they put some of the text in a new line etc the pattern does not get picked up because it doesn't exactly match the pattern set.
Is there a more reliable way for pattern matching? The goal is to make it very simple for the user to write but make it easily matchable on my end. I was considering stripping out all the whitespace/newlines etc and then trying to match the pattern with no spaces i.e. texttextdate1date2.
Anyone got any better solutions?
Update
Here is a small example of the pattern I would need to match:
FIND me#test.com 01/01/2010 to 10/01/2010
Here is my current regex:
FIND [A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4} [0-9]{1,2}/[0-9]{1,2}/[0-9]{2,4} to [0-9]{1,2}/[0-9]{1,2}/[0-9]{2,4}
This works fine 90% of the time, however, if users submit this information via email it can have all different kinds of formatting and HTML I am not interested in. I am using a combination of the HtmlAgilityPack and a HTML tag removing regex to strip all the HTML from the email, but even at that I can't seem to get a match on some occassions.
I believe this could be a more parsing related question than pattern matching, but I think maybe there is a better way of doing this...
To match at least one or more whitespace characters (space, tab, newline), use:
\s+
Substitute the above wherever you have the physical space in your pattern and you should be fine.
Example of matching multiple groups in a text with multiple whitespaces and/or newlines.
var txt = "text text date1\ndate2";
var matches = Regex.Match(txt, #"([a-z]+)\s+([a-z]+)\s+([a-z0-9]+)\s+([a-z0-9]+)", RegexOptions.Singleline);
matches.Groups[n].Value with n from 1 to 4 will contain your matches.
I would split the string into a string array and match each resulting string to the necessary Regular Expression.
\b(text)[\s]+(text)[\s]+(date1)[\s]+(date2)\b
Its a nasty expression but here is something that will work for the input you provided:
^(\w+)\s+([\w#.]+)\s+(\d{2}\/\d{2}\/\d{4})[^\d]+(\d{2}\/\d{2}\/\d{4})$
This will work with variable amounts of whitespace between the capture groups as well.
Through ORegex you can tokenize your string and just pattern match on token sequences:
var tokens = input.Split(new[]{' ','\t','\n','\r'}, StringSplitOptions.RemoveEmptyEntries);
var oregex = new ORegex<string>("{0}{0}{1}{1}", IsText, IsDate);
var matches = oregex.Matches(tokens); //here is your subsequence tokens.
...
public bool IsText(string str)
{
...
}
public bool IsDate(string str)
{
...
}