C# Capturing the first match with regex - c#

I've got an input string that looks like this:
url=https%3A%2F%2Fdomain.com%2Fsale-deal%3Futm_source%3Dinsider-primary-action%3Dinsider-primary-action&utm_source=FB
or
url=https%3A%2F%2Fdomain.com%2Fsale&utm_source=FB&sub_id1=M12
the string sometimes has or non %3Futm_source
how to get link between url= and %3Futm_source% or &utm_source
Regex reg = new Regex(#"url=(https%3A%2F%2Fdomain.com[a-zA-Z0-9-_/%\.]+)%3Futm_source|&utm_source");
Match result = reg.Match(inPut);
Console.WriteLine(result.Groups[1].Value));
it always get from url= to &utm_source

You can use this
(?<=url=).*?(?=%3Futm_source|&utm_source)
(?<=url=) Positive look behind. matches url=.
.* - Matches anything except new line.
(?=%3Futm_source|&utm_source) - Positive look ahead. Matches %3Futm_source or &utm_source
Demo

Related

Simplify Regex grouping

var pattern = (?:[P|p]rint\("")(.+)(?:""\);?)
var input = Print("Hello World");
Results in two groups, the second one captures exactly what I want to capture and the first one is completely useless, how do I remove the first one?
I tried (?:ABC) it didn't work
Your pattern uses 1 capturing group () and 2 non capturing groups using (?:)
Those 2 non capturing groups you can omit as well as the | from the character class. I think you also would like to make the .* non greedy like .*? to prevent overmatching.
Then your pattern could look like(Matching an optional semicolon at the end):
[Pp]rint\("(.+?)"\);?
Regex demo
You might also use a version with a negated character class to match not a double quote:
[Pp]rint\(("[^"]+)"\);
Regex demo
Try following :
string input = "var input = Print(\"Hello World\");";
string pattern = "[Pp]rint\\(\"(?'message'[^\"]+)";
Match match = Regex.Match(input, pattern);
string message = match.Groups["message"].Value;

check pattern of HTTP GET using REGEXc#

I'm new to RegEx and having trouble getting pattern
have request with first line that look like
GET /someFolder/someSubfolder/someFile.fileExtenstion?param1=abc HTTP/1.1
I would like to check that the correct pattren exist
meaning first word GET later some valid URL than HTTP/verison
What I have till now is
string input = line;
Match match = Regex.Match(input, #"GET /([A-Za-z0-9-.+!*'();:#&=+$,/?%#[]])\ HTTP/1.1",
RegexOptions.IgnoreCase);
// check the Match instance.
if (match.Success)
{
string URL = match.Groups[1].Value;
}
But I get No match
What am I missing ?
You can simplify the regex a lot as
^GET.*HTTP\/1\.1$
^ anchors the regex at the start of the string.
.* matches anything
$ anchors the regex at end of string. Ensures that nothing followes the matched string
Regex Example
Old question but it deserve new answer for anyone looking for correctly matching HTTP Start Line and extract values from it.
The (.*) will not match white space, also escaping forward slash not necessary in C# and will lead to not match .
Here is sample code with named capturing group:
var httpRegex = new Regex(#"^(?<method>[a-zA-Z]+)\s(?<url>.+)\sHTTP/(?<major>\d)\.(?<minor>\d+)$");
var match = httpRegex.Match("GET http://www.google.com HTTP/1.1");
if (match.Success)
{
Console.WriteLine(
$"Method: {match.Groups["method"].Value}\r\n" +
$"Url: {match.Groups["url"].Value}\r\n" +
$"httpVersion: HTTP/{match.Groups["major"].Value}.{match.Groups["minor"].Value}"
);
}
Escaping forward slash required in languages like PHP and JavaScript, and here the same code for PHP with escaping https://regex101.com/r/2l7k83/1/

How to replace raw urls inside paragraphs to html links using regex

How to change absolute url within a paragraph:
<p>http://www.google.com</p>
into html link into paragraph:
<p>http://www.google.com</p>
Thare can be a lot of paragraphs. I want the regex to cut out the generic url value from this: <p>url<p>, and put it into template like this: <p>url</p>
How to do it in the short way ? Can it be done using regex.Replace() method ?
BTW: Regular expression used for absolute urls matching can be like this: ^(ht|f)tp(s?)\:\/\/[0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*(:(0-9)*)*(\/?)([a-zA-Z0-9\-\.\?\,\'\/\\\+&%\$#_]*)?$ (taken from msdn)
Try to use this regex:
(?<!\")(ht|f)tp(s?)\:\/\/[0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*(:(0-9)*)*(\/?)([a-zA-Z0-9\-\.\?\,\'\/\\\+&%\$#_]*)?(?!\")
to avoid matching <a href="http://www.google.com"> like strings(enclosed by").
And a sample code:
var inputString = #"<p>http://www.google.com</p><p>my web link</p>";
var pattern = #"(?<url>(?<!\")(ht|f)tp(s?)\:\/\/[0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*(:(0-9)*)*(\/?)([a-zA-Z0-9\-\.\?\,\'\/\\\+&%\$#_]*)?(?!\"))";
var result = Regex.Replace(strInput, pattern, "${url}");
explain:
(?<!subexpression) Zero-width negative lookbehind assertion.
(?!subexpression) Zero-width negative lookahead assertion.
(?<name>subexpression) Captures the matched subexpression into a named group.
form your regex: remove first ^ and last $ - it means "match the whole input string from start to end"
string regexPattern = #"(ht|f)tp(s?)\:\/\/[0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*(:(0-9)*)*(\/?)([a-zA-Z0-9\-\.\?\,\'\/\\\+&%\$#_]*)?";
string input = #"<p>http://www.google.com</p>";
var reg = new Regex(regexPattern, RegexOptions.IgnoreCase);
// $0 - substitution, refers to the text matched by the whole pattern
var output = reg.Replace(input, "$0");
more about substitutions http://msdn.microsoft.com/en-us/library/ewy2t5e0.aspx

Quick Help With Regex C#

How can I match on the following string: A constant string name, followed by a period, followed by any positive integer, followed by another dot.
For example I want to find anything like this:
SomeText.1.
SomeText.99.
SomeText.100.
SomeText.1002.
Regex.Match(input, #"SomeText\.\d+\.");
Try something like this:
^SomeText\.\d+\.$
To explain:
The ^ means the beginning of the line, as $ means the end of the line. This ensure that the entire string matches the expression, not that something in it happens to match the pattern.
The SomeText part is self explanatory.
The \. means "match a single .". The \ is required to escape the meaning of the period, which by itself would mean "Any single character"
The \d+ means "One or more digits".
Then the \. again, and finally $ to signify that's where we expect the string to end.
If you want to be able to retrieve the number, try:
var exp = new Regex(#"SomeText\.(?<number>\d+)\.",RegexOptions.Compiled);
foreach(string s in allStrings)
{
var collection = exp.Match(s);
if (collection.Success)
{
int myNumber = int.parse(collection.Groups["number"].Value);
// ...
}
}
Your regex would look like SomeText\.\d+\.
Which, in c# code would be
var result = Regex.Match(stringToMatch, #"SomeText\.\d+\.");

Regular expression with "|"

I need to be able to check for a pattern with | in them. For example an expression like d*|*t should return true for a string like "dtest|test".
I'm no regular expression hero so I just tried a couple of things, like:
Regex Pattern = new Regex("s*\|*d"); //unable to build because of single backslash
Regex Pattern = new Regex("s*|*d"); //argument exception error
Regex Pattern = new Regex(#"s*\|*d"); //returns true when I use "dtest" as input, so incorrect
Regex Pattern = new Regex(#"s*|*d"); //argument exception error
Regex Pattern = new Regex("s*\\|*d"); //returns true when I use "dtest" as input, so incorrect
Regex Pattern = new Regex("s*" + "\\|" + "*d"); //returns true when I use "dtest" as input, so incorrect
Regex Pattern = new Regex(#"s*\\|*d"); //argument exception error
I'm a bit out of options, what should I then use?
I mean this is a pretty basic regular expression I know, but I'm not getting it for some reason.
In regular expressions, the * means "zeros or more (the pattern before it)", e.g. a* means zero or more a, and (xy)* expects matches of the form xyxyxyxy....
To match any characters, you should use .*, i.e.
Regex Pattern = new Regex(#"s.*\|.*d");
(Also, | means "or")
Here . will match any characters[1], including |. To avoid this you need to use a character class:
new Regex(#"s[^|]*\|[^d]*d");
Here [^x] means "any character except x".
You may read http://www.regular-expressions.info/tutorial.html to learn more about RegEx.
[1]: Except a new line \n. But . will match \n if you pass the Singleline option. Well this is more advanced stuff...
A | inside a char class will be treated literally, so you can try the regex:
[|]
How about s.*\|.*d?
The problem of your tries is, that you wrote something like s* - which means: match any number of s(including 0). You need to define the characters following the s by using . like in my example. You can use \w for alphanumerical characters, only.
Try this.
string test1 = "dtest|test";
string test2 = "apple|orange";
string pattern = #"d.*?\|.*?t";
Console.WriteLine(Regex.IsMatch(test1, pattern));
Console.WriteLine(Regex.IsMatch(test2, pattern));
Regex Pattern = new Regex(#"s*\|*d"); would work, except that having |* means "0 or more pipes". So You probably want Regex Pattern = new Regex(#"s.*\|.*d");
In Javascript, if you construct
var regex = /somestuff\otherstuff/;,
then backslashes are as you'd expect. But if you construct the very same thing with the different syntax
var regex = new Regex("somestuff\\otherstuff");
then because of a weirdness in the way Javascript is parsed you have have to double all backslashes. I suspect your first attempt was correct, but you imported a new problem while solving the old in that you ran afoul of this other issue about single backslashes.

Categories