Single Regex to match URL

Single Regex to match URL - c#

Any ideas how I can use a single regular expression to validate a single url and also match urls in a text block?
var x = "http://myurl.com";
var t = "http://myurl.com ref";
var y = "some text that contains a url http://myurl.com some where";
var expression = "\b(https?|ftp|file)://[-A-Z0-9+&##/%?=~_|!:,.;]*[A-Z0-9+&##/%=~_|]";
Regex.IsMatch(x, expression, RegexOptions.IgnoreCase); // returns true;
Regex.IsMatch(t, expression, RegexOptions.IgnoreCase); // returns false;
Regex.Matches(y, expression, RegexOptions.IgnoreCase); // returns http://myurl.com;

First of all you have to escape correctly. Use "\\b..." instead of "\b...". IsMatch will also be true for partial matches. You can check if the whole input is matching by doing this:
Match match = Regex.Match(x, expression, RegexOptions.IgnoreCase);
if (match.Success && match.Length == x.Length))
// full match
With this check and the escape fix, your expression will work as it is. You also can write a helper method for it:
private bool FullMatch(string input, string pattern, RegexOptions options)
{
Match match = Regex.Match(input, pattern, options);
return match.Success && match.Length == input.Length;
}
Your code will change to this:
var x = "http://myurl.com";
var t = "http://myurl.com ref";
var y = "some text that contains a url http://myurl.com some where";
var expression = "\\b(https?|ftp|file)://[-A-Z0-9+&##/%?=~_|!:,.;]*[A-Z0-9+&##/%=~_|]";
FullMatch(x, expression, RegexOptions.IgnoreCase); // returns true;
FullMatch(t, expression, RegexOptions.IgnoreCase); // returns false;
Regex.Matches(y, expression, RegexOptions.IgnoreCase); // returns http://myurl.com;

i think the word boundary is getting you; it will not match for non-word characters.
try this:
var expression = #"(^|\s)(https?|ftp|file)://[-A-Z0-9+&##/%?=~_|!:,.;]*[A-Z0-9+&##/%=~_|]($|\s)";
this will bind the start of the match to the beginning of the string or space, and the end of the match to the end of the string or space.
more info: http://www.regular-expressions.info/wordboundaries.html
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a
word character. After the last character in the string, if the last
character is a word character. Between two characters in the string,
where one is a word character and the other is not a word character.
Simply put: \b allows you to perform a "whole words only" search using
a regular expression in the form of \bword\b. A "word character" is a
character that can be used to form words. All characters that are not
"word characters" are "non-word characters".

Related

Remove Adjacent Space near a Special Character using regex

Using regex want to remove adjacent Space near replacement Character
replacementCharcter = '-'
this._adjacentSpace = new Regex($#"\s*([\{replacementCharacter}])\s*");
MatchCollection replaceCharacterMatch = this._adjacentSpace.Matches(extractedText);
foreach (Match replaceCharacter in replaceCharacterMatch)
{
if (replaceCharacter.Success)
{
cleanedText = Extactedtext.Replace(replaceCharacter.Value, replaceCharacter.Value.Trim());
}
}
Extractedtext = - whi, - ch
cleanedtext = -whi, -ch
expected result : cleanedtext = -whi,-ch

You can use
var Extactedtext = "- whi, - ch";
var replacementCharacter = "-";
var _adjacentSpace = new Regex($#"\s*({Regex.Escape(replacementCharacter)})\s*");
var cleanedText = _adjacentSpace.Replace(Extactedtext, "$1");
Console.WriteLine(cleanedText); // => -whi,-ch
See the C# demo.
NOTE:
replacementCharacter is of type string in the code above
$#"\s*({Regex.Escape(replacementCharacter)})\s*" will create a regex like \s*-\s*, Regex.Escape() will escape any regex-special char (like +, (, etc.) correctly to be used in a regex pattern, and the whole regex simply matches (and captured into Group 1 with the capturing parentheses) the replacementCharacter enclosed with zero or more whitespaces
No need using Regex.Matches, just replace all matches if there are any, that is how Regex.Replace works.
_adjacentSpace is the compiled Regex object, to replace, just call the .Replace() method of the regex object instance
The replacement is a backreference to the Group 1 value, the - char here.

Split a regex matched string with commas?

I have a regex patters which does altering of the string being matched..
var output = Regex.Replace(entity.NamingPattern, #"\[(?<token>.+?)\]|(?<word>[^\[\]])", (match) =>
{
var wordMatch = match.Groups["word"];
if (wordMatch.Success) return $"'{wordMatch.Value}'";
return "new."+match.Groups["token"].Value;
});
but is also possible to ensure that all words and tokens that being matched are separated with a comma?
So something like this
(something[tester]somethi[worker]some[i]sadas,
is returned as this
'(','s','o','m','e','t','h','i','n','g','new.tester','s','o','m','e','t','h','i','new.worker','s','o','m','e','new.i','s','a','d','a','s',','
Word matches on each character and token matches with the content each square brackets, and remove the the brackets. but how i join the I am not sure about that?

You could add a comma after the word group and add surrounding single quotes and a comma for the token group.
At the final result, trim the comma at the end of output using TrimEnd
The non greedy dot .*? in the pattern could also be a negated character class [^[\]]*
Pattern
\[(?<token>[^[\]]*)\]|(?<word>[^[\]])
.Net regex demo
Example code
var s = "(something[tester]somethi[worker]some[i]sadas,";
var output = Regex.Replace(s, #"\[(?<token>[^[\]]*)\]|(?<word>[^[\]])", (match) =>
{
var wordMatch = match.Groups["word"];
if (wordMatch.Success) return $"'{wordMatch.Value}',";
return "'new."+match.Groups["token"].Value+"',";
});
Console.WriteLine(output.TrimEnd(','));
Output
'(','s','o','m','e','t','h','i','n','g','new.tester','s','o','m','e','t','h','i','new.worker','s','o','m','e','new.i','s','a','d','a','s',','

Find hashtags in string

I am working on a Xamarin.Forms PCL project in C# and would like to detect all the hashtags.
I tried splitting at spaces and checking if the word begins with an # but the problem is if the post contains two spaces like "Hello #World Test" it would lose that the double space
string body = "Example string with a #hashtag in it";
string newbody = "";
foreach (var word in body.Split(' '))
{
if (word.StartsWith("#"))
newbody += "[" + word + "]";
newbody += word;
}
Goal output:
Example string with a [#hashtag] in it
I also only want it to have A-Z a-z 0-9 and _ stopping at any other character
Test #H3ll0_W0rld$%Test => Test [#H3ll0_W0rld]$%Test
Other Stack questions try to detect the string and extract it, I would like it work with it and put it back in the string without losing anything that methods such as splitting by certain characters would lose.

You can use Regex with #\w+ and $&
Explanation
# matches the character # literally (case sensitive)
\w+ matches any word character (equal to [a-zA-Z0-9_])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
$& Includes a copy of the entire match in the replacement string.
Example
var input = "asdads sdfdsf #burgers, #rabbits dsfsdfds #sdf #dfgdfg";
var regex = new Regex(#"#\w+");
var matches = regex.Matches(input);
foreach (var match in matches)
{
Console.WriteLine(match);
}
or
var result = regex.Replace(input, "[$&]" );
Console.WriteLine(result);
Ouput
#burgers
#rabbits
#sdf
#dfgdfg
asdads sdfdsf [#burgers], [#rabbits] dsfsdfds [#sdf] [#dfgdfg]
Updated Demo here
Another Example

Use a regular expression: \#\w*
string pattern = "\#\w*";
Regex rgx = new Regex(pattern, RegexOptions.IgnoreCase);
MatchCollection matches = rgx.Matches(input);

Regex MatchEvaluator doesn't work with words contains "_" underscore

I am trying to match and format output regex result. I have a words array e.g:
var resultArray = new List {"new", "new_"}; // notice the word with underscore
But when i try to search a sentence like this:
New Law_Book_with_New_Cover
it does match the with the first word "New" but not the middle one with "New_". here is my code
if (resultArray.Count > 0)
{
string regex = "\\b(?:" + String.Join("|", resultArray.ToArray()) + ")\\b";
MatchEvaluator myEvaluator = new MatchEvaluator(GetHighlightMarkup);
return Regex.Replace(result, regex, myEvaluator, RegexOptions.Compiled | RegexOptions.Multiline | RegexOptions.IgnoreCase);
}
private static string GetHighlightMarkup(Match m)
{
return string.Format("<span class=\"focus\">{0}</span>", m.Value);
}
And yes i did tried escaping the word "\New_" but no luck still.
What am i missing ?

It seems you need to match your items only if they are not enclosed with letters.
You may replace the word boundaries in your regex with lookarounds:
string regex = #"(?<!\p{L})(?:" + String.Join("|", resultArray.ToArray()) + #")(?!\p{L})";
where \p{L} matches any letter, (?<!\p{L}) requires the absence of a letter before the match, and (?!\p{L}) disallows a letter after the match.

search string for everything before a set of characters in C#

I'm looking for a way to search a string for everything before a set of characters in C#. For Example, if this is my string value:
This is is a test.... 12345
I want build a new string with all of the characters before "12345".
So my new string would equal "This is is a test.... "
Is there a way to do this?
I've found Regex examples where you can focus on one character but not a sequence of characters.

You don't need to use a Regex:
public string GetBitBefore(string text, string end)
{
var index = text.IndexOf(end);
if (index == -1) return text;
return text.Substring(0, index);
}

You can use a lazy quantifier to match anything, followed by a lookahead:
var match = Regex.Match("This is is a test.... 12345", #".*?(?=\d{5})");
where:
.*? lazily matches everything (up to the lookahead)
(?=…) is a positive lookahead: the pattern must be matched, but is not included in the result
\d{5} matches exactly five digits. I'm assuming this is your lookahead; you can replace it

You can do so with help of regex lookahead.
.*(?=12345)
Example:
var data = "This is is a test.... 12345";
var rxStr = ".*(?=12345)";
var rx = new System.Text.RegularExpressions.Regex (rxStr,
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
var match = rx.Match(data);
if (match.Success) {
Console.WriteLine (match.Value);
}
Above code snippet will print every thing upto 12345:
This is is a test....
For more detail about see regex positive lookahead

This should get you started:
var reg = new Regex("^(.+)12345$");
var match = reg.Match("This is is a test.... 12345");
var group = match.Groups[1]; // This is is a test....
Of course you'd want to do some additional validation, but this is the basic idea.

^ means start of string
$ means end of string
The asterisk tells the engine to attempt to match the preceding token zero or more times. The plus tells the engine to attempt to match the preceding token once or more
{min,max} indicate the minimum/maximum number of matches.
\d matches a single character that is a digit, \w matches a "word character" (alphanumeric characters plus underscore), and \s matches a whitespace character (includes tabs and line breaks).
[^a] means not so exclude a
The dot matches a single character, except line break characters
In your case there many way to accomplish the task.
Eg excluding digit: ^[^\d]*
If you know the set of characters and they are not only digit, don't use regex but IndexOf(). If you know the separator between first and second part as "..." you can use Split()

Take a look at this snippet:
class Program
{
static void Main(string[] args)
{
string input = "This is is a test.... 12345";
// Here we call Regex.Match.
MatchCollection matches = Regex.Matches(input, #"(?<MySentence>(\w+\s*)*)(?<MyNumberPart>\d*)");
foreach (Match item in matches)
{
Console.WriteLine(item.Groups["MySentence"]);
Console.WriteLine("******");
Console.WriteLine(item.Groups["MyNumberPart"]);
}
Console.ReadKey();
}
}

You could just split, not as optimal as the indexOf solution
string value = "oiasjdoiasj12345";
string end = "12345";
string result = value.Split(new string[] { end }, StringSplitOptions.None)[0] //Take first part of the result, not the quickest but fairly simple

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Single Regex to match URL - c#

Related

Remove Adjacent Space near a Special Character using regex

Split a regex matched string with commas?

Find hashtags in string

Regex MatchEvaluator doesn't work with words contains "_" underscore

search string for everything before a set of characters in C#

Categories

Resources