Using wildcards to get URLs from a string

Using wildcards to get URLs from a string - c#

I'm using Visual Studio C# and want to list wildcard instances in a string that are URls. I've used regex with Perl for years, but I just cannot figure it out in C#. For the string, there may not be any or there could be one or more urls.
str = "This has more than one URL http://findme.com/lost and another one named http://www.amidumb.net but then there is this one https://hello.ua/findme/ifyoucan/ at last."
I want to list iamlost.com, www.amidumb.net and hello.ua
This is where I am:
string pattern = #"\/\/(.*)\/(.*)";
Regex r = new Regex(pattern, RegexOptions.IgnoreCase);
Match m = r.Match(newLineItem);
ArrayList results = new ArrayList();
while(m.Success) {
Console.WriteLine(m + "\n");
m = m.NextMatch();
}
When I run the above, it prints the whole line after the first instance of ":" starting with the first occurance of "//".
It seems like I should be able to select the first (.*) after // and each one after that.
I'm gussing I somehow need to add each found instance to a list but I am totally lost. Am I even headed in the right direction?

Here is a simple example to get what you are looking for I believe.
void Main()
{
var rawString = "This has more than one URL http://findme.com/lost and another one named http://www.amidumb.net but then there is this one https://hello.ua/findme/ifyoucan/ at last.";
var urlList = UrlMaker(rawString);
}
// You can define other methods, fields, classes and namespaces here
public List<string> UrlMaker(string input)
{
List<string> urls = new List<string>();
var linkParser = new Regex(#"\b(?:https?://|www\.)\S+\b", RegexOptions.Compiled | RegexOptions.IgnoreCase);
var rawString = input;
foreach (Match m in linkParser.Matches(rawString))
{
if (m.Value.Contains("http"))
{
Uri url = new Uri(m.Value);
urls.Add(url.Host);
}
else
{
urls.Add(m.Value);
}
}
return urls;
}
this code outputs:

Related

C# "between strings" run several times

Here is my code to find a string between { }:
var text = "Hello this is a {Testvar}...";
int tagFrom = text.IndexOf("{") + "{".Length;
int tagTo = text.LastIndexOf("}");
String tagResult = text.Substring(tagFrom, tagTo - tagFrom);
tagResult Output: Testvar
This only works for one time use.
How can I apply this for several Tags? (eg in a While loop)
For example:
var text = "Hello this is a {Testvar}... and we have more {Tagvar} in this string {Endvar}.";
tagResult[] Output (eg Array): Testvar, Tagvar, Endvar

IndexOf() has another overload that takes the start index of which starts to search the given string. if you omit it, it will always look from the beginning and will always find the first one.
var text = "Hello this is a {Testvar}...";
int start = 0, end = -1;
List<string> results = new List<string>();
while(true)
{
start = text.IndexOf("{", start) + 1;
if(start != 0)
end = text.IndexOf("}", start);
else
break;
if(end==-1) break;
results.Add(text.Substring(start, end - start));
start = end + 1;
}

I strongly recommend using regular expressions for the task.
using System;
using System.Text.RegularExpressions;
namespace ConsoleApp1
{
class Program
{
static void Main(string[] args)
{
var regex = new Regex(#"(\{(?<var>\w*)\})+", RegexOptions.IgnoreCase);
var text = "Hello this is a {Testvar}... and we have more {Tagvar} in this string {Endvar}.";
var matches = regex.Matches(text);
foreach (Match match in matches)
{
var variable = match.Groups["var"];
Console.WriteLine($"Found {variable.Value} from position {variable.Index} to {variable.Index + variable.Length}");
}
}
}
}
Output:
Found Testvar from position 17 to 24
Found Tagvar from position 47 to 53
Found Endvar from position 71 to 77
For more information about regular expression visit the MSDN reference page:
https://learn.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference
and this tool may be great to start testing your own expressions:
http://regexstorm.net/tester
Hope this help!

I would use Regex pattern {(\\w+)} to get the value.
Regex reg = new Regex("{(\\w+)}");
var text = "Hello this is a {Testvar}... and we have more {Tagvar} in this string {Endvar}.";
string[] tagResult = reg.Matches(text)
.Cast<Match>()
.Select(match => match.Groups[1].Value).ToArray();
foreach (var item in tagResult)
{
Console.WriteLine(item);
}
c# online
Result
Testvar
Tagvar
Endvar

Many ways to skin this cat, here are a few:
Split it on { then loop through, splitting each result on } and taking element 0 each time
Split on { or } then loop through taking only odd numbered elements
Adjust your existing logic so you use IndexOf twice (instead of lastindexof). When you’re looking for a } pass the index of the { as the start index of the search

This is so easy by using Regular Expressions just by using a simple pattern like {([\d\w]+)}.
See the example below:-
using System.Text.RegularExpressions;
...
MatchCollection matches = Regex.Matches("Hello this is a {Testvar}... and we have more {Tagvar} in this string {Endvar}.", #"{([\d\w]+)}");
foreach(Match match in matches){
Console.WriteLine("match : {0}, index : {1}", match.Groups[1], match.index);
}
It can find any series of letters or number in these brackets one by one.

"Cut out" a specific Text of a file and put it into a string

My application saves informations into a file like this:
[Name]{ExampleName}
[Path]{ExamplePath}
[Author]{ExampleAuthor}
I want to cut the [Name]{....} out and just get back the "ExampleName".
//This is what the strings should contain in the end.
string name = "ExampleName"
string path = "ExamplePath"
Is there any way to do this in C#?

You can extract the keys and the values and push them into a dictionary that you can later easily access like this:
var text = "[Name]{ExampleName} [Path]{ExamplePath} [Author]{ExampleAuthor}";
// You can use regex to extract the Value/Pair
var rgx = new Regex(#"\[(?<key>[a-zA-Z]+)\]{(?<value>[a-zA-Z]+)}", RegexOptions.IgnorePatternWhitespace);
var matches = rgx.Matches(text);
// Now you can add the values to a dictionary
var dic = new Dictionary<string, string>();
foreach (Match match in matches)
{
dic.Add(match.Groups["key"].Value, match.Groups["value"].Value);
}
// Then you can access your values like this.
var name = dic["Name"];

You could use a regular expression to cut out the string part between brackets:
var regex = new Regex("\[.*\]{(?<variableName>.*)}");
If you try matching this regular expression on your strings, you end up with your resulting strings in the match group 'variableName' (match.Groups["variableName"].Value).

I'm not sure to understand what you need but I will try.
You can use this :
var regex = new Regex(Regex.Escape("Name"));
var newText = regex.Replace("NameExampleName", "", 1);

Attempting to capture multiple groups but only the last group is captured

I am trying to use regex to help to convert the following string into a Dictionary:
{TheKey|TheValue}{AnotherKey|AnotherValue}
Like such:
["TheKey"] = "TheValue"
["AnotherKey"] = "AnotherValue"
To parse the string for the dictionary, I am using the regex expression:
^(\{(.+?)\|(.+?)\})*?$
But it will only capture the last group of {AnotherKey|AnotherValue}.
How do I get it to capture all of the groups?
I am using C#.
Alternatively, is there a more straightforward way to approach this rather than using Regex?
Code (Properties["PromptedValues"] contains the string to be parsed):
var regex = Regex.Matches(Properties["PromptedValues"], #"^(\{(.+?)\|(.+?)\})*?$");
foreach(Match match in regex) {
if(match.Groups.Count == 4) {
var key = match.Groups[2].Value.ToLower();
var value = match.Groups[3].Value;
values.Add(key, new StringPromptedFieldHandler(key, value));
}
}
This is coded to work for the single value, I would be looking to update it once I can get it to capture multiple values.

The $ says that: The match must occur at the end of the string or before \n at the end of the line or string.
The ^ says that: The match must start at the beginning of the string or line.
Read this for more regex syntax: msdn RegEx
Once you remove the ^ and $ your regex will match all of the sets You should read: Match.Groups and get something like the following:
public class Example
{
public static void Main()
{
string pattern = #"\{(.+?)\|(.+?)\}";
string input = "{TheKey|TheValue}{AnotherKey|AnotherValue}";
MatchCollection matches = Regex.Matches(input, pattern);
foreach (Match match in matches)
{
Console.WriteLine("The Key: {0}", match.Groups[1].Value);
Console.WriteLine("The Value: {0}", match.Groups[2].Value);
Console.WriteLine();
}
Console.WriteLine();
}
}

Your regex tries to match against the entire line. You can get individual pairs if you don't use anchors:
var input = Regex.Matches("{TheKey|TheValue}{AnotherKey|AnotherValue}");
var matches=Regex.Matches(input,#"(\{(.+?)\|(.+?)\})");
Debug.Assert(matches.Count == 2);
It's better to name the fields though:
var matches=Regex.Matches(input,#"\{(?<key>.+?)\|(?<value>.+?)\}");
This allows you to access the fields by name, and even use LINQ:
var pairs= from match in matches.Cast<Match>()
select new {
key=match.Groups["key"].Value,
value=match.Groups["value"].Value
};

Alternatively, you can use the Captures property of your groups to get all of the times they matched.
if (regex.Success)
{
for (var i = 0; i < regex.Groups[1].Captures.Count; i++)
{
var key = regex.Groups[2].Captures[i].Value.ToLower();
var value = regex.Groups[3].Captures[i].Value;
}
}
This has the advantage of still checking that your entire string was made up of matches. Solutions suggesting you remove the anchors will find things that look like matches in a longer string, but will not fail for you if anything was malformed.

.NET Regex: how to retrieve multiple matches on multiple lines

I have the following Regex:
\b((.|\n)*)=((.|\n)*)new((.|\n)*)\(\)
It is used to detect an object assignment from a c# source code string,
like this one: var a = new Person();
it works fine when I have only one match, but if I try to process this:
var a = new Person();
var x = new WebClient();
It returns only one match, like this: {var a = new Person(); var x = new WebClient()}
I need to extract both matches. How do I do that , I'm relatively new to regex and I have no idea what to do.
When I test my regex on RegExr , it works just fine (with the global checkbox checked)..

This expression should get you started. Try passing in the Multiline regex option rather than trying to deal with newlines in the regex itself:
var src = #"var a = new Person();
var x = new WebClient();";
var pattern = #"(\w+\s*)(\w*\s*)=\s+new\s+(\w+)\(\)";
var expr = new System.Text.RegularExpressions.Regex(pattern,RegexOptions.Multiline);
foreach(Match match in expr.Matches(src) )
{
var assignType = match.Groups[1].Value;
var id = match.Groups[2].Value;
var objType = match.Groups[3].Value;
}
That said, there are (much) better tools than RegEx to deal with C# parsing, are you interested in those?

\n is allowing it to match new line.
This works for me against your test data in expresso:
\b((.)*)=((.)*)new((.)*)\(\)
If you don't need the matching groups - the brackets - this seems to work as well:
\b.*=.*new.*\(\)
This is possibly a better fit than using . (any character).
\b[\w\s]*=[\w\s]*new[\w\s]*\(\)
If you're confident the code base has exact spacing (e.g. enforced by something like StyleCop) then you can get more specific again with regards to the \w (word character) and \s (space character).
Also I'm not sure if it is intentional, but you're not matching the ; at the end of the line.

You can use named groups. I modified the pattern to the following and the groups named asgn will match a whole assignment:
(?<asgn>\b\w+\s+\w+\s*\=\s*new\s+\w+\([^)]*\)\s*;)
This is how to access the named group:
string pat = #"(?<asgn>\b\w+\s+\w+\s*\=\s*new\s+\w+\([^)]*\)\s*;)";
string input = #"var a = new Person();
var x = new WebClient();";
foreach (Match m in Regex.Matches(input, pat))
{
Console.WriteLine(m.Groups["asgn"].Value);
}
If you need to parse and extract each part of the assignment, you can name more groups into the pattern, as the following:
(?<asgn>\b(?<vtype>\w+)\s+(?<name>\w+)\s*\=\s*new\s+(?<type>\w+)\((?<args>[^)]*)\)\s*;)
with which you can extract variable-type, variable name, type, and constructor args from the matched string.

Increasing Regex Efficiency

I have about 100k Outlook mail items that have about 500-600 chars per Body. I have a list of 580 keywords that must search through each body, then append the words at the bottom.
I believe I've increased the efficiency of the majority of the function, but it still takes a lot of time. Even for 100 emails it takes about 4 seconds.
I run two functions for each keyword list (290 keywords each list).
public List<string> Keyword_Search(HtmlNode nSearch)
{
var wordFound = new List<string>();
foreach (string currWord in _keywordList)
{
bool isMatch = Regex.IsMatch(nSearch.InnerHtml, "\\b" + #currWord + "\\b",
RegexOptions.IgnoreCase);
if (isMatch)
{
wordFound.Add(currWord);
}
}
return wordFound;
}
Is there anyway I can increase the efficiency of this function?
The other thing that might be slowing it down is that I use HTML Agility Pack to navigate through some nodes and pull out the body (nSearch.InnerHtml). The _keywordList is a List item, and not an array.

I assume that the COM call nSearch.InnerHtml is pretty slow and you repeat the call for every single word that you are checking. You can simply cache the result of the call:
public List<string> Keyword_Search(HtmlNode nSearch)
{
var wordFound = new List<string>();
// cache inner HTML
string innerHtml = nSearch.InnerHtml;
foreach (string currWord in _keywordList)
{
bool isMatch = Regex.IsMatch(innerHtml, "\\b" + #currWord + "\\b",
RegexOptions.IgnoreCase);
if (isMatch)
{
wordFound.Add(currWord);
}
}
return wordFound;
}
Another optimization would be the one suggested by Jeff Yates. E.g. by using a single pattern:
string pattern = #"(\b(?:" + string.Join("|", _keywordList) + #")\b)";

I don't think this is a job for regular expressions. You might be better off searching each message word by word and checking each word against your word list. With the approach you have, you're searching each message n times where n is the number of words you want to find - it's no wonder that it takes a while.

Most of the time comes form matches that fail, so you want to minimize failures.
If the search keyword are not frequent, you can test for all of them at the same time (with regexp \b(aaa|bbb|ccc|....)\b), then you exclude the emails with no matches. The one that have at least one match, you do a thorough search.

one thing you can easily do is match agaist all the words in one go by building an expression like:
\b(?:word1|word2|word3|....)\b
Then you can precompile the pattern and reuse it to look up all occurencesfor each email (not sure how you do this with .Net API, but there must be a way).
Another thing is instead of using the ignorecase flag, if you convert everything to lowercase, that might give you a small speed boost (need to profile it as it's implementation dependent). Don't forget to warm up the CLR when you profile.

This may be faster. You can leverage Regex Groups like this:
public List<string> Keyword_Search(HtmlNode nSearch)
{
var wordFound = new List<string>();
// cache inner HTML
string innerHtml = nSearch.InnerHtml;
string pattern = "(\\b" + string.Join("\\b)|(\\b", _keywordList) + "\\b)";
Regex myRegex = new Regex(pattern, RegexOptions.IgnoreCase);
MatchCollection myMatches = myRegex.Matches(innerHtml);
foreach (Match myMatch in myMatches)
{
// Group 0 represents the entire match so we skip that one
for (int i = 1; i < myMatch.Groups.Count; i++)
{
if (myMatch.Groups[i].Success)
wordFound.Add(_keywordList[i-1]);
}
}
return wordFound;
}
This way you're only using one regular expression. And the indices of the Groups should correlate with your _keywordList by an offset of 1, hence the line wordFound.Add(_keywordList[i-1]);
UPDATE:
After looking at my code again I just realized that putting the matches into Groups is really unnecessary. And Regex Groups have some overhead. Instead, you could remove the parenthesis from the pattern, and then simply add the matches themselves to the wordFound list. This would produce the same effect, but it'd be faster.
It'd be something like this:
public List<string> Keyword_Search(HtmlNode nSearch)
{
var wordFound = new List<string>();
// cache inner HTML
string innerHtml = nSearch.InnerHtml;
string pattern = "\\b(?:" + string.Join("|", _keywordList) + ")\\b";
Regex myRegex = new Regex(pattern, RegexOptions.IgnoreCase);
MatchCollection myMatches = myRegex.Matches(innerHtml);
foreach (Match myMatch in myMatches)
{
wordFound.Add(myMatch.Value);
}
return wordFound;
}

Regular expressions can be optimized quite a bit when you just want to match against a fixed set of constant strings. Instead of several matches, e.g. against "winter", "win" or "wombat", you can just match against "w(in(ter)?|ombat)", for example (Jeffrey Friedl's book can give you lots of ideas like this). This kind of optimisation is also built into some programs, notably emacs ('regexp-opt'). I'm not too familiar with .NET, but I assume someone has programmed similar functionality - google for "regexp optimization".

If the regular expression is indeed the bottle neck, and even optimizing it (by concatenating the search words to one expression) doesn’t help, consider using a multi-pattern search algorithm, such as Wu-Manber.
I’ve posted a very simple implementation here on Stack Overflow. It’s written in C++ but since the code is straightforward it should be easy to translate it to C#.
Notice that this will find words anywhere, not just at word boundaries. However, this can be easily tested after you’ve checked whether the text contains any words; either once again with a regular expression (now you only test individual emails – much faster) or manually by checking the characters before and after the individual hits.

If your problem is about searching for outlook items containing certain string, you should get a gain from using outlooks search facilities...
see:
http://msdn.microsoft.com/en-us/library/bb644806.aspx

If your keyword search is straight literals, ie do not contain further regex pattern matches, then other method may be more appropriate. The following code demonstrates one such method, this code only goes through each email once, your code went through each email 290 time( twice)
public List<string> FindKeywords(string emailbody, List<string> keywordList)
{
// may want to clean up the input a bit, such as replacing '.' and ',' with a space
// and remove double spaces
string emailBodyAsUppercase = emailbody.ToUpper();
List<string> emailBodyAsList = new List<string>(emailBodyAsUppercase.Split(' '));
List<string> foundKeywords = new List<string>(emailBodyAsList.Intersect(keywordList));
return foundKeywords;
}

If you can use .Net 3.5+ and LINQ you could do something like this.
public static class HtmlNodeTools
{
public static IEnumerable<string> MatchedKeywords(
this HtmlNode nSearch,
IEnumerable<string> keywordList)
{
//// as regex
//var innerHtml = nSearch.InnerHtml;
//return keywordList.Where(kw =>
// Regex.IsMatch(innerHtml,
// #"\b" + kw + #"\b",
// RegexOptions.IgnoreCase)
// );
//would be faster if you don't need the pattern matching
var innerHtml = ' ' + nSearch.InnerHtml + ' ';
return keywordList.Where(kw => innerHtml.Contains(kw));
}
}
class Program
{
static void Main(string[] args)
{
var keyworkList = new string[] { "hello", "world", "nomatch" };
var h = new HtmlNode()
{
InnerHtml = "hi there hello other world"
};
var matched = h.MatchedKeywords(keyworkList).ToList();
//hello, world
}
}
... reused regex example ...
public static class HtmlNodeTools
{
public static IEnumerable<string> MatchedKeywords(
this HtmlNode nSearch,
IEnumerable<KeyValuePair<string, Regex>> keywordList)
{
// as regex
var innerHtml = nSearch.InnerHtml;
return from kvp in keywordList
where kvp.Value.IsMatch(innerHtml)
select kvp.Key;
}
}
class Program
{
static void Main(string[] args)
{
var keyworkList = new string[] { "hello", "world", "nomatch" };
var h = new HtmlNode()
{
InnerHtml = "hi there hello other world"
};
var keyworkSet = keyworkList.Select(kw =>
new KeyValuePair<string, Regex>(kw,
new Regex(
#"\b" + kw + #"\b",
RegexOptions.IgnoreCase)
)
).ToArray();
var matched = h.MatchedKeywords(keyworkSet).ToList();
//hello, world
}
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Using wildcards to get URLs from a string - c#

Related

C# "between strings" run several times

"Cut out" a specific Text of a file and put it into a string

Attempting to capture multiple groups but only the last group is captured

.NET Regex: how to retrieve multiple matches on multiple lines

Increasing Regex Efficiency

Categories

Resources