Extracts all sub strings between string separators in a string (C#) - c#

I'm trying to parse content of a string to see if the string includes urls, to convert the full string to html, to make the string clickable.
I'm not sure if there is a smarter way of doing this, but I started trying creating a parser with the Split method for strings, or Regex.Split in C#. But I can't find a good way of doing it.
(It is a ASP.NET MVC application, so perhaps there is some smarter way of doing this)
I want to ex. convert the string;
"Customer office is responsible for this. Contact info can be found {link}{www.customerservice.com}{here!}{link} More info can be found {link}{www.customerservice.com/moreinfo}{here!}{link}"
Into
"Customer office is responsible for this. Contact info can be found <a href=www.customerservice.com>here!</a> More info can be found <a href=www.customerservice.com/moreinfo>here!</a>"
i.e.
{link}{url}{text}{link} --> <a href=url>text</a>
Anyone have a good suggestion? I can also change the way the input string is formatted.

You can use the following to match:
{link}{([^}]*)}{([^}]*)}{link}
And replace with:
<a href=$1>$2</a>
See DEMO
Explanation:
{link} match {link} literally
{([^}]*)} match all characters except } in capturing group 1 (for url)
{([^}]*)} match all characters except } in capturing group 2 (for value)
{link} match {link} literally again

you can use regex as
{link}{(.*?)}{(.*?)}{link}
and substution as
<a href=\1>\2</a>
Regex

For your simple link format {link}{url}{text} you can use simple Regex.Replace:
Regex.Replace(input, #"\{link\}\{([^}]*)\}\{([^}]*)\}", #"$2");

Also this non-regex idea may help
var input = "Customer office is responsible for this. Contact info can be found {link}{www.customerservice.com}{here!}{link} More info can be found {link}{www.customerservice.com/moreinfo}{here!}{link}";
var output = input.Replace("{link}{", "<a href=")
.Replace("}{link}", "</a>")
.Replace("}{", ">");

Related

Regex for extract URL from string fails when string contains multiple double quotes?

I am using regex for extracting url from string and it's working mostly;
var regex=new Regex("<a [^>]*href=(?:'(?<href>.*?)')|(?:\"(?<href>.*?)\")",RegexOptions.IgnoreCase);
following strings working fine:
"This is Test page <a href='test.aspx'>test page</a>"
"This is Test page <a href='test1.aspx'>test</a> another one <a href='test2.aspx'>test</a>"
"This is Tests\"s page <a href='test1.aspx'>test</a> another one <a href='test2.aspx'>test</a>"
"This is Test page"
"This is Test page\"s without problem"
But some time it's not returning good result. Following code return bad result (string contains 2 double quotes) -
var inputString="This string create \"problem\" for me";
var regex=new Regex("<a [^>]*href=(?:'(?<href>.*?)')|(?:\"(?<href>.*?)\")",RegexOptions.IgnoreCase);
var urls=regex.Matches(inputString).OfType<Match>().Select(m =>m.Groups["href"].Value);
foreach(var zzzzzzz in urls){
Console.WriteLine(zzzzzzz);
}
Demo with problem
Could anyone help me to solve this problem?
Maybe you can change your regex like this:<a .*?href=(?:['"](?<href>[^'"]*?)['"])
On Csharp:"<a .*?href=(?:['\"](?<href>[^'\"]*?)['\"])"
Solution:
You should use an HTML Parser to get rid of current and further headaches. A tested and working example can be found for example here.
Regex explanation:
As for your regex, it currently fails because of alternation that you did not enclose into a group. Thus, it can return strings that have no <a... href inside them. More, there are other issues that you can have with your current regex.
A "fixed" regex (meaning it will be capable of handling escaped entities and both double and single quotes) would look like:
(?i)<a\b[^<]*href=(?:(?:'(?<href>[^'\\]*(?:\\.[^'\\]*)*)')|(?:\"(?<href>[^'\\]*(?:\\.[^'\\]*)*))\")
But it is unlikely you can fully rely on regex when parsing HTML. Use the solution, not a workaround.

Using regex to split a formatted string to URL like StackOverFlow

I'm trying to write a parser that will create links found in posted text that are formatted like so:
[Site Description](http://www.stackoverflow.com)
to be rendered as a standard HTML link like this:
Site Description
So far what I have is the expression listed below and will work on the example above, but if will not work if the URL has anything after the ".com". Obviously there is no single regex expression that will find every URL but would like to be able to match as many as I can.
(\[)([A-Za-z0-9 -_]*)(\])(\()((http|https|ftp)\://[A-Za-z0-9\-\.]+\.[a-zA-Z]{2,3}(:[a-zA-Z0-9]*)?/?)(\))
Any help would be greatly appreciated. Thanks.
Darn. It seems #Jerry and #MikeH beat me to it. My answer is best, however, as the link tags are all uppercase ;)
Find what: \[([^]]+)\]\(([^)]+)\)
Replace with: $1
http://regex101.com/r/cY7lF0
Well, you could try negated classes so you don't have to worry about the parsing of the url itself?
\[([^]]+)\]\(([^)]+)\)
And replace with:
$1
regex101 demo
Or maybe use only the beginning parts to identify a url?
\[([^]]+)\]\(((?:https?|ftp)://[^)]+)\)
The replace is the same.

C# Replace URL Regex

I am trying to pull a URL out of a string and use it later to create a Hyperlink. I would like to be able to do the following:
- determine if the input string contains a URL
- remove the URL from the input string
- store the extracted URL in a variable for later use
Can anyone help me with this?
Here is a great solution for recognizing URLs in popular formats such as:
www.google.com
http://www.google.com
mailto:somebody#google.com
somebody#google.com
www.url-with-querystring.com/?url=has-querystring
The regular expression used is:
/((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+#)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+#)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%#.\w_]*)#?(?:[\w]*))?)/
However, I would recommend you go to http://blog.mattheworiordan.com/post/13174566389/url-regular-expression-for-links-with-or-without-the to see the working example.
Replace input with your input
string input = string.Empty;
var matches = Regex.Matches(input,
#"/((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+#)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+#)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%#.\w_]*)#?(?:[.\!\/\\w]*))?)/");
List<string> urlList = (matches.Cast<object>().Select(match => match.ToString())).ToList();

I need a regex expression which can return to me the relative URL + query string from an HTML content string

I have found useful regex expressions from the site, but this particular one eludes me.
Basically, I need to extract this:
/uploadedimages/space earth nasa hd wallpapers 62.jpg?n=6965
from this string using regex:
<p>test james lafferty joseph <strong>swami</strong> is a great guy.<img src=\"/uploadedimages/space earth nasa hd wallpapers 62.jpg?n=6965\" alt=\"nasa1\" title=\"nasa1\" style=\"width: 100px; height: 57px; \" width=\"100\" height=\"57\" /></p>\r\n<p><br /></p>\r\n<p><br /></p>
The regex expression I have extracts the URL without the query string. It is ok if the regex hard codes the string '/uploadedimages/'. However, other than this hard-coding, everything else needs to be generic. This could be anything - not just an image, could be an href linked to a pdf file. Query string could be anything valid as well.
Other regex expressions I have found work only with the absolute URLs starting with http, etc.
I am not sure why nobody was able to provide an acceptable answer for this question. As this would be a very real problem for any developer who needs to extract URLs of any kind fully from an HTML fragment which may or may not be valid HTML, here is the answer which I have verified as working in C#:
matches = Regex.Matches(target, "(?<=\")(http:|https:)?[/\\\\](?:[A-Za-z0-9-._~!$&'()*+,;=:# ]|%[0-9a-fA-F]{2})*([/\\\\](?:([A-Za-z0-9-._~!$&'()*+,;=:# ]|%[0-9a-fA-F]{2}))*)*(?:\\?[a-zA-Z0-9=/\\\\&]+)?(?=\")", RegexOptions.IgnoreCase);
This will extract any number of URLs in the HTML fragment with query string, and I have also gone ahead and modified the REGEX so that it works properly with escape characters in C# regex. The pure REGEX will not work as-is in C# as we have to escape the "\" and """ characters.
Assuming you want a regex like this?
<([^=<>]+)=\\?"([^\\"]+)
Otherwise, please be less ambiguous about what you are actually trying to parse out. Thanks!
I'd recommend doing this in stages, since it will be much simpler. You can use .net in a cleaner way, regexes are not needed here, and neither is a full dom parser if you know the format the data will come in. Assuming for the moment that what you really want is the relative url of the image source, and that there is only ever one image in the html, I would recommend something like the following.
string Parse(string html)
{
var temp = html.Substring(html.IndexOf("src=") + 5);
return temp.Substring(0, temp.IndexOf("\""));
}
To do it using regular expressions, based off kgoedtel's answer (modified slightly) you'll need to do something like:
string Parse(string html)
{
var r = new Regex("<img [^=<>]+=\\\\?\"([^\\\\\"]+)");
return r.Match(html).Groups[1].Value;
}
IEnumerable<string> ParseMany(string html)
{
var r = new Regex("[^=<>]+=\\\\?\"([^\\\\\"]+)");
return r.Matches(html).OfType<Match>().Select(m=>m.Groups[1].Value);
}

C# Need to locate web addresses using REGEX is that possible?

C# Need to locate web addresses using REGEX is that possible?
Basically I need to parse a string prior to loading it into a WebBrowser
myString = "this is an example string http://www.google.com , and I need to make the link clickable";
webBrow.DocumentText = myString;
Basically what I want to happen is a replace of the web address so that it looks like a hyperlink, and do this with any address pulled in to the string. I would need to replace the web address so that web address would read like
<a href='web address'>web address</a>
This would allow me to have the links clickable..
Any Ideas?
new Regex(#"https?://([-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?").Match(myString)
It's possible depending on how strict or permissive you want your parsing to be.
As a first cut, you can try #"\bhttp://\S+" which will match any string starting with "http://" at a word boundary (non-word character, such as whitespace or punctuation).
To search using a regex and replace all occurrences with your custom text, you could use the Regex.Replace method.
You may want to read up on Regular Expression Language Elements to learn more.

Categories