Get the number of an href url parameter from downloaded html page?

Get the number of an href url parameter from downloaded html page? - c#

I am trying to get an ID from a url parameter inside an href that looks like this:
MyItemName
I want the 71312 only and at the momment I am trying to do it using regex (but if you have a better approch I would be glad to try):
string html,itemID;
using (var client = new WebClient())
{
html = client.DownloadString("http://www.mysite.com/search.php?search_text=" + myItemName);
}
string pattern = "" + myItemName + "";
Match m = Regex.Match(html, pattern, RegexOptions.IgnoreCase);
if (m.Success)
{
itemID = m.Groups[1].Value;
MessageBox.Show(itemID);
}
Example of the html:
more html body
<h1>Items - List</h1>
<p>MyItemNameTest, MyItemNameTestB, MYItemNameOther</p>
</div>
more html body

To show where your regex went wrong:
. and ? are special characters in regular expressions. . means "any character" and ? means "zero or one occurences of the previous expression". Therefore your regex fails to match. Also, you need to use verbatim strings in C# (unless you want to escape every backslash):
#"" + myItemName + "";
will probably work.
That said, unless all the links you're examining follow exactly this format, you might run into problems. It's kind of a running gag here on SO that parsing HTML with regular expressions will earn you the wrath of Cthulhu.

Use:
Uri u = new Uri("http://www.mysite.com/myitem.php?id=12313");
string s = u.Query;
HttpUtility.ParseQueryString(s).Get("id");
In variable id you have the number. Figure out the rest of the function :)

Related

.Net regex for string between \" "\

I have been trying to get the id from the following text
<body id=\"body\" runat=\"server\">
In C# using substring or even Regex, but nothing seems to be working. No matter what regex i use, i always get the whole line back. I have been trying to use ^id, ^id.*, ^id=\\\\\\\\.* and id=.* but they don't either work or give me the desired output. Is there any way i can get the id portion from this text which is enclosed between the characters \" "\?

Try this:
string htmlString = "<body id=\"body\" runat=\"server\">";
Regex regex = new Regex("id=\"(.*?)\"");
Match m = regex.Match(htmlString);
Group g = m.Groups[1];
string id = g.ToString();
Console.WriteLine(id); //body
Test here:
http://rextester.com/BQSF93427

How do I remove url from text

I have this sample texts like
EA SPORTS UFC (Microsoft Xbox One, 2014) $40.00 via eBay http://t.co/Wpwj0R1EQm Tibet snake.... http://t.co/yPZXvNnugL
How do I remove urls http://t.co/Wpwj0R1EQm, http://t.co/yPZXvNnugL etc from text. I need to perform sentiment analysis and want clean words.
I am able to get rid of bad characters using simple regex.
The pattern is to remove http://t.co/{Whatever-first-word}

Regular Expressions are your friend.
Simplifying your requirement to be remove all URLS in a given string. If we accept that a URL is anything that starts with http and ends with a space (URLs cannot contain spaces) then something like the follow should suffice. This regex finds any string that starts with http (Will also catch https) and ends in a space and replaces it with an empty string
string text = "EA SPORTS UFC (Microsoft Xbox One, 2014) $40.00 via eBay http://t.co/Wpwj0R1EQm Tibet snake.... http://t.co/yPZXvNnugL";
string cleanedText = Regex.Replace(text, #"http[^\s]+", "");
//cleanedText is now "EA SPORTS UFC (Microsoft Xbox One, 2014) $40.00 via eBay Tibet snake.... "

text = Regex.Replace(text, #"((http|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?)", "");
The pattern above will match a URL like you want, for example
http://this.com/ah.aspx?id=1
in:
this is a url http://this.com/ah.aspx?id=1 sdfsdf
You can see this in action in a regex fiddle for it.

You can use this function https://stackoverflow.com/a/17253735/2577248
Step1. sub = Find substring between "http://" and " " (white space)
Step2. Replace "http://" + sub with #"";
Step3. Repeat util original string does not contain any "http://t.co/any"
string str = #"EA SPORTS UFC (Microsoft Xbox One, 2014) $40.00 via eBay http://t.co/Wpwj0R1EQm Tibet snake.... http://t.co/yPZXvNnugL" + " ";
while(str.Contains("http://")){
string removedStr = str.Substring("http://", #" ");
str = str.Replace("http://" + removedStr , #"");
}

Regex.Replace
And I would try this patten:
var regex_url_pattern = #"_^(?:(?:https?|ftp)://)(?:\S+(?::\S*)?#)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,})))(?::\d{2,5})?(?:/[^\s]*)?$_iuS"
Combined:
string output = Regex.Replace(input, regex_url_pattern, "");

Trouble with Regular Expressions replacing custom tags in C#

I have a simple editor that I allow people to update text on part of a website with.
I allow a couple of pseudo tags that I replace with html when I actually render their content. I'd like to use regular expressions to locate these tags and replace them with the appropriate html markup.
Basically there will be a block of text that may have one or more of the following embedded psuedo tags that I need to replace via regex using c#:
[E]me#myemail.com[/E]
needs to turn into
<a class='LinkText' href='mailto:me#myemail.com'>me#myemail.com</a>
and
[L text='My Link Text']www.google.com[/L]
needs to turn into
<a class="MyLinkClass" href="www.google.com">My Link Text</a>
For the email pseudo-tag I came up with the following Regex, but it doesn't work:
Content = Regex.Replace(Content, #"\[E\](?(email)[^<>]+)\[/E\]", "<a class='LinkText' href='mailto:?{email}'>?{email}</a>");
Since I'm stuck on this one I haven't made much headway on the other one either.
Any thoughts how I might get this to work? I've always struggle with syntax on these regular expressions... Any help or direction would be greatly appreciated!!

A few pointers:
It looks like you're trying to use named capture groups. You can create one of these inside of your regular expression using (?<name>subexpression)
When accessing a named capture group using Regex.Replace, you can access the named capture group using ${name}.
Other than that you're pretty close. Here are two regular expressions that should be a good starting point:
Links:
string linkReplacement =
Regex.Replace(
linkContent,
#"\[L text='(?<text>[^']*)'\](?<link>[^\]]*)\[/L\]",
"<a class='MyLinkClass' href='${link}'>${text}</a>");
Emails:
string emailReplacement =
Regex.Replace(
emailContent,
#"\[E\](?<email>[^\]]*)\[/E\]",
"<a class='LinkText' href='mailto:${email}'>${email}</a>");
Working example: https://dotnetfiddle.net/nhsoJ9
Edit: Updated to remove greediness.

Whipped this up in LINQPad...
void Main()
{
string s =
#"[E]me#myemail.com[/E]
blagra
shlarga";
foreach ( Match m in Regex.Matches( s, #"\[E\](\w+#\w+.\w+)\[/E\]") )
{
string emailMatch = m.Groups[1].Value;
string entireMatch = m.Groups[0].Value;
string replacement = string.Format( #"<a class=""MyLinkClass"" href=""{0}"">My Link Text</a>", m.Groups[1] );
string newString = s.Replace( entireMatch, replacement );
newString.Dump();
}
}
The second replacement is left as an exercise to the reader :) ;-)
You can simplify the line:
foreach ( Match m in Regex.Matches( s, #"\[E\](\w+#\w+.\w+)\[/E\]") )
to be:
foreach ( Match m in Regex.Matches( s, #"\[E\](.+)\[/E\]") )
if you want.

Regex string issue in making plain text urls clickable

I need a working Regex code in C# that detects plain text urls (http/https/ftp/ftps) in a string and make them clickable by putting an anchor tag around it with same url. I have already made a Regex pattern and the code is attached below.
However, if there is already any clickable url is present in the input string then the above code puts another anchor tag over it. For example the existing substring in the below code: string sContent: "ftp://www.abc.com'>ftp://www.abc.com" has another anchor tag over it when the code below is run. Is there any way to fix it?
string sContent = "ttt <a href='ftp://www.abc.com'>ftp://www.abc.com</a> abc ftp://www.abc.com abbbbb http://www.abc2.com";
Regex regx = new Regex("(http|https|ftp|ftps)://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?", RegexOptions.IgnoreCase);
MatchCollection mactches = regx.Matches(sContent);
foreach (Match match in mactches)
{
sContent = sContent.Replace(match.Value, "<a href='" + match.Value + "'>" + match.Value + "</a>");
}
Also, I want a Regex code to make emails as clickable with "mailto" tag. I can do it myself but the above mentioned issue of double anchor tag will also appear in it.

I noticed in your example test string that if a duplicate link e.g. ftp://www.abc.com is in the string and is already linked then the result will be to double anchor that link. The Regular Expression that you already have and that #stema has supplied will work, but you need to approach how you replace the matches in the sContent variable differently.
The following code example should give you what you want:
string sContent = "ttt <a href='ftp://www.abc.com'>ftp://www.abc.com</a> abc ftp://www.abc.com abbbbb http://www.abc2.com";
Regex regx = new Regex("(?<!(?:href='|<a[^>]*>))(http|https|ftp|ftps)://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?", RegexOptions.IgnoreCase);
MatchCollection matches = regx.Matches(sContent);
for (int i = matches.Count - 1; i >= 0 ; i--)
{
string newURL = "<a href='" + matches[i].Value + "'>" + matches[i].Value + "</a>";
sContent = sContent.Remove(matches[i].Index, matches[i].Length).Insert(matches[i].Index, newURL);
}

Try this
Regex regx = new Regex("(?<!(?:href='|>))(http|https|ftp|ftps)://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?", RegexOptions.IgnoreCase);
It should work for your example.
(?<!(?:href='|>)) is a negative lookbehind, that means the pattern matches only if it is not preceeded by "href='" or ">".
See lookarounds on regular-expressions.info
and the especially the zero-width negative lookbehind assertion on msdn
See something similar on Regexr. I had to remove the alternation from the look behind, but .net should be able to handle it.
Update
To ensure that there are also (maybe possible) cases like "<p>ftp://www.def.com</p>" correctly handled, I improved the regex
Regex regx = new Regex("(?<!(?:href='|<a[^>]*>))(http|https|ftp|ftps)://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?", RegexOptions.IgnoreCase);
The lookbehind (?<!(?:href='|<a[^>]*>)) is now checking that there is not a "href='" nor a tag starting with "
The output of the teststring
ttt <a href='ftp://www.abc.com'>ftp://www.abc.com</a> abc <p>ftp://www.def.com</p> abbbbb http://www.ghi.com
is with this expression
ttt <a href='ftp://www.abc.com'>ftp://www.abc.com</a> abc <p><a href='ftp://www.def.com'>ftp://www.def.com</a></p> abbbbb <a href='http://www.ghi.com'>http://www.ghi.com</a>

I know I arrived late to this party, but there are several problems with the regex that the existing answers don't address. First and most annoying, there's that forest of backslashes. If you use C#'s verbatim strings, you don't have to do all that double escaping. And anyway, most of the backslashes weren't needed in the first place.
Second, there's this bit: ([\\w+?\\.\\w+])+. The square brackets form a character class, and everything inside them is treated either as a literal character or a class shorthand like \w. But getting rid of the square brackets isn't enough to make it work. I suspect this is what you were trying for: \w+(?:\.\w+)+.
Third, the quantifiers at the end of the regex - ]*)? - are mismatched. * can match zero or more characters, so there's no point making the enclosing group optional. Also, that kind of arrangement can result in severe performance degradation. See this page for details.
There are other, minor problems, but I won't go into them right now. Here's the new and improved regex:
#"(?n)(https?|ftps?)://\w+(\.\w+)+([-a-zA-Z0-9~!##$%^&*()_=+/?.:;',\\]*)(?![^<>]*+(>|</a>))"
The negative lookahead - (?![^<>]*+(>|</a>)) is what prevents matches inside tags or in the content of an anchor element. It's still very crude, though. There are several areas, like inside <script> elements, where you don't want it to match but it does. But trying to cover all the possibilities would result in a mile-long regex.

Check out: Detect email in text using regex and Regex URL Replace, ignore Images and existing Links, just replace the regex for links, it will never replace a link inside a tag, only in contents.
http://html-agility-pack.net/?z=codeplex
Something like:
string textToBeLinkified = "... your text here ...";
const string regex = #"((www\.|(http|https|ftp|news|file)+\:\/\/)[_.a-z0-9-]+\.[a-z0-9\/_:#=.+?,##%&~-]*[^.|\'|\# |!|\(|?|,| |>|<|;|\)])";
Regex urlExpression = new Regex(regex, RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture);
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(textToBeLinkified);
var nodes = doc.DocumentNode.SelectNodes("//text()[not(ancestor::a)]") ?? new HtmlNodeCollection();
foreach (var node in nodes)
{
node.InnerHtml = urlExpression.Replace(node.InnerHtml, #"$0");
}
string linkifiedText = doc.DocumentNode.OuterHtml;

extract links regex c#

I've been trying to solve these problem for last two hours but seems like I can't find any solution.
I need to extract links from an HTML file. There are 100+ links, but only 25 of them are valid.
Valid links are placed inside
<td><a href=" (link) ">
First I had (and still have) a problem with double quotes inside verbatim strings. So, I have replaced verbatim with "normal" strings so I can use \" for " but the problem is that this Regex I have written doesn't work
Match LinksTemp = Regex.Match(
htmlCode,
"<td><a href=\"(.*)\">",
RegexOptions.IgnoreCase);
as I get "<td><a href="http://www.google.com"> as output instead of http://www.google.com
Anyone know how can I solve this problem and how can I use double quotes inside of verbatim strings (example #" <>"das"sa ")

Escaped double quotes sample: #"some""test"
Regex sample: "<a href=\"(.*?)\">"
var match = Regex.Match(html, "<td><a href=\"(.*?)\">",
RegexOptions.Singleline); //spelling error
var url = match.Groups[1].Value;
Also you may want to use Regex.Matches(...) instead of Regex.Match(...)

If you want to take every elements use code simply like this:
string htmlCode = "<td><a href=\" www.aa.pl \"><td> <a href=\" www.cos.com \"><td>";
Regex r = new Regex( "<a href=\"(.*?)\">", RegexOptions.IgnoreCase );
MatchCollection mc = r.Matches(htmlCode);
foreach ( Match m1 in mc ) {
MessageBox.Show( m1.Groups[1].ToString() );
}

Why not parse this with an HTML-parsing is good and fast HTML-Parsing.
example:
string HTML = "<td><a href='http://www.google.com'>";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(HTML);
HtmlNodeCollection a = doc.DocumentNode.SelectNodes("//a[#href]");
string url = a[0].GetAttributeValue("href", null);
Console.WriteLine(url);
Console.ReadLine();
you need import using HtmlAgilityPack;

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Get the number of an href url parameter from downloaded html page? - c#

Use: Uri u = new Uri("http://www.mysite.com/myitem.php?id=12313"); string s = u.Query; HttpUtility.ParseQueryString(s).Get("id"); In variable id you have the number. Figure out the rest of the function :)

Related

.Net regex for string between \" "\

How do I remove url from text

Trouble with Regular Expressions replacing custom tags in C#

Regex string issue in making plain text urls clickable

extract links regex c#

Categories

Resources