Regular Expression to get the SRC of images in C# - c#

I'm looking for a regular expression to isolate the src value of an img.
(I know that this is not the best way to do this but this is what I have to do in this case)
I have a string which contains simple html code, some text and an image. I need to get the value of the src attribute from that string. I have managed only to isolate the whole tag till now.
string matchString = Regex.Match(original_text, #"(<img([^>]+)>)").Value;

string matchString = Regex.Match(original_text, "<img.+?src=[\"'](.+?)[\"'].*?>", RegexOptions.IgnoreCase).Groups[1].Value;

I know you say you have to use regex, but if possible i would really give this open source project a chance:
HtmlAgilityPack
It is really easy to use, I just discovered it and it helped me out a lot, since I was doing some heavier html parsing. It basically lets you use XPATHS to get your elements.
Their example page is a little outdated, but the API is really easy to understand, and if you are a little bit familiar with xpaths you will get head around it in now time
The code for your query would look something like this: (uncompiled code)
List<string> imgScrs = new List<string>();
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlText);//or doc.Load(htmlFileStream)
var nodes = doc.DocumentNode.SelectNodes(#"//img[#src]"); s
foreach (var img in nodes)
{
HtmlAttribute att = img["src"];
imgScrs.Add(att.Value)
}

I tried what Francisco Noriega suggested, but it looks that the api to the HtmlAgilityPack has been altered. Here is how I solved it:
List<string> images = new List<string>();
WebClient client = new WebClient();
string site = "http://www.mysite.com";
var htmlText = client.DownloadString(site);
var htmlDoc = new HtmlDocument()
{
OptionFixNestedTags = true,
OptionAutoCloseOnEnd = true
};
htmlDoc.LoadHtml(htmlText);
foreach (HtmlNode img in htmlDoc.DocumentNode.SelectNodes("//img"))
{
HtmlAttribute att = img.Attributes["src"];
images.Add(att.Value);
}

This should capture all img tags and just the src part no matter where its located (before or after class etc) and supports html/xhtml :D
<img.+?src="(.+?)".+?/?>

The regex you want should be along the lines of:
(<img.*?src="([^"])".*?>)
Hope this helps.

you can also use a look behind to do it without needing to pull out a group
(?<=<img.*?src=")[^"]*
remember to escape the quotes if needed

This is what I use to get the tags out of strings:
</? *img[^>]*>

Here is the one I use:
<img.*?src\s*?=\s*?(?:(['"])(?<src>(?:(?!\1).)*)\1|(?<src>[^\s>]+))[^>]*?>
The good part is that it matches any of the below:
<img src='test.jpg'>
<img src=test.jpg>
<img src="test.jpg">
And it can also match some unexpected scenarios like extra attributes, e.g:
<img src = "test.jpg" width="300">

Related

retrive the last match case or list with regular expression and than work with it

my issue is that I'll download html page content to string with
System.Net.WebClient wc = new System.Net.WebClient();
string webData = wc.DownloadString("http://prices.shufersal.co.il/");
and trying to retrive the last number of page from the navigation menu
<a data-swhglnk=\"true\" href=\"/?page=2\">2</a>
so at the end I'll want want to find the last data-swhglnk and retrive from it the last page.
I try
Regex.Match(webData, #"swhglnk", RegexOptions.RightToLeft);
I would be happy to understand the right approch to issues like this
If you're about to parse HTML and find some information in it, you should use method more reliable than regex, i.e:
-HtmlAgilityPack https://htmlagilitypack.codeplex.com/
-csQuery https://github.com/jamietre/CsQuery
and operate on objects, not strings.
Update
If you decide to use HtmlAgilityPack, you will have to write code like this:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(webData);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a[#data-swhglnk]"))
{
HtmlAttribute data = node.Attributes["data-swhglnk"];
//do your processing here
}

How can I read an HTML file a Paragraph at a time?

I reckon it would be something like (pseudocode):
var pars = new List<string>();
string par;
while (not eof("Platypus.html"))
{
par = getNextParagraph();
pars.Add(par);
}
...where getNextParagraph() looks for the next "<p>" and continues until it finds "</p>", burning its bridges behind it ("cutting" the paragraph so that it is not found over and over again). Or some such.
Does anybody have insight on how exactly to do this / a better methodology?
UPDATE
I tried to use Aurelien Souchet's code.
I have the following usings:
using HtmlAgilityPack;
using HtmlDocument = System.Windows.Forms.HtmlDocument;
...but this code:
HtmlDocument doc = new HtmlDocument();
is unwanted ("Cannot access private constructor 'HtmlDocument' here")
Also, both "doc.LoadHtml()" and "doc.DocumentNode" give the old "Cannot resolve symbol 'Bla'" err msg
UPDATE 2
Okay, I had to prepend "HtmlAgilityPack." so that the ambiguous reference was disambiguated.
As people suggests in the comments, I think HtmlAgilityPack is the best choice, it's easy to use and to find good examples or tutorials.
Here is what I would write:
//don't forgot to add the reference
using HtmlAgilityPack;
//Function that takes the html as a string in parameter and return a list
//of strings with the paragraphs content.
public List<string> GetParagraphsListFromHtml(string sourceHtml)
{
var pars = new List<string>();
//first create an HtmlDocument
HtmlDocument doc = new HtmlDocument();
//load the html (from a string)
doc.LoadHtml(sourceHtml);
//Select all the <p> nodes in a HtmlNodeCollection
HtmlNodeCollection paragraphs = doc.DocumentNode.SelectNodes(".//p");
//Iterates on every Node in the collection
foreach (HtmlNode paragraph in paragraphs)
{
//Add the InnerText to the list
pars.Add(paragraph.InnerText);
//Or paragraph.InnerHtml depends what you want
}
return pars;
}
It's just a basic example, you can have some nested paragraphs in your html then this code maybe won't work as expected, it all depends the html you are parsing and what you want to do with it.
Hope it helps!

Parsing Hyperlinks from a webpage

I have written following code to parse hyperlinks from a given page.
WebClient web = new WebClient();
string html = web.DownloadString("http://www.msdn.com");
string[] separators = new string[] { "<a ", ">" };
List<string> hyperlinks= html.Split(separators, StringSplitOptions.None).Select(s =>
{
if (s.Contains("href"))
return s;
else
return null;
}).ToList();
Although string split still has to be tweaked to return urls perfectly. My question is there some Data Structure, something on the line of XmlReader or so, which could read HTML strings efficiently.
Any suggestion for improving above code would also be helpful.
Thanks for your time.
try HtmlAgilityPack
HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load("http://www.msdn.com");
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[#href]"))
{
Console.WriteLine(link.GetAttributeValue("href", null));
}
this will print out every link on your URL.
if you want to store the links in a list:
var linkList = doc.DocumentNode.SelectNodes("//a[#href]")
.Select(i => i.GetAttributeValue("href", null)).ToList();
You should be using a parser. The most widely used one is HtmlAgilityPack. Using that, you can interact with the HTML as a DOM.
Assuming you're dealing with well formed XHTML, you could simply treat
the text as an XML document. The framework is loaded with features to
do exactly what you're asking.
http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.aspx
Does .NET framework offer methods to parse an HTML string?
refactored,
var html = new WebClient().DownloadString("http://www.msdn.com");
var separators = new[] { "<a ", ">" };
html.Split(separators, StringSplitOptions.None).Select(s => s.Contains("href") ? s : null).ToList();

Parse through current page

Is there a way to get a page to parse through its self?
So far I have:
string whatever = TwitterSpot.InnerHtml;
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(whatever);
foreach("this is where I am stuck")
{
}
I want to parse the page so what I did is create a parent div named TwitterSpot. Put the InnerHtml into a string, and have loaded it as a new HtmlDocument.
Next I want to get within that a string value of "#XXXX+n " and replace it in the page infront with some cool formatting.
I am getting stuck on my foreach loop do not know how I should search for a # or how to look through the loaded HtmlDocument.
The next step is to apply change to where ever I have seen a # tag. I could do this is JavaScript probably a lot easier I know but I am adament on seeing how I can get asp.net c# to do it.
The # is a string value within the html I am not referring to it as a Control ID.
Assuming you're using HtmlAgilityPack, you could use xpath to find text nodes which contain your value:
var matchedNodes = document.DocumentNode
.SelectNodes("//text()[contains(.,'#XXXX+n ')]");
Then you could just interate through these nodes and make all the necessary replacemens:
foreach (HtmlTextNode node in matchedNodes)
{
node.Text = node.Text.Replace("#XXXX+n ", "brand new text");
}
You can use http://htmlagilitypack.codeplex.com/ to parse HTML and manipulate its content; works very well.
I guess you could use RegEx to find all matches and loop through them.
You could just change it to be:
string whatever = TwitterSpot.InnerHtml;
whatever = whatever.Replace("#XXXX+n ", String.format("<b>{0}</b>", "#XXXX+n "));
No parsing required...
When I did this before, I stored the HTML in an XML doc and looped through each node. You can then apply XSLT or just parse the nodes.
It sounds like for your purposes though that you don't really need to do that. I'd recommend making the divs into server controls and programmatically looping through their child controls, as such:
foreach (Object o in divSomething.Controls)
{
if (o.GetType == "TextBox" && ((TextBox)o).ID == "txtSomething")
{
((TextBox)o).Attributes.Add("style", "font: Arial; color: Red;");
}
}

To search for strings with in a string (search for all hrefs in HTML source)

I have a string variable that contains the entire HTML of a web page.
The web page would contain links to other websites. I would like to create a list of all hrefs (webcrawler like ).
What is the best possible way to do it ?
Will using any extension function help ? what about using Regex ?
Thanks in Advance
Use a DOM parser such as the HTML Agility Pack to parse your document and find all links.
There's a good question on SO about how to use HTML Agility Pack available here. Here's a simple example to get you started:
string html = "your HTML here";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var links = doc.DocumentNodes.DescendantNodes()
.Where(n => n.Name == "a" && n.Attributes.Contains("href")
.Select(n => n.Attributes["href"].Value);
I think you'll find this answers your question to a T
http://msdn.microsoft.com/en-us/library/t9e807fx.aspx
:)
I would go with Regex.
Regex exp = new Regex(
#"{href=}*{>}",
RegexOptions.IgnoreCase);
string InputText; //supply with HTTP
MatchCollection MatchList = exp.Matches(InputText);
Try this Regex (should work):
var matches = Regex.Matches (html, #"href=""(.+?)""");
You can go through the matches and extract the captured URL.
Have you looked into using HTMLAGILITYPACK? http://htmlagilitypack.codeplex.com/
With this you can simply us XPATH to get all of the links on the page and put them into a list.
private List<string> ExtractAllAHrefTags(HtmlDocument htmlSnippet)
{
List<string> hrefTags = new List<string>();
foreach (HtmlNode link in htmlSnippet.DocumentNode.SelectNodes("//a[#href]"))
{
HtmlAttribute att = link.Attributes["href"];
hrefTags.Add(att.Value);
}
return hrefTags;
}
Taken from another post here - Get all links on html page?

Categories