I have the following JQUERY code that relates to a html document from a website.
$
Anything is appreciated,
Salute.
From what I can remember using the HtmlAgilityPack
var rawText = "<html><head><head><body><div id='container'><article><p>stuff<p></article><article><p>stuff2</p></article></div></body></html>";
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(rawText);
var stuff = doc.DocumentNode.Descendants("div")
.SelectMany(div => div.Descendants("article"));
var length = stuff.Count();
var textValues = stuff.Select(a => a.InnerHtml).ToList();
Output:
length: 2
textValues: List<String> (2 items)
<p>stuff<p>
<p>stuff2</p>
To get the HTML, instead of hardcoding it as above, use the WebClient class since it has a simplier API than WebRequest.
var client = new WebClient();
var html = client.DownloadString("http://yoursite.com/file.html");
To answer your question specifically related to using the System.Net namespace you would do this:
go here to see the way to use the WebRequest class itself to get the
content.
http://msdn.microsoft.com/en-us/library/456dfw4f%28v=vs.110%29.aspx
Next after you get the content back you need to parse it using HTMLAgility pack found here: http://htmlagilitypack.codeplex.com/
How would one code the JQUERY into C#, this is an untested example:
var doc = new HtmlDocument();
doc.Load(#"D:\test.html"); //you can also use a memory stream instead.
var container = doc.GetElementbyId("continer");
foreach (HtmlNode node in container.Elements("img"))
{
HtmlAttribute valueAttribute = node.Attributes["value"];
if (valueAttribute != null) Console.WriteLine(valueAttribute.Value);
}
In your case the attributes you want after you find the element are alt, src, and href
It will take you about 1 day to learn agilitypack but it's mature fast and well liked by the community.
Related
I am currently working on a project and I am learning HAP as I go.
I get the basics of it and it seems like it could be very powerful.
I'm having an issue right now, I am trying to scrape a product on this one website and get the links to the images but I dont know how to extract the link from the xpath.
I used to do this with Regex which was alot easier but I am moving on this HAP.
This is my current code I dont think it will be very useful to see but i'll out it in either way.
private static void HAP()
{
var url = "https://www.dhgate.com/product/brass-hexagonal-fidget-spinner-hexa-spinner/403294406.html#gw-0-4|ff8080815e03d6df015e9394cc681f8a:ff80808159abe8a5015a3fd78c5b51bb";
// HtmlWeb - A Utility class to get HTML document from http
var web = new HtmlWeb();
//Load() Method download the specified HTML document from an Internet resource.
var doc = web.Load(url);
var rootNode = doc.DocumentNode;
var divs = doc.DocumentNode.SelectNodes(String.Format("//IMG[#src='{0}']", "www.dhresource.com/webp/m/100x100/f2/albu/g5/M00/14/45/rBVaI1kWttaAI1IrAATeirRp-t8793.jpg"));
Console.WriteLine(divs);
Console.ReadLine();
}
This is the link I am scraping from
https://www.dhgate.com/product/2017-led-light-up-hand-spinners-fidget-spinner/398793721.html#s1-0-1b;searl|4175152669
And this should be the xPath of the first image.
//IMG[#src='//www.dhresource.com/webp/m/100x100s/f2-albu-g5-M00-6E-20-rBVaI1kWtmmAF9cmAANMKysq_GY926.jpg/2017-led-light-up-hand-spinners-fidget-spinner.jpg']
I create a helper method for this.
I had to get the node, and then the attribute and then cycle through the attribute to get all the links.
private static void HAP()
{
//Declare the URL
var url = "https://www.dhgate.com/product/brass-hexagonal-fidget-spinner-hexa-spinner/403294406.html#gw-0-4|ff8080815e03d6df015e9394cc681f8a:ff80808159abe8a5015a3fd78c5b51bb";
// HtmlWeb - A Utility class to get HTML document from http
var web = new HtmlWeb();
//Load() Method download the specified HTML document from an Internet resource.
var doc = web.Load(url);
var rootNode = doc.DocumentNode;
var nodes = doc.DocumentNode.SelectNodes("//img");
foreach (var src in nodes)
{
var links = src.Attributes["src"].Value;
Console.WriteLine(links);
}
Console.ReadLine();
}
I searched but could not find anything that worked for me.
A while ago I started with C# and my first personal project was a simple WebCrawler.
It should check the sourcecode for special Strings to identify if for example Google Analytics or something similar is included.
So it works fine but of course I'm missing the JS and Iframes since HttpWebRequest does not render the website as I know.
So I wanted to check for "<script src="" for example and then get the URL through a split.
But this does not work as expected and I don't think this is a clean and good way.
Since I'm checking for strings it could be destroyed by simply changing the string from "<script" to "< script" as example so I have no idea how to get a specific string from a big string.
I found regular expressions (rex) and split but I'm not sure if rex and split would be good since there could be more types of "src=" or split("\"", "\"", text)
I don't want a "here you go" of course I want to understand and to do it myself but I have no idea where to go from here..
Sorry for the long text and no examples but at the moment I have no access and there is not really much except for rex and split's
EDIT: I think I'll create a class which checks every char for a special row like "
Best,
Mike
Try Html agility pack
I haven't used it personally, but something like this should work (i haven't tested it):
string url = "some/url";
var request = (HttpWebRequest)HttpWebRequest.Create(url);
var webResponse = (HttpWebResponse)request.GetResponse();
var responseStream = webResponse.GetResponseStream();
var streamReader = new StreamReader(responseStream);
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(streamReader.ReadToEnd());
var scripts = doc.DocumentNode.Descendants()
.Where(n => n.Name == "script");
this should get you all script nodes to do with them what you want =)
So I found a way to get the JS URL here is my code
List<string> srcurl = new List<string>();
HtmlWeb hw = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = hw.Load("some/url");
HtmlNodeCollection linkNodes = doc.DocumentNode.SelectNodes("//script[#src]");
foreach (HtmlNode linkNode in linkNodes)
{
HtmlAttribute link = linkNode.Attributes["src"];
srcurl.Add(link.Value);
}
Regarding the code from #avidenic, if you want to use it be aware to use
doc.LoadHtml(streamReader.ReadToEnd());
Best,
Mike
Is it possible to set custom encoding when loading pages with the method below?
HtmlWeb hwWeb = new HtmlWeb();
HtmlDocument hd = hwWeb.load("myurl");
I want to set encoding to "iso-8859-9".
I use C# 4.0 and WPF.
Edit: The question has been answered on MSDN.
I suppose you could try overriding the encoding in the HtmlWeb object.
Try this:
var web = new HtmlWeb
{
AutoDetectEncoding = false,
OverrideEncoding = myEncoding,
};
var doc = web.Load(myUrl);
Note: It appears that the OverrideEncoding property was added to HTML agility pack in revision 76610 so it is not available in the current release v1.4 (66017). The next best thing to do would be to read the page manually with the encodings overridden.
var document = new HtmlDocument();
using (var client = new WebClient())
{
using (var stream = client.OpenRead(url))
{
var reader = new StreamReader(stream, Encoding.GetEncoding("iso-8859-9"));
var html = reader.ReadToEnd();
document.LoadHtml(html);
}
}
This is a simple version of the solution answered here (for some reasons it got deleted)
A decent answer is over here which handles auto-detecting the encoding as well as some other nifty features:
C# and HtmlAgilityPack encoding problem
I am trying to grab a data from a WEBPAGE , <DIV>particular class <DIV class="personal_info"> it has 10 similar <DIV>S and is of same Class "Personal_info" ( as shown in HTML Code and now i want to extract all the DIVs of Class personal_info which are in 10 - 15 in every webpage .
<div class="personal_info"><span class="bold">Rama Anand</span><br><br> Mobile: 9916184586<br>rama_asset#hotmail.com<br> Bangalore</div>
to do the needful i started using HTML AGILE PACK as suggested by some one in Stack overflow
and i stuck at the beginning it self bcoz of lack of knowledge in HtmlAgilePack my C# code goes like this
HtmlAgilityPack.HtmlDocument docHtml = new HtmlAgilityPack.HtmlDocument();
HtmlAgilityPack.HtmlWeb docHFile = new HtmlWeb();
docHtml = docHFile.Load("http://127.0.0.1/2.html");
then how to code further so that data from DIV whose class is "personal_info" can be grabbed ... suggestion with example will be appreciated
I can't check this right now, but isn't it:
var infos = from info in docHtml.DocumentNode.SelectNodes("//div[#class='personal_info']") select info;
To get a url loaded you can do something like:
var document = new HtmlAgilityPack.HtmlDocument();
var url = "http://www.google.com";
var request = (HttpWebRequest)WebRequest.Create(url);
using (var responseStream = request.GetResponse().GetResponseStream())
{
document.Load(responseStream, Encoding.UTF8);
}
Also note there is a fork to let you use jquery selectors in agility pack.
IEnumerable<HtmlNode> myList = document.QuerySelectorAll(".personal_info");
http://yosi-havia.blogspot.com/2010/10/using-jquery-selectors-on-server-sidec.html
What happened to Where?
node.DescendantNodes().Where(node_it => node_it.Name=="div");
if you want top node (root) you use page.DocumentNode as "node".
Folks,
I'm tryning to extract data from web page using C#.. for the moment I used the Stream from the WebReponse and I parsed it as a big string. It's long and painfull. Someone know better way to extract data from webpage? I say WINHTTP but isn't for c#..
To download data from a web page it is easier to use WebClient:
string data;
using (var client = new WebClient())
{
data = client.DownloadString("http://www.google.com");
}
For parsing downloaded data, provided that it is HTML, you could use the excellent Html Agility Pack library.
And here's a complete example extracting all the links from a given page:
class Program
{
static void Main(string[] args)
{
using (var client = new WebClient())
{
string data = client.DownloadString("http://www.google.com");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(data);
var nodes = doc.DocumentNode.SelectNodes("//a[#href]");
foreach(HtmlNode link in nodes)
{
HtmlAttribute att = link.Attributes["href"];
Console.WriteLine(att.Value);
}
}
}
}
If the webpage is valid XHTML, you can read it into an XPathDocument and xpath your way quickly and easily straight to the data you want. If it's not valid XHTML, I'm sure there are some HTML parsers out there you can use.
Found a similar question with an answer that should help.
Looking for C# HTML parser