I've been trying to extract data from a website by giving it the HTML string.
I did some research and figured out that I had to use the HtmlAgilityPack; however,
I can't figure out how to apply the examples to my case.
I've done different tests but none seem to work.
A webpage example could be
http://www.trivago.com/?aDateRange[arr]=2014-11-02&aDateRange[dep]=2014-11-03&iRoomType=7&iPathId=34741&iGeoDistanceItem=0&iViewType=0&bIsSeoPage=false&bIsSitemap=false&
I would only need to extract the contact data,
address, telephone, the official homepage link and the title of the
element in the list.
I tried moving through the source with Firebug and the class structure
to get to this data is as follows:
class="no-touch"
class="web10152"
class="page_wrapper"
class="main_content"
class="main"
class="centercol content"
class="content"
class="container_itemlist itemlist_simplified"
class="itemlist hotellist group component" // Has a List of each item
// Item (undernode of itemlist hotellist group component)
class="hotel item bookmarkable historisable" //item main class
// Path to get title
class="cf item_wrapper"
class="item_prices"
<h3 title="ITEM TITLE" </h3>
// Path to get contact info
class="slideout_wrapper component expand"
class="slideout_content_container"
class="slideout_content info item_info js_trivago_info active"
class="item_info_block contact" // Contains info
<em> ADDRESS INFORMATION </em>
<em> TELEPHONE INFO </em>
class="partnerHomepageLink link"
//Contains Link info
I don't know how to communicate this with HtmlAgilityPack.
Here is the last thing i tried...
HtmlAgilityPack.HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(page);
try
{
var table = doc.DocumentNode.SelectSingleNode("//h3[#class='jsheadline js_slideout_trigger js_trackable']/title");
var table1 = doc.DocumentNode.SelectSingleNode("//div[#class='item_info_block contact']");
var ele = table1.Elements("em");
}
catch { Program.ChangeColor(Program.TextColors.PROGRAM_ERROR);
Console.WriteLine("\nError Report: Failed to parse page!");
}
How can I achieve this?
Related
I'm trying to scrape a link from the source code of a website that varies with every source code.
Form example:
<div align="center">
<a href="http://www10.site.com/d/the rest of the link">
<span class="button_upload green">
The next time I get the source code the http://www10 changes to any http://www + number like http://www65.
How can I scrape the exact link with the new changed number?
Edit :
Here's how i use RE MatchCollection m1 = Regex.Matches(textBox6.Text, "(href=\"http://www10)(?<td_inner>.*?)(\">)", RegexOptions.Singleline);
You mentioned in the comments that you use Regulars expressions for parsing the HTML Document. That is a the hardest way you can do this (also, generally not recommended!). Try using a HTML Parser like http://html-agility-pack.net
For HTML Agility Pack: You install it via NuGet Packeges and here is an example (posted on their website):
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[#href]")
{
HtmlAttribute att = link["href"];
att.Value = FixLink(att);
}
doc.Save("file.htm");
It can also load string contents, not just files. You use xPath or CSS Selectors to navigate inside the document and select what you want.
How about a JS function like this, run when the page loads:
// jQuery is required!
var updateLinkUrl = function (num) {
$.each($('.button_upload.green'), function (pos, el) {
var orig = $(el).parent().prop("href");
var newurl = orig.replace("www10", "www" + num);
$(el).parent().prop("href", newurl);
});
};
$(document).ready(function () { updateLinkUrl(65); });
I've tried two ways to get just the text from an HTML page with HTML Agility Pack:
Method 1
var root = doc.DocumentNode;
foreach (HtmlNode node in root.SelectNodes("//text()"))
{
sb.AppendLine(node.InnerText.Trim() + " ");
}
Method 2
var root = doc.DocumentNode;
foreach (var node in root.DescendantsAndSelf())
{
if (!node.HasChildNodes)
{
string text = node.InnerText;
if (!string.IsNullOrEmpty(text))
sb.AppendLine(text.Trim() + " ");
}
}
Both of these will leave behind the </form> tags if they are present of the page. For example, here's www.google.com:
"body": " Search Images Maps Play YouTube News Gmail Drive More Calendar
Translate Mobile Books Wallet Shopping Blogger Finance Photos Videos Docs
Even more » Account Options Sign in Search settings Web History
× Try a fast, secure browser with updates built in. Yes, get Chrome
now Advanced search Language tools </form> Advertising Programs
Business Solutions +Google About Google © 2016 - Privacy - Terms "
What gives?
Edit: By "Just the text" I mean "language text"....so:
<i>book reports</i> becomes book reports
More Details becomes More Details
<div>Check out our <b>deals</b>!</div> becomes Check out our deals!
Please search for your question before posting
Using C# regular expressions to remove HTML tags
Samples pulled from this webpage
String result = Regex.Replace(htmlDocument, #"<[^>]*>", String.Empty);
Or if you want to use Agility (also pulled from webpage)
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(Properties.Resources.HtmlContents);
var text = doc.DocumentNode.SelectNodes("//body//text()").Select(node => node.InnerText);
StringBuilder output = new StringBuilder();
foreach (string line in text)
{
output.AppendLine(line);
}
string textOnly = HttpUtility.HtmlDecode(output.ToString());
Let say this is my html code
<a class="" data-tracking-id="0_Motorola"
href="/motorola?otracker=nmenu_sub_electronics_0_Motorola">
Motorola
</a>
I used C# code to find the href value like this
var tags = htmlDoc.DocumentNode.SelectNodes("//div[#class='top-menu unit']
//ul//li//div[#id='submenu_electronics']//a");
if (tags != null)
{
foreach (var t in tags)
{
var name = t.InnerText.Trim();
var url =t.Attributes["href"].Value;
}
}
I am getting url='/motorola' but I need url=/motorola?otracker=nmenu_sub_electronics_0_Motorola
its not appending text after ?,&.. Please clarify where I went wrong.
I have used HtmlAgilityPack in the past and I have previously used it like this :
var url = t.GetAttributeValue("href","");
You can try that and see if it works.
I am using VS2010 and using HTMLAGilityPack1.4.6 (from Net40-folder).
Following is my HTML
<html>
<body>
<div id="header">
<h2 id="hd1">
Patient Name
</h2>
</div>
</body>
</html>
I am using following code in C# to access "hd1".
Please tell me correct way to do it.
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
try
{
string filePath = "E:\\file1.htm";
htmlDoc.LoadHtml(filePath);
if (htmlDoc.DocumentNode != null)
{
HtmlNodeCollection _hdPatient = htmlDoc.DocumentNode.SelectNodes("//h2[#id=hd1]");
// htmlDoc.DocumentNode.SelectNodes("//h2[#id='hd1']");
//_hdPatient.InnerHtml = "Patient SurName";
}
}
catch (Exception ex)
{
throw ex;
}
Tried many permutations and combinations... I get null.
plz help.
Your problem is the way how you load data into HtmlDocument. In order to load data from file you should use Load(fileName) method. But you are using LoadHtml(htmlString) method, which treats "E:\\file1.htm" as document content. When HtmlAgilityPack tries to find h2 tags in E:\\file1.htm string, it finds nothing. Here is the correct way to load html file:
string filePath = "E:\\file1.htm";
htmlDoc.Load(filePath); // use instead of LoadHtml
Also #Simon Mourier correctly pointed that you should use SelectSingleNode method if you are trying to get single node:
// Single HtmlNode
var patient = doc.DocumentNode.SelectSingleNode("//h2[#id='hd1'");
patient.InnerHtml = "Patient SurName";
Or if you are working with collection of nodes, then process them in a loop:
// Collection of nodes
var patients = doc.DocumentNode.SelectNodes("//div[#class='patient'");
foreach (var patient in patients)
patient.SetAttributeValue("style", "visibility: hidden");
You were almost there:
HtmlNode _hdPatient = htmlDoc.DocumentNode.SelectSingleNode("//h2[#id='hd1']");
_hdPatient.InnerHtml = "Patient SurName"
I'm trying to parse this field, but can't get it to work. Current attempt:
var name = doc.DocumentNode.SelectSingleNode("//*[#id='my_name']").InnerHtml;
<h1 class="bla" id="my_name">namehere</h1>
Error: Object reference not set to an instance of an object.
Appreciate any help.
#John - I can assure that the HTML is correctly loaded. I am trying to read my facebook name for learning purposes. Here is a screenshot from the Firebug plugin. The version i am using is 1.4.0.
http://i54.tinypic.com/kn3wo.jpg
I guess the problem is that profile_name is a child node or something, that's why I'm not able to read it?
The reason your code doesn't work is because there is JavaScript on the page that is actually writing out the <h1 id='profile_name'> tag, so if you're requesting the page from a User Agent (or via AJAX) that doesn't execute JavaScript then you won't find the element.
I was able to get my own name using the following selector:
string name =
doc.DocumentNode.SelectSingleNode("//a[#id='navAccountName']").InnerText;
Try this:
var name = doc.DocumentNode.SelectSingleNode("//#id='my_name'").InnerHtml;
HtmlAgilityPack.HtmlNode name = doc.DocumentNode.SelectSingleNode("//h1[#id='my_name']").InnerText;
public async Task<List<string>> GetAllTagLinkContent(string content)
{
string html = string.Format("<html><head></head><body>{0}</body></html>", content);
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode.SelectNodes("//[#id='my_name']");
return nodes.ToList().ConvertAll(r => r.InnerText).Select(j => j).ToList();
}
It's ok with ("//a[#href]"); You can try it as above.Hope helpful