Extracting content from Webpage - c#

I am attempting to use HTMLagilitypack to extract all the content from the webpage.
foreach (HtmlTextNode node in doc.DocumentNode.SelectNodes("//text()"))
{
sb.AppendLine(node.Text);
}
When i try to parse google.com using above code i get lots of javascript. All i want is to extract the content in the webpage like in h or p tags. Like taking the question,answer,comments on this page and removing everything else.
I am really new to XPath and don't exactly know where to move forward. So any help would be appreciated.

You can filter for the non-wanted tags by name and remove them from your document.
doc = page.Load("http://www.google.com");
doc.DocumentNode.Descendants().Where(n => n.Name == "script" || n.Name == "style").ToList().ForEach(n => n.Remove());

You could use this XPath expression:
//body//*[local-name() != 'script']/text()
It takes only the elements inside the body and skips the script elements

Related

Selecting text from some elements inside a div and ignore other elements. HTML Agility Pack

I'm trying to build a web scraping tool for a news website. I'm having problems selecting the relevant text since the text is divided into multiple different elements. I'm using HTML Agility Pack and I have tried to select text ( //text() ) from the main div, but when I do this I get a lot of garbage text I don't want, like javascript code.
How can I select text from some nested elements and ignore other elements?
<div class="texto_container paywall">
Some text I want
<a href="https://www.sabado.pt/sabermais/ana-gomes" target="_blank" rel="noopener">
Text I want
</a>
sample of text I want
<em>
another text i want
</em>
<aside class="multimediaEmbed contentRight">
A lot of nested elements here with some text I dont want
</aside>
<div class="inContent">
A lot of nested elements here with some text I don't want
</div>
Back to the text I want!
<twitter-widget class="twitter-tweet twitter-tweet-rendered" id="twitter-widget-0" >
Don't want any of this text located in nested elements!
</twitter-widget>
<p>
Final revelant text i want to collect!
</p>
</div>
EDIT
I tried to use XPath to exclude the tags I don't want, but I still get text nodes from those tags in the result.
var parse_me = htmlDoc.DocumentNode.SelectNodes("//div[#class='texto_container paywall']//text()[not(parent::aside)][not(parent::div[#class='inContent'])][not(parent::twitter-widget)]");
I think this code doesn't work because on the tags I don't want to include the text parent nodes aren't the "main" tag, because it is inside of a lot of nested tags.
EDIT
After some thinking and some research I fixed the previous problem by using ancestor:: instead of parent:: and I got rid of some of the intended text.
But I still can't get rid of the twitter-widget text, because it always returns a null node even with the XPath copied from the Google Chrome inspect element tool.
var Twitter_Node = htmlDoc.DocumentNode.SelectSingleNode("//*[#id='twitter - widget - 0']");
This gets returned as null. How is this possible? XPath was copied from Chrome.
You can try to exclude the text from specific tags :
//body//text()[not(parent::aside)][not(parent::div[#class="inContent"])][not(parent::twitter-widget)]
You could use concat but it's more complicated since you have to know the number and the position of each tag in the "chain" :
concat(//body//div[#class="texto_container paywall"]/text()[1],//body//a[#href]/text(),//body//div[#class="texto_container paywall"]/text()[2],//body//em/text(),//body//div[#class="texto_container paywall"]/text()[5],//body//p/text())
I am using ScrapySharp nuget which adds in my sample below, (It's possible HtmlAgilityPack offers the same functionality built it, I am just used to ScrapySharp from years ago)
You can simply punctually extract all the texts you do NOT want, and then replace their occurrences in the main div text with an empty string, removing them from the final result.
var doc = new HtmlDocument();
doc.Load(#"C:\Desktop\z.html"); //I created an html with your sample HTML set as the html body
List<string> textsIWant = new List<string>();
var textsIdoNotWant = new List<string>();
//text I do not want
var aside = doc.DocumentNode.CssSelect(".multimediaEmbed.contentRight").FirstOrDefault();
if (aside != null)
{
textsIdoNotWant.Add(aside.InnerText);
}
var inContent = doc.DocumentNode.CssSelect(".inContent").FirstOrDefault();
if (inContent != null)
{
textsIdoNotWant.Add(inContent.InnerText);
}
var twitterWidget = doc.DocumentNode.CssSelect("#twitter-widget-0").FirstOrDefault();
if (twitterWidget != null)
{
textsIdoNotWant.Add(twitterWidget.InnerText);
}
var div = doc.DocumentNode.CssSelect(".texto_container.paywall").FirstOrDefault();
if (div != null)
{
var text = div.InnerText;
foreach (var textIDoNotWant in textsIdoNotWant)
{
text = text.Replace(textIDoNotWant, string.Empty);
}
textsIWant.Add(text);
}
foreach (var text in textsIWant)
Console.WriteLine(text);

Getting text from between two html nodes using HtmlAgilityPack

Suppose I have the following HTML
<p id="definition">
<span class="hw">emolument</span> \ih-MOL-yuh-muhnt\, <i>noun</i>:
The wages or perquisites arising from office, employment, or labor
</p>
I want to extract each part separately using HTMLAgilityPack in C#
I can get the word and word class easily enough
var definition = doc.DocumentNode.Descendants()
.Where(x => x.Name == "p" && x.Attributes["id"] == "definition")
.FirstOrDefault();
string word = definition.Descendants()
.Where(x => x.Name == "span")
.FirstOrDefault().InnerText;
string word_class = definition.Descendants()
.Where(x => x.Name == "i")
.FirstOrDefault().InnerText;
But how do I get the pronunciation or actual definition? These fall between nodes, and if I use defintion.InnerText I get the whole lot in one string. Is there a way to do this in XPath perhaps?
How do I select text between nodes in HtmlAgilityPack?
Is there a way to do this in XPath perhaps?
Yes - and quite an easy one.
The key concept you need to understand is how text and child element nodes are organized in XML/HTML - and thus XPath.
If the textual content of an element is punctuated by child elements, they end up in separate text nodes. You can access individual text nodes by their position.
Simply using text() on any element retrieves all child text nodes. Applying //p/text() to the snippet you have shown yields (individual results separated by -------):
[EMPTY TEXT NODE, EXCEPT WHITESPACE]
-----------------------
\ih-MOL-yuh-muhnt\,
-----------------------
:
The wages or perquisites arising from office, employment, or labor
The first text node of this p element only contains whitespace, so that's probably not what you're after. //p/text()[2] retrieves
\ih-MOL-yuh-muhnt\,
and //p/text()[3]:
:
The wages or perquisites arising from office, employment, or labor
HtmlNode text = doc.DocumentNode.Descendants().Where(x => x.Name == "p" && x.Id == "definition").FirstOrDefault();
foreach (HtmlNode node in text.SelectNodes(".//text()"))
{
Console.WriteLine(node.InnerText.Trim());
}
Output of this will be:
emolument
\ih-MOL-yuh-muhnt\,
noun
:
The wages or perquisites arising from office, employment, or labor
If you want 2. \ih-MOL-yuh-muhnt\, result. You need this.
HtmlNode a = text.SelectNodes(".//text()[2]").FirstOrDefault();

How to/Should I retrieve data from particularly formatted HTML without regex

I have a whole pile of HTML which is just a bunch of this:
<li id="entry-c7" data-user="ThisIsSomeonesUsername">
<img width="28" height="28" class="avatar" src="http://very_long_url.png">
<span class="time">6:07</span>
<span class="username">ThisIsSomeonesUsername</span>
<span class="message">This is my message. It is nice, no?</span>
</li>
Repeated over and over again about a hundred thousand times (with different content, of course). This is all taken from an HTMLDocument by retrieving the element which holds all this. The document is retrieved from a WebBrowser in a Windows Form. This looks like:
HtmlDocument document = webBrowser1.Document;
HtmlElement element = document.GetElementById(chatElementId);
Assume "chatElementId" is just some known ID. What I would like to do is retrieve the content in "time" (6:07 in this example), "username" (ThisIsSomeonesUsername), and "message" (This is my message... etc.). The message portion can contain almost anything, including further html (such as links, images, etc.), but I want to keep all that intact. I was going to use a regular expression to parse the InnerHtml of the element retrieved using the method above, but apparently this will bring about the destruction of the universe. How then should I go about doing this?
Edit: People keep suggesting Html Agility Pack, so is there an easy way to go about doing this in Html Agility Pack without using the full HTML source? I'm not sure if the rest of the html outside of this class is all that great... but should I just pass the whole html anyway?
Read the link on the Nico's answer ... I was about to post the same one (it's hilarious).
Having said that, from your comments it seems like you're intent on regex. So, regex it away.
It shouldn't be hard to do.
Go to http://regexpal.com/, paste your data on the bottom part, play with the regex part on the top until you're happy with the result, and just loop over your data and extract what you need to your heart content.
(I'm not sure if I'd do it, but sometimes a quick fix is better than a long more "correct" answer).
Just an FYI Regex cant parse HTML in any usable fasion... RegEx match open tags except XHTML self-contained tags just for those that stumble across this post.
Now for your requirement have you tried using XmlDocument or XDocument?
Just try the following (note the img tag is missing the end />) if that is the case in your HTML this wont work as its not valid XML).
//parse the xml
var xDoc = XDocument.Parse(html);
//create our list of results (basic tuple here, could be your class)
List<Tuple<string, string, string>> attributes = new List<Tuple<string, string, string>>();
//iterate all li elemenets
foreach (var element in xDoc.Root.Elements("li"))
{
//set the default values
string time = "",
username = "",
message = "";
//get the time, username message attributes
XElement tElem = element.Elements("span").FirstOrDefault(x => x.Attributes("class").Count() > 0 && x.Attribute("class").Value == "time");
XElement uElem = element.Elements("span").FirstOrDefault(x => x.Attributes("class").Count() > 0 && x.Attribute("class").Value == "username");
XElement mElem = element.Elements("span").FirstOrDefault(x => x.Attributes("class").Count() > 0 && x.Attribute("class").Value == "message");
//set our values based on element results
if (tElem != null)
time = tElem.Value;
if (uElem != null)
username = uElem.Value;
if (mElem != null)
message = mElem.Value;
//add to our list
attributes.Add(new Tuple<string, string, string>(time, username, message));
}

Parse through current page

Is there a way to get a page to parse through its self?
So far I have:
string whatever = TwitterSpot.InnerHtml;
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(whatever);
foreach("this is where I am stuck")
{
}
I want to parse the page so what I did is create a parent div named TwitterSpot. Put the InnerHtml into a string, and have loaded it as a new HtmlDocument.
Next I want to get within that a string value of "#XXXX+n " and replace it in the page infront with some cool formatting.
I am getting stuck on my foreach loop do not know how I should search for a # or how to look through the loaded HtmlDocument.
The next step is to apply change to where ever I have seen a # tag. I could do this is JavaScript probably a lot easier I know but I am adament on seeing how I can get asp.net c# to do it.
The # is a string value within the html I am not referring to it as a Control ID.
Assuming you're using HtmlAgilityPack, you could use xpath to find text nodes which contain your value:
var matchedNodes = document.DocumentNode
.SelectNodes("//text()[contains(.,'#XXXX+n ')]");
Then you could just interate through these nodes and make all the necessary replacemens:
foreach (HtmlTextNode node in matchedNodes)
{
node.Text = node.Text.Replace("#XXXX+n ", "brand new text");
}
You can use http://htmlagilitypack.codeplex.com/ to parse HTML and manipulate its content; works very well.
I guess you could use RegEx to find all matches and loop through them.
You could just change it to be:
string whatever = TwitterSpot.InnerHtml;
whatever = whatever.Replace("#XXXX+n ", String.format("<b>{0}</b>", "#XXXX+n "));
No parsing required...
When I did this before, I stored the HTML in an XML doc and looped through each node. You can then apply XSLT or just parse the nodes.
It sounds like for your purposes though that you don't really need to do that. I'd recommend making the divs into server controls and programmatically looping through their child controls, as such:
foreach (Object o in divSomething.Controls)
{
if (o.GetType == "TextBox" && ((TextBox)o).ID == "txtSomething")
{
((TextBox)o).Attributes.Add("style", "font: Arial; color: Red;");
}
}

XML problem - HTML within a node is being removed (ASP.NET C# LINQ to XML)

When I load this XML node, the HTML within the node is being completely stripped out.
This is the code I use to get the value within the node, which is text combined with HTML:
var stuff = innerXml.Descendants("root").Elements("details").FirstOrDefault().Value;
Inside the "details" node is text that looks like this:
"This is <strong>test copy</strong>. This is A Link"
When I look in "stuff" var I see this:
"This is test copy. This is A Link". There is no HTML in the output... it is pulled out.
Maybe Value should be innerXml or innerHtml? Does FirstOrDefault() have anything to do with this?
I don't think the xml needs a "cdata" block...
HEre is a more complete code snippet:
announcements =
from link in xdoc.Descendants(textContainer).Elements(textElement)
where link.Parent.Attribute("id").Value == Announcement.NodeId
select new AnnouncmentXml
{
NodeId = link.Attribute("id").Value,
InnerXml = link.Value
};
XDocument innerXml;
innerXml = XDocument.Parse(item.InnerXml);
var abstract = innerXml.Descendants("root").Elements("abstract").FirstOrDefault().Value;
Finally, here is a snippet of the Xml Node. Notice how there is "InnerXml" within the standard xml structure. It starts with . I call this the "InnerXml" and this is what I am passing into the XDocument called InnerXml:
<text id="T_403080"><root> <title>How do I do stuff?</title> <details> Look Here Some Form. Please note that lorem ipsum dlor sit amet.</details> </root></text>
[UPDATE]
I tried to use this helper lamda, and it will return the HTML but it is escaped, so when it displays on the page I see the actual HTML in the view (it shows instead of giving a link, the tag is printed to screen:
Title = innerXml.Descendants("root").Elements("title").FirstOrDefault().Nodes().Aggregate(new System.Text.StringBuilder(), (sb, node) => sb.Append(node.ToString()), sb => sb.ToString());
So I tried both HTMLEncode and HTMLDecode but neither helped. One showed the escaped chars on the screen and the other did nothing:
Title =
System.Web.HttpContext.Current.Server.HtmlDecode(
innerXml.Descendants("root").Elements("details").Nodes().Aggregate(new System.Text.StringBuilder(), (sb, node) => sb.Append(node.ToString()), sb => sb.ToString())
);
I ended up using an XmlDocument instead of an XDocument. It doesn't seem like LINQ to XML is mature enough to support what I am trying to do. THere is no InnerXml property of an XDoc, only Value.
Maybe someday I will be able to revert to LINQ. For now, I just had to get this off my plate. Here is my solution:
// XmlDoc to hold custom Xml within each node
XmlDocument innerXml = new XmlDocument();
try
{
// Parse inner xml of each item and create objects
foreach (var faq in faqs)
{
innerXml.LoadXml(faq.InnerXml);
FAQ oFaq = new FAQ();
#region Fields
// Get Title value if node exists and is not null
if (innerXml.SelectSingleNode("root/title") != null)
{
oFaq.Title = innerXml.SelectSingleNode("root/title").InnerXml;
}
// Get Details value if node exists and is not null
if (innerXml.SelectSingleNode("root/details") != null)
{
oFaq.Description = innerXml.SelectSingleNode("root/details").InnerXml;
}
#endregion
result.Add(oFaq);
}
}
catch (Exception ex)
{
// Handle Exception
}
I do think wrapping your details node in a cdata block is the right decision. CData basically indicates that the information contained within it should be treated as text, and not parsed for XML special characters. The html charaters in the details node, especially the < and > are in direct conflict with the XML spec, and should really be marked as text.
You might be able to hack around this by grabbing the innerXml, but if you have control over the document content, cdata is the correct decision.
In case you need an example of how that should look, here's a modified version of the detail node:
<details>
<![CDATA[
This is <strong>test copy</strong>. This is A Link
]]>
</details>

Categories