Parse through current page - c#

Is there a way to get a page to parse through its self?
So far I have:
string whatever = TwitterSpot.InnerHtml;
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(whatever);
foreach("this is where I am stuck")
{
}
I want to parse the page so what I did is create a parent div named TwitterSpot. Put the InnerHtml into a string, and have loaded it as a new HtmlDocument.
Next I want to get within that a string value of "#XXXX+n " and replace it in the page infront with some cool formatting.
I am getting stuck on my foreach loop do not know how I should search for a # or how to look through the loaded HtmlDocument.
The next step is to apply change to where ever I have seen a # tag. I could do this is JavaScript probably a lot easier I know but I am adament on seeing how I can get asp.net c# to do it.
The # is a string value within the html I am not referring to it as a Control ID.

Assuming you're using HtmlAgilityPack, you could use xpath to find text nodes which contain your value:
var matchedNodes = document.DocumentNode
.SelectNodes("//text()[contains(.,'#XXXX+n ')]");
Then you could just interate through these nodes and make all the necessary replacemens:
foreach (HtmlTextNode node in matchedNodes)
{
node.Text = node.Text.Replace("#XXXX+n ", "brand new text");
}

You can use http://htmlagilitypack.codeplex.com/ to parse HTML and manipulate its content; works very well.

I guess you could use RegEx to find all matches and loop through them.

You could just change it to be:
string whatever = TwitterSpot.InnerHtml;
whatever = whatever.Replace("#XXXX+n ", String.format("<b>{0}</b>", "#XXXX+n "));
No parsing required...

When I did this before, I stored the HTML in an XML doc and looped through each node. You can then apply XSLT or just parse the nodes.
It sounds like for your purposes though that you don't really need to do that. I'd recommend making the divs into server controls and programmatically looping through their child controls, as such:
foreach (Object o in divSomething.Controls)
{
if (o.GetType == "TextBox" && ((TextBox)o).ID == "txtSomething")
{
((TextBox)o).Attributes.Add("style", "font: Arial; color: Red;");
}
}

Related

Selecting text from some elements inside a div and ignore other elements. HTML Agility Pack

I'm trying to build a web scraping tool for a news website. I'm having problems selecting the relevant text since the text is divided into multiple different elements. I'm using HTML Agility Pack and I have tried to select text ( //text() ) from the main div, but when I do this I get a lot of garbage text I don't want, like javascript code.
How can I select text from some nested elements and ignore other elements?
<div class="texto_container paywall">
Some text I want
<a href="https://www.sabado.pt/sabermais/ana-gomes" target="_blank" rel="noopener">
Text I want
</a>
sample of text I want
<em>
another text i want
</em>
<aside class="multimediaEmbed contentRight">
A lot of nested elements here with some text I dont want
</aside>
<div class="inContent">
A lot of nested elements here with some text I don't want
</div>
Back to the text I want!
<twitter-widget class="twitter-tweet twitter-tweet-rendered" id="twitter-widget-0" >
Don't want any of this text located in nested elements!
</twitter-widget>
<p>
Final revelant text i want to collect!
</p>
</div>
EDIT
I tried to use XPath to exclude the tags I don't want, but I still get text nodes from those tags in the result.
var parse_me = htmlDoc.DocumentNode.SelectNodes("//div[#class='texto_container paywall']//text()[not(parent::aside)][not(parent::div[#class='inContent'])][not(parent::twitter-widget)]");
I think this code doesn't work because on the tags I don't want to include the text parent nodes aren't the "main" tag, because it is inside of a lot of nested tags.
EDIT
After some thinking and some research I fixed the previous problem by using ancestor:: instead of parent:: and I got rid of some of the intended text.
But I still can't get rid of the twitter-widget text, because it always returns a null node even with the XPath copied from the Google Chrome inspect element tool.
var Twitter_Node = htmlDoc.DocumentNode.SelectSingleNode("//*[#id='twitter - widget - 0']");
This gets returned as null. How is this possible? XPath was copied from Chrome.
You can try to exclude the text from specific tags :
//body//text()[not(parent::aside)][not(parent::div[#class="inContent"])][not(parent::twitter-widget)]
You could use concat but it's more complicated since you have to know the number and the position of each tag in the "chain" :
concat(//body//div[#class="texto_container paywall"]/text()[1],//body//a[#href]/text(),//body//div[#class="texto_container paywall"]/text()[2],//body//em/text(),//body//div[#class="texto_container paywall"]/text()[5],//body//p/text())
I am using ScrapySharp nuget which adds in my sample below, (It's possible HtmlAgilityPack offers the same functionality built it, I am just used to ScrapySharp from years ago)
You can simply punctually extract all the texts you do NOT want, and then replace their occurrences in the main div text with an empty string, removing them from the final result.
var doc = new HtmlDocument();
doc.Load(#"C:\Desktop\z.html"); //I created an html with your sample HTML set as the html body
List<string> textsIWant = new List<string>();
var textsIdoNotWant = new List<string>();
//text I do not want
var aside = doc.DocumentNode.CssSelect(".multimediaEmbed.contentRight").FirstOrDefault();
if (aside != null)
{
textsIdoNotWant.Add(aside.InnerText);
}
var inContent = doc.DocumentNode.CssSelect(".inContent").FirstOrDefault();
if (inContent != null)
{
textsIdoNotWant.Add(inContent.InnerText);
}
var twitterWidget = doc.DocumentNode.CssSelect("#twitter-widget-0").FirstOrDefault();
if (twitterWidget != null)
{
textsIdoNotWant.Add(twitterWidget.InnerText);
}
var div = doc.DocumentNode.CssSelect(".texto_container.paywall").FirstOrDefault();
if (div != null)
{
var text = div.InnerText;
foreach (var textIDoNotWant in textsIdoNotWant)
{
text = text.Replace(textIDoNotWant, string.Empty);
}
textsIWant.Add(text);
}
foreach (var text in textsIWant)
Console.WriteLine(text);

retrive the last match case or list with regular expression and than work with it

my issue is that I'll download html page content to string with
System.Net.WebClient wc = new System.Net.WebClient();
string webData = wc.DownloadString("http://prices.shufersal.co.il/");
and trying to retrive the last number of page from the navigation menu
<a data-swhglnk=\"true\" href=\"/?page=2\">2</a>
so at the end I'll want want to find the last data-swhglnk and retrive from it the last page.
I try
Regex.Match(webData, #"swhglnk", RegexOptions.RightToLeft);
I would be happy to understand the right approch to issues like this
If you're about to parse HTML and find some information in it, you should use method more reliable than regex, i.e:
-HtmlAgilityPack https://htmlagilitypack.codeplex.com/
-csQuery https://github.com/jamietre/CsQuery
and operate on objects, not strings.
Update
If you decide to use HtmlAgilityPack, you will have to write code like this:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(webData);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a[#data-swhglnk]"))
{
HtmlAttribute data = node.Attributes["data-swhglnk"];
//do your processing here
}

How to/Should I retrieve data from particularly formatted HTML without regex

I have a whole pile of HTML which is just a bunch of this:
<li id="entry-c7" data-user="ThisIsSomeonesUsername">
<img width="28" height="28" class="avatar" src="http://very_long_url.png">
<span class="time">6:07</span>
<span class="username">ThisIsSomeonesUsername</span>
<span class="message">This is my message. It is nice, no?</span>
</li>
Repeated over and over again about a hundred thousand times (with different content, of course). This is all taken from an HTMLDocument by retrieving the element which holds all this. The document is retrieved from a WebBrowser in a Windows Form. This looks like:
HtmlDocument document = webBrowser1.Document;
HtmlElement element = document.GetElementById(chatElementId);
Assume "chatElementId" is just some known ID. What I would like to do is retrieve the content in "time" (6:07 in this example), "username" (ThisIsSomeonesUsername), and "message" (This is my message... etc.). The message portion can contain almost anything, including further html (such as links, images, etc.), but I want to keep all that intact. I was going to use a regular expression to parse the InnerHtml of the element retrieved using the method above, but apparently this will bring about the destruction of the universe. How then should I go about doing this?
Edit: People keep suggesting Html Agility Pack, so is there an easy way to go about doing this in Html Agility Pack without using the full HTML source? I'm not sure if the rest of the html outside of this class is all that great... but should I just pass the whole html anyway?
Read the link on the Nico's answer ... I was about to post the same one (it's hilarious).
Having said that, from your comments it seems like you're intent on regex. So, regex it away.
It shouldn't be hard to do.
Go to http://regexpal.com/, paste your data on the bottom part, play with the regex part on the top until you're happy with the result, and just loop over your data and extract what you need to your heart content.
(I'm not sure if I'd do it, but sometimes a quick fix is better than a long more "correct" answer).
Just an FYI Regex cant parse HTML in any usable fasion... RegEx match open tags except XHTML self-contained tags just for those that stumble across this post.
Now for your requirement have you tried using XmlDocument or XDocument?
Just try the following (note the img tag is missing the end />) if that is the case in your HTML this wont work as its not valid XML).
//parse the xml
var xDoc = XDocument.Parse(html);
//create our list of results (basic tuple here, could be your class)
List<Tuple<string, string, string>> attributes = new List<Tuple<string, string, string>>();
//iterate all li elemenets
foreach (var element in xDoc.Root.Elements("li"))
{
//set the default values
string time = "",
username = "",
message = "";
//get the time, username message attributes
XElement tElem = element.Elements("span").FirstOrDefault(x => x.Attributes("class").Count() > 0 && x.Attribute("class").Value == "time");
XElement uElem = element.Elements("span").FirstOrDefault(x => x.Attributes("class").Count() > 0 && x.Attribute("class").Value == "username");
XElement mElem = element.Elements("span").FirstOrDefault(x => x.Attributes("class").Count() > 0 && x.Attribute("class").Value == "message");
//set our values based on element results
if (tElem != null)
time = tElem.Value;
if (uElem != null)
username = uElem.Value;
if (mElem != null)
message = mElem.Value;
//add to our list
attributes.Add(new Tuple<string, string, string>(time, username, message));
}

Using HTMLAgilityPack to get all the values of a select element

Here is what I have so far:
HtmlAgilityPack.HtmlDocument ht = new HtmlAgilityPack.HtmlDocument();
TextReader reader = File.OpenText(#"C:\Users\TheGateKeeper\Desktop\New folder\html.txt");
ht.Load(reader);
reader.Close();
HtmlNode select= ht.GetElementbyId("cats[]");
List<HtmlNode> options = new List<HtmlNode>();
foreach (HtmlNode option in select.ChildNodes)
{
if (option.Name == "option")
{
options.Add(option);
}
}
Now I have a list of all the "options" for the select element. What properties do I need to access to get the key and the text?
So if for example the html for one option would be:
<option class="level-1" value="1">Funky Town</option>
I want to get as output:
1 - Funky Town
Thanks
Edit: I just noticed something. When I got the child elements of the "Select" elements, it returned elements of type "option" and elements of type "#text".
Hmmm .. #text has the string I want, but select has the value.
I tought HTMLAgilityPack was an html parser? Why did it give me confusing values like this?
This is due to the default configuration for the html parser; it has configured the <option> as HtmlElementFlag.Empty (with the comment 'they sometimes contain, and sometimes they don't...'). The <form> tag has the same setup (CanOverlap + Empty) which causes them to appear as empty nodes in the dom, without any child nodes.
You need to remove that flag before parsing the document.
HtmlNode.ElementsFlags.Remove("option");
Notice that the ElementsFlags property is static and any changes will affect all further parsing.
edit: you should probably be selecting the option nodes directly via xpath. I think this should work for that:
var options = select.SelectNodes("option");
that will get your options without the text nodes. the options should contain that string you want somewhere. waiting for your html sample.
foreach (var option in options)
{
int value = int.Parse(option.Attributes["value"].Value);
string text = option.InnerText;
}
 
you can add some sanity checking on the attribute to make sure it exists.

XML problem - HTML within a node is being removed (ASP.NET C# LINQ to XML)

When I load this XML node, the HTML within the node is being completely stripped out.
This is the code I use to get the value within the node, which is text combined with HTML:
var stuff = innerXml.Descendants("root").Elements("details").FirstOrDefault().Value;
Inside the "details" node is text that looks like this:
"This is <strong>test copy</strong>. This is A Link"
When I look in "stuff" var I see this:
"This is test copy. This is A Link". There is no HTML in the output... it is pulled out.
Maybe Value should be innerXml or innerHtml? Does FirstOrDefault() have anything to do with this?
I don't think the xml needs a "cdata" block...
HEre is a more complete code snippet:
announcements =
from link in xdoc.Descendants(textContainer).Elements(textElement)
where link.Parent.Attribute("id").Value == Announcement.NodeId
select new AnnouncmentXml
{
NodeId = link.Attribute("id").Value,
InnerXml = link.Value
};
XDocument innerXml;
innerXml = XDocument.Parse(item.InnerXml);
var abstract = innerXml.Descendants("root").Elements("abstract").FirstOrDefault().Value;
Finally, here is a snippet of the Xml Node. Notice how there is "InnerXml" within the standard xml structure. It starts with . I call this the "InnerXml" and this is what I am passing into the XDocument called InnerXml:
<text id="T_403080"><root> <title>How do I do stuff?</title> <details> Look Here Some Form. Please note that lorem ipsum dlor sit amet.</details> </root></text>
[UPDATE]
I tried to use this helper lamda, and it will return the HTML but it is escaped, so when it displays on the page I see the actual HTML in the view (it shows instead of giving a link, the tag is printed to screen:
Title = innerXml.Descendants("root").Elements("title").FirstOrDefault().Nodes().Aggregate(new System.Text.StringBuilder(), (sb, node) => sb.Append(node.ToString()), sb => sb.ToString());
So I tried both HTMLEncode and HTMLDecode but neither helped. One showed the escaped chars on the screen and the other did nothing:
Title =
System.Web.HttpContext.Current.Server.HtmlDecode(
innerXml.Descendants("root").Elements("details").Nodes().Aggregate(new System.Text.StringBuilder(), (sb, node) => sb.Append(node.ToString()), sb => sb.ToString())
);
I ended up using an XmlDocument instead of an XDocument. It doesn't seem like LINQ to XML is mature enough to support what I am trying to do. THere is no InnerXml property of an XDoc, only Value.
Maybe someday I will be able to revert to LINQ. For now, I just had to get this off my plate. Here is my solution:
// XmlDoc to hold custom Xml within each node
XmlDocument innerXml = new XmlDocument();
try
{
// Parse inner xml of each item and create objects
foreach (var faq in faqs)
{
innerXml.LoadXml(faq.InnerXml);
FAQ oFaq = new FAQ();
#region Fields
// Get Title value if node exists and is not null
if (innerXml.SelectSingleNode("root/title") != null)
{
oFaq.Title = innerXml.SelectSingleNode("root/title").InnerXml;
}
// Get Details value if node exists and is not null
if (innerXml.SelectSingleNode("root/details") != null)
{
oFaq.Description = innerXml.SelectSingleNode("root/details").InnerXml;
}
#endregion
result.Add(oFaq);
}
}
catch (Exception ex)
{
// Handle Exception
}
I do think wrapping your details node in a cdata block is the right decision. CData basically indicates that the information contained within it should be treated as text, and not parsed for XML special characters. The html charaters in the details node, especially the < and > are in direct conflict with the XML spec, and should really be marked as text.
You might be able to hack around this by grabbing the innerXml, but if you have control over the document content, cdata is the correct decision.
In case you need an example of how that should look, here's a modified version of the detail node:
<details>
<![CDATA[
This is <strong>test copy</strong>. This is A Link
]]>
</details>

Categories