Please check the code bellow. I am trying to grab a html text value from this html doc. I want to grab text Quick Kill 32 oz. Mosquito Yard Spray and i already tried to do it using SelectSingleNode like bellow and this cant grab this text value. Any idea how to fix it?
string html = #"<div class='pod-plp__description js-podclick-analytics' data-podaction='product name'>
<a class='' data-pos='0' data-request-type='sr' data-pod-type='pr' href='/p/AMDRO-Quick-Kill-32-oz-Mosquito-Yard-Spray-100530440/304755303'>
<span class='pod-plp__brand-name'>AMDRO</span>
Quick Kill 32 oz. Mosquito Yard Spray
</a>
</div>";
var doc = new HtmlDocument();
doc.Load(html);
string title = doc.DocumentNode
.SelectSingleNode("//div[#class='pod-plp__description js-podclick-analytics']span[#class='pod-plp__brand-name']")
.InnerText;
You are trying to targeting only span[#class='pod-plp__brand-name'] which will return you only inside span but you need following-sibling::text() to grab text after your span. Please see my example code bellow. Also you can learn more from html-agility-pack official site.
var Content = htmlDoc.DocumentNode.SelectSingleNode("//span[#class='pod-plp__brand-name']/following-sibling::text()[1]");
string title = titleAgain.InnerText.Trim();
Found solution from here
Related
I'm new with C#, and I'm trying to access an element from a website using webBrowser. I wondered how can I get the "Developers" string from the site:
<div id="title" style="display: block;">
<b>Title:</b> **Developers**
</div>
I tried to use webBrowser1.Document.GetElementById("title") ,but I have no idea how to keep going from here.
Thanks :)
You can download the source code using WebClient class
then look within the file for the <b>Title:</b>**Developers**</div> and then omit everything beside the "Developers".
HtmlAgilityPack and CsQuery is the way many people has taken to work with HTML page in .NET, I'd recommend them too.
But in case your task is limited to this simple requirement, and you have a <div> markup that is valid XHTML (like the markup sample you posted), then you can treat it as an XML. Means you can use .NET native API such as XDocument or XmlDocument to parse the HTML and perform an XPath query to get specific part from it, for example :
var xml = #"<div id=""title"" style=""display: block;""> <b>Title:</b> Developers</div>";
//or according to your code snippet, you may be able to do as follow :
//var xml = webBrowser1.Document.GetElementById("title").OuterHtml;
var doc = new XmlDocument();
doc.LoadXml(xml);
var text = doc.DocumentElement.SelectSingleNode("//div/b/following-sibling::text()");
Console.WriteLine(text.InnerText);
//above prints " Developers"
Above XPath select text node ("Developers") next to <b> node.
You can use HtmlAgilityPack (As mentioned by Giannis http://htmlagilitypack.codeplex.com/). Using a web browser control is too much for this task:
HtmlAgilityPack.HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("http://www.google.com");
var el = doc.GetElementbyId("title");
string s = el.InnerHtml; // get the : <b>Title:</b> **Developers**
I haven't tried this code but it should be very close to working.
There must be an InnerText in HtmlAgilityPack as well, allowing you to do this:
string s = el.InnerText; // get the : Title: **Developers**
You can also remove the Title: by removing the appropriate node:
el.SelectSingleNode("//b").Remove();
string s = el.InnerText; // get the : **Developers**
If for some reason you want to stick to the web browser control, I think you can do this:
var el = webBrowser1.Document.GetElementById("title");
string s = el.InnerText; // get the : Title: **Developers**
UPDATE
Note that the //b above is XPath syntax which may be interesting for you to learn:
http://www.w3schools.com/XPath/xpath_syntax.asp
http://www.freeformatter.com/xpath-tester.html
i want to know how can i get data from webpage
example :
<li id="hello1">about me
<ul class="square">
<li><strong>name: john</strong></li>
</ul>
</li>
i want to read john in front of name: so how i cant read it in c#
oh i have tried to use HTML Agility Pack :( but due to its poor documentation i was not able to use so need help .
Use HtmlAgilityPack
HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);
var nameElement= doc.DocumentNode.SelectSingleNode("//li[#id='hello1']").InnerText;
//name would contain `about me name: john`
Regex.Match(nameElement,#"(?<=name:\s*)\w+").Value;//john
I have used HTML Agility Pack before and it is great tool
HtmlDocument document = new HtmlDocument();
document.LoadHtml(YourHTML);
var collection = document.DocumentNode.SelectNodes("//li[#id='hello1']");
I have a string that has HTML formatted content.
Now I want to convert that string to HTML, May I use HtmlElementCollection
Is it possible? If yes, then how?
Kindly explain. Thanks!
Take a look at the HtmlAgilityPack. More information can be found on the answers of other similar questions.
A string will be handled as HTML when you push this string into an environment that will render the content as HTML.
When you push the content of the string to an environment that doesn't handle HTML or you explicitly say that you don't want it to render as HTML. It will be rendered as plain text.
Use HTMLAGILITYPACK and use the following:
var st1 = stringdata; // your html formatted string
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(st1); // this is now html doc
If someone still locking for this
For that just type code below in the place you what to show the HTML
#((MarkupString)htmlString)
//htmlString =>string that has HTML formatted content
//tested with blazor server last ver of dotnet(now 6.0)
i am using the HTML Agility pack to convert
<font size="1">This is a test</font>
to
This is a test
using this code:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string stripped = doc.DocumentNode.InnerText;
but i ran into an issue where i have this:
<font size="1">This is a test & this is a joke</font>
and the code above converted this to
This is a test & this is a joke
but i wanted it to convert it to:
This is a test & this is a joke
does the html agility pack support what i am trying to do? why doesn't the HTML agiligy code do this by default or i am doing something wrong ?
You can run HttpUtility.HtmlDecode() on the output.
However, note that InnerText will include HTML tags that may be contained inside the outermost tag. If you want to remove all tags, you will have to walk the document tree and retrieve all the text bit by bit.
good morning!
i am using c# (framework 3.5sp1) and want to parse following piece of html via regex:
<h1>My caption</h1>
<p>Here will be some text</p>
<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>
<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>
<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>
i need following output:
group 1: content of h1
group 2: content of h1-following text
group 3-n: content of subcaptions + text
what i have atm:
<hr.*?/>
<h2.*?>(.*?)</h2>
([\W\S]*?)
<hr.*?/>
this will give me every odd subcaption + content (eg. 1, 3, ...) due to the trailing <hr/>. for parsing the h1-caption i have another pattern (<h1.*?>(.*?)</h1>), which only gives me the caption but not the content - i'm fine with that atm.
does anybody have a hint/solution for me or any alternative logics (eg. parsing the html via reader and assigning it this way?)?
edit:
as some brought in HTMLAgilityPack, i was curious about this nice tool. i accomplished getting content of the <h1>-tag.
but ... myproblem is parsing the rest. this is caused by: the tags for the content may vary - from <p> to <div> and <ul>...
atm this seems more or less iterate over the whole document and parsing tag for tag ...?
any hints?
You will really need HTML parser for this
Don't use regex to parse HTML. Consider using the HTML Agility Pack.
There are some possibilities:
REGEX - Fast but not reliable, it cant deal with malformed html.
HtmlAgilityPack - Good, but have many memory leaks. If you want to deal with a few files, there is no problem.
SGMLReader - Really good, but there are a problem. Sometimes it cant find the default namespace to get others nodes, then it is impossible to parse html.
http://developer.mindtouch.com/SgmlReader
Majestic-12 - Good but not so fast as SGMLReader.
http://www.majestic12.co.uk/projects/html_parser.php
Example for SGMLreader (VB.net)
Dim sgmlReader As New Sgml.SgmlReader()
Public htmldoc As New System.Xml.Linq.XDocument
sgmlReader.DocType = "HTML"
sgmlReader.WhitespaceHandling = System.Xml.WhitespaceHandling.All
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower
sgmlReader.InputStream = New System.IO.StringReader(vSource)
sgmlReader.CaseFolding = CaseFolding.ToLower
htmldoc = XDocument.Load(sgmlReader)
Dim XNS As XNamespace
' In this part you can have a bug, sometimes it cant get the Default Namespace*********
Try
XNS = htmldoc.Root.GetDefaultNamespace
Catch
XNS = "http://www.w3.org/1999/xhtml"
End Try
If XNS.NamespaceName.Trim = "" Then
XNS = "http://www.w3.org/1999/xhtml"
End If
'use it with the linq commands
For Each link In htmldoc.Descendants(XNS + "script")
Scripts &= link.Value
Next
In Majestic-12 is different, you have to walk to every tag with a "Next" command. You can find a example code with the dll.
As others have mentioned, use the HtmlAgilityPack. However, if you like jQuery/CSS selectors, I just found a fork of the HtmlAgilityPack called Fizzler:
http://code.google.com/p/fizzler/
Using this you could find all <p> tags using:
var pTags = doc.DocumentNode.QuerySelectorAll('p').ToList();
Or find a specific div like <div id="myDiv"></div>:
var myDiv = doc.DocumentNode.QuerySelectorAll('#myDiv');
It can't get any easier than that!