So far I used HtmlAgilityPack for building Html documents.
The problem is that it is not stable, I get Stackoverflow Exceptions and it doesn't support jQuery syntax.
What I am trying to use to build Html documents is CsQuery.
My question is:
Is it designated for building Html documents?
I like the functions it offers, but I cannot render the modified html document.
For example:
CQ fragment= CQ.CreateFragment("<p>some text</p>");
CQ html = CQ.CreateFromFile(#"index.html");
CQ modified_html= html.Select("#test").Append(fragment);
Which means, I want to append fragment variable to element with id "test".
the problem is that I expect modified_html.Render() to return the modified version (including < p> sometext < /p> added to #test element), but it actually doesn't!!!
is there anyway to achieve this?
Actually it does. I also checked with your code, it do append <p>some text</p> to the modified_html. The only possible issue I can think: there is no element with id = "test" in index.html. You may also want to save modified html to file so it will be easier for you to examine the output :
modified_html.Save(#"index_modified.html");
Related
I have html code:
<p>Answer1</p>
<h2>Category1</h2>
<p>Answer2</p>
<p>Answer3</p>
I need to do parsing so that each answer (p) belongs to the category(h2) above.
If nothing is above, then the category will be null.
Look like this :
obj1.category = null; obj1.answer = "Answer1";
obj2.category ="Category1"; obj2.answer = "Answer2";
obj3.category ="Category1"; obj3.answer = "Answer3";
I tried to solve this, but it was useless.
Use HTMLAgilityPack. It will parse HTML and allow you do use LINQ to SELECT whatever you need from the DOM structure.
In addition to HTMLAgilityPack, I've also written a light weight HTML parse for C#.
There's no big secret to the technique, but it's sort of detailed work. You just go through the text character by character and pull out HTML elements.
My parser is on Github as HtmlMonkey.
UPDATE:
I just added support for fairly advanced selectors to easily find nodes within a parsed document.
I need to retrieve some info from an html doc since the web service to get a json or an xml is still not ready. Im working with c# and using regular expressions to get the data i need from the html string. I've managed to get the div i want to work with from the whole html string but now i'm having trouble getting the info between the first span tag.
I've attempted to retrieve the data between ; and the first closing span tag but what i really want is the content between the first span tag.
Here's the regular expression i've written so far, but it's not working:
".*;(?<Content>(\r|\n|.)*)</span>"
I also tried this but didnt work either:
"<span class=""type"">(?<Content>(\r|\n|.)*)</span>"
Here is the div i want to retrieve the data from:
<div class="main">ABASASDFÓ 18/06/2014 17:38h Blabla Balbal <span class="type">15.80€ </span>+1.94 % +0.30€ | HOME <SPAN class="type2">11,398.70</span> +0.65 % +74.10</div>
EDIT: I can't use Htmlagilitypack since my client does not want us to use any external library. I've also heard about using the XmlReader but i'm not sure the structure of the html will match an xml one accordingly.
This regex will capture the string:
"<span class=\"type\">(?<Content>([^<]*))</span>"
Although, I agree with other answers, you should use something like Path instead of Regexes for parsing html.
Here's how it is done with a regex in Javascript. You should be able to adapt this for C# pretty easily.
var inner = html.match( /<span class="type"(?:\s+[a-z]+(?:\s*=\s*(?:"[^"]*"|'[^']*'|[^\s>]+)))*\s*>([\S\s]*)<\/span>/i)[1];
Fiddle: http://jsfiddle.net/GarryPas/uk32r8vz/
You want to use XPath for that. Something like this:
div/span/text()
I understand not wanting some external 3rd party library in your solution, the solution to that is to go fetch the source code of the entire library:
https://htmlagilitypack.codeplex.com/
Now you don't have an external library, you have an internal library and you can use the right tool for the job!
XmlReader is a fairly low-level tool, it could technically do the job for you but what you're more after is "use XmlReader to do XPath" which is talked about here: https://msdn.microsoft.com/en-us/library/ms950778.aspx
The XPathReader class is the result of all that, which has been superseded by LINQ to XML: https://msdn.microsoft.com/en-ca/library/bb387098.aspx
So another option here is to try to use some LINQ to process your HTML file, but that might be tricky since HTML isn't good XML. Still, it's another option if you're looking for those.
I have a c# application. I need to extract data from a html page and add it to my database. The html page contains some css code and I am interested in all of the id's attributes from the css. How can I pull out the id's info into my code? I tried something like this but it doesn't seem to work:
var styles = document.DocumentNode.SelecNodes("//style");
foreach(HtmlNode node in styles)
{
var text = node.Attributes["id"];
}
I really appreciate any help!
More of a fishing rod than a fish, but that's all I got time to do ATM.
First, look at this tutorial: xpath on w3schools. I've done some work with XPath, and it was only after going through their tutorial that things started to make a bit of sense.
Then, please get this html agility test pack, it will let you quickly test your code against the page you're trying to parse.
From here, it should be a short way to get what you want.
Try this, access Id property directly :
var styles = document.DocumentNode.SelecNodes("//*[#style]");
foreach(HtmlNode node in styles)
{
var text = node.Id;
}
Edit: expression changed to "//*[#style]" which gets you only elements with style attribute.
Within a c# project I'm sending a WebRequest to a php website, which takes the values and uses a select statement to query the DB and return an HTML page. Since there is only one value that comes back from that query, I need to assign this value to my c# code.
The source of the body-tag of the returned HTML-page (and with the StreamReader in my c#) looks like this:
<table border='1'>
<tr><td>ValueINeed</td></tr>
</table>
How do I access the value inside this in order to assign it to a string in my c# code?
thank you.
If you are the author of the PHP code as well, I would suggest that you make another page that returns json or something instead, this way you would be able to avoid parsing HTML.
But if this really is what you are stuck with, I would suggest that you take a look at Html Agility Pack. Here is another quesiton here on StackOverflow that are about how to use the Html Agility Pack.
If the result is always the same you could just split the string or use regex.
If not you may use a html parser: http://htmlagilitypack.codeplex.com/
I'm on a development process of a crawling engine. My program crawls websites through Xpath with HtmlAgilityPack. I need to get some image src tag's directly. You can see my simple code below which is not working correctly, thanks in advice!
PS: Please ignore " char problem, XPath patterns are provided by database.
Agility.DocumentNode.SelectSingleNode("//img[#id="product_photo"]/#src");
And this is the line i need to crawl (the *...* part shows block to extract
<img id="product_photo" src="*/images/thumb/4400/10280/st.jpg*">
Some pages provide image in meta tags so .Attributes["src"] wont work.
UPDATE: You can see my query and result here
You cann't get the value of "src" or any other attributes in using:
Agility.DocumentNode.SelectSingleNode(yourXpath);
Just by using:
string s=Agility.DocumentNode.SelectSingleNode(yourXpath).value;
It's because XPath cann't return value of an attribute by SelectSingleNode() func in HtmlAgilityPack class. So you must use SelectSingleNode(yourXpath).value or use Regex after the pharsing to get just the "src" without the outerText.