I need to be able to parse an HTML template file (with the intention of injecting an SVG element into a html file, then converting it to pdf via wkhtmltopdf).
I know about the HTML Agility Pack, but it seems incapable of parsing local files (attempts to use file:// URIs have caused it to throw exceptions).
So, can anyone recommend a C# HTML parser for local HTML files?
HTML Agility Pack is fine for local files, check out this example from the docs.
Alternatively, load the content from the file into a string using something like File.ReadAllText then pass it into HtmlDocument.LoadHtml(string html).
How about using the HtmlDocument.LoadHtml function of HTML Agility Pack?
You could use the File.ReadAllText to read the text into memory and pass it to the LoadHtml function.
Related
I am trying to get the html text from a website using HTML Agility Pack, but it has an error in the load and is only giving me a part of the html text... is there any way to get the whole text avoiding errors with that DLL o any other solution.
I have xml file and xslt formatted using html, is there any tool or renderer which would accept these two as input and output an rtf equivalent using C#
I am not 100% sure of what is your input - HTML or some kind of XML?
If you need to convert HTML to RTF my suggestion would be to use Word - it can be controlled from C# easily and it can open HTML and save RTF. This will work fine on a client, it becomes tricky on a server though.
If your input is some simple/standard HTML (e.g. using only a restricted set of tags in a regular way and without CSS) you can convert it to some simple RTF directly with a XSLT
is there any c# library or any free tool which can convert a html file with many referenced resources into a one "all-in-one" html file?
The main task is to have only one file, it means I need to include
Javascript external files - this will probably mean replace all 'script' tags
with 'src' attribute by 'script' tags with content read from referenced file.
Images - replace src="picture.png" with data uri - something like src="..."
CSS files
may be i forgot something :)
This HTML file must be readable in all browsers, that's why I cannot use MHT file format (unreadable on Safari, iPad...)
You can use HTML Agility Pack to go read/write the html document. HTML Agility supports XPath so you can get a list of nodes you want to modify.
Using this, changing the attribute value of image tags should be easy. You can also get a list of external js references, read them and then update the script tag accordingly.
We are moving an e-commerce website to a new platform and because all of their pages are static html and they do not have all their product information in a database, we must scrape their current website for the product descriptions.
Here is one of the pages: http://www.cabinplace.com/accrugsbathblackbear.htm
What is the best was to get the description into a string? Should I use html agility pack? and if so how would this be done? as I am new to html agility pack and xhtml in general.
Thanks
The HTML Agility Pack is a good library to use for this kind of work.
You did not indicate if all of the content is structured this way nor if you have already gotten the kind of fragment you posted from the HTML files, so it is difficult to advise further.
In general, if all pages are structured similarly, I would use an XPath expression to extract the paragraph and pick the innerHtml or innerText from each page.
Something like the following:
var description = htmlDoc.SelectNodes("p[#class='content_txt']")[0].innerText;
Also,
If you need a good tool for testing or finding the Xpath for the HAP you can use this one:
HTML-Agility-xpath-finder. It is made using the same library so if you find a xpath in this tool you be securely able to use in your code.
I am using HttpWebRequest to put a remote web page into a String and I want to make a list of all it's script tags (and their contents) for parsing.
What is the best method to do this?
The best method is to use an HTML parser such as the HTML Agilty Pack.
From the site:
It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).
Sample applications:
Page fixing or generation. You can fix a page the way you want, modify the DOM, add nodes, copy nodes, well... you name it.
Web scanners. You can easily get to img/src or a/hrefs with a bunch XPATH queries.
Web scrapers. You can easily scrap any existing web page into an RSS feed for example, with just an XSLT file serving as the binding. An example of this is provided.
Use an XML parser to get all the script tags with their content.
Like this one: simple xml