HTML Parser validate tags - c#

I need html parse which have capability to identify error in generated html and if tags are not closed then close it and return the valid html.
More detail: i am getting data from database and break that record to show partial detail on my website to click on more button then show complete content. After breaking string then validate.
I have already used Html Agility Pack but i am new to use it, if this library solve my issue then guide me how (tutorial) or suggest me another library.

I don't think such a library does exist. The problem is, that some libraries can indeed identify errors in your HTML but they cant fix them for you.
I think using the W3C validator as a service is the best starting point here. There's an open source library which uses the API of the W3C validator to validate a document and gives you the response if it is valid or not as well as errors and warnings. I would start with this and then go on from there.
W3C Markup Validator library in C#

Here are a couple of validation programs from the World Wide Web Consortium, the W3C:
Windows: http://validator.w3.org/docs/install_win.html
UNIX / Linux: http://validator.w3.org/docs/install.html
You can also use their web services to validate your CSS, HTML, XML, XHTML, JavaScript and many other web technologies. The W3C is one of the overseers of keeping the Internet highly interoperable and internet devices somewhat compatible with each other.

Related

What is the best way to filter bad HTML Content from Posts using AntiXSS Library?

I want to create an Asp.net Website and I want to prevent Cross Site Scripting. I have a page with Summernote (a WYSIWYG HTML Editor), which, when submittet, posts HTML Code to MVC ActionResult via form or Ajax Post.
This Method saves this Code in my Database as content/body of a message. On another Site, you can display the content, which shows formating things like Lists etc.
Because of security reasons i want to filter the content i recieve from client. I am using the AntiXSS Library from Microsoft.
A part of my MVC Code:
[ValidateInput(false), HttpPost, ValidateAntiForgeryToken]
public ActionResult CreateMessage(string subject, string body)
{
var cleanBody = Sanitizer.GetSafeHtmlFragment(body);
//do the Database thing here
}
The major problem is, that it kills my HTML Elements with tag, because it removes the src=""
should be:
<p><img src="data:image/png;base64,some/ultra/long/picture/code/here" data-filename="grafik.png"></p>
remaining:
<p><img src="" alt=""><img src=""></p>
What can i do to prevent this?
Is there a way to add an exception rule?
Is there an another better way?
How does it work?
Thanks for help!
There is no such thing anymore as the "AntiXSS Library". It used to be a separate library, but Microsoft moved it into .Net, so it's now under System.Web.Security.AntiXss.
The reason this is important is that you need a sanitizer. The way you are using AntiXss currently will take a list of html tags and a list of attributes to those tags, and will remove everything else from your html code. That's not very good for you, because you only want to remove javascript, regardless of tags or attributes. Let's take for example <a>, with its href attribute. You most probably want to allow your users to insert links, but you don't want them to be able to insert javascript via <a href="javascript: ...">. So you cannot filter out href for <a>, but if you leave it, your page will be vulnerable to XSS.
So you want a sanitizer that only removes javascript. In the original AntiXSS library there was a sanitizer, but when Microsoft moved it to .Net, the sanitizer was left out.
So in short, AntiXss will not help you with your current usecase.
You can find proper html sanitizers like for example Google Caja (client-side sanitizer here), or many others. The point is, even if this sanitizer is in javascript (on the client), if you carefully don't insert your data into the page DOM before sanitizing it, it will all be fine.
So in short, you could just save any data from the HTML editor to your database as is without any transformation (mind sql injection of course, but current data access technologies should have that covered), and then when such data is displayed, send it to the client without adding it to the page dom (like as json data for example, but properly encoded for json then of course!), then run your sanitizer that will remove any javascript, and then add it to the page.
The reason this is very good is because your wysiwyg html editor will likely have a preview screen. Don't forget to add sanitization to previews as well, otherwise the preview will be vulnerable to XSS. If sanitization was on the server, you would have to send the editor contents to the server, sanitize it and send it back to your user for preview - not very user-friendly.
Also note that many wysiwyg editors support hooking into their rendering and adding such a sanitizer. If an editor does not support this and does not have its own sanitizer, that cannot be made secure with regard to XSS.

Loading Javascript with C# Console Application

I am currently using htmlAgilityPack for some web scraping, however I've encountered a website that has script tags and I am unable to load it for scraping. I have little experience with web and am unsure how to properly load the webpage and convert back to something htmlAgility can parse.
Pretty much, when I inspect element in chrome, there is a table, but the htmlAgilityPack reads a script tag.
Any help would be appreciated.
Thank you
I have had similar problems too. It is very annoying that their is not one unified method of doing on all websites in a C# console.
However depending on the site you are looking at there may be some information in meta tags in the head section of the html. When I was making an application to get Youtube Subscription count I found it had the count in a meta tag (I assume this information is here for the scripts to use). This may be similar for the web page you are scraping.
To do this I first added a
document.save(//put a link to where the html file needs to go)
then I opened the html document in Google Chrome, opened up dev tools and did a search for "Subscriptions" (You can replace this for whatever you are looking for). Hopefully depending on the website you are scraping there may be a tag with some info in it for you.
Good Luck! :)

Locating HTML Tags

I am trying to automate the testing of web forms. To that end I need to know how to use C# to dynamically locate input tags within the HTML page then assign values to them. I don't want to use XPath, because each time I will be using a different web form. I want to pass the web form's URL to Selenium and then automatically populate the fields. I've heard of HTMLAgilityPack. Would that help me? If so, how can I use it?
I appreciate your help.
I may have missed a crucial part of your question, however, have you looked at Selenium WebDriver?
If you write a test that handles a generic web form you can back your test by data that is dynamic. Therefore you can cater for changes in the page by using Data Driven Tests. I've written tests for many pages and there are always common actions, but I cater for each page differently though as there are different things on that page!
[EDIT]
Following on from your comments, I think looking into Selenium would be a good idea. The way to handle different pages is to have these element definitions ready in a 'definitions' class for each page. That way once you know what the page is, you just use the correct class for your definitions. It is best to know what elements you are going to be interacting with in your tests before the tests run. The point of automated UI testing is for a known set of actions to be performed and a correct result achieved.
I would suggest you look up some tutorials such as this and you can see my blog
though I wrote this when I was initially learning WatiN and then replaced it with Selenium (I like it better :P).
Html Agility Pack
This is an agile HTML parser that builds a read/write DOM and supports
plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor
XSLT to use it, don't worry...). It is a .NET code library that allows
you to parse "out of the web" HTML files. The parser is very tolerant
with "real world" malformed HTML. The object model is very similar to
what proposes System.Xml, but for HTML documents (or streams).
HtmlDocument doc = new HtmlDocument();
doc.Load(path);
foreach (HtmlNode input in doc.DocumentNode.SelectNodes("//input"))
// Your Code...

Validate VML in HTML emails using C#

I'm looking for a C# library or Web Service that will parse an HTML email that contains VML and report any syntax errors.
Does anyone know if such a beast exists?
Thanks
I'm not aware of a library or web service - although that could well just be due to a lack of knowledge on my part - but you could probably write your own validator reasonably easily. There are published schemas for the various versions of Word HTML emails and Office (e.g. http://www.microsoft.com/downloads/details.aspx?FamilyId=0B764C08-0F86-431E-8BD5-EF0E9CE26A3A&displaylang=en, http://www.microsoft.com/download/en/details.aspx?id=101) so you should be able to use them to validate your content?

HTML to DOM Library

I am looking for a C# library that would translate the HTML code (and the css specified in the code) into a DOM tree for simpler parsing. I am looking for something similar to this one (which is in PHP):
http://simplehtmldom.sourceforge.net/
Of course I know I could embed a browser control, but I am looking for something more efficient.
Check out the HTML Agility Pack. It hasn't been updated in a while, but it still works very well.
I second Mr. Dorman on the HtmlAgilityPack. I did a brief blog post on web scraping some time ago; it mentions the 'pack, but mostly discusses other details. Depending on your application, it might be of some use.
We have used HTMLAgility here in our project to extract specific html tags with a given set of attributes using XPath and it has never failed us.
There is no way to get DOM with styles like that. Only option is "Selenium" framework that works with real browser.

Categories