HtmlAgilityPack Issue in reading html

HtmlAgilityPack Issue in reading html - c#

I am reading websites in C# and get contents as string....there are some sites which do not have well formed html structure.
I am using HtmlAgilityPack which give me issue in that case.
Can you people suggest me what to use so that it can read whole string and i can get useful informations?
Here is my code
htmlDoc.LoadHtml(s);
if (htmlDoc.ParseErrors != null && htmlDoc.ParseErrors.Count() > 0)
Why this IF Condition is true for my case

What is the error you're getting? Is it throwing an exception or are you just wanting to see the error? Hard to tell what your actual question is.
You can see the markup errors in the HTML by using the HtmlDoc.ParseErrors property and iterate though them. This will give you the line number, code and type of error.
You can see more info about this property here
https://stackoverflow.com/a/5367455/235644
Edit
Ok so you've updated your question since my reply. You can see the specific error that's returning true in your IF statement by looping through the .ParseErrors are described above.
Second Edit
You can loop though the errors like so:
foreach (var error in htmlDoc.ParseErrors)
{
Debug.WriteLine(error.Line);
Debug.WriteLine(error.Reason);
}

You have to fix the bug in your HTML, and after it is valid you can go on.
Here is the same problem:
Invalid HTML in AgilityPack

If your html is external and you can't fix it, you can first run it through a cleanup preprocessor, then parse it with HtmlAgilityPack.
This will attempt to fix as many issues as possible automatically before HtmlAgilityPack gets to see it. The most popular HTML cleanup tool is Tidy. See the .NET version here:
http://sourceforge.net/projects/tidynet/

Related

Trying to get word count using HtmlAgilityPack, but node list is returning as null

I have the following code that grabs the nodes with text for certain descendants of specific tags/classes, and it was working before, but I haven't ran this program in a couple of months (nobody else has touched it) so I'm wondering why it's throwing an error now. My nodeList looks like this:
var nodesList = doc.DocumentNode
.SelectNodes("//article[#class='article-content']//div[#class='article-content-block']//text()[not(parent::script)]")
.Select(node => node.InnerText).ToList();
I look at the web page, and there are multiple paragraph and ul tags that fit that particular Xpath query, but nodesList is returning:
System.ArgumentNullException: 'Value cannot be null. (Parameter 'source')'
The DocumentNode has name: #document, which I would expect is normal and the InnerHtml is showing the entirety of the page's HTML however the InnerText is showing Javascript must be enabled for the correct page display. Any ideas as to why it would be throwing null? I don't recall seeing the Javascript must be enabled for the correct page display before for the DocumentNode's InnerText, so I'm wondering if that has something to do with it.

It sounds like the webpage content is being loaded dynamically. That's not a problem for your browser, because it executes Javascript automatically, but the .NET web components don't do any of that. You should be able to use your browser's dev tools to determine which request actually contains the content you're looking for, and then replicate that request in your code.
It could also be that something else about your request isn't playing nice with the server - missing/bad HTTP headers, unexpected TLS version, maybe even firewall stuff - causing it to return a different response.

How do you assert a text in your page body in webdriver c#?

I need to assert if page body contains text "error", how would I do that? I got to this point and I am getting error on .getText
driver.FindElement(By.TagName("body")).getText().Equals("Error")

You don't really want to do it this way... but since you asked, here is the answer.
driver.FindElement(By.TagName("body")).Text.Contains("Error");
You were using .getText() which is Java. You were also using .Equals() but what you meant was .Contains() because the entirety of the body tag is likely not to be only "Error".
What I would recommend is for you to narrow down to the element that actually contains the "Error" text and then use the line above (with the proper locator), e.g.
driver.FindElement(By.Id("the ID of the element that contains Error")).Text.Contains("Error");
You'd have to provide the surrounding HTML for the element for us to help you any more specifically.

ajax auto complete extender results as strings

i have run into an interesting issue...
i am using ASP and auto-complete extender on a textbox
i got everything working but i was getting very odd results.
when searching for something like 315122-111 the only result that would come up was 315011.
this is because the item-number is being treated like a number instead of a string
315122-111=315011
i am sending everything as a string.. when i use fiddler to view the traffic all the auto-complete responses coming in 315122-111 they are just not being properly...
any ideas on how to fix this dilemma?

Is there an eval() being called on the javascript side?
http://www.w3schools.com/jsref/jsref_eval.asp
There's a bug that was fixed in this post:
http://forums.asp.net/t/1164200.aspx?New+AutoCompleteExtender+doesn+t+work+with+Numeric+Values
They stated:
1. download the ajax control toolkit source. open the file
2. AjaxControlToolkit\AutoComplete\AutoCompleteBehavior.js at line 748,
3. you should see: "if (String.isInstanceOfType(pair)) {"
That line is what is causing the problem.... to fix it, I changed that line to "if (String.isInstanceOfType(pair) || Int.isInstanceOfType(pair)) {".

Loading XML Document - Name cannot begin with the zero character

I am trying to load something which claims to be an XML document into any type of .net XML object: XElement, XmlDocument, or XmlTextReader. All of them throw an exception :
Name cannot begin with the '0' character, hexadecimal value 0x30
The error related to a bit of 'XML'
<chart_value
color="ff4400"
alpha="100"
size="12"
position="cursor"
decimal_char="."
0=""
/>
I believe the problem is the author should not have named an attribute as 0.
If I could change this I would, but I do not have control of this feed. I suppose those who use it are using more permissive tools. Is there anyway I can load this as XML without throwing an error?
There is no XML declaration either, nor namespace or contract definition. I was thinking I might have to turn it into a string and do a replace, but this is not very elegant. Was wondering if there was any other options.

As many have said, this is not XML.
Having said that, it's almost XML and WANTS to be XML, so I don't think you should use a regex to screw around inside of it (here's why).
Wherever you're getting the stream, dump into into a string, change 0= to something like zero= and try parsing it.
Don't forget to reverse the operation if you have to return-to-sender.
If you're reading from a file, you can do something like this:
var txt = File.ReadAllText(#"\path\to\wannabe.xml");
var clean = txt.Replace("0=", "zero=");
var doc = new XmlDocument();
doc.LoadXml(clean);
This is not guaranteed to remove all potential XML problems -- but it should remove the one you have.

Just replace the Numeric value with '_'
Example: "0=" replace to "_0="
I hope that will fix the problem, thanks.

It might claim to be an XML document, but the claim is clearly false, so you should reject the document.
The only good way to deal with bad XML is to find out what bit of software is producing it, and either fix it or throw it away. All the benefits of XML go out of the window if people start tolerating stuff that's nearly XML but not quite.

The 0="" obviously uses an invalid attribute name 0. You'd probably have to do a find/replace to try and fix the XML if you cannot fix it at the source that created it. You might be able to use RegEx to try to do more efficient manipulation of the XML string.

C# How can I get from CompilerError instance exact text which caused the error

Could you please tell me how can I get from CompilerError instance exact text which caused the error.
Edited:
What about using
compilerError.FileName
and reading the file with text reader? I am trying to do so but it seems that Compiler doesn't create cs file that doesn't pass the compilation any suggestions?

This CompilerError? : http://msdn.microsoft.com/en-us/library/system.codedom.compiler.compilererror.aspx
There are FileName and Line properties, that's the best you can get.
What are you compiling - is it entirely in-memory (CodeDOM)?
If so you can add code-line pragmas to your object model: http://msdn.microsoft.com/en-us/library/system.codedom.codelinepragma.aspx then you'll be able to link an error to a DOM element.
Or, you can compile from source, then you'll have the source code itself and can get the text from the line number.

The best your going to get is the line number.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

HtmlAgilityPack Issue in reading html - c#

You have to fix the bug in your HTML, and after it is valid you can go on. Here is the same problem: Invalid HTML in AgilityPack

Related

Trying to get word count using HtmlAgilityPack, but node list is returning as null

How do you assert a text in your page body in webdriver c#?

ajax auto complete extender results as strings

Loading XML Document - Name cannot begin with the zero character

C# How can I get from CompilerError instance exact text which caused the error

Categories

Resources