C# Parsing hidden fields with the HTML Agility Pack

C# Parsing hidden fields with the HTML Agility Pack - c#

I need to write an application for a friends site which parses hidden fields. I've downloaded the Html Agility Pack library, but I'm kinda confused because there are not really any examples. The HTML field looks like this:
<input type = "hidden" autocomplete="off" value="randomvalue" name="foo">
How would I go about getting the value from this field?

from memory, something like:
var value = docroot.SelectSingleNode("//input[#type='hidden' and #name='foo']")
.Attributes["value"].Value;

Related

C# CsQuery as Html Documents Builder

So far I used HtmlAgilityPack for building Html documents.
The problem is that it is not stable, I get Stackoverflow Exceptions and it doesn't support jQuery syntax.
What I am trying to use to build Html documents is CsQuery.
My question is:
Is it designated for building Html documents?
I like the functions it offers, but I cannot render the modified html document.
For example:
CQ fragment= CQ.CreateFragment("<p>some text</p>");
CQ html = CQ.CreateFromFile(#"index.html");
CQ modified_html= html.Select("#test").Append(fragment);
Which means, I want to append fragment variable to element with id "test".
the problem is that I expect modified_html.Render() to return the modified version (including < p> sometext < /p> added to #test element), but it actually doesn't!!!
is there anyway to achieve this?

Actually it does. I also checked with your code, it do append <p>some text</p> to the modified_html. The only possible issue I can think: there is no element with id = "test" in index.html. You may also want to save modified html to file so it will be easier for you to examine the output :
modified_html.Save(#"index_modified.html");

HTML Agility Pack (C#) malforms my code

I'm currently coding a desktop application in c# which also has to handle XHTML document manipulation. For that purpose I'm using the Html Agility Pack which seemed to be okay so far. After carefully checking the output from Html Agility Pack I found out that the code isn't well formed xhtml any more.
It removes self-closing tags (slash) and overwrites other proprietary code elements...
eg. input html code:
<input autocapitalize="off" id="username" name="username" placeholder="Benutzername" type="text" value="$(username)" />
eg. output html code
<input autocapitalize="off" id="username" name="username" placeholder="Benutzername" type="text" value="$(username)">
(removed the trailing slash...)
Another example is with proprietary code elements (for Mikrotik hotspot devices):
eg input html code
<form action="$(link-login-only)" method="post" name="login" $(if chap-id) onSubmit="return doLogin()"$(endif)>
The $(if chap-id), $(endif) and $(link-login-only) parts are custom code fragments interpreted from the Mikrotik device.
eg. output html code after Html Agility Pack (which transforms it to unuseable code)
<form action="$(link-login-only)" method="post" name="login" $(if="" chap-id)="" onsubmit="return doLogin()" $(endif)="">
Has someone an idea how to "instruct" Html Agility Pack to output well formed XHTML and to ignore "custom code" fragments (is this possibly via Regex)?
Thanks in advance! :-)

In your first example, HTML Agility Pack is actually fixing your markup. The input element is a void element. Since there is no context inside, it needs no closing tag.
HTML Agility Pack is made for parsing valid HTML markup, not markup embedded with custom code. In your first example, the custom markup is inside quotes therefore isn't an issue. In your second example, the variables are outside quotes.
HTML Agility Pack tries to parse them as regular (but malformed) attributes of the element. There's no way to fix that. You'll have to find another way to parse your markup if you need support for custom code inside the markup.

Necromancing.
Problem 1 is because you probably didn't specify OptionOutputAsXml = true, meaning HtmlAgilityPack outputs HTML instead of XHTML.
Actually, doing this is rather clever, as it reduces the file size.
If you need XHTML, you need to specifically instruct HtmlAgilityPack to output XHTML (XML), not HTML (SGML).
SGML allows for tags with no closing tag (/>), while XML does not.
To fix this:
public static void BeautifyHtml()
{
string input = "<html><body><p>This is some test test<br ><ul><li>item 1<li>item2<</ul></body>";
HtmlAgilityPack.HtmlDocument test = new HtmlAgilityPack.HtmlDocument();
test.LoadHtml(input);
test.OptionOutputAsXml = true;
test.OptionCheckSyntax = true;
test.OptionFixNestedTags = true;
System.Text.StringBuilder sb = new System.Text.StringBuilder();
using (System.IO.TextWriter stringWriter = new System.IO.StringWriter(sb))
{
test.Save(stringWriter);
}
string beautified = sb.ToString();
System.Console.WriteLine(beautified);
}

An alternative is CsQuery which, at least for the simple cases you've got here, will leave your pre-processor tags alone by nature of just treating them like valueless attributes. That is, HAP appears to convert any attribute someattribute without a value to someattribute="". CsQuery won't do this.
However the observations #Justin Niessner makes about your markup are going to be true for any parser that is not specifically designed to parse the templating code you have in there. Just because this one example makes it through CsQuery is no guarantee some other format won't result in something that's not a valid attribute name, or if not valid, at least acceptable to an HTML5 parser.
If you need to manipulate something as HTML, then do it after templating. If you need to manipulate it before the templating engine has at it, then you're in a catch 22, since it's not HTML yet. Or alternatively you could use a templating system that uses valid HTML markup for its keywords (example: Knockout).

C# access return value of PHP within an HTML-page

Within a c# project I'm sending a WebRequest to a php website, which takes the values and uses a select statement to query the DB and return an HTML page. Since there is only one value that comes back from that query, I need to assign this value to my c# code.
The source of the body-tag of the returned HTML-page (and with the StreamReader in my c#) looks like this:
<table border='1'>
<tr><td>ValueINeed</td></tr>
</table>
How do I access the value inside this in order to assign it to a string in my c# code?
thank you.

If you are the author of the PHP code as well, I would suggest that you make another page that returns json or something instead, this way you would be able to avoid parsing HTML.
But if this really is what you are stuck with, I would suggest that you take a look at Html Agility Pack. Here is another quesiton here on StackOverflow that are about how to use the Html Agility Pack.

If the result is always the same you could just split the string or use regex.
If not you may use a html parser: http://htmlagilitypack.codeplex.com/

String manipluation, how to extract an HTML element value easily?

There are many times I need to extract value of an element from a HTML page. Something like this:
<!-- many html here -->
<input type="hidden" name="id" value="ExtractMe!">
<!-- many html here -->
How can extract the value easily?

Have a look at the HTMLAgility pack, it makes this type of task very easy and regex-free.

If you need to parse HTML within your C# application consider using HTMLAgilityPack from here http://htmlagilitypack.codeplex.com/

If you just want to pluck values you're probably best to parse this as XML.
You have a choice of standard XML or LINQ.
have a look here or here for some examples.

Why don't you use regular expressions? This the MSDN Regular Expression Documentation, in there you can look for The section Extracting a Single Match or the First Match.

Regex to get the tags

I have a html like this :
<h1> Headhing </h>
<font name="arial">some text</font></br>
some other text
In C#,
I want to get the out put as below. Simply content inside the font start tag and end tag
<font name="arial">some text</font>

First off, your html is wrong. you should close a <h1> with a </h1> not </h>. This one thing is why reg ex is inappropriate to parse tags.
Second, there are hundreds of questions on SO talking about parsing html with regex. The answer is don't. Use something like the html agility pack.

I wouldn't recommend to try it with regex.
I use the HTML Agility Pack to parse HTML and get what I want.
It's a lovely HTML parser that is commonly recommended for this. It will take malformed HTML and massage it into XHTML and then a traversable DOM, like the XML classes. So, is very useful for the code you find in the wild.
There's also an HTML parser from Microsoft MSHTML but I haven't tried it.

Regex regExfont = new Regex(#"<font name=""arial""[^>]*>.*</font>");
MatchCollection rows = regExfont.Matches(string);
good website is http://www.regexlib.com/RETester.aspx

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# Parsing hidden fields with the HTML Agility Pack - c#

from memory, something like: var value = docroot.SelectSingleNode("//input[#type='hidden' and #name='foo']") .Attributes["value"].Value;

Related

C# CsQuery as Html Documents Builder

HTML Agility Pack (C#) malforms my code

C# access return value of PHP within an HTML-page

String manipluation, how to extract an HTML element value easily?

Regex to get the tags

Categories

Resources