Parsing innertext of html - c#

This is part of html that i am parsing
<li>http://some.link.com/4DFR6DJ43Y/sessionid?ticket=ASDSIDFK32423421</li>
I want to get http://some.link.com/4DFR6DJ43Y/sessionid?ticket=ASDSIDFK32423421 as an output.
So far i have tried
HtmlDocument document = new HtmlDocument();
document.LoadHtml(responseFromServer);
var link = document.DocumentNode.SelectSingleNode("//a");
if (link != null)
{
if(link.innerText.Contains("ticket"))
{
Console.WriteLine(link.InnerText);
}
}
... but output is null (no inner texts are found).

That's probably because the first link in your HTML document as returned by SelectSingleNode(), doesn't contains text "ticket". You can check for the target text in XPath directly , like so :
var link = document.DocumentNode.SelectSingleNode("//a[contains(.,'ticket')]");
if (link != null)
{
Console.WriteLine(link.InnerText);
}
or using LINQ style if you like :
var link = document.DocumentNode
.SelectNodes("//a")
.OfType<HtmlNode>()
.FirstOrDefault(o => o.InnerText.Contains("ticket"));
if (link != null)
{
Console.WriteLine(link.InnerText);
}

You provided a piece of code that won't compile because innerText is not defined. If you try this code, you'll probably get what you asked for:
HtmlDocument document = new HtmlDocument();
document.LoadHtml(html);
var link = document.DocumentNode.SelectSingleNode("//a");
if (link != null)
{
if(link.InnerText.Contains("ticket"))
{
Console.WriteLine(link.InnerText);
}
}

You can use HTML Agility Pack instead of HTML Document then you can do deep parsing in HTML. for more information please see the following information.
See the following link.
How to use HTML Agility pack

Related

How to scrape a variable data from a source code?

I'm trying to scrape a link from the source code of a website that varies with every source code.
Form example:
<div align="center">
<a href="http://www10.site.com/d/the rest of the link">
<span class="button_upload green">
The next time I get the source code the http://www10 changes to any http://www + number like http://www65.
How can I scrape the exact link with the new changed number?
Edit :
Here's how i use RE MatchCollection m1 = Regex.Matches(textBox6.Text, "(href=\"http://www10)(?<td_inner>.*?)(\">)", RegexOptions.Singleline);
You mentioned in the comments that you use Regulars expressions for parsing the HTML Document. That is a the hardest way you can do this (also, generally not recommended!). Try using a HTML Parser like http://html-agility-pack.net
For HTML Agility Pack: You install it via NuGet Packeges and here is an example (posted on their website):
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[#href]")
{
HtmlAttribute att = link["href"];
att.Value = FixLink(att);
}
doc.Save("file.htm");
It can also load string contents, not just files. You use xPath or CSS Selectors to navigate inside the document and select what you want.
How about a JS function like this, run when the page loads:
// jQuery is required!
var updateLinkUrl = function (num) {
$.each($('.button_upload.green'), function (pos, el) {
var orig = $(el).parent().prop("href");
var newurl = orig.replace("www10", "www" + num);
$(el).parent().prop("href", newurl);
});
};
$(document).ready(function () { updateLinkUrl(65); });

Delete single tag from HTML document using HtmlAgilityPack

I have an HTML document that may contain unopened or unclosed tags. I'm using HtmlAgilityPack to find the errors, but once I do, I'd like to remove the broken tag from the document without affecting any of the other content. What's the best way to do this?
Here's my code so far:
HtmlDocument articleDoc = new HtmlDocument();
articleDoc.LoadHtml(article.ArticleBody);
if (articleDoc.ParseErrors != null && articleDoc.ParseErrors.Count() > 0) {
foreach (HtmlParseError error in articleDoc.ParseErrors) {
HtmlParseErrorCode eCode = error.Code;
if (eCode == HtmlParseErrorCode.TagNotOpened || eCode == HtmlParseErrorCode.TagNotClosed) {
//Delete tag here
}
}
}
Thanks in advance for any help!

Loop through HTML with tags from string

I'm parsing an PHP script to C# due to performance.
This is the PHP source where i'm having trouble with:
$dom = new DOMDocument;
$dom->loadHTML($message);
foreach ($dom->getElementsByTagName('a') as $node) {
if ($node->hasAttribute('href')) {
$link = $node->getAttribute('href');
if ((strpos($link, 'http://') === 0) || (strpos($link, 'https://') === 0)) {
$add_key = ((strpos($link, '{key}') !== false) || (strpos($link, '%7Bkey%7D') !== false));
$node->setAttribute('href', $url . 'index.php?route=ne/track/click&link=' . urlencode(base64_encode($link)) . '&uid={uid}&language=' . $data['language_code'] . ($add_key ? '&key={key}' : ''));
}
}
}
The problem that i'm having is the getElementByTagName part.
As said here, should i use htmlagilitypack. My code so far is this:
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(leMessage);
leMessage is an string that holds the HTML. So far so good. Only problem is that there isn't an getElementsByTag function in the HtmlAgillityPack. And in the normal HtmlDocument ( without the pack ), i can't use an string as html page right?
So does anybody knows what i should do to make this work? Only thing i can think of now is to make an webbrowser in the windows form and set the document content to leMessage and then parse it from there. But personaly i don't like that solution... But if there isn't another way...
The following was the first top-of-the-page block of code that popped up when I followed your link and clicked on "Examples":
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[#href"])
{
HtmlAttribute att = link["href"];
// DO SOMETHING WITH THE LINK HERE
}
doc.Save("file.htm");
Please do your own googling in the future.

Html Agility Pack/C#: how to create/replace tags?

The task is simple, but I couldn't find the answer.
Removing tags (nodes) is easy with Node.Remove()... But how to replace them?
There's a ReplaceChild() method, but it requires to create a new tag. How do I set the contents of a tag? InnerHtml and OuterHtml are read only properties.
See this code snippet:
public string ReplaceTextBoxByLabel(string htmlContent)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlContent);
foreach(HtmlNode tb in doc.DocumentNode.SelectNodes("//input[#type='text']"))
{
string value = tb.Attributes.Contains("value") ? tb.Attributes["value"].Value : " ";
HtmlNode lbl = doc.CreateElement("span");
lbl.InnerHtml = value;
tb.ParentNode.ReplaceChild(lbl, tb);
}
return doc.DocumentNode.OuterHtml;
}
Are you sure InnerHtml is a read only property?
The HTMLAgility pack's documentation says otherwise: (Cut & Paste)
Gets or Sets the HTML between the start and end tags of the object.
Namespace: HtmlAgilityPack
Assembly: HtmlAgilityPack (in HtmlAgilityPack.dll) Version: 1.4.0.0 (1.4.0.0)
Syntax
C#
public virtual string InnerHtml { get; set; }
If it is read only could you post some code?

html agility pack remove children

I'm having difficulty trying to remove a div with a particular ID, and its children using the HTML Agility pack. I am sure I'm just missing a config option, but its Friday and I'm struggling.
The simplified HTML runs:
<html><head></head><body><div id='wrapper'><div id='functionBar'><div id='search'></div></div></div></body></html>
This is as far as I have got. The error thrown by the agility pack shows it cannot find a div structure:
<div id='functionBar'></div>
Here's the code so far (taken from Stackoverflow....)
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
// There are various options, set as needed
//htmlDoc.OptionFixNestedTags = true;
// filePath is a path to a file containing the html
htmlDoc.LoadHtml(Html);
string output = string.Empty;
// ParseErrors is an ArrayList containing any errors from the Load statement
if (htmlDoc.ParseErrors != null && htmlDoc.ParseErrors.Count > 0)
{
// Handle any parse errors as required
}
else
{
if (htmlDoc.DocumentNode != null)
{
HtmlAgilityPack.HtmlNode bodyNode = htmlDoc.DocumentNode.SelectSingleNode("//body");
if (bodyNode != null)
{
HtmlAgilityPack.HtmlNode functionBarNode = bodyNode.SelectSingleNode ("//div[#id='functionBar']");
bodyNode.RemoveChild(functionBarNode,false);
output = bodyNode.InnerHtml;
}
}
}
bodyNode.RemoveChild(functionBarNode,false);
But functionBarNode is not a child of bodyNode.
How about functionBarNode.ParentNode.RemoveChild(functionBarNode, false)? (And forget the bit about finding bodyNode.)
You can simply call:
var documentNode = document.DocumentNode;
var functionBarNode = documentNode.SelectSingleNode("//div[#id='functionBar']");
functionBarNode.Remove();
It is much simpler, and does the same as:
functionBarNode.ParentNode.RemoveChild(functionBarNode, false);
This will work for multiple:
HtmlDocument d = this.Download(string.Format(validatorUrl, Url));
foreach (var toGo in QuerySelectorAll(d.DocumentNode, "p[class=helpwanted]").ToList())
{
toGo.Remove();
}

Categories