Select meta tag using c#? - c#

This is Xml where i want to select meta tag
<meta charset="utf-8">
<title>Gmail: Email from Google</title>
<meta name="description" content="10+ GB of storage, less spam,
and mobile access. Gmail is email that's intuitive, efficient, and
useful. And maybe even fun.">
<link rel="icon" type="image/ico" href="//mail.google.com/favicon.ico">
I am doing this
string texturl = textBox2.Text;
string Url = "http://" + texturl;
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load(Url);
var SpanNodes = doc.DocumentNode.SelectNodes("//meta");
if (SpanNodes != null)
{
foreach (HtmlNode SN in SpanNodes)
{
string text = SN.InnerText;
MessageBox.Show(text);
}
Its not actually selecting any text from there............what i am doing wrong please help

meta elements are self-closing elements, meaning they have no text children (InnerText). I believe you want to get the value of the content attribute. I believe you do that using something like SN["content"], but I don't know HtmlAgilityPack.

Related

Html Agility Pack Text </form> Tags Remain

I've tried two ways to get just the text from an HTML page with HTML Agility Pack:
Method 1
var root = doc.DocumentNode;
foreach (HtmlNode node in root.SelectNodes("//text()"))
{
sb.AppendLine(node.InnerText.Trim() + " ");
}
Method 2
var root = doc.DocumentNode;
foreach (var node in root.DescendantsAndSelf())
{
if (!node.HasChildNodes)
{
string text = node.InnerText;
if (!string.IsNullOrEmpty(text))
sb.AppendLine(text.Trim() + " ");
}
}
Both of these will leave behind the </form> tags if they are present of the page. For example, here's www.google.com:
"body": " Search Images Maps Play YouTube News Gmail Drive More Calendar
Translate Mobile Books Wallet Shopping Blogger Finance Photos Videos Docs
Even more » Account Options Sign in Search settings Web History
× Try a fast, secure browser with updates built in. Yes, get Chrome
now Advanced search Language tools </form> Advertising Programs
Business Solutions +Google About Google © 2016 - Privacy - Terms "
What gives?
Edit: By "Just the text" I mean "language text"....so:
<i>book reports</i> becomes book reports
More Details becomes More Details
<div>Check out our <b>deals</b>!</div> becomes Check out our deals!
Please search for your question before posting
Using C# regular expressions to remove HTML tags
Samples pulled from this webpage
String result = Regex.Replace(htmlDocument, #"<[^>]*>", String.Empty);
Or if you want to use Agility (also pulled from webpage)
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(Properties.Resources.HtmlContents);
var text = doc.DocumentNode.SelectNodes("//body//text()").Select(node => node.InnerText);
StringBuilder output = new StringBuilder();
foreach (string line in text)
{
output.AppendLine(line);
}
string textOnly = HttpUtility.HtmlDecode(output.ToString());

get value from web page using Html Agility Pack

I am trying to get the value of the "Pool Hashrate" using the HTML Agility Pack. Right when I hit my string hash, I get Object reference not set to an instance of an object. Can somebody tell me what I am doing wrong?
string url = http://p2pool.org/ltcstats.php?address
protected void Page_Load(string address)
{
string url = address;
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(Url);
string hash = doc.DocumentNode.SelectNodes("/html/body/div/center/div/table/tbody/tr[1]")[0].InnerText;
}
Assuming you're trying to access that url, of course it should fail. That url doesn't return a full document, but just a fragment of html. There is no html tag, there is no body tag, just the div. Your xpath query returns nothing and thus the null reference exception. You need to query the right thing.
When I access that url, it returns this:
<div>
<center>
<div style="margin-right: 20px;">
<h3>Personal LTC Stats</h3>
<table class='zebra-striped'>
<tr><td>Pool Hashrate: </td><td>66.896 Mh/s</td></tr>
<tr><td>Your Hashrate: </td><td>0 Mh/s</td></tr>
<tr><td>Estimated Payout: </td><td> LTC</td></tr>
</table>
</div>
</center>
</div>
Given this, if you wanted to get the Pool Hashrate, you'd use a query more like this:
/div/center/div/table/tr[1]/td[2]
In the end you need to do this:
var url = "http://p2pool.org/ltcstats.php?address";
var web = new HtmlWeb();
var doc = web.Load(url);
var xpath = "/div/center/div/table/tr[1]/td[2]";
var poolHashrate = doc.DocumentNode.SelectSingleNode(xpath);
if (poolHashrate != null)
{
var hash = poolHashrate.InnerText;
// do stuff with hash
}
The problem is that xpath is not finding the specified node. You can specify an id to the table or the tr in order to have a smaller xpath
Also, based on your code I assume that you're looking for a single node only, so you may want to use something like this
doc.DocumentNode.SelectSingleNode("xpath");
Another good option is using Fizzler

Inject some string to a specific part of string in C#

How to insert some string to a specific part of another string. What i am trying to achieve is i have an html string like this in my variable say string stringContent;
<html><head>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
<meta name="Viewport" content="width=320; user-scaleable=no;
initial-scale=1.0">
<style type="text/css">
body {
background: black;
color: #80c0c0;
}
</style>
<script>
</script>
</head>
<body>
<button type="button" onclick="callNative();">Call to Native
Code!</button>
<br><br>
</body></html>
I need to add below string content inside <script> <script/> tag
function callNative()
{
window.external.notify("Uulalaa!");
}
function addToBody(text)
{
document.body.innerHTML = document.body.innerHTML + "<br>" + text;
}
How i can achieve this in C#.
Assuming your content is stored in the string content, you can start by finding the script tag with:
int scriptpos = content.IndexOf("<script");
Then to get past the end of the script tag:
scriptpos = content.IndexOf(">", scriptpos) + 1;
And finally to insert your new content:
content = content.Insert(scriptpos, newContent);
This at least allows for potential attributes in the script tag.
Use htmlString.Replace(what, with)
var htmlString = "you html bla bla where's the script tag? oooups here it is!!!<script></script>";
var yourScript = "alert('HA-HA-HA!!!')";
htmlString = htmlString.Replace("<script>", "<script>" + yourScript);
Note that this will insert yourScript inside all <script> elements.
var htmlString = #"<script>$var1</script> <script>$var2</script>"
.Replace("$var1", "alert('var1')")
.Replace("$var2", "alert('var2')");
var htmlString = "you html bla bla where's the script tag? oooups here it is!!!<script></script>";
var yourScript = "alert('HA-HA-HA!!!')";
htmlString = htmlString.Insert(html.IndexOf("<script>") + "<script>".Length + 1, yourScript);
This can be done in another (safer) way, using HTML Agility Pack (open source project http://htmlagilitypack.codeplex.com). It helps you to parse and edit html without you having to worry about malformed tags (<br/>, <br />, < br / > etc). It includes operations to make it easy to insert elements, like AppendChild.
If you are dealing with HTML, this is the way to go.
For this you can read the html file into string by using File.ReadAllText method. Here For example, i have used sample html string. After that, by some string operations you can add tags under script like follows.
string text = "<test> 10 </test>";
string htmlString =
#" <html>
<head>
<script>
<tag1> 5 </tag1>
</script>
</head>
</html>";
int startIndex = htmlString.IndexOf("<script>");
int length = htmlString.IndexOf("</script>") - startIndex;
string scriptTag = htmlString.Substring(startIndex, length) + "</script>";
string expectedScripTag = scriptTag.Replace("<script>", "<script><br>" + text);
htmlString = htmlString.Replace(scriptTag, expectedScripTag);

Trying to scrape a webpage with Agility on C#

i m using C# and Agility Pack to scrape a website but the result i m getting differs from what i m seeing in firebug. I suppose this is cause the site is using some Ajax.
// The HtmlWeb class is a utility class to get the HTML over HTTP
HtmlWeb htmlWeb = new HtmlWeb();
// Creates an HtmlDocument object from an URL
HtmlAgilityPack.HtmlDocument document = htmlWeb.Load("http://www.saxobank.com/market-insight/saxotools/forex-open-positions");
// Targets a specific node
HtmlNode someNode = document.GetElementbyId("href");
// If there is no node with that Id, someNode will be null
richTextBox1.Text = document.DocumentNode.OuterHtml;
Could someone advise on the code on how to do this correctly cause what i m getting back is just plain html. What i m looking for is for a
div id= "ctl00_MainContent_PositionRatios1_canvasContainer"
Any ideas?
If you have the ID, then use it - href is the name of an attribute, not the id.
HtmlNode someNode =
document.GetElementbyId("ctl00_MainContent_PositionRatios1_canvasContainer");
This will be the div with that id.
You can then select any a child node and its href attribute of that node using:
var href = someNode.SelectNodes("//a")[0].Attributes["href"].Value;

How to get content via xpath

On the web page i have
<meta name="description" content="Learn about 94.100.179.159" />
how can i get exactly the text "Learn about 94.100.179.159" via Xpath or HtmlAgilityPack
i've tried
HtmlWeb hwObject = new HtmlWeb();
HtmlDocument htmldocObject = hwObject.Load("http://whois.domaintools.com/94.100.179.159");
foreach (HtmlNode link in htmldocObject.DocumentNode.SelectNodes("//meta"))
{
string s = link.InnerText;
Console.WriteLine(s);
}
Console.ReadLine();
but that gives me not that i want, how to solve that?
//meta[#name = 'description']/#content
is the XPATH for the attribute you specified
string s = link.Value;
should return the attribute content.
Meta tags don't have any inner text, they have attributes.
Try this:
HtmlWeb hwObject = new HtmlWeb();
HtmlDocument htmldocObject = hwObject.Load("http://whois.domaintools.com/94.100.179.159");
foreach (HtmlNode link in htmldocObject.DocumentNode.SelectNodes("//meta"))
{
Console.WriteLine("-META-");
var attribDump=link.Attributes.Select(a=>a.Name+" : "+a.Value);
foreach (var x in attribDump)
{
Console.WriteLine(x);
}
}
Select the nodes as follows
SelectNodes("//*[local-name()='meta')]"))
Then, for each HtmlNode,
Console.WriteLine(link.Attributes["content"].Value);

Categories