How to scrape a variable data from a source code?

How to scrape a variable data from a source code? - c#

I'm trying to scrape a link from the source code of a website that varies with every source code.
Form example:
<div align="center">
<a href="http://www10.site.com/d/the rest of the link">
<span class="button_upload green">
The next time I get the source code the http://www10 changes to any http://www + number like http://www65.
How can I scrape the exact link with the new changed number?
Edit :
Here's how i use RE MatchCollection m1 = Regex.Matches(textBox6.Text, "(href=\"http://www10)(?<td_inner>.*?)(\">)", RegexOptions.Singleline);

You mentioned in the comments that you use Regulars expressions for parsing the HTML Document. That is a the hardest way you can do this (also, generally not recommended!). Try using a HTML Parser like http://html-agility-pack.net
For HTML Agility Pack: You install it via NuGet Packeges and here is an example (posted on their website):
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[#href]")
{
HtmlAttribute att = link["href"];
att.Value = FixLink(att);
}
doc.Save("file.htm");
It can also load string contents, not just files. You use xPath or CSS Selectors to navigate inside the document and select what you want.

How about a JS function like this, run when the page loads:
// jQuery is required!
var updateLinkUrl = function (num) {
$.each($('.button_upload.green'), function (pos, el) {
var orig = $(el).parent().prop("href");
var newurl = orig.replace("www10", "www" + num);
$(el).parent().prop("href", newurl);
});
};
$(document).ready(function () { updateLinkUrl(65); });

Related

Html Agility Pack Text </form> Tags Remain

I've tried two ways to get just the text from an HTML page with HTML Agility Pack:
Method 1
var root = doc.DocumentNode;
foreach (HtmlNode node in root.SelectNodes("//text()"))
{
sb.AppendLine(node.InnerText.Trim() + " ");
}
Method 2
var root = doc.DocumentNode;
foreach (var node in root.DescendantsAndSelf())
{
if (!node.HasChildNodes)
{
string text = node.InnerText;
if (!string.IsNullOrEmpty(text))
sb.AppendLine(text.Trim() + " ");
}
}
Both of these will leave behind the </form> tags if they are present of the page. For example, here's www.google.com:
"body": " Search Images Maps Play YouTube News Gmail Drive More Calendar
Translate Mobile Books Wallet Shopping Blogger Finance Photos Videos Docs
Even more » Account Options Sign in Search settings Web History
× Try a fast, secure browser with updates built in. Yes, get Chrome
now Advanced search Language tools </form> Advertising Programs
Business Solutions +Google About Google © 2016 - Privacy - Terms "
What gives?
Edit: By "Just the text" I mean "language text"....so:
<i>book reports</i> becomes book reports
More Details becomes More Details
<div>Check out our <b>deals</b>!</div> becomes Check out our deals!

Please search for your question before posting
Using C# regular expressions to remove HTML tags
Samples pulled from this webpage
String result = Regex.Replace(htmlDocument, #"<[^>]*>", String.Empty);
Or if you want to use Agility (also pulled from webpage)
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(Properties.Resources.HtmlContents);
var text = doc.DocumentNode.SelectNodes("//body//text()").Select(node => node.InnerText);
StringBuilder output = new StringBuilder();
foreach (string line in text)
{
output.AppendLine(line);
}
string textOnly = HttpUtility.HtmlDecode(output.ToString());

Parsing innertext of html

This is part of html that i am parsing
<li>http://some.link.com/4DFR6DJ43Y/sessionid?ticket=ASDSIDFK32423421</li>
I want to get http://some.link.com/4DFR6DJ43Y/sessionid?ticket=ASDSIDFK32423421 as an output.
So far i have tried
HtmlDocument document = new HtmlDocument();
document.LoadHtml(responseFromServer);
var link = document.DocumentNode.SelectSingleNode("//a");
if (link != null)
{
if(link.innerText.Contains("ticket"))
{
Console.WriteLine(link.InnerText);
}
}
... but output is null (no inner texts are found).

That's probably because the first link in your HTML document as returned by SelectSingleNode(), doesn't contains text "ticket". You can check for the target text in XPath directly , like so :
var link = document.DocumentNode.SelectSingleNode("//a[contains(.,'ticket')]");
if (link != null)
{
Console.WriteLine(link.InnerText);
}
or using LINQ style if you like :
var link = document.DocumentNode
.SelectNodes("//a")
.OfType<HtmlNode>()
.FirstOrDefault(o => o.InnerText.Contains("ticket"));
if (link != null)
{
Console.WriteLine(link.InnerText);
}

You provided a piece of code that won't compile because innerText is not defined. If you try this code, you'll probably get what you asked for:
HtmlDocument document = new HtmlDocument();
document.LoadHtml(html);
var link = document.DocumentNode.SelectSingleNode("//a");
if (link != null)
{
if(link.InnerText.Contains("ticket"))
{
Console.WriteLine(link.InnerText);
}
}

You can use HTML Agility Pack instead of HTML Document then you can do deep parsing in HTML. for more information please see the following information.
See the following link.
How to use HTML Agility pack

get value from web page using Html Agility Pack

I am trying to get the value of the "Pool Hashrate" using the HTML Agility Pack. Right when I hit my string hash, I get Object reference not set to an instance of an object. Can somebody tell me what I am doing wrong?
string url = http://p2pool.org/ltcstats.php?address
protected void Page_Load(string address)
{
string url = address;
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(Url);
string hash = doc.DocumentNode.SelectNodes("/html/body/div/center/div/table/tbody/tr[1]")[0].InnerText;
}

Assuming you're trying to access that url, of course it should fail. That url doesn't return a full document, but just a fragment of html. There is no html tag, there is no body tag, just the div. Your xpath query returns nothing and thus the null reference exception. You need to query the right thing.
When I access that url, it returns this:
<div>
<center>
<div style="margin-right: 20px;">
<h3>Personal LTC Stats</h3>
<table class='zebra-striped'>
<tr><td>Pool Hashrate: </td><td>66.896 Mh/s</td></tr>
<tr><td>Your Hashrate: </td><td>0 Mh/s</td></tr>
<tr><td>Estimated Payout: </td><td> LTC</td></tr>
</table>
</div>
</center>
</div>
Given this, if you wanted to get the Pool Hashrate, you'd use a query more like this:
/div/center/div/table/tr[1]/td[2]
In the end you need to do this:
var url = "http://p2pool.org/ltcstats.php?address";
var web = new HtmlWeb();
var doc = web.Load(url);
var xpath = "/div/center/div/table/tr[1]/td[2]";
var poolHashrate = doc.DocumentNode.SelectSingleNode(xpath);
if (poolHashrate != null)
{
var hash = poolHashrate.InnerText;
// do stuff with hash
}

The problem is that xpath is not finding the specified node. You can specify an id to the table or the tr in order to have a smaller xpath
Also, based on your code I assume that you're looking for a single node only, so you may want to use something like this
doc.DocumentNode.SelectSingleNode("xpath");
Another good option is using Fizzler

Loop through HTML with tags from string

I'm parsing an PHP script to C# due to performance.
This is the PHP source where i'm having trouble with:
$dom = new DOMDocument;
$dom->loadHTML($message);
foreach ($dom->getElementsByTagName('a') as $node) {
if ($node->hasAttribute('href')) {
$link = $node->getAttribute('href');
if ((strpos($link, 'http://') === 0) || (strpos($link, 'https://') === 0)) {
$add_key = ((strpos($link, '{key}') !== false) || (strpos($link, '%7Bkey%7D') !== false));
$node->setAttribute('href', $url . 'index.php?route=ne/track/click&link=' . urlencode(base64_encode($link)) . '&uid={uid}&language=' . $data['language_code'] . ($add_key ? '&key={key}' : ''));
}
}
}
The problem that i'm having is the getElementByTagName part.
As said here, should i use htmlagilitypack. My code so far is this:
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(leMessage);
leMessage is an string that holds the HTML. So far so good. Only problem is that there isn't an getElementsByTag function in the HtmlAgillityPack. And in the normal HtmlDocument ( without the pack ), i can't use an string as html page right?
So does anybody knows what i should do to make this work? Only thing i can think of now is to make an webbrowser in the windows form and set the document content to leMessage and then parse it from there. But personaly i don't like that solution... But if there isn't another way...

The following was the first top-of-the-page block of code that popped up when I followed your link and clicked on "Examples":
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[#href"])
{
HtmlAttribute att = link["href"];
// DO SOMETHING WITH THE LINK HERE
}
doc.Save("file.htm");
Please do your own googling in the future.

C# Html Agility Pack ( SelectSingleNode )

I'm trying to parse this field, but can't get it to work. Current attempt:
var name = doc.DocumentNode.SelectSingleNode("//*[#id='my_name']").InnerHtml;
<h1 class="bla" id="my_name">namehere</h1>
Error: Object reference not set to an instance of an object.
Appreciate any help.
#John - I can assure that the HTML is correctly loaded. I am trying to read my facebook name for learning purposes. Here is a screenshot from the Firebug plugin. The version i am using is 1.4.0.
http://i54.tinypic.com/kn3wo.jpg
I guess the problem is that profile_name is a child node or something, that's why I'm not able to read it?

The reason your code doesn't work is because there is JavaScript on the page that is actually writing out the <h1 id='profile_name'> tag, so if you're requesting the page from a User Agent (or via AJAX) that doesn't execute JavaScript then you won't find the element.
I was able to get my own name using the following selector:
string name =
doc.DocumentNode.SelectSingleNode("//a[#id='navAccountName']").InnerText;

Try this:
var name = doc.DocumentNode.SelectSingleNode("//#id='my_name'").InnerHtml;

HtmlAgilityPack.HtmlNode name = doc.DocumentNode.SelectSingleNode("//h1[#id='my_name']").InnerText;

public async Task<List<string>> GetAllTagLinkContent(string content)
{
string html = string.Format("<html><head></head><body>{0}</body></html>", content);
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode.SelectNodes("//[#id='my_name']");
return nodes.ToList().ConvertAll(r => r.InnerText).Select(j => j).ToList();
}
It's ok with ("//a[#href]"); You can try it as above.Hope helpful

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to scrape a variable data from a source code? - c#

Related

Html Agility Pack Text </form> Tags Remain

Parsing innertext of html

get value from web page using Html Agility Pack

Loop through HTML with tags from string

C# Html Agility Pack ( SelectSingleNode )

Categories

Resources