Parse single data elements from HTML tables with C#?

Parse single data elements from HTML tables with C#? - c#

I have this code in my main function and I want to parse only the first row of the table (e.g Nov 7, 2017 73.78 74.00 72.32 72.71 17,245,947).
I created a node that concludes only the first row but when I start debugging the node value is null. How can I parse these data and store them for example in a string or in single variables. Is there a way?
WebClient web = new WebClient();
string page = web.DownloadString("https://finance.google.com/finance/historical?q=NYSE:C&ei=7O4nV9GdJcHomAG02L_wCw");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(page);
var node = doc.DocumentNode.SelectSingleNode("//*[#id=\"prices\"]/table/tbody/tr[2]");
List<List<string>> node = doc.DocumentNode.SelectSingleNode("//*[#id=\"prices\"]/table").Descendants("tr").Skip(1).Where(tr => tr.Elements("td").Count() > 1).Select(tr => tr.Elements("td").Select(td=>td.InnerText.Trim()).ToList()).ToList() ;

It seems that your selection XPath string has errors. Since tbody is a generated node it should not be included in path:
//*[#id=\"prices\"]/table/tr[2]
While this should read the value HtmlAgilityPack hits another problem malformed html. All <tr> and <td> nodes in parsed text do not have corresponding </tr> or </td> closing tags and HtmlAgitilityPack fails to select values from table with malformed rows. Therefore, it is necessary to select in first step the whole table:
//*[#id=\"prices\"]/table
And in the next step either sanitize HTML by adding </tr> and </td> closing tags and repeat parsing with corrected table or use extracted string to hand parse it - just extract lines 10 to 15 from table string and split them on > character. Raw parsing is shown below. Code is tested and working.
using System;
using System.Collections.Generic;
using System.IO;
using System.Net;
namespace GoogleFinanceDataScraper
{
class Program
{
static void Main(string[] args)
{
WebClient web = new WebClient();
string page = web.DownloadString("https://finance.google.com/finance/historical?q=NYSE:C&ei=7O4nV9GdJcHomAG02L_wCw");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(page);
var node = doc.DocumentNode.SelectSingleNode("//div[#id='prices']/table");
string outerHtml = node.OuterHtml;
List<String> data = new List<string>();
using(StringReader reader = new StringReader(outerHtml))
{
for(int i = 0; ; i++)
{
var line = reader.ReadLine();
if (i < 9) continue;
else if (i < 15)
{
var dataRawArray = line.Split(new char[] { '>' });
var value = dataRawArray[1];
data.Add(value);
}
else break;
}
}
Console.WriteLine($"{data[0]}, {data[1]}, {data[2]}, {data[3]}, {data[4]}, {data[5]}");
}
}
}

Related

Parse HTML class in individual items with htmlagilitypack

I want to parse HTML, I used the following code but I get all of it in one item instead of getting the items individually
var url = "https://subscene.com/subtitles/searchbytitle?query=joker&l=";
var web = new HtmlWeb();
var doc = web.Load(url);
IEnumerable<HtmlNode> nodes =
doc.DocumentNode.Descendants()
.Where(n => n.HasClass("search-result"));
foreach (var item in nodes)
{
string itemx = item.SelectSingleNode(".//a").Attributes["href"].Value;
MessageBox.Show(itemx);
MessageBox.Show(item.InnerText);
}
I only receive 1 message for the first item and the second message displays all items

When you search the data from the url based on class 'search-result', there is only one node that is returned. Instead of iterating through its children, you only go through that one div, which is why you are only getting one result.
If you want to get a list of all the links inside the div with class "search-result", then you can do the following.
Code:
string url = "https://subscene.com/subtitles/searchbytitle?query=joker&l=";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
List<string> listOfUrls = new List<string>();
HtmlNode searchResult = doc.DocumentNode.SelectSingleNode("//div[#class='search-result']");
// Iterate through all the child nodes that have the 'a' tag.
foreach (HtmlNode node in searchResult.SelectNodes(".//a"))
{
string thisUrl = node.GetAttributeValue("href", "");
if (!string.IsNullOrEmpty(thisUrl) && !listOfUrls.Contains(thisUrl))
listOfUrls.Add(thisUrl);
}
What does it do?
SelectSingleNode("//div[#class='search-result']") -> retrieves the div that has all the search results and ignores the rest of the document.
Iterates through all the "subnodes" only that have href in it and adds it to a list. Subnodes are determined based on the dot notation SelectNodes(".//a") (Instead of .//, if you do //, it will search the entire page which is not what you want).
If statement makes sure its only adding unique non-null values.
You have all the links now.
Fiddle: https://dotnetfiddle.net/j5aQFp

I think it's how you're looking up and storing the data. Try:
foreach (HtmlNode link doc.DocumentNode.SelectNodes("//a[#href]"))
{
string hrefValue = link.GetAttributeValue( "href", string.Empty );
MessageBox.Show(hrefValue);
MessageBox.Show(link.InnerText);
}

How to extract data from HtmlTable in C# and arrange in a row?

I want to extract data from HTMLTable row by row. But I'm facing problems in separating columns in the rows. The code I'm using below gives me each cell in a single line. But I want each row in 1 line then another. how can I do that?
HtmlNode table = doc.DocumentNode.SelectSingleNode("//table[" + tableCounter + "]");
foreach (var cell in table.SelectNodes(".//tr/td"))
{
string someVariable = cell.InnerText;
ReportFileWriter(someVariable);
}
tableCounter++;
This is the output I get from this code:
The Current Output
and the original table is like this:
The Original Html Table
and the output I want is to have spaces between columns:
The Desired Output

Since I don't know your specific website, I used the following code to parse the
html table.
You need install Nuget -> HtmlAgilityPack.
Code:
WebClient webClient = new WebClient();
string page = webClient.DownloadString("http://www.mufap.com.pk/payout-report.php?tab=01");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(page);
List<List<string>> table = doc.DocumentNode.SelectSingleNode("//table[#class='mydata']")
.Descendants("tr")
.Skip(1)
.Where(tr => tr.Elements("td").Count() > 1)
.Select(tr => tr.Elements("td").Select(td => td.InnerText.Trim()).ToList())
.ToList();
string result = string.Empty;
foreach (var item in table[0])
{
result = result + " " + item;
}
Console.WriteLine(result);
The first row in website:
The result you will get:

Get only the text of a webpage using HTML Agility Pack?

I'm trying to scrape a web page to get just the text. I'm putting each word into a dictionary and counting how many times each word appears on the page. I'm trying to use HTML Agility Pack as suggested from this post: How to get number of words on a web page?
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
int wordCount = 0;
Dictionary<string, int> dict = new Dictionary<string, int>();
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//text()"))
{
MatchCollection matches = Regex.Matches(node.InnerText, #"\b(?:[a-z]{2,}|[ai])\b", RegexOptions.IgnoreCase);
foreach (Match s in matches)
{
//Add the entry to the dictionary
}
}
However, with my current implementation, I'm still getting lots of results that are from the markup that should not be counted. It's close, but not quite there yet (I don't expect it to be perfect).
I'm using this page as an example. My results are showing a lot of the uses of the words "width" and "googletag", despite those not being in the actual text of the page at all.
Any suggestions on how to fix this? Thanks!

You can't be sure that the word you are searching for is displayed or not to the user as there will be JS execution and CSS rules that will affect that.
The following program does find 0 matches for "width", and "googletag" but finds 126 "html" matches whereas Chrome Ctrl+F finds 106 matches.
Note that the program does not match the word if it's parent node is <script>.
using HtmlAgilityPack;
using System;
namespace WordCounter
{
class Program
{
private static readonly Uri Uri = new Uri("https://www.w3schools.com/html/html_editors.asp");
static void Main(string[] args)
{
var doc = new HtmlWeb().Load(Uri);
var nodes = doc.DocumentNode.SelectSingleNode("//body").DescendantsAndSelf();
var word = Console.ReadLine().ToLower();
while (word != "exit")
{
var count = 0;
foreach (var node in nodes)
{
if (node.NodeType == HtmlNodeType.Text && node.ParentNode.Name != "script" && node.InnerText.ToLower().Contains(word))
{
count++;
}
}
Console.WriteLine($"{word} is displayed {count} times.");
word = Console.ReadLine().ToLower();
}
}
}
}

Using HTML Agility Pack to load all data into listboxes?

I have a page with 300-something rows and wanting to load them all into a list box, but different lists.
I want to put the date in one box, and the other 2 numbers in 2 other boxes also.
HTML ex:
<table>
<tr>
<td>01/01/2017</td>
<td>100</td>
<td>500</td>
</tr>
<tr>
<td>01/02/2017</td>
<td>200</td>
<td>400</td>
</tr>
</table>
My code that pulls this:
private void LoadHTML()
{
int count = 0;
var link = #"http://example.com/data";
HtmlWeb Web = new HtmlWeb();
var htmlDoc = Web.Load(link);
var node = htmlDoc.DocumentNode.SelectNodes("//td");
foreach (var x in node)
{
count = count + 1;
if (count > 5)
{
listBox1.Items.Add(x.InnerText);
}
}
}
listbox1 add's all the data from x, since everything is a td. tr would add each row, but I have nothing to split the data. The count after 5 is where my data starts. There is headers but I don't know how to pull the data from the specific headers in this form.

First you need to get a tr nodes.
Next, iterate it and get the td nodes.
var trNodes = htmlDoc.DocumentNode.SelectNodes("//tr");
foreach (var tr in trNodes)
{
var tdNodes = tr.SelectNodes("./td");
listBox1.Items.Add(tdNodes[0].InnerText);
listBox2.Items.Add(tdNodes[1].InnerText);
listBox3.Items.Add(tdNodes[2].InnerText);
}

Regex to remove and replace characters

I have the following
<option value="Abercrombie">Abercrombie</option>
My file has about 2000 rows in it each row has a different location, I'm trying to understand regex but unfortunately nothing I learn will go in and I'm unsure if this is possible.
What I want to do is run a regex which will strip the above HTML which will leave the following
Abercrombie
I then want to prefix a particular number to the front so the result would be for example
2,Abercrombie
Is this possible?

Don't use a regular expression since HTML is not a regular language. You can use Linq's XML parser. If you want to process the entire file, you can replace the elements inline:
int myNumber = 2;
var html = #"<html><body><option value=""Abercrombie"">Abercrombie</option><div><option value=""Forever21"">Forever21</option></div></body></html>";
var doc = XDocument.Load(new StringReader(html));
var options = doc.Descendants().Where(o => o.Name == "option").ToList();
foreach (var element in options)
{
element.ReplaceWith(string.Format("{0},{1}", myNumber, element.Value));
}
var result = doc.ToString();
This gives:
<html>
<body>2,Abercrombie<div>2,Forever21</div></body>
</html>
If you just want to grab the text for a specific tag, you can use the following:
int myNumber = 2;
var html = #"<option value=""Abercrombie"">Abercrombie</option>";
var doc = XDocument.Load(new StringReader(html));
var element = doc.Descendants().FirstOrDefault(o => o.Name == "option");
var attribute = element.Attribute("value").Value;
var result = string.Format("{0},{1}", myNumber, attribute);
//result == "2,Abercrombie"

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Parse single data elements from HTML tables with C#? - c#

Related

Parse HTML class in individual items with htmlagilitypack

How to extract data from HtmlTable in C# and arrange in a row?

Get only the text of a webpage using HTML Agility Pack?

Using HTML Agility Pack to load all data into listboxes?

Regex to remove and replace characters

Categories

Resources