I'm reading in a huge HTML string that has some info I need to extract from it. I can set up the search parameters (where to parse), but how can I achieve this without saving to a temp file then using StreamReader?
Example:
//Pertinent data starts here:
<!--
body for the page starts here
-->
<table border="0" >
<tr>
<td class='HeaderTD'><b>User Name</b></td>
<td class='HeaderTD'><b>Mark TheMan</b></td>
</tr>
<tr>
<td class='DataTD_Black_Bold '>Department</td>
<td class='DataTD'>Programming</td>
</tr>
<tr>
<td class='DataTD_Black_Bold '>Office Phone</td>
<td class='DataTD'>555-555-5555</td>
</tr>
<tr>
<td class='DataTD_Black_Bold '>Office Ext</td>
<td class='DataTD'>x5555</td>
I need to just set some attributes in a class to the various fields (which are strings):
User.UserName = "Mark TheMan";
User.Department = "Programming";
User.OfficePhone = "555-555-5555";
etc.
You see I need to search for a line that contains something like "<b>User Name</b>" then return the next line so I can parse out the desired data. Let me know if you need more info, thanks!
You should use Html parser, HtmlAgilityPack is very good.
Here is a little console application to show you how easy is to rip the data from tables :
static void Main(string[] args)
{
HtmlDocument doc = new HtmlDocument();
doc.Load("example.html");
foreach (HtmlNode table in doc.DocumentNode.SelectNodes("//table"))
{
foreach (HtmlNode row in table.SelectNodes("tr"))
{
foreach (HtmlNode cell in row.SelectNodes("th|td"))
{
Console.WriteLine("Cell value : " + cell.InnerText);
}
}
}
}
And for your example output will be :
Cell value : User Name
Cell value : Mark TheMan
Cell value : Department
Cell value : Programming
Cell value : Office Phone
Cell value : 555-555-5555
Related
I have a table like this:
<table border="0" cellpadding="0" cellspacing="0" id="table2">
<tr>
<th>Name
</th>
<th>Age
</th>
</tr>
<tr>
<td>Mario
</td>
<th>Age: 78
</td>
</tr>
<tr>
<td>Jane
</td>
<td>Age: 67
</td>
</tr>
<tr>
<td>James
</td>
<th>Age: 92
</td>
</tr>
</table>
I want to get the last td from all rows using Html Agility Pack.
Here is my C# code so far:
await page.GoToAsync(NumOfSaleItems, new NavigationOptions
{
WaitUntil = new WaitUntilNavigation[] { WaitUntilNavigation.DOMContentLoaded }
});
var html4 = page.GetContentAsync().GetAwaiter().GetResult();
var htmlDoc4 = new HtmlDocument();
htmlDoc4.LoadHtml(html4);
var SelectTable = htmlDoc4.DocumentNode.SelectNodes("/html/body/div[2]/div/div/div/table[2]/tbody/tr/td[1]/div[3]/div[2]/div/table[2]/tbody/tr/td[4]");
if (SelectTable.Count == 0)
{
continue;
}
else
{
foreach (HtmlNode row in SelectTable)//
{
string value = row.InnerText;
value = value.ToString();
var firstSpaceIndex = value.IndexOf(" ");
var firstString = value.Substring(0, firstSpaceIndex);
LastSellingDates.Add(firstString);
}
}
How can I get only the last column of the table?
I think the XPath you want is: //table[#id='table2']//tr/td[last()].
//table[#id='table2'] finds the table by ID anywhere in the document. This is preferable to a long brittle path from the root, since a table ID is less likely to change than the rest of the HTML structure.
//tr gets the descendent rows in the table. I'm using two slashes in case there might be an intervening <tbody> element in the actual HTML.
/td[last()] gets the last <td> in each row.
From there you just need to select the InnerText of each <td>.
var tds = htmlDoc.DocumentNode.SelectNodes("//table[#id='table2']//tr/td[last()]");
var values = tds?.Select(td => td.InnerText).ToList() ?? new List<string>();
Working demo here: https://dotnetfiddle.net/7I8yk1
I have an HTML file that contains many tables, but I want to access a specific table from the file (not all tables).
So how can I do that?
Code is look something like below and all tables are without ids
`<table border=1>
<tr><td>VI not loadable</td><td>0</td></tr>
<tr><td>Test not loadable</td><td>0</td></tr>
<tr><td>Test not runnable</td><td>0</td></tr>
<tr><td>Test error out</td><td>0</td></tr>
</table>`
every table should have an Id or something that could be Identified from the others, if so you can get it via jquery. for example :
<table class="table table-striped" id="tbl1">
<thead>
<tr>
<th>Firstname</th>
<th>Lastname</th>
<th>Email</th>
</tr>
</thead>
<tbody>
<tr>
<td>John</td>
<td>Doe</td>
<td>john#example.com</td>
</tr>
<tr>
<td>Mary</td>
<td>Moe</td>
<td>mary#example.com</td>
</tr>
<tr>
<td>July</td>
<td>Dooley</td>
<td>july#example.com</td>
</tr>
</tbody>
and get it like this:
var table = $('#tbl1').html();
if not you can find it by its priority in the file. for example you can access to 2nd table like this :
var table = $('table:nth-child(2)')
or in C# maybe this would help:
HtmlNode table = doc.DocumentNode.SelectSingleNode("//table[1]")
foreach (var cell in table.SelectNodes(".//tr/td"))
{
string someVariable = cell.InnerText
}
Hello i making HttpWebResponse and getting the HtmlPage with all data that i need for example table with date info that i need to save them to array list and save it to xml file
Example of html Page
<tbody>
<tr class="odd">
<tr class="even">
<td class="padding5 sorting_1">
<span class="DateHover" sort="14/03/18/22/56" title="18.03.14" ref="18.03.14">18.03.14</span>
</td>
<td class="CellStyleDefaultText">
<span class="transSpan">Info</span>
</td>
<td class="CellStyleDefaultText" title="UserNumber123">UserNumber123</td>
<td class="CellStyleSignedNumber floatopHomePage">
<span title="701,554.23 ">701,554.23 </span>
</td>
<td class="CellStyleAmount CellStyleAmountNew">
<div title="-3354999.71">-3354999.71</div>
</td>
<td class="CellStyleDetails CCMoreDetailsTd">
<span> 17.03.14 Info</span>
</td>
</tr>
</tbody>
Ok the first span with dateTime i got
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//span[#class='DateHover']"))
span with info
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//td[#class='transSpan']"))
and then i stuck to get UserNumber123 i did this
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//td[#class='CellStyleDefaultText']"))
but it returns me span transSpan as well because it in td
and all others td CellStyleSignedNumber,CellStyleAmount,CellStyleDetails i can't get.
Any ideas?
You can simply mention the attribute name to select element that has particular attribute set. So you can try to get UserNumber123 this way :
doc.DocumentNode.SelectNodes("//td[#class='CellStyleDefaultText' and #title]")
Above XPath means, select <td> element that has title attribute and hass class attribute value equals 'CellStyleDefaultText'.
For the rest <td>, try to use XPath contains() function, for example :
doc.DocumentNode.SelectNodes("//td[contains(#class,'CellStyleSignedNumber')]")
UPDATE :
Responding the latter part of your comment, if you intend to get <td> that has child <span>element, you can add the criteria as simple as following :
doc.DocumentNode.SelectNodes("//td[span and contains(#class,'CellStyleSignedNumber')]")
I have a table in the HTML code below:
<table style="padding: 0px; border-collapse: collapse;">
<tr>
<td><h3>My Regional Financial Office</h3></td>
</tr>
<tr>
<td> </td>
</tr>
<tr>
<td><h3>My Address</h3></td>
</tr>
<tr>
<td>000 Test Ave S Ste 000</td>
</tr>
<tr>
<td>Golden Valley, MN 00000</td>
</tr>
<tr>
<td>Get Directions</td>
</tr>
<tr>
<td> </td>
</tr>
</table>
How can I get the inner text of the next 2 <tr> tags after the tablerow containing the text "My Address?"
You can use following XPath :
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var tdOfInterests =
htmlDoc.DocumentNode
.SelectNodes("//tr[td/h3[.='My Address']]/following-sibling::tr[position() <= 2]/td");
foreach (HtmlNode td in tdOfInterests)
{
//given html input in question following code will print following 2 lines:
//000 Test Ave S Ste 000
//Golden Valley, MN 00000
Console.WriteLine(td.InnerText);
}
The key of above XPath is using following-sibling with position() filter.
UPDATE :
A bit explanation about the XPath used in this answer :
//tr[td/h3[.='My Address']]
above part select <tr> element that has :
child <td> element that has child <h3> element with value equals
'My Address'
/following-sibling::tr[position() <= 2]
next part select following <tr> element with position <=2 from current <tr> element (the one selected by previous XPath part)
/td
the last part select child <td> element from current <tr> element
Im trying to download a page contain a table like this
<table id="content-table">
<tbody>
<tr>
<th id="name">Name</th>
<th id="link">link</th>
</tr>
<tr class="tt_row">
<td class="ttr_name">
<a title="name_of_the_movie" href="#"><b>name_of_the_movie</b></a>
<br>
<span class="pre">message</span>
</td>
<td class="td_dl">
<img alt="Download" src="#">
</td>
</tr>
<tr class="tt_row"> .... </tr>
<tr class="tt_row"> .... </tr>
</tbody>
</table>
i want to extract the name_of_the_movie from td class="ttr_name" and download link from td class="td_dl"
this is the code i used to loop through table rows
HtmlAgilityPack.HtmlDocument hDocument = new HtmlAgilityPack.HtmlDocument();
hDocument.LoadHtml(htmlSource);
HtmlNode table = hDocument.DocumentNode.SelectSingleNode("//table");
foreach (var row in table.SelectNodes("//tr"))
{
HtmlNode nameNode = row.SelectSingleNode("td[0]");
HtmlNode linkNode = row.SelectSingleNode("td[1]");
}
currently i have no idea how to check the nameNode and linkNode and extract data inside it
any help would be appreciated
Regards
I can't test it right now, but it should be something among the lines of :
string name= namenode.Element("a").Element("b").InnerText;
string url= linknode.Element("a").GetAttributeValue("href","unknown");
nameNode.Attributes["title"]
linkNode.Attributes["href"]
presuming you are getting the correct Nodes.
public const string UrlExtractor = #"(?: href\s*=)(?:[\s""']*)(?!#|mailto|location.|javascript|.*css|.*this\.)(?<url>.*?)(?:[\s>""'])";
public static Match GetMatchRegEx(string text)
{
return new Regex(UrlExtractor, RegexOptions.IgnoreCase).Match(text);
}
Here is how you can extract all Href Url. I'm using that regex in one of my projects, you can modify it to match your needs and rewrite it to match title as well. I guess it is more convenient to match them in bulk