C# How I can retrieve this information? - c#

On the HTML Page I have something like that
<table class="information">
<tbody>
<tr>
<td class="name">Name:</td>
<td>John</td>
</tr>
<tr>
<td>...</td>
<td>...</td>
</tr>
....
</tbody>
</table>
How I can retrieve the name (there are other information too but in my example I wrote only name)?
Notes: HTML has more than one table
I tried this
foreach (HtmlElement item in wb.Document.GetElementsByTagName("table"))
{
if (item.OuterHtml.Contains("information"))
{
... //Here i don't know how to continue
}
}

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var table = doc.DocumentNode.SelectSingleNode("//table[#class='information']");
var td = table.SelectSingleNode("//td[#class='name']");
Console.WriteLine(td.InnerText);
or
var text = doc.DocumentNode.Descendants("td")
.First(td => td.Attributes["class"] != null && td.Attributes["class"].Value == "name")
.InnerText;

HtmlElementCollection tData = wb.Document.GetElementsByTagName("td");
foreach (HtmlElement td in tData)
{
string name = "";
if (td.GetAttribute("classname") == "name")
{
name = td.InnerText;
}
}

Check out HtmlAgilityPack - it is free and quite good library to work with html sources.

Related

Get specific table from html document with HtmlAgilityPack C#

I have html document with two tables. For example:
<html>
<body>
<p>This is where first table starts</p>
<table>
<tr>
<th>head</th>
<th>head1</th>
</tr>
<tr>
<td>data</td>
<td>data1</td>
</tr>
</table>
<p>This is where second table starts</p>
<table>
<tr>
<th>head</th>
<th>head1</th>
</tr>
<tr>
<td>data</td>
<td>data1</td>
</tr>
</table>
</body>
</html>
And i want to parse first and second but separatly
I will explain:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(#richTextBox1.Text);
if(comboBox_tables.Text.Equals("Table1"))
{
DataTable dt = new DataTable();
dt.Columns.Add("id", typeof(string));
dt.Columns.Add("inserted_at", typeof(string));
dt.Columns.Add("DisplayName", typeof(string));
HtmlNode table = doc.DocumentNode.SelectSingleNode("//table[1]");
foreach (var row in doc.DocumentNode.SelectNodes("//tr"))
{
var nodes = row.SelectNodes("td");
if (nodes != null)
{
var id = nodes[0].InnerText;
var inserted_at = nodes[1].InnerText;
var DisplayName = nodes[2].InnerText;
dt.Rows.Add(id, inserted_at, DisplayName);
}
dataGridView1.DataSource = dt;
I'm trying to select first table with //table[1]. But it's always takes both tables. How can i select the first table for if(table1) and the second for else if(table2)?
You are selecting the table[1], but not doing anything with the return value.
Use the table variable to select all tr nodes.
HtmlNode table = doc.DocumentNode.SelectSingleNode("//table[1]");
foreach (var row in table.SelectNodes("//tr"))
.. rest of the code

How to find last column of a table using Html Agility Pack

I have a table like this:
<table border="0" cellpadding="0" cellspacing="0" id="table2">
<tr>
<th>Name
</th>
<th>Age
</th>
</tr>
<tr>
<td>Mario
</td>
<th>Age: 78
</td>
</tr>
<tr>
<td>Jane
</td>
<td>Age: 67
</td>
</tr>
<tr>
<td>James
</td>
<th>Age: 92
</td>
</tr>
</table>
I want to get the last td from all rows using Html Agility Pack.
Here is my C# code so far:
await page.GoToAsync(NumOfSaleItems, new NavigationOptions
{
WaitUntil = new WaitUntilNavigation[] { WaitUntilNavigation.DOMContentLoaded }
});
var html4 = page.GetContentAsync().GetAwaiter().GetResult();
var htmlDoc4 = new HtmlDocument();
htmlDoc4.LoadHtml(html4);
var SelectTable = htmlDoc4.DocumentNode.SelectNodes("/html/body/div[2]/div/div/div/table[2]/tbody/tr/td[1]/div[3]/div[2]/div/table[2]/tbody/tr/td[4]");
if (SelectTable.Count == 0)
{
continue;
}
else
{
foreach (HtmlNode row in SelectTable)//
{
string value = row.InnerText;
value = value.ToString();
var firstSpaceIndex = value.IndexOf(" ");
var firstString = value.Substring(0, firstSpaceIndex);
LastSellingDates.Add(firstString);
}
}
How can I get only the last column of the table?
I think the XPath you want is: //table[#id='table2']//tr/td[last()].
//table[#id='table2'] finds the table by ID anywhere in the document. This is preferable to a long brittle path from the root, since a table ID is less likely to change than the rest of the HTML structure.
//tr gets the descendent rows in the table. I'm using two slashes in case there might be an intervening <tbody> element in the actual HTML.
/td[last()] gets the last <td> in each row.
From there you just need to select the InnerText of each <td>.
var tds = htmlDoc.DocumentNode.SelectNodes("//table[#id='table2']//tr/td[last()]");
var values = tds?.Select(td => td.InnerText).ToList() ?? new List<string>();
Working demo here: https://dotnetfiddle.net/7I8yk1

String with email format turns into Hyperlink

Hi this probably is an easy question but i cant find information about how to solve it i have a table with a field named Email and the values are of type string but the problem is that mvc or the browser automatically changes that string email into Hyperlink as shown in the following picture
when i inspect the element it is an hyperlink:
lacubana#la.com
what can i do to only display the emails as string? i don't want that information to be in hyperlink format. thanks very much
Edited: here is my code of the view
<table class="table table-bordered table-striped">
<tr>
<th>Email</th>
<th>Contraseña</th>
<th>NickName</th>
<th>TipoUsuario</th>
<th>Acciones</th>
</tr>
#foreach (var item in Model)
{
<tr>
<td>#Html.DisplayFor(modelItem => item.Email)</td>
<td>#Html.DisplayFor(modelItem => item.Contraseña)</td>
<td>#Html.DisplayFor(modelItem => item.NickName)</td>
#if (item.TipoUsuario == 1)
{
<td>Administrador</td>
}
else
{
<td>Vendedor</td>
}
<td>
#Html.ActionLink("Editar", "EditarUsuario", new { id = item.IdUser }) |
#Html.ActionLink("Eliminar", "EliminarUsuario", new { id = item.IdUser })
</td>
</tr>
}
</table>
and here is the code of my controller:
IList<Usuario> UsuarioList = new List<Usuario>();
var query = from usu in database.ReportingUsersT
where usu.Activo == 1
select usu;
var listdata = query.ToList();
foreach (var Usuariodata in listdata)
{
UsuarioList.Add(new Usuario()
{
IdUser = Usuariodata.IdUser,
Email = Usuariodata.Email,
Contraseña = Usuariodata.Contraseña,
NickName = Usuariodata.NickName,
TipoUsuario = Usuariodata.TipoUsuario
});
}
return View(UsuarioList);
#Html.DisplayFor(...) is determining that the text is an email and is wrapping it in a link. You can simply use
<td>#item.Email</td>
to display it as text

Parsing html with html agility pack

I want to collect all tags in from this div but do not know how to do this in the best way with xpath method
<div class="biz_info">
<h3>Sørby Rehab</h3>
<table class="string_14">
<tbody>
<tr>
<td>Postadr.:</td>
<td class="tab_space">Rognerudveien 8 B, 0681 Oslo</td>
</tr>
<tr>
<td>Telefon:</td>
<td class="tab_space">928 70 700</td>
</tr>
<tr>
<td>Nettside:</td>
<td class="tab_space">www.sorby-rehab.no</td>
</tr>
</tbody>
</table>
</div>
Today my code looks like this (but very bad):
HtmlDocument doc = new HtmlDocument();
doc.Load(new StringReader(result));
HtmlNode root = doc.DocumentNode;
List<string> anchorTags = new List<string>();
foreach (HtmlNode link in root.SelectNodes("//#class=biz_info"))
{
string att = link.OuterHtml;
anchorTags.Add(att);
}
Is someone who is professional in xpath that can help me?
HtmlDocument html = new HtmlDocument();
html.Load(new StringReader(result));
var anchorTags = html.DocumentNode.SelectNodes("//div[#class='biz_info']//a")
.Select(a => a.OuterHtml)
.ToList();
That will give you list of anchor tags html. If you need just urls:
urls = html.DocumentNode.SelectNodes("//div[#class='biz_info']//a[#href!='']")
.Select(a => a.Attributes["href"].Value)
.ToList();

How to get a link's title and href value separately with html agility pack?

Im trying to download a page contain a table like this
<table id="content-table">
<tbody>
<tr>
<th id="name">Name</th>
<th id="link">link</th>
</tr>
<tr class="tt_row">
<td class="ttr_name">
<a title="name_of_the_movie" href="#"><b>name_of_the_movie</b></a>
<br>
<span class="pre">message</span>
</td>
<td class="td_dl">
<img alt="Download" src="#">
</td>
</tr>
<tr class="tt_row"> .... </tr>
<tr class="tt_row"> .... </tr>
</tbody>
</table>
i want to extract the name_of_the_movie from td class="ttr_name" and download link from td class="td_dl"
this is the code i used to loop through table rows
HtmlAgilityPack.HtmlDocument hDocument = new HtmlAgilityPack.HtmlDocument();
hDocument.LoadHtml(htmlSource);
HtmlNode table = hDocument.DocumentNode.SelectSingleNode("//table");
foreach (var row in table.SelectNodes("//tr"))
{
HtmlNode nameNode = row.SelectSingleNode("td[0]");
HtmlNode linkNode = row.SelectSingleNode("td[1]");
}
currently i have no idea how to check the nameNode and linkNode and extract data inside it
any help would be appreciated
Regards
I can't test it right now, but it should be something among the lines of :
string name= namenode.Element("a").Element("b").InnerText;
string url= linknode.Element("a").GetAttributeValue("href","unknown");
nameNode.Attributes["title"]
linkNode.Attributes["href"]
presuming you are getting the correct Nodes.
public const string UrlExtractor = #"(?: href\s*=)(?:[\s""']*)(?!#|mailto|location.|javascript|.*css|.*this\.)(?<url>.*?)(?:[\s>""'])";
public static Match GetMatchRegEx(string text)
{
return new Regex(UrlExtractor, RegexOptions.IgnoreCase).Match(text);
}
Here is how you can extract all Href Url. I'm using that regex in one of my projects, you can modify it to match your needs and rewrite it to match title as well. I guess it is more convenient to match them in bulk

Categories