Find indexes in String using multiple search items and one single iteration - c#

I have the following HTML sample document:
.....
<div class="TableElement">
<table>
<tr>
<th class="boxToolTip" title="La quotazione di A2A è in rialzo o in ribasso?"> </th>
..
<th class="boxToolTip" class="ColumnLast" title="Trades più recenti su A2A">Ora <img title='' alt='' class='quotePageRTupgradeLink' href='#quotePageRTupgradeContainer' id='cautionImageEnt' src='/common/images/icons/caution_sign.gif'/></th>
</tr>
<tr class="odd">
..
<td align="center"><span id="quoteElementPiece6" class="PriceTextUp">1,619</span></td>
<td align="center"><span id="quoteElementPiece7" class="">1,6235</span></td>
<td align="center"><span id="quoteElementPiece8" class="">1,591</span></td>
<td align="center"><span id="quoteElementPiece9" class="">1,5995</span></td>
..
</tr>
</table>
</div>
......
I need to get the values corresponding at quoteElementPiece 6,7,8,9 and 17 (currently further in the document) section.
I am simply searching one by one in the code at the moment:
int index6 = doc.IndexOf("quoteElementPiece6");
..
int index17 = doc.IndexOf("quoteElementPiece17");
I want to improve this by scanning in one go and having all the indexes for the substrings I need. Example:
var searchstrings = new string[]
{
"quoteElementPiece6",
"quoteElementPiece7",
"quoteElementPiece8",
"quoteElementPiece9",
"quoteElementPiece17"
};
int[] indexes = getIndexes(document,searchstrings); //indexes should be sorted accordingly to the order in searchstrings
Is there anything native in .NET doing this (LinQ for istance)?
I know there are HTML Parser libraries but I prefer avoiding using those, I would like to learn how to do this for each kind of document.

var words = new []{
"quoteElementPiece6",
"quoteElementPiece7"};
// I take for granted your `document` is a string and not an `HtmlDocument` or whatnot.
var result = words.Select(word=>document.IndexOf(word));
Console.WriteLine(string.Join(",", result));

you can do this with LINQ. check my solution
var doc = "this is my document";
List<string> searchstrings = new List<string>
{
"quoteElementPiece6",
"quoteElementPiece7",
"quoteElementPiece8",
"quoteElementPiece9",
"quoteElementPiece17"
};
var lastIndexOfList = new List<int>(searchstrings.Count);
searchstrings.ForEach(x => lastIndexOfList.Add(doc.LastIndexOf(x)));

var pattern = #"(?s)<tr class=""odd"">.+?</tr>";
var tr = Regex.Match(html, pattern).Value.Replace(" ", "");
var xml = XElement.Parse(tr);
var nums = xml
.Descendants()
.Where(n => (string)n.Attribute("id") != null)
.Where(n => n.Attribute("id").Value.StartsWith("quoteElementPiece"))
.Select(n => Regex.Match(n.Attribute("id").Value, "[0-9]+").Value);

Related

Html Agility Pack parsing table into object

So I have HTML like this:
<tr class="row1">
<td class="id">123</td>
<td class="date">2014-08-08</td>
<td class="time">12:31:25</td>
<td class="notes">something here</td>
</tr>
<tr class="row0">
<td class="id">432</td>
<td class="date">2015-02-09</td>
<td class="time">12:22:21</td>
<td class="notes">something here</td>
</tr>
And it continues like that for each customer row. I want to parse contents of each table row to an object. I've tried few methods but I can't seem to get it work right.
This is what I have currently
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
foreach (HtmlNode row in doc.DocumentNode.SelectNodes("//table[#id='customerlist']//tr"))
{
Customer cust = new Customer();
foreach (HtmlNode info in row.SelectNodes("//td"))
{
if (info.GetAttributeValue("class", String.Empty) == "id")
{
cust.ID = info.InnerText;
}
if (info.GetAttributeValue("class", String.Empty) == "date")
{
cust.DateAdded = info.InnerText;
}
if (info.GetAttributeValue("class", String.Empty) == "time")
{
cust.TimeAdded = info.InnerText;
}
if (info.GetAttributeValue("class", String.Empty) == "notes")
{
cust.Notes = info.InnerText;
}
}
Console.WriteLine(cust.ID + " " + cust.TimeAdded + " " + cust.DateAdded + " " + cust.Notes);
}
It works to the point that it prints info of the last row of the table on each loop. I'm just missing something very simple but cannot see what.
Also is my way of creating the object fine, or should I use a constructor and create the object from variables? E.g.
string Notes = String.Empty;
if (info.GetAttributeValue("class", String.Empty) == "notes")
{
Notes = info.InnerText;
}
..
Customer cust = new Customer(id, other_variables, Notes, etc);
Your XPath query is wrong. You need to use td instead of //td:
foreach (HtmlNode info in row.SelectNodes("td"))
Passing //td to SelectNodes() will match all <td> elements in the document, hence your inner loop runs 8 times instead of 4 times, and the last 4 times always overrides the values previously set in your Customer object.
See XPath Examples

How to parse this HTML text using htmlagilitypack?

So below are the lines of code,
<td class="line1left">SCN02_MS_AddNotes_CAM</td><td class="line1left">798 (6.14%)
</td><td class="line1left">0.9</td><td class="line1left">0s (<span> - %</span>)
</td><td class="line1left">0% (<span class="goodPercentage">-100%</span>)
</td>
<td class="line1left">SCN05_MS_UpdateCustomer_CAM</td><td class="line1left">888 (6.83%)
</td><td class="line1left">1.0</td><td class="line1left">0s (<span> - %</span>)
</td><td class="line1left">0% (<span class="goodPercentage">-100%</span>)
</td>
From the first block, I need to get SCN02_MS_AddNotes_CAM and 798. To get 798 I am using this code, but I am getting the (6.14%) also, which I don't want.
var content1 = doc1.DocumentNode.SelectNodes("//td[#class='line1left']")[1].InnerText;
I want to get 798 only. So can anybody help me?
I also want to know how to get the same values from the second block. I was under the impression that the number inside the brackets represents the different occurrences of the class line1left. But here it is representing the different InnerHtml elements.
[1]
Does anybody know how to get this to work?
Thanks a lot in advance.!
var line1left_list = (from d in document.DocumentNode.Descendants()
where d.Name == "td " && d.Attributes["class"] != null
&& (d.Attributes["class"].Value == "line1left")
select d);
foreach (HtmlNode line1left in line1left_list)
{
var _link = line1left.Descendants("a").FirstOrDefault();
string linkUrl = "";
string link = "";
if (_link != null)
{
linkUrl = _link.Attributes["href"].Value;
link = _link.InnerText
}
}
It looks like you want the InnerText of all <td> tags with the class attribute of "line1left", unless that <td> has an <a> inside of it, in which case you want the InnerText of <a>.
Here is an example that will do just that. If the <td> has an <a>, then <a> is selected, otherwise <td> is selected.
HtmlDocument doc1 = new HtmlDocument();
doc1.Load("xmlfile2.xml");
var nodes = doc1.DocumentNode.SelectNodes("(//td[#class='line1left']/a) | (//td[#class='line1left' and not(a)])");
foreach(var node in nodes)
Console.WriteLine(node.InnerText.Trim());
This will select all the nodes in the document. You can use regular C# code to strip off the unwanted formatting on the individual values.

Taking different Table of Elements with HtmlAgilityPack

I have this loop structure several times.
Table 1
<table>
<tbody>
<tr>
<th>titulo</th>
</tr>
</tbody>
</table>
Table 2
<table>
<tbody>
<tr>
<th>Texto</th>
<th>Texto</th>
<th>Texto</th>
<th>Texto</th>
</tr>
</tbody>
</table>
This pattern is repeated several times.
How do I switch them to an array and a list for me to get the values ​​of each ?
Short Demo using a Console App:
class Program
{
static void Main(string[] args)
{
HtmlDocument doc = new HtmlDocument();
doc.Load("Demo.html");
var result = doc.DocumentNode.SelectNodes("//table")
.Select(table => new //create anonymous type
{
Table = table,
HeaderNodes = table.SelectNodes("./tbody/tr/th").ToList() //the th subnodes
});
foreach (var table in result)
{
foreach (HtmlNode headerNode in table.HeaderNodes)
{
Console.WriteLine( headerNode.InnerText);
}
Console.WriteLine("--------------------------");
}
}
}
Output:
titulo
--------------------------
Texto
Texto
Texto
Texto
--------------------------

xpath and htmlagility pack

I figured it out! I will leave this posted just in case some other newbie like myself has the same question.
Answer: **("./td[2]/span[#class='smallfont']")***
I am a novice at xpath and html agility. I am so close yet so far.
GOAL: to pull out 4:30am
by using the following with htmlagility pack:
foreach (HtmlNode table in doc.DocumentNode.SelectNodes("//table[#id='weekdays']/tr[2]")){
string time = table.SelectSingleNode("./td[2]").InnerText;
I get it down to "\r\n\t\t\r\n\t\t\t4:30am\r\n\t\t\r\n\t" when I try doing anything with the span I get xpath exceptions. What must I add to the ("./td[2]") to just end up with the 4:30am?
HTML
<td class="alt1 espace" nowrap="nowrap" style="text-align: center;">
<span class="smallfont">4:30am</span>
</td>
I don't know if Linq is an option, but you could have also done something like this:
var time = string.Empty;
var html =
"<td class=\"alt1 espace\" nowrap=\"nowrap\" style=\"text-align: center;\"><span class=\"smallfont\">4:30am</span></td>";
var document = new HtmlDocument() { OptionWriteEmptyNodes = true, OptionOutputAsXml = true };
document.LoadHtml(html);
var timeSpan =
document.DocumentNode.Descendants("span").Where(
n => n.Attributes["class"] != null && n.Attributes["class"].Value == "smallfont").FirstOrDefault();
if (timeSpan != null)
time = timeSpan.InnerHtml;

Most efficient way to parse delimited into html table c#

I've got the following delimited string with pairs:
1,5|2,5|3,5
I want to create a table as follows:
< table>
< tr>< td>1< /td>< td>5< /td>< /tr>
< tr>< td>2< /td>< td>5< /td>< /tr>
< tr>< td>3< /td>< td>5< /td>< /tr>
< /table>
What's the most efficient way in C#?
Parse the string (simple splitting should be enough) and I'd suggest using the .NET XML classes (or Html Agility Pack for the purists out there) to generate the table. Might be overkill vs building up the string manually especially for simple data but it is less verbose and should be easier to extend later.
Using LINQ to XML:
var str = "1,5|2,5|3,5";
var table =
new XElement("table",
str.Split('|')
.Select(pair =>
new XElement("tr",
pair.Split(',')
.Select(num => new XElement("td", num))
)
)
).ToString();
Yields the string:
<table>
<tr>
<td>1</td>
<td>5</td>
</tr>
<tr>
<td>2</td>
<td>5</td>
</tr>
<tr>
<td>3</td>
<td>5</td>
</tr>
</table>
Version 1: Straight-forward
String html = "<table>";
Array.ForEach<String>("1,5|2,5|3,5".Split('|'),r =>
{
html += "<tr>";
Array.ForEach(r.Split(','),c =>
{
html += String.Format("<td>{0}</td>", c);
});
html += "</tr>";
});
html += "</table>";
Untested, but something of the sort?
I take it back, battle tested and working.
Version two, less the delegate:
String html = "<table>";
foreach (String r in "1,5|2,5|3,5".Split('|'))
{
html += "<tr>";
foreach (String c in r.Split(','))
html += String.Format("<td>{0}</td>", c);
html += "</tr>";
}
html += "</table>";
Both versions in a working demo.
And Another version which includes StringBuilder
If you search for efficient way, then you shouldn't use string concat, use StringBuilder instead:
private static string ToTable(string input)
{
var result = new StringBuilder(input.Length * 2);
result.AppendLine("<table>");
foreach (var row in input.Split('|'))
{
result.Append("<tr>");
foreach (var cell in row.Split(','))
result.AppendFormat("<td>{0}</td>", cell);
result.AppendLine("/<tr>");
}
result.AppendLine("</table>");
return result.ToString();
}
Create a IList from your collection as described above using the String.Split method in the code behind and use the native DataList UI Control, bind the datasource to the control and set the DataSource property of the control to your List.
<asp:DataList ID="YourDataList" RepeatLayout="Table" RepeatColumns="2" RepeatDirection="Horizontal" runat="server">
<ItemTemplate>
<%# Eval("value") %>
</ItemTemplate>
</asp:DataList>

Categories