Htmlagilitypack only parses table rows partialy

Htmlagilitypack only parses table rows partialy - c#

I'm trying to parse the main (last in the dom tree)
<table>
in this website: "https://aips.um.si/PredmetiBP5/Main.asp?Mode=prg&Zavod=77&Jezik=&Nac=1&Nivo=P&Prg=1571&Let=1"
Im using the Htmlagilitypack and writing code in C# on a wpf application in visual studio 17.
Right now im using this code:
iso = Encoding.GetEncoding("windows-1250");
web = new HtmlWeb()
{
AutoDetectEncoding = false,
OverrideEncoding = iso,
};
//http = https://aips.um.si/PredmetiBP5/Main.asp?Mode=prg&Zavod=77&Jezik=&Nac=1&Nivo=P&Prg=1571&Let=1
string http = formatLetnikLink(l.Attributes["onclick"].Value).ToString();
var htmlProgDoc = web.Load(http);
string s = htmlProgDoc.ParsedText;
htmlprogDoc.ParsedText correctly includes all the rows
that are supposed to be in the last table
(I had this for debugging, just incase the watch window was broken or something... idk...)
I tried to first get all the tables on the tables on the website. And realized that there are 6
<table></table>
tags on it, even tho you visualy see only one. After debuggign for a couple of hours, i realized that the last main table, is the last
<table>
in the dom tree, and that the parser parsing fully all the
<tr>
tags that the table has. This is the problem, I need all the tr tags.
var tables = htmlProgDoc.DocumentNode.SelectNodes("//table");
There are 6 times
<table></table>
tags, as expected, and everyone of them is fully parsed, including all their rows and columns, except the last one, in the last one it only parses the first two rows and then the parser apears to append a
</table>
by its self, I also tried using the direct xpath selector, copy-ed from firefox:
"/html/body/div/center[2]/font/font/font/table", instead of "//table"
which found the correct table, but the table also contained only the first 2 rows
var theTableINeed = tables.Last();
//contains the correct table which I need, but with only the first two rows

The Html on that page is malformed. One possible workaround is stripping the code for last table and parse it as a document.
var client = new WebClient();
string html = client.DownloadString(url);
int lastTableOpen = html.LastIndexOf("<table");
int lastTableClose = html.LastIndexOf("</table");
string lastTable = html.Substring(lastTableOpen, lastTableClose - lastTableOpen + 8);
Then use HtmlAgilityPack:
var table = new HtmlDocument();
table.LoadHtml(lastTable);
foreach (var row in table.DocumentNode.SelectNodes("//table//tr"))
{
Console.WriteLine(row.ToString());
}
But I don't know if there are problems in the table itself.

Related

C# Xpath only returning first element

I am using HTMLAgilityPack to read and load an XML file. After the file is loaded, I want to insert the values from it into a database.
XML looks like this:
<meeting>
<jobname></jobname>
<jobexperience></jobexperience>
</meeting>
I'm trying to accomplish this using XPath statements within a foreach loop as seen here:
DataTable dt = new DataTable();
//Add Data Columns here
dt.Columns.Add("JobName");
dt.Columns.Add("JobExperience");
// Create a string to read the XML tag "job"
string xPath_job = "//job";
string xPath_job_experience = "//jobexperience";
/* Use a ForEach loop to go through all 'meeting' tags and get the values
from the 'JobName' and 'JobExperience' tags */
foreach (HtmlNode planned_meeting in doc.DocumentNode.SelectNodes("//meeting"))
{
DataRow dr = dt.NewRow();
dr["JobName"] = planned_meeting.SelectSingleNode(xPath_job).InnerText;
dr["JobName"] = planned_meeting.SelectSingleNode(xPath_job_experience).InnerText;
dt.Rows.Add(dr);
}
So the problem is that even though the foreach loop is going through every 'meeting' tag, it's getting the values from only the first 'meeting' tag.
Any help would be greatly appreciated!

So the problem is that even though the foreach loop is going through every 'meeting' tag, it's getting the values from only the first 'meeting' tag.
Yes, that's what the code does. The XPath operator // selects all the elements in the whole document, e.g. //job select all job elements in the whole document.
So in your foreach loop you select all meeting elements in the whole document with
doc.DocumentNode.SelectNodes("//meeting"))
and then - in the loop - you select all //job and all //jobexperience elements in the whole document with
string xPath_job = "//job";
string xPath_job_experience = "//jobexperience";
So you select the first element of all elements - over and over again... Hence the impression that you only get the first element.
So change the code in a way that the children of the current meeting element get selected (by removing the // operator):
string xPath_job = "job";
string xPath_job_experience = "jobexperience";

Parsing with AngleSharp

Writing programm to Parse some data from one website using AngleSharp. Unfortunately I didn't find any documentation and it makes understanding realy hard.
How can I by using QuerySelectorAll get only link? I'm getting now just all things <a ...>...</a> with Name of article.
1. Name of artucle
The method I'm using now:
var items = document.QuerySelectorAll("a").Where(item => item.ClassName != null && item.ClassName.Contains("object-title-a text-truncate"));
In the previous example I also used ClassName.Contains("object-name"), but if we deal with table cells, there are no any class. As I understand to parse right element maybee I must use some info about parent also. So here is the question, how can I get this '4' value from tabble cell?
....<th class="strong">Room</th>
<td>4</td>....

Regarding your first question.
Here is an example that you can extract the link address.
This a Link of another Stackoveflow post that is related.
var source = #"<a href='http://kinnisvaraportaal-kv-ee.postimees.ee/muua-odra-tanaval-kesklinnas-valmiv-suur-ja-avar-k-2904668.html?nr=1&search_key=69ec78d9b1758eb34c58cf8088c96d10' class='object-title-a text-truncate'>1. Name of artucle</a>";
var parser = new HtmlParser();
var doc = parser.Parse(source);
var selector = "a";
var menuItems = doc.QuerySelectorAll(selector).OfType<IHtmlAnchorElement>();
foreach (var i in menuItems)
{
Console.WriteLine(i.Href);
}
For your Second question, you can check the example on the documention, here is the Link and below is the code sample:
// Setup the configuration to support document loading
var config = Configuration.Default.WithDefaultLoader();
// Load the names of all The Big Bang Theory episodes from Wikipedia
var address = "https://en.wikipedia.org/wiki/List_of_The_Big_Bang_Theory_episodes";
// Asynchronously get the document in a new context using the configuration
var document = await BrowsingContext.New(config).OpenAsync(address);
// This CSS selector gets the desired content
var cellSelector = "tr.vevent td:nth-child(3)";
// Perform the query to get all cells with the content
var cells = document.QuerySelectorAll(cellSelector);
// We are only interested in the text - select it with LINQ
var titles = cells.Select(m => m.TextContent);

JavaScript added rows not accessible in c#

I am making a web page in asp.net. I have created an HTML table with runat="server". I added some rows to it using JavaScript by getting values from input boxes. I need these rows accessible in code behind c#. When I use this table in code behind, it gives me just one row, which I added as a table heading in the HTML section. It does not consider the rows added in JavaScript. Is there any way to access these rows?
My HTML Code:
<table id="tblStaff" border="1" runat="server">
<tr>
<th>S. No.</th>
<th>Staff Name</th>
<th>Room</th>
<th>Phone No.</th>
<th>Remove</th>
</tr>
</table>
Javascript Code:
function addStaff()
{
var tbl = document.getElementById('tblStaff');
var lastRow = tbl.rows.length;
// if there's no header row in the table, then iteration = lastRow + 1
var iteration = lastRow;
var row = tbl.insertRow(lastRow);
var cell0 = row.insertCell(0);
var textNode = document.createTextNode(iteration);
cell0.appendChild(textNode);
var cell1 = row.insertCell(1);
var textNode = document.createTextNode(document.getElementById('txtStaffName').value);
cell1.appendChild(textNode);
var cell2 = row.insertCell(2);
textNode = document.createTextNode(document.getElementById('txtRoomNo').value);
cell2.appendChild(textNode);
var cell3 = row.insertCell(3);
textNode = document.createTextNode(document.getElementById('txtPhoneNo').value);
cell3.appendChild(textNode);
var lastCell = row.insertCell(4);
var el = document.createElement('input');
el.type = 'button';
el.name = 'btnDelete' + iteration;
el.id = 'btnDelete' + iteration;
el.value = 'Remove';
el.size = 40;
el.setAttribute('onclick', 'deleteRow(' + iteration + ');');
lastCell.appendChild(el);
}

I'd recommend to let the user add some data via a form, then send it to the server and store it where you want. If this was successful, add the rerendered table to the response and render it on the client or add just the new row client side. On server side you cannot access the new rows, but you can access its data models, what should be sufficient.
If you want further advice you should elaborate on why you want to access those rows server side.
EDIT:
Have a look at this fiddle. There you see the structure your table has to be. Now if you want to add an user, just add a new row in this format with index of users in name incremented. Now if you hit submit, all users will be send as an array to the server where you can iterate through them and send an email.
You web api method could look like:
//route url/to/api
public void AddNewUser(User[] users)
{
//iterate through users and send email
}

Try this
using System.Web.UI.HtmlControls;
foreach (HtmlTableRow row in tblStaff.Rows)
{
string txtCell0 = row.Cells[0].InnerText;
string txtCell1 = row.Cells[1].InnerText;
}

It looks like your addStaff() function only adds to the table in the frontend. If I may point out, the C# is executed server side, and so, the rows added by your javascript do not exist at the time when your C# is looking for them.
Like Florian Gl suggested, you could include an ajax submit in your addStaff() function, that also sends whatever row the javascript displays in the table in the browser, to the C# in the backend. Once you have the data in your C#, you can proceed with sending the email or can even opt to save a database...

Aspose.Words - MailMerge images

I am trying to loop through a Dataset, creating a page per item using Aspose.Words Mail-Merge functionality. The below code is looping through a Dataset - and passing some values to the Mail-Merge Execute function.
var blankDocument = new Document();
var pageDocument = new Document(sFilename);
...
foreach (DataRow row in ds.Tables[0].Rows){
var sBarCode = row["BarCode"].ToString();
var imageFilePath = HttpContext.Current.Server.MapPath("\\_temp\\") + sBarCode + ".png";
var tempDoc = (Document)pageDocument.Clone(true);
var fieldNames = new string[] { "Test", "Barcode" };
var fieldData = new object[] { imageFilePath, imageFilePath };
tempDoc.MailMerge.Execute(fieldNames, fieldData);
blankDocument.AppendDocument(tempDoc, ImportFormatMode.KeepSourceFormatting);
}
var stream = new MemoryStream();
blankDocument.Save(stream, SaveFormat.Docx);
// I then output this stream using headers,
// to cause the browser to download the document.
The mail merge item { MERGEFIELD Test } gets the correct data from the Dataset. However the actual image displays page 1's image on all pages using:
{ INCLUDEPICTURE "{MERGEFIELD Barcode }" \* MERGEFORMAT \d }
Say this is my data for the "Barcode" field:
c:\img1.png
c:\img2.png
c:\img3.png
Page one of this document, displays c:\img1.png in text for the "Test" field. And the image that is show, is img1.png.
However Page 2 shows c:\img2.png as the text, but displays img1.png as the actual image.
Does anyone have any insight on this?
Edit: It seems as this is more of a Word issue. When I toggle between Alt+F9 modes inside Word, the image actually displays c:\img1.png as the source. So that would be why it is being displayed on every page.
I've simplified it to:
{ INCLUDEPICTURE "{MERGEFIELD Barcode }" \d }
Also, added test data for this field inside Word's Mailings Recipient List. When I preview, it doesn't pull in the data, changing the image. So, this is the root problem.

I know this is old question. But still I would like to answer it.
Using Aspose.Words it is very easy to insert images upon executing mail merge. To achieve this you should simply use mergefield with a special name, like Image:MyImageFieldName.
https://docs.aspose.com/words/net/insert-checkboxes-html-or-images-during-mail-merge/#how-to-insert-images-from-a-database
Also, it is not required to loop through rows in your dataset and execute mail merge for each row. Simply pass whole data into MailMerge.Execute method and Aspose.Words will duplicate template for each record in the data.
Here is a simple example of such template
After executing mail merge using the following code:
// Create dummy data.
DataTable dt = new DataTable();
dt.Columns.Add("FirstName");
dt.Columns.Add("LastName");
dt.Columns.Add("MyImage");
dt.Rows.Add("John", "Smith", #"C:\Temp\1.png");
dt.Rows.Add("Jane", "Smith", #"C:\Temp\2.png");
// Open template, execute mail merge and save the result.
Document doc = new Document(#"C:\Temp\in.docx");
doc.MailMerge.Execute(dt);
doc.Save(#"C:\Temp\out.docx");
The result will look like the following:
Disclosure: I work at Aspose.Words team.

If this was Word doing the output, (not sure about Aspose), there would be two possible problems here.
INCLUDEPICTURE expects backslashes to be doubled up, e.g. "c\\img2.png", or (somewhat less reliable) to use forward slashes, or Mac ":" separators on that platform. It may be OK if the data comes in via a field result as you are doing here, though.
INCLUDEPICTURE results have not updated automatically "by design" since Microsoft modified a bunch of field behaviors for security reasons about 10 years ago. If you are merging to an output document, you can probably work around that by using the following nested fields:
{ INCLUDEPICTURE { IF TRUE "{ MERGEFIELD Barcode }" } }
or to remove the fields in the result document,
{ IF { INCLUDEPICTURE { IF TRUE "{ MERGEFIELD Barcode }" } } {
INCLUDEPICTURE { IF TRUE "{ MERGEFIELD Barcode }" } } }
All the { } need to be inserted with Ctrl+F9 in the usual way.
(Don't ask me where this use of "TRUE" is documented - as far as I know, it is not.)

Get text above table MS Word

This one is probably a little stupid, but I really need it. I have document with 5 tables each table has a heading. heading is a regular text with no special styling, nothing. I need to extract data from those tables + plus header.
Currently, using MS interop I was able to iterate through each cell of each table using something like this:
app.Tables[1].Cell(2, 2).Range.Text;
But now I'm struggling on trying to figure out how to get the text right above the table.
Here's a screenshot:
For the first table I need to get "I NEED THIS TEXT" and for secnd table i need to get: "And this one also please"
So, basically I need last paragraph before each table. Any suggestions on how to do this?

Mellamokb in his answer gave me a hint and a good example of how to search in paragraphs. While implementing his solution I came across function "Previous" that does exactly what we need. Here's how to use it:
wd.Tables[1].Cell(1, 1).Range.Previous(WdUnits.wdParagraph, 2).Text;
Previous accepts two parameters. First - Unit you want to find from this list: http://msdn.microsoft.com/en-us/library/microsoft.office.interop.word.wdunits.aspx
and second parameter is how many units you want to count back. In my case 2 worked. It looked like it should be because it is right before the table, but with one, I got strange special character: ♀ which looks like female indicator.

You might try something along the lines of this. I compare the paragraphs to the first cell of the table, and when there's a match, grab the previous paragraph as the table header. Of course this only works if the first cell of the table contains a unique paragraph that would not be found in another place in the document:
var tIndex = 1;
var tCount = oDoc.Tables.Count;
var tblData = oDoc.Tables[tIndex].Cell(1, 1).Range.Text;
var pCount = oDoc.Paragraphs.Count;
var prevPara = "";
for (var i = 1; i <= pCount; i++) {
var para = oDoc.Paragraphs[i];
var paraData = para.Range.Text;
if (paraData == tblData) {
// this paragraph is at the beginning of the table, so grab previous paragraph
Console.WriteLine("Header: " + prevPara);
tIndex++;
if (tIndex <= tCount)
tblData = oDoc.Tables[tIndex].Cell(1, 1).Range.Text;
else
break;
}
prevPara = paraData;
}
Sample Output:
Header: I NEED THIS TEXT
Header: AND THIS ONE also please

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Htmlagilitypack only parses table rows partialy - c#

Related

C# Xpath only returning first element

Parsing with AngleSharp

JavaScript added rows not accessible in c#

Aspose.Words - MailMerge images

Get text above table MS Word

Categories

Resources