Unable to figure out XPath in HtmlAgilityPack - c#

I have trying to get around making my first C# application(that can do more than just say "Hello world"),
now the html file got lots of tags,(but got only two h4 tags that are given below.)
but here is the part that i am interested in:
<table width="100%" height="400" border="0" align="center" cellpadding="0" cellspacing="0" bordercolor="#111111" background="images/page_bg.gif" style="BORDER-COLLAPSE: collapse">
<tbody valign="top">
<tr>
<td>
<table width="80%" border="0" valign=top background="images/page_bg.gif">
<tr>
<td>
<div align="center">
<h4 align="center">
<font face="Verdana, Arial, Helvetica, sans-serif" size="2">
<b>
<font size="4" face="Arial, Helvetica, sans-serif">
UNWANTED TEXT
</font></b></font></h4>
<p><br />
Name : {NAME HERE} <br>Number : {NUMBERS HERE}<br>Number2 : {NUMBERS2}<br><br><h4>UNWANTED TEXT</h4><br>detail NO. : <span class=style7>{NUmbers3}</span><br><br><a href=http://test.xom>UNWANTED TEXT</a><br><br>
</p>
<p class="content"><em><strong>
<p> </p>
I wish to get NAME,Numbers1,Numbers2,Numbers3, So, i guess i got to do something like this =
//div[#align = "centre"]/h4/followingsibling::Text();
but surely it is incomplete, any ideas on how should i do it, I got the Xpath from firebug :
/html/body/table/tbody/tr[2]/td/table/tbody/tr/td/table/tbody/tr[2]/td/div/table/tbody/tr/td/table/tbody/tr/td/div/h4
i have also tried doing(for just getting the raw data first and then trimming it further)
HtmlNodeCollection node = doc.DocumentNode.SelectNodes("//table[#height='400']//div[#align='centre']"//p);
foreach(HtmlNode node1 in node)
textBox1.Text += node1.InnerText;
But the Node here is passed on as NULL
Any help is greatly appreciated.

Firefox adds tbody tag to table (in original html this tag can be absent). So, I would suggest do not write all path, find most characterizing path and use //.
For example, //div[#class='data']/table//tr/td

Did you notice that you have #align="centre" but the HTML has align="center" (as in, British vs US spelling)?

Related

Gembox.Document: Convert HTML to Pdf, border css does not work

I am using to Gembox.Document to convert HTML to PDF. This is my HTML:
<div style="border-top:5px solid black;border-left:5px solid black;padding:2px 15px;font-size:18px;font-weight:bold;line-height:22px; background-color:aquamarine;width:230px">
Test
</div>
But in PDF, the border is lost.
Do you know how to fix this problem?
Edit: I add sample code: (test HTML link: https://www.gemboxsoftware.com/document/examples/c-sharp-convert-html-to-pdf/307)
<!DOCTYPE html>
<html>
<body>
<table style="width:100%;" cellpadding="0" cellspacing="0">
<tr>
<td style="width:50%;">
<div style="border-top:5px solid black;border-left:5px solid black;padding:2px 15px;font-size:18px;font-weight:bold;line-height:22px; background-color:aquamarine;width:230px">
Test
</div>
</td>
<td style="width:50%;font-size:18px;font-weight:bold;line-height:22px;">Number</td>
</tr>
</table>
<div style="border-top:5px solid black;border-left:5px solid black;padding:2px 15px;font-size:18px;font-weight:bold;line-height:22px; background-color:aquamarine;width:230px">
Test
</div>
</body>
</html>
Edit 2: I found out a solution to fix, it is really not beautiful but at least it worked for me
<table style="width:100%;" cellpadding="0" cellspacing="0">
<tr>
<td style="width:50%;">
<table style="width:150px;margin:0px" cellpadding="0" cellspacing="0">
<tr>
<td style="border-top:1px solid black;border-left:1px solid black;">
<div style="padding:2px 15px;font-size:18px;font-weight:bold;line-height:22px;">
Test
</div>
</td>
</tr>
</table>
</td>
<td style="width:50%;font-size:18px;font-weight:bold;line-height:22px;">Number</td>
</tr>
</table>
What version are you using? Perhaps you should try again with the current latest bugfix version, from here.
I tried converting this HTML:
And I get this PDF:
As you may notice, the top and left borders are there. Also, all other CSS except "width" are there.
Last, I also tried "display" and "visibility" and it seems that both work.

UI distorted in Outlook

My application generates email which opens perfectly when opened in browser(example chrome). But when the same email is opened in Microsoft Outlook, it gets distorted heavily(like text is not visible, button text gets wrapped). Any suggestions what could be the problem. I have verified that all the scripting(js and css) has been done inline, ie on the .aspx page.
Email when opened in Outlook :
Email when opened in Web browser :
HTML Code
<table class="footer" style="border-collapse: collapse;border-spacing: 0;width: 100%;background-color: #f6f9fb">
<tbody>
<tr>
<td class="inner" style="padding: 0;vertical-align: top;padding-top: 60px;padding-bottom: 55px" align="center">
<table class="cols" style="border-collapse: collapse;border-spacing: 0;width: 600px">
<tbody>
<tr>
<td class="left" style="padding: 0;vertical-align: top;font-size: 11px;font-weight: 400;letter-spacing: 0.01em;line-height: 17px;padding-bottom: 22px;text-align: left;width: 35%;padding-right: 5px;color: #b3b3b3;font-family: sans-serif">
<table class="social" style="border-collapse: collapse;border-spacing: 0">
<tbody>
<tr>
<td colspan="3">
<p style="padding: 0;vertical-align: top;font-size: 11px;font-weight: 400;letter-spacing: 0.01em;line-height: 17px;padding-bottom: 4px;padding-left: 5px;color: #b3b3b3;font-family: sans-serif;text-transform:none;"><strong>Test Inc.</strong><br/>1234 Road Parkway<br/>Houston, Texas 77077<br/>1-811-811-9611<br/><br/>
<img style="border: 0;-ms-interpolation-mode: bicubic;display: block;max-width: 200px" src="SomeURL" alt="myatomDirect" width="135" height="58" border="0" />
</p>
</td>
</tr>
</tbody>
</table>
</td>
<td class="right" style="padding: 0;vertical-align: top;font-size: 11px;font-weight: 400;letter-spacing: 0.01em;line-height: 17px;padding-bottom: 5px;text-align: right;width: 65%;padding-left: 5px;color: #b3b3b3;font-family: sans-serif">
<div id="campaign">
<p style="padding: 0;vertical-align: top;font-size: 11px;font-weight: 400;letter-spacing: 0.01em;line-height: 17px;padding-bottom: 10px;padding-left: 5px;color: #b3b3b3;font-family: sans-serif;text-transform:none;">You are receiving this email because you registered for an account on
. Please do not reply to this message; it was sent from an unmonitored e-mail address. This message is a service e-mail related to your use of . For general inquiries or to request support with your account, please email us at
SomeURL.</p>
</div>
</td>
</tr>
<tr>
<td colspan="2">
<p style="padding: 0;vertical-align: top;font-size: 11px;font-weight: 400;letter-spacing: 0.01em;line-height: 17px;padding-bottom: 15px;text-align: center;padding-left: 5px;color: #b3b3b3;font-family: sans-serif;text-transform:none;"></p>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
Any suggestions will highly appreciated.
Since Outlook uses Microsoft Word to render the emails, therefore there are certain issues that might arise in Outlook and not in any other Email client. So you need to be extra careful with your emails.
<table class="cols" style="border-collapse: collapse; border-spacing: 0; width: 600;">
Try using the width attribute without the px.
This has been discussed here
Tips to bridge Outlook hurdles
In case the primary font is not available on the subscriber’s device, Outlook tends to render the entire email copy in Times New Roman, ignoring the specified fallback font. In such cases, you need to force Outlook to render the fallback font that is specified using a conditional code.
<!--[if mso]> <style> h1 {
font-family: Primary font, Fallback font;
} </style><![endif]-->
Outlook will not automatically wrap text into the tables you create. Instead, table cells will widen to try and accommodate large URLs or other unbroken text strings. To avoid this, you can include the following:
<td style="word-break:break-all;">
Hope this works for you.

Cannot find elements within a div tag - Selenium

<div id="template_data">
<input id="reservaActual" type="hidden">
<input id="_ctl0_data_holder_idResActual" name="_ctl0:data_holder:idResActual" type="hidden">
<table id="MainScreen" cellspacing="0" cellpadding="0" align="center">
<tbody>
<tr>
<tr>
<td id="panizq" class="PanelIzq" valign="top">
Within the above html, I am able to find the template_data which is a div tag but Iam unable to find anything within, not the input id="_ctl0_data_holder_idResActual", nor the table id MainScreen or the td id panizq.
I've tried xpath, cssselectors, ids but not luck. I want to start from scrath and have ideas on what I can try because I am not sure if i need to go through levels but want to see how people on SO try to tackle this issue and go through it with somebody:
Below is my code;
public void SelectAgency()
{
_driver.Manage().Timeouts().ImplicitlyWait(TimeSpan.FromSeconds(20));
_driver.FindElement(By.Id("template_data"));
}

How to set accessibility attributes within .NET 's MenuItems, _without_ JavaScript

I am working on a sidenav that is built on .NET MenuItems like so:
<asp:MenuItem value="19" Text="Profile" Selectable="false"></asp:MenuItem>
<asp:MenuItem value="0" Text="Overview" ToolTip="Overview" Selected="true"></asp:MenuItem>
<asp:MenuItem value="2" Text="My Info & Email Subscriptions" ToolTip="My Info & Email Subscriptions"></asp:MenuItem>
In HTML, the output produces a series of nested tables around each MenuItem which looks like this:
<div id="_links" class="span-3">
<table id="FormUserControl__tabMenu" cellpadding="0" cellspacing="0" border="0" style="clear:left;">
<tbody>
<tr id="FormUserControl__tabMenun0">
<td>
<table cellpadding="0" cellspacing="0" border="0" width="100%">
<tbody>
<tr>
<td style="white-space:nowrap;width:100%;">
<a style="text-decoration:none;">
<div id="FormUserControl__tabMenu_ctl00__tabMenuItemPanel" class="myAccountHeading ">
Profile
</div>
</a>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr onmouseover="Menu_HoverStatic(this)" onmouseout="Menu_Unhover(this)" onkeyup="Menu_Key(this)" title="Overview" id="FormUserControl__tabMenun1">
<td>
<table cellpadding="0" cellspacing="0" border="0" width="100%">
<tbody>
<tr>
<td style="white-space:nowrap;width:100%;"><a href="javascript:__doPostBack('FormUserControl$_tabMenu','0')" style="text-decoration:none;">
<div id="FormUserControl__tabMenu_ctl01__tabMenuItemPanel" class="sideNav">
Overview
</div>
</td>
<tr>
<tbody>
</td>
</tr>
</tbody>
</table>
</div>
How can I add to add accessibility role and aria-level to these innermost divs? The goal is to achieve accessibility compliance. For example:
<div role="heading" aria-level="[2]">Profile</div>
I have looked through MSDN documentation and it looks like there isn't a way to add those attributes within the intial MenuItem declaration.
I also tried adding role and aria-level attributes within CSS, which I know is hacky, but I figured since content can be set, it was worth trying. That doesn't work.
I could readily do this in JavaScript, but I really want to avoid involving that, it's a last resort and I do know how to do that.
Is there a way to change the MenuItem output to involve role and aria-level? Or, is there a way to have it output a header instead of a div nested within two tables?
Many thanks!

ASP.NET: How to extract a specific value from a table html source?

I want to extract the movie name of each row in the IMDb`s Boxoffice table..
example html table row:
<tr class="chart_even_row">
<td style="text-align: right;">
<b>1</b>
</td>
<td>
<img border="0" src="http://ia.media-imdb.com/images/M/MV5BMjA4NDg3NzYxMF5BMl5BanBnXkFtZTcwNTgyNzkyNw##._V1._SY30_SX23_.jpg" width="20" height="30">
</td>
<td>
<a href="/title/tt1392170/" >The Hunger Games</a> (2012)
</td>
<td style="text-align: right; padding-right: 20px;">$155M
</td>
<td style="text-align: right;">
$155M
</td>
<td style="text-align: center;">
1
</td>
</tr>
The value I want to extract is "The Hunger Games"..
I need a C# code that would achieve this for me..
NOTE: I want to do this via REGEX
Thanks in advance,
Rashad.
Screen scraping the IMDB is complicated, fragile, and forbidden. The IMDB provides plain-text data files you can use instead at http://www.imdb.com/interfaces
Update
Allow me to reiterate: screen scraping and data mining IMDB.com is in violation of their terms of use.
Regarding Regex: see this answer.
So if you're dead-set on doing this in violation of the IMDB's terms of use, the HTML Agility Pack is probably the best way to go.
try to copy paste the code in single html file. if you have too many pages to fetch then try to write code that will read them through html agility pack.
You can find html agility pack here http://htmlagilitypack.codeplex.com/

Categories