Screen scraping with HTMLAgility help please

Screen scraping with HTMLAgility help please - c#

Last night when I asked about screen scraping I was given an excellent article link and has got me to this point. I have a few questions however. I will post my code as well as the html source below. I am trying to grab the data between the data tables, and then send the data to an sql table. I have found success in grabbing Description Widget 3.5 ect... Last Modified By Joe however because the 1st 2 /tr also contains img src=/......" alt="00721408" the numbers do not get grabbed. I am stuck as to how to alter the code so that all the data in the table is grabbed. 2nd, What do I need to do next in order to prepare the data to be sent to a sql table. My code is as follows:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using HtmlAgilityPack;
using System.Windows.Forms;
namespace ConsoleApplication1
{
}
class Program
{
static void Main(string[] args)
{
// Load the html document
var webGet = new HtmlWeb();
var doc = webGet.Load("http://localhost");
// Get all tables in the document
HtmlNodeCollection tables = doc.DocumentNode.SelectNodes("//table");
// Iterate all rows in the first table
HtmlNodeCollection rows = tables[0].SelectNodes(".//tr");
for (int i = 0; i < rows.Count; ++i)
{
// Iterate all columns in this row
HtmlNodeCollection cols = rows[i].SelectNodes(".//td");
for (int j = 0; j < cols.Count; ++j)
{
// Get the value of the column and print it
string value = cols[j].InnerText;
Console.WriteLine(value);
}
}
}
}
<table class="data">
<tr><td>Part-Num</td><td width="50"></td><td><img src="/partcode/number/072140" alt="072140"/></td></tr>
<tr><td>Manu-Number</td><td width="50"></td><td><img src="/partcode/manu/00721408" alt="00721408" /></td></tr>
<tr><td>Description</td><td></td><td>Widget 3.5</td></tr>
<tr><td>Manu-Country</td><td></td><td>United States</td></tr>
<tr><td>Last Modified</td><td></td><td>26 Jan 2011, 8:08 PM</td></tr>
<tr><td>Last Modified By</td><td></td><td>
Manu
</td></tr>
</table>
<p>
</body></html>

While fragile something like this would work in your case - basically just including the text content of all image alt attributes:
// Iterate all rows in the first table
HtmlNodeCollection rows = tables[0].SelectNodes(".//tr");
for (int i = 0; i < rows.Count; ++i)
{
// Iterate all columns in this row
HtmlNodeCollection cols = rows[i].SelectNodes(".//td");
for (int j = 0; j < cols.Count; ++j)
{
var images = cols[j].SelectNodes("img");
if(images!=null)
foreach (var image in images)
{
if(image.Attributes["alt"]!=null)
Console.WriteLine(image.Attributes["alt"].Value);
}
// Get the value of the column and print it
string value = cols[j].InnerText;
Console.WriteLine(value);
}
}

I'm a litte confused as to what data you're trying to obtain however...
you could try:
SelectNodes("//td[text()='Description']/../child::*[3]")
whose inner text should be "Widget 3.5"
SelectNodes("//td[text()='Manu-Country']/../child::*[3]")
whose inner text should be "United States"
etc. etc.
Btw just as a shameless plug, you should check out : systemhtml.codeplex.com
It's yet another html parser.

Related

C#;WebDriver: String array is coming up null when attempting to click using linktext

I am trying to read the following list:
<ol class="sublist">
<li>
Sort Out Your Values
</li>
<li>
Establish Realistic Goals
</li>
<li>
Determine Your Monthly Net Income
This is the code i wrote for it; but currently, everytime it runs; my string is coming up empty. I want to get the inner text so that in my loop i grab it and click it and return back to previous screen.
IWebElement container = driver.FindElement(By.ClassName("sublist"));
IList<IWebElement> elements = driver.FindElements(By.TagName("a"));
string [] newlink = new string[elements.Count()];
for (int i = 0; i < newlink.Count(); i++)
{
if (newlink[i] != null)
{
driver.FindElement(By.LinkText(newlink[i])).Click();
driver.WaitForElement(By.CssSelector("[id$='hlnkPrint']"));
driver.Navigate().Back();
}
}
The script is able to run but was getting that the links were null, so i added a check to see if any of them were null and it turns out all of them are.
Im sure it has something to do with with the '.text' or 'ToString', but Im not sure where to implement that.
Thanks

There's a few issues with your code.
- You haven't set the value of newlink, just created it.
- Count is a property, but you're using it as a method.
- Link text is the .Text property of an IWebElement, and you would need to access that.
- Your current code will likely click one link, and after going back will throw a StaleElementException.
In the following
- I set newlink to the Text values of the links found for elements
- I then iterate through the array of link text
IWebElement container = driver.FindElement(By.ClassName("sublist"));
IList<IWebElement> elements = container.FindElements(By.TagName("a"));
string[] newlink = new string[elements.Count];
for (int i = 0; i < newlink.Count; i++)
{
newlink[i] = elements[i].Text;
}
for (int i = 0; i < newlink.Count; i++)
{
if (newlink[i] != null)
{
driver.FindElement(By.LinkText(newlink[i])).Click();
driver.WaitForElement(By.CssSelector("[id$='hlnkPrint']"));
driver.Navigate().Back();
}
}
You can use FindElement() on an IWebElement. so in this case, if you want to find elements that are children of container, you would use container.FindElements().

how to get values from dynamically generated text boxes?

I am trying to get the values of text box which i generated dynamically on page load and cloned them using jquery....
every text box has a unique id in form of a matrix for eg textboxes of row one have ids textbox11,textbox12,textbox13,textbox14 etc
for row two textbox21,textbox22,textbox23........
is there any way to get the values..
using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using System.Web.UI;
using System.Web.UI.WebControls;
public partial class product_entry : System.Web.UI.Page
{
int count;
protected void Page_Load(object sender, EventArgs e)
{
int i, j;
for (i = 0; i < 1; i++)
{
TableRow tr = new TableRow();
tr.Attributes.Add("class", "tabrow");
for (j = 0; j <= 8; j++)
{
TableCell tc = new TableCell();
if (j == 0)
{
tc.Controls.Add(new LiteralControl("<Button class=remove type=button>-</button>"));
}
if (j == 1)
{
tc.Attributes.Add("class", "sno");
}
if (j == 2 || j == 3 || j == 4 || j == 5 || j == 6 || j == 7 || j == 8)
{
TextBox tb = new TextBox();
tb.Style["width"] = "98%";
tc.Controls.Add(tb);
}
tr.Controls.Add(tc);
}
Table1.Controls.Add(tr);
}
}
protected void Button1_Click(object sender, EventArgs e)
{
Label2.Text = TextBox1.Text;
for (int i = 1; i <= count; i++)
{
for (int j = 1; j <= 7; j++)
{
TextBox aa = (TextBox)Pnl.FindControl("textbox" + i + j);
Response.Write(aa.Text);
}
}
}
}
i want to fetch the values of hundreds of text boxes geerated using jquery by using the loop designed above is there any way to do that

you can add a specific css class to all dynamically generated text boxes on page load and by class selector using Jquery you can access all of them:
Use each: i is the postion in the array, obj is the DOM object that you are iterating (can be accessed with the jQuery wrapper $(this) as well).
$('input.SomeClass').each(function(i,obj){
var textboxid = $(this).id;
var textboxValue = $(this).val(); // get text inside text box
});

If an element has a unique ID, you can query that element using the selector format
$('#theExactId').val()
In HTML 4.01, IDs cannot start with a number. If your IDs really are 11, 12, etc you might want/need to prepend at least one alpha character.
If you were trying to get the data on the server side, the variables would be part of the HTTP POST.

To actually get the value you use
$('#theExactId').val()

An ASP.NET Literal doesn't add any markup to the page. Therefore you have to wrap your content in some container so that you can edit it via JavaScript:
ou should also consider refactoring all "hard" references to control ids from within your JavaScript code. In ASP.NET, there's the concept of naming containers that will ensure unique ids throughout the page. So if you have a control called
txtName
in your Content section, the real name of the client (as seen from JavaScript) will be e.g.
ct00_txtName
if using masterpages. The prefix is automatically added by the naming container to make the id unique.
Fortunately, every control has a property (server-side) called ClientID which will reveal the actual id that is available on the client. So if you need to access a control from client-side, make sure you always use ClientID property to get the name.
Like this:
var name = document.getElementById('<% =txtName.ClientID %>');

Make sure that your C# code that initially creates the text boxes and your jQuery that duplicates the tags assign a unique name to each tag. So you have tags like this:
<input name="textbox1" ... />
<input name="textbox2" ... />
<input name="textbox99" ... />
After you submitted the form via postback or Ajax, you can use the Request.Form collection to access the values of the text boxes like so:
foreach (string key in Request.Form.Keys)
{
if(key.StartsWith("textbox"))
{
string currentValue = Request.Form[key];
}
}
If you submit the form using GET, you need to access Request.QueryString

List<custom> to Excel c#

can anyone help me?
I have a structure
public struct Data
{
public string aaaAAA;
public string bbbBBB;
public string cccCCC;
...
...
}
then some code to bring in a data into a List, creaitng new list etc.
I want to then transport this to excel which I have done like this,
for (int r = 0; r < newlist.Count; r++)
{
ws.Cells[row,1] = newlist[r].aaaAAA;
ws.Cells[row,2] = newlist[r].bbbBBB;
ws.Cells[row,3] = newlist[r].cccBBB;
}
This works, but it is painfully slow. I am inputting over 12,000 rows and my structure has 85 elements (so each row has 85 columns of data).
Can anyone help make this quicker??
Thanks,
Timujin

If as #juharr mentioned you are able to use OpenXML, look at the ClosedXML library for creating Excel documents, found here.
Using your example above you could then use the following code:
var wb = new XLWorkbook();
var ws = wb.Worksheets.Add("Data_Test_Worksheet");
ws.Cell(1, 1).InsertData(newList);
wb.SaveAs(#"c:\temp\Data_Test.xlsx");
If you require a header row, then you would just have to add those manually, using something like the below(Then you would start inserting your rows above from Row 2):
PropertyInfo[] properties = newList.First().GetType().GetProperties();
List<string> headerNames = properties.Select(prop => prop.Name).ToList();
for (int i = 0; i < headerNames.Count; i++)
{
ws.Cell(1, i + 1).Value = headerNames[i];
}
On the performance requirement, this seems to be more performant than iterating through the array. I have done some basic testing on my side and to insert 20 000 rows for sample object containing 2 properties, it took a total of 1 second.

Export a large data query (60k+ rows) to Excel

I created a reporting tool as part of an internal web application. The report displays all results in a GridView, and I used JavaScript to read the contents of the GridView row-by-row into an Excel object. The JavaScript goes on to create a PivotTable on a different worksheet.
Unfortunately I didn't expect that the size of the GridView would cause overloading problems with the browser if more than a few days are returned. The application has a few thousand records per day, let's say 60k per month, and ideally I'd like to be able to return all results for up to a year. The number of rows is causing the browser to hang or crash.
We're using ASP.NET 3.5 on Visual Studio 2010 with SQL Server and the expected browser is IE8. The report consists of a gridview that gets data from one out of a handful of stored procedures depending on which population the user chooses. The gridview is in an UpdatePanel:
<asp:UpdatePanel ID="update_ResultSet" runat="server">
<Triggers>
<asp:AsyncPostBackTrigger ControlID="btn_Submit" />
</Triggers>
<ContentTemplate>
<asp:Panel ID="pnl_ResultSet" runat="server" Visible="False">
<div runat="server" id="div_ResultSummary">
<p>This Summary Section is Automatically Completed from Code-Behind</p>
</div>
<asp:GridView ID="gv_Results" runat="server"
HeaderStyle-BackColor="LightSkyBlue"
AlternatingRowStyle-BackColor="LightCyan"
Width="100%">
</asp:GridView>
</div>
</asp:Panel>
</ContentTemplate>
</asp:UpdatePanel>
I was relatively new to my team, so I followed their typical practice of returning the sproc to a DataTable and using that as the DataSource in the code behind:
List<USP_Report_AreaResult> areaResults = new List<USP_Report_AreaResult>();
areaResults = db.USP_Report_Area(ddl_Line.Text, ddl_Unit.Text, ddl_Status.Text, ddl_Type.Text, ddl_Subject.Text, minDate, maxDate).ToList();
dtResults = Common.LINQToDataTable(areaResults);
if (dtResults.Rows.Count > 0)
{
PopulateSummary(ref dtResults);
gv_Results.DataSource = dtResults;
gv_Results.DataBind();
(I know what you're thinking! But yes, I have learned much more about parameterization since then.)
The LINQToDataTable function isn't anything special, just converts a list to a datatable.
With a few thousand records (up to a few days), this works fine. The GridView displays the results, and there's a button for the user to click which launches the JScript exporter. The external JavaScript function reads each row into an Excel sheet, and then uses that to create a PivotTable. The PivotTable is important!
function exportToExcel(sMyGridViewName, sTitleOfReport, sHiddenCols) {
//sMyGridViewName = the name of the grid view, supplied as a text
//sTitleOfReport = Will be used as the page header if the spreadsheet is printed
//sHiddenCols = The columns you want hidden when sent to Excel, separated by semicolon (i.e. 1;3;5).
// Supply an empty string if all columns are visible.
var oMyGridView = document.getElementById(sMyGridViewName);
//If no data is on the GridView, display alert.
if (oMyGridView == null)
alert('No data for report');
else {
var oHid = sHiddenCols.split(";"); //Contains an array of columns to hide, based on the sHiddenCols function parameter
var oExcel = new ActiveXObject("Excel.Application");
var oBook = oExcel.Workbooks.Add;
var oSheet = oBook.Worksheets(1);
var iRow = 0;
for (var y = 0; y < oMyGridView.rows.length; y++)
//Export all non-hidden rows of the HTML table to excel.
{
if (oMyGridView.rows[y].style.display == '') {
var iCol = 0;
for (var x = 0; x < oMyGridView.rows(y).cells.length; x++) {
var bHid = false;
for (iHidCol = 0; iHidCol < oHid.length; iHidCol++) {
if (oHid[iHidCol].length !=0 && oHid[iHidCol] == x) {
bHid = true;
break;
}
}
if (!bHid) {
oSheet.Cells(iRow + 1, iCol + 1) = oMyGridView.rows(y).cells(x).innerText;
iCol++;
}
}
iRow++;
}
}
What I'm trying to do: Create a solution (probably client-side) that can handle this data and process it into Excel. Someone might suggest using the HtmlTextWriter, but afaik that doesn't allow for automatically generating a PivotTable and creates an obnoxious pop-up warning....
What I've tried:
Populating a JSON object -- I still think this has potential but I haven't found a way of making it work.
Using a SQLDataSource -- I can't seem to use it to get any data back out.
Paginating and looping through the pages -- Mixed progress. Generally ugly though, and I still have the problem that the entire dataset is queried and returned for each page displayed.
Update:
I'm still very open to alternate solutions, but I've been pursuing the JSON theory. I have a working server-side method that generates the JSON object from a DataTable. I can't figure out how to pass that JSON into the (external) exportToExcel JavaScript function....
protected static string ConstructReportJSON(ref DataTable dtResults)
{
StringBuilder sb = new StringBuilder();
sb.Append("var sJSON = [");
for (int r = 0; r < dtResults.Rows.Count; r++)
{
sb.Append("{");
for (int c = 0; c < dtResults.Columns.Count; c++)
{
sb.AppendFormat("\"{0}\":\"{1}\",", dtResults.Columns[c].ColumnName, dtResults.Rows[r][c].ToString());
}
sb.Remove(sb.Length - 1, 1); //Truncate the trailing comma
sb.Append("},");
}
sb.Remove(sb.Length - 1, 1);
sb.Append("];");
return sb.ToString();
}
Can anybody show an example of how to carry this JSON object into an external JS function? Or any other solution for the export to Excel.

It's easy and efficient to write CSV files. However, if you need Excel, it can also be done in a reasonably efficient way, that can handle 60,000+ rows by using the Microsoft Open XML SDK's open XML Writer.
Install Microsoft Open SDK if you don't have it already (google "download microsoft open xml sdk")
Create a Console App
Add Reference to DocumentFormat.OpenXml
Add Reference to WindowsBase
Try running some test code like below (will need a few using's)
Just Check out Vincent Tan's solution at http://polymathprogrammer.com/2012/08/06/how-to-properly-use-openxmlwriter-to-write-large-excel-files/ ( Below, I cleaned up his example slightly to help new users. )
In my own use I found this pretty straight forward with regular data, but I did have to strip out "\0" characters from my real data.
using DocumentFormat.OpenXml;
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Spreadsheet;
...
using (var workbook = SpreadsheetDocument.Create("SomeLargeFile.xlsx", SpreadsheetDocumentType.Workbook))
{
List<OpenXmlAttribute> attributeList;
OpenXmlWriter writer;
workbook.AddWorkbookPart();
WorksheetPart workSheetPart = workbook.WorkbookPart.AddNewPart<WorksheetPart>();
writer = OpenXmlWriter.Create(workSheetPart);
writer.WriteStartElement(new Worksheet());
writer.WriteStartElement(new SheetData());
for (int i = 1; i <= 50000; ++i)
{
attributeList = new List<OpenXmlAttribute>();
// this is the row index
attributeList.Add(new OpenXmlAttribute("r", null, i.ToString()));
writer.WriteStartElement(new Row(), attributeList);
for (int j = 1; j <= 100; ++j)
{
attributeList = new List<OpenXmlAttribute>();
// this is the data type ("t"), with CellValues.String ("str")
attributeList.Add(new OpenXmlAttribute("t", null, "str"));
// it's suggested you also have the cell reference, but
// you'll have to calculate the correct cell reference yourself.
// Here's an example:
//attributeList.Add(new OpenXmlAttribute("r", null, "A1"));
writer.WriteStartElement(new Cell(), attributeList);
writer.WriteElement(new CellValue(string.Format("R{0}C{1}", i, j)));
// this is for Cell
writer.WriteEndElement();
}
// this is for Row
writer.WriteEndElement();
}
// this is for SheetData
writer.WriteEndElement();
// this is for Worksheet
writer.WriteEndElement();
writer.Close();
writer = OpenXmlWriter.Create(workbook.WorkbookPart);
writer.WriteStartElement(new Workbook());
writer.WriteStartElement(new Sheets());
// you can use object initialisers like this only when the properties
// are actual properties. SDK classes sometimes have property-like properties
// but are actually classes. For example, the Cell class has the CellValue
// "property" but is actually a child class internally.
// If the properties correspond to actual XML attributes, then you're fine.
writer.WriteElement(new Sheet()
{
Name = "Sheet1",
SheetId = 1,
Id = workbook.WorkbookPart.GetIdOfPart(workSheetPart)
});
writer.WriteEndElement(); // Write end for WorkSheet Element
writer.WriteEndElement(); // Write end for WorkBook Element
writer.Close();
workbook.Close();
}
If you review that code you'll notice two major writes, first the Sheet, and then later the workbook that contains the sheet. The workbook part is the boring part at the end, the earlier sheet part contains all the rows and columns.
In your own adaptation, you could write real string values into the cells from your own data. Instead, above, we're just using the row and column numbering.
writer.WriteElement(new CellValue("SomeValue"));
Worth noting, the row numbering in Excel starts at 1 and not 0. Starting rows numbered from an index of zero will lead to "Corrupt file" error messages.
Lastly, if you're working with very large sets of data, never call ToList(). Use a data reader style methodology of streaming the data. For example, you could have an IQueryable and utilize it in a for each. You never really want to have to rely on having all the data in memory at the same time, or you'll hit an out of memory limitation and/or high memory utilization.

I would try to use displaytag to display the results. You could set it up display a certain number per page, which should solve your overloading issue. Then, you can set displaytag to allow for an Excel export.

We typically handle this with an "Export" command button which is wired up to a server side method to grab the dataset and convert it to CSV. Then we adjust the response headers and the browser will treat it as a download. I know this is a server side solution, but you may want to consider it since you'll continue having timeout and browser issues until you implement server side record paging.

Almost a week and a half since I began this problem, I've finally managed to get it all working to some extent. I'll wait temporarily from marking an answer to see if anybody else has a more efficient, better 'best practices' method.
By generating a JSON string, I've divorced the JavaScript from the GridView. The JSON is generated in code behind when the data is populated:
protected static string ConstructReportJSON(ref DataTable dtResults)
{
StringBuilder sb = new StringBuilder();
for (int r = 0; r < dtResults.Rows.Count; r++)
{
sb.Append("{");
for (int c = 0; c < dtResults.Columns.Count; c++)
{
sb.AppendFormat("\"{0}\":\"{1}\",", dtResults.Columns[c].ColumnName, dtResults.Rows[r][c].ToString());
}
sb.Remove(sb.Length - 1, 1); //Truncate the trailing comma
sb.Append("},");
}
sb.Remove(sb.Length - 1, 1);
return String.Format("[{0}]", sb.ToString());
}
Returns a string of data such as
[ {"Caller":"John Doe", "Office":"5555","Type":"Incoming", etc},
{"Caller":"Jane Doe", "Office":"7777", "Type":"Outgoing", etc}, {etc} ]
I've hidden this string by assigning the text to a Literal in the UpdatePanel using:
<div id="div_JSON" style="display: none;">
<asp:Literal id="lit_JSON" runat="server" />
</div>
And the JavaScript parses that output by reading the contents of the div:
function exportToExcel_Pivot(sMyJSON, sTitleOfReport, sReportPop) {
//sMyJSON = the name, supplied as a text, of the hidden element that houses the JSON array.
//sTitleOfReport = Will be used as the page header if the spreadsheet is printed.
//sReportPop = Determines which business logic to create a pivot table for.
var sJSON = document.getElementById(sMyJSON).innerHTML;
var oJSON = eval("(" + sJSON + ")");
// DEBUG Example Test Code
// for (x = 0; x < oJSON.length; x++) {
// for (y in oJSON[x])
// alert(oJSON[x][y]); //DEBUG, returns field value
// alert(y); //DEBUG, returns column name
// }
//If no data is in the JSON object array, display alert.
if (oJSON == null)
alert('No data for report');
else {
var oExcel = new ActiveXObject("Excel.Application");
var oBook = oExcel.Workbooks.Add;
var oSheet = oBook.Worksheets(1);
var oSheet2 = oBook.Worksheets(2);
var iRow = 0;
var iCol = 0;
//Take the column names of the JSON object and prepare them in Excel
for (header in oJSON[0])
{
oSheet.Cells(iRow + 1, iCol + 1) = header;
iCol++;
}
iRow++;
//Export all rows of the JSON object to excel
for (var r = 0; r < oJSON.length; r++)
{
iCol = 0;
for (c in oJSON[r])
{
oSheet.Cells(iRow + 1, iCol + 1) = oJSON[r][c];
iCol++;
} //End column loop
iRow++;
} //End row
The string output and the JavaScript 'eval' parsing both work surprisingly fast, but looping through the JSON object is a little slower than I'd like.
I believe that this method would be limited to around 1 billion characters of data -- maybe less depending how memory testing works out. (I've calculated that I'll probably be looking at a maximum of 1 million characters per day, so that should be fine, within one year of reporting.)

separate datatable for each row in db. .NEt

I have long tables generated by datagrid control that go beyond the page width. I would like to convert that into separate table for each row or definition list where each field name is followed by field value.
How would I do that.

Uses jquery. If you have more than one table you'll need to change it to accommodate that. Also, just appends to the end of the document. If you want it elsewhere, find the element you want to place it after and insert it into the DOM at that point.
$(document).ready(
function() {
var headers = $('tr:first').children();
$('tr:not(:first)').each(
function(i,row) {
var cols = jQuery(row).children();
var dl = jQuery('<dl></dl>');
for (var i=0, len = headers.length; i < len; ++i) {
var dt = jQuery('<dt>');
dt.text( jQuery(headers[i]).text() );
var dd = jQuery('<dd>');
dd.text( jQuery(cols[i]).text() );
dl.append(dt).append(dd);
}
$('body').append(dl);
}
);
$('table').remove();
}
);

Here's a reference:
http://www.mail-archive.com/flexcoders#yahoogroups.com/msg15534.html
The google terms I think you want are "invert datagrid". You'll get lots of hits.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Screen scraping with HTMLAgility help please - c#

Related

C#;WebDriver: String array is coming up null when attempting to click using linktext

how to get values from dynamically generated text boxes?

List<custom> to Excel c#

Export a large data query (60k+ rows) to Excel

separate datatable for each row in db. .NEt

Categories

Resources