I'm using HTML Agility Pack to web scrape to datatable. However the website have multiple same column name which it was not able to add on for the second table.
The error will be prompt out like this as the "2020" had been added before
My code as below :
public void WebDataScrap()
{
try
{
//Get the content of the URL from the Web
const string url = "https://www.wsj.com/market-data/quotes/MY/XKLS/0146/financials/annual/cash-flow";
var web = new HtmlWeb();
var doc = web.Load(url);
const string classValue = "cr_dataTable"; //cr_datatable
//var nodes = doc.DocumentNode.SelectNodes($"//table[#class='{classValue}']") ?? Enumerable.Empty<HtmlNode>();
var resultDataset = new DataSet();
foreach (HtmlNode table in doc.DocumentNode.SelectNodes($"//table[#class='{classValue}']") ?? Enumerable.Empty<HtmlNode>())
{
var resultTable = new DataTable(table.Id);
foreach (HtmlNode row in table.SelectNodes("//tr"))
{
var headerCells = row.SelectNodes("th");
if (headerCells != null)
{
foreach (HtmlNode cell in headerCells)
{
resultTable.Columns.Add(cell.InnerText);
}
}
var dataCells = row.SelectNodes("td");
if (dataCells != null)
{
var dataRow = resultTable.NewRow();
for (int i = 0; i < dataCells.Count; i++)
{
dataRow[i] = dataCells[i].InnerText;
}
resultTable.Rows.Add(dataRow);
}
}
}
}
catch (Exception ex)
{
MessageBox.Show(ex.ToString());
}
}
The URL i trying to web scrape : https://www.wsj.com/market-data/quotes/MY/XKLS/0146/financials/annual/cash-flow
I did try to do looping to skip if it was having the same name but it will prompt that the column unable to find when I try to debug.
Is there any solution that can help to solve this? In the end I will need to export the datatable to csv/excel file.
Thanks
I think you want to do this instead:
foreach (HtmlNode table in doc.DocumentNode.SelectNodes($"//table[#class='{classValue}']") ?? Enumerable.Empty<HtmlNode>())
{
var resultTable = new DataTable(table.Id);
// select all the headers and add them to the table
var headerCells = table.SelectNodes("thead/tr/th");
if (headerCells != null)
{
foreach (HtmlNode cell in headerCells)
{
resultTable.Columns.Add(cell.InnerText);
}
}
// select all the rows and add them to the table
foreach (HtmlNode row in table.SelectNodes("tbody/tr"))
{
var dataCells = row.SelectNodes("td");
if (dataCells != null)
{
var dataRow = resultTable.NewRow();
for (int i = 0; i < dataCells.Count; i++)
{
dataRow[i] = dataCells[i].InnerText;
}
resultTable.Rows.Add(dataRow);
}
}
}
The header section and the data section each have their own loop rather than the header section being nested in the data loop. We're also being more explicit about where we want data from: the header should come from thead/tr/th and the data should come from tbody/tr.
Related
So I'm trying to scrape some website data (specifically the first table here). I am using the table xpath, and trying to get the specific row data assigned to my model.
public static async Task<List<SuspensionModel>> GetSuspensionData()
{
var htmlDocument = new HtmlDocument();
var httpResponseMessage = await _httpClient.GetAsync(_2020SuspUrl);
await EnsureSuccessStatusCode(httpResponseMessage);
var SuspStatsAsHtml = await httpResponseMessage.Content.ReadAsStringAsync();
htmlDocument.LoadHtml(SuspStatsAsHtml);
var suspData = ParseTable(htmlDocument, "/html/body/div[3]/div[3]/div[5]/div[1]/table[1]/tbody/tr");
//return ;
}
private static List<SuspensionModel> ParseTable(HtmlDocument htmlDocument, string xPath)
{
var returnData = new List<SuspensionModel>();
foreach (HtmlNode row in htmlDocument.DocumentNode.SelectNodes(xPath))
{
HtmlNodeCollection cells = row.SelectNodes("td");
var arr = new String[7];
for (int i = 0; i < cells.Count; ++i)
{
arr[i] = cells[i].InnerText;
}
var susp = new SuspensionModel
{
IncidentDate = DateTime.Parse(arr[0]),
OffenderName = arr[1],
OffenderTeam = arr[2],
OffenseDesc = arr[3],
ActionDate = DateTime.Parse(arr[4]),
OffenseLength = arr[5],
SalaryLoss = int.Parse(arr[6])
};
returnData.Add(susp);
}
return returnData;
}
In my ParseTable method, where I am assigning values in my model, how can I access the specific cell data in the given row? Basically, I want to do something like:
foreach row, step through each cell and assign to the correct model value. As I have it now, my cells variable always returns null, so I assume I am not using HtmlAgilityPack correctly.
Any assistance is appreciated here!
I ended up resolving this. I was missing two things, and it turns out it wasn't related to HtmlAgilityPack.
I needed to add .Skip(1) to my foreach row so that it skipped the table header row.
foreach (HtmlNode row in htmlDocument.DocumentNode.SelectNodes(xPath).Skip(1))
I needed to fix my SalaryLoss value. I was assigning it as an int, but I needed to change that to a double as it was a currency value.
SalaryLoss = double.Parse(arr[6], System.Globalization.NumberStyles.Currency)
I have a .csv file structured like so, first row is the header column, for each SomeID I need to add the NetCharges together(or substract if the code calls for it) and put each item into its own column by the SomeCode column.
Heres the file I receive;
SomeID,OrderNumber,Code,NetCharge,Total
23473,30388,LI 126.0000, 132.00
96021, 000111, LI, 130.00, 126.00
23473,30388,FU 6.0000, 132.00
4571A,10452,LI,4100.0000, 4325.0000
4571A,10452,FU,150.00,4325.0000
4571A,10452,DT,75.00,4325.0000
I need to insert the data to my sql table which is structured like this. This is what I'm aiming for:
ID OrderNumber LICode LICodeValue FUCode FUCodeValue DTCode, DTCodeValue, total
23473 30388n LI 126.000 FU 6.0000 NULL NULL 132.0000
4571A 10452 LI 4100.0000 FU 150.0000 DT 75.00 4325.0000
My SomeID will not always be grouped together like the 4571A id is.I basically need to iterate over this file and create one record for each SomeID. I cannot seem to find a way with csvHelper. I'm using C# and csvHelper. I have trid this so far but I cannot get back to the SomeId after passing on to the nexr one:
using (var reader = new StreamReader( "C:\testFiles\some.csv" ))
using (var csv = new CsvReader( reader, CultureInfo.InvariantCulture ))
{
var badRecords = new List<string>();
var isRecordBad = false;
csv.Configuration.HasHeaderRecord = true;
csv.Configuration.HeaderValidated = null;
csv.Configuration.IgnoreBlankLines = true;
csv.Configuration.Delimiter = ",";
csv.Configuration.BadDataFound = context =>
{
isRecordBad = true;
badRecords.Add( context.RawRecord );
};
csv.Configuration.MissingFieldFound = ( s, i, context ) =>
{
isRecordBad = true;
badRecords.Add( context.RawRecord );
};
List<DataFile> dataFile = csv.GetRecords<DataFile>().ToList();
//initialize variable
string lastSomeId = "";
if (!isRecordBad)
{
foreach (var item in dataFile)
{
// check if its same record
if (lastSomeId != item.SomeID)
{
MyClass someClass = new MyClass();
lastSomeId = item.SomeID;
//decimal? LI = 0;//was going to use these as vars for calculations not sure I need them???
//decimal? DSC = 0;
//decimal? FU = 0;
someClass.Id = lastSomeId;
someClass.OrdNum = item.OrderNumber;
if (item.Code == "LI")
{
someClass.LICode = item.Code;
someClass.LICodeValue = item.NetCharge;
}
if (item.Code == "DT")
{
someClass.DTCode = item.Code;
someClass.DTCodeValue = item.NetCharge
}
if (item.Code == "FU")
{
someClass.FUCode = item.Code;
someClass.FUCodeValue = item.NetCharge;
}
someClass.Total = (someClass.LICodeValue + someClass.FUCodeValue);
//check for other values to calculate
//insert record to DB
}
else
{
//Insert into db after maipulation of values
}
}
}
isRecordBad = false;
}//END Using
Any clues would be greatly appreciated. Thank you in advance.
I have an log file like this..
This is the segment 1
============================
<MAINELEMENT><ELEMENT1>10-10-2013 10:10:22.444</ELEMENT1><ELEMENT2>1111</ELEMENT2>
<ELEMENT3>Message 1</ELEMENT3></MAINELEMENT>
<MAINELEMENT><ELEMENT1>10-10-2013 10:10:22.555</ELEMENT1><ELEMENT2>1111</ELEMENT2>
<ELEMENT3>Message 2</ELEMENT3></MAINELEMENT>
This is the segment 2
============================
<MAINELEMENT><ELEMENT1>10-11-2012 10:10:22.444</ELEMENT1><ELEMENT2>2222</ELEMENT2>
<ELEMENT3>Message 1</ELEMENT3></MAINELEMENT>
<MAINELEMENT><ELEMENT1>10-11-2012 10:10:22.555</ELEMENT1><ELEMENT2>2222</ELEMENT2>
<ELEMENT3>Message 2</ELEMENT3></MAINELEMENT>
How can I read this into DataTable excluding the data This is the segment 1 and This is the segment 2 and ====== lines completely.
I would like to have the Datatable as with Columns as "ELEMENT1", "ELEMENT2", "ELEMENT3" and fill the details with the content between those tags in the order of print of line.
It should not change the sequence of the order of records in the table while inserting.
HtmlAgilityPack seems to be a good tool for what you need:
using HtmlAgilityPack;
class Program
{
static void Main(string[] args)
{
var doc = new HtmlDocument();
doc.Load("log.txt");
var dt = new DataTable();
bool hasColumns = false;
foreach (HtmlNode row in doc
.DocumentNode
.SelectNodes("//mainelement"))
{
if (!hasColumns)
{
hasColumns = true;
foreach (var column in row.ChildNodes
.Where(node => node.GetType() == typeof(HtmlNode)))
{
dt.Columns.Add(column.Name);
}
}
dt.Rows.Add(row.ChildNodes
.Where(node => node.GetType() == typeof(HtmlNode))
.Select(node => node.InnerText).ToArray());
}
}
}
could do this, where stringData is the data from the file you have
var array = stringData.Split(new[] { "============================" }, StringSplitOptions.RemoveEmptyEntries);
var document = new XDocument(new XElement("Root"));
foreach (var item in array)
{
if(!item.Contains("<"))
continue;
var subDocument = XDocument.Parse("<Root>" + item.Substring(0, item.LastIndexOf('>') + 1) + "</Root>");
foreach (var element in subDocument.Root.Descendants("MAINELEMENT"))
{
document.Root.Add(element);
}
}
var table = new DataTable();
table.Columns.Add("ELEMENT1");
table.Columns.Add("ELEMENT2");
table.Columns.Add("ELEMENT3");
var rows =
document.Descendants("MAINELEMENT").Select(el =>
{
var row = table.NewRow();
row["ELEMENT1"] = el.Element("ELEMENT1").Value;
row["ELEMENT2"] = el.Element("ELEMENT2").Value;
row["ELEMENT3"] = el.Element("ELEMENT3").Value;
return row;
});
foreach (var row in rows)
{
table.Rows.Add(row);
}
foreach (DataRow dataRow in table.Rows)
{
Console.WriteLine("{0},{1},{2}", dataRow["ELEMENT1"], dataRow["ELEMENT2"], dataRow["ELEMENT3"]);
}
I'm not so sure where you problem is.
You can use XElement for reading the xml and manually creating DataTable.
For Reading the XML See Xml Parsing using XElement
Then you can create dynamically the datatable.
Heres an example of creating a datatable in code
https://sites.google.com/site/bhargavaclub/datatablec
But why do you want to use a DataTable ? There are a lot of downsides...
i have code return "Datatable ","Datatable" content idstudent , avg , firstname ,namecourse and date, this "Datatable" content more row .
DataTable row = mn.selectProgram("programStudent", attributes);
JsonTrans responce = new JsonTrans();
responce.Convert(row);
//if (row.Rows.Count != 0)
//{
// foreach (DataRow result in row.Rows)
// {
// string idstudent = result["id"].ToString();
// string avgstudent = result["AVG"].ToString();
// string firstname = result["fname"].ToString();
// string date = result["date"].ToString();
// string namecourse = result["name"].ToString();
// }
//}
i try :
public string Convert(DataTable row)
{
StringBuilder sb = new StringBuilder();
StringWriter sw = new StringWriter(sb);
JsonWriter jsonWriter = new JsonTextWriter(sw);
jsonWriter.Formatting = Formatting.Indented;
jsonWriter.WriteStartArray();
if (row.Rows.Count != 0)
{
foreach (DataRow result in row.Rows)
{
jsonWriter.WriteStartObject();
string idstudent = result["id"].ToString();
jsonWriter.WritePropertyName("id");
jsonWriter.WriteValue(idstudent);
jsonWriter.WriteEndObject();
}
}
jsonWriter.WriteEndArray();
jsonWriter.Close();
sw.Close();
How can convert this row to json like to :
[ {idstudent:"value" ,avg :"value" , avg : "value",firstname :"value"}
{idstudent:"value" ,avg :"value" , avg : "value",firstname :"value"}
{idstudent:"value" ,avg :"value" , avg : "value",firstname :"value"} ]
It looks like your code is very close to what you're trying to accomplish. If you want every column present in your DataRows, your foreach loop should iterate through all of the table's columns. (Note that I have changed your DataTable name here to table for clarity, and the DataRow name to row.)
foreach (DataRow row in table.Rows)
{
jsonWriter.WriteStartObject();
foreach (DataColumn col in table.Columns)
{
jsonWriter.WritePropertyName(col.ColumnName);
jsonWriter.WriteValue((row[col.ColumnName] == null) ? string.Empty : row[col.ColumnName].ToString());
}
jsonWriter.WriteEndObject();
}
Note that this example will write out empty strings where the source data is null. If you want to leave null values out entirely, perform the check for (row[col.ColumnName] == null) prior to writing the property name and value.
I have my code as below
string[] keys = { "myCustomUserControl.ascx", "myCustomUserControl.ascx.cs", "myCustomUserControl.ascx.designer.cs" };
string customUserControlName = CommonDataCalls.GetCustomUserControlName(keys);
UserControl objUserControl = (UserControl)this.LoadControl("~/UserControls/" + userControlName);
userControlPlaceHolder.Controls.Add(objUserControl);
The definition of GetCustomUserControlName is as below
public string GetCustomUserControlName(string[] keys)
{
try
{
string userConrolsPhysicalPtah = System.Web.HttpContext.Current.Server.MapPath("~/UserControls/");
DataTable objDataTable = new DataTable();
foreach (string key in keys)
{
objRequestVO.addObject("ACA_KEY", key);
CResponseVO objResponseVO = (CResponseVO)objGateway.ExecuteBusinessService(CConstant.ADMIN, CConstant.ASSEMBLY_INFO, CConstant.SELECT, objRequestVO);
DataSet objDataSet = (DataSet)objResponseVO.getObject("RES_DS");
cUserTrce objGeneral = new cUserTrce();
if (!objGeneral.IsNullOrEmptyDataset(objDataSet))
{
if (objDataTable.Rows.Count == 0)
{
objDataTable = objDataSet.Tables[0].Clone();
}
objDataTable.Rows.Add(objDataSet.Tables[0].Rows[0].ItemArray);
}
}
if (objDataTable != null && objDataTable.Rows.Count == 3)
{
string containerName = "usercontrols";
foreach (DataRow dr in objDataTable.Rows)
{
string userControlFileBlobUrl = dr["ACA_ASSEMBLY_PATH"].ToString();
string userControlFileName = dr["ACA_CLASS_NAME"].ToString();
Storage.Blob blobHandler = new Storage.Blob();
Stream blobstream = blobHandler.GetBlob(userControlFileBlobUrl, containerName);
if (!(File.Exists(userConrolsPhysicalPtah + userControlFileName)))
{
MemoryStream ms = (MemoryStream)blobstream;
FileStream outStream = File.OpenWrite(userConrolsPhysicalPtah + userControlFileName);
ms.WriteTo(outStream);
outStream.Flush();
outStream.Close();
}
}
string customUserControlName = (from DataRow row in objDataTable.Rows
where row["ACA_KEY"].ToString() == keys[0]
select row["ACA_CLASS_NAME"].ToString()).First();
return customUserControlName;
}
else
{
return null;
}
}
catch
{
return null;
}
}
The mithod basically copies the user controls to the virtual path at run time .
In aspx.cs page I try to load it dynamically .
But I can see the file is getting copied to the virtual path but this. Load control gives me exception saying Could not load type 'myCustomUserControl'.
I am using azure web role
What is wrong here ?
I solved the bug . I am just putting here for anyone to refer .
It's a one word change -
http://blog.kjeldby.dk/2008/11/dynamic-compilation-in-a-web-application/
Change
CodeBehind="myCustomUserControl.ascx.cs"
to
CodeFile="myCustomUserControl.ascx.cs"
Thanks to #Roopesh & #Kristoffer Brinch Kjeldby
and it will start working.