I am currently reading in an HTML document using CsQuery. This document has several HTML tables and I need to read in the data while preserving the structure. At the moment, I simply have a List of List of List of strings. This is a list of tables containing a list of rows containing a list of cells containing the content as a string.
List<List<List<string>>> page_tables = document_div.Cq().Find("TABLE")
.Select(table => table.Cq().Find("TR")
.Select(tr => tr.Cq().Find("td")
.Select(td => td.InnerHTML).ToList())
.ToList())
.ToList();
Is there a better way to store this data, so I can easily access particular tables, and specific rows and cells? I'm writing several methods that deal with this page_tables object so I need to nail down its formulation first.
Is there a better way to store this data, so I can easily access particular tables, and specific rows and cells?
On most occassions, well-formed HTML fits nicely into an XML structure so you could store it as an XML document. LINQ to XML would make querying very easy
XDocument doc = XDocument.parse("<html>...</html>");
var cellData = doc.Descendant("td").Select(x => x.Value);
Based on the comments I feel obliged to point out that there are a couple of other scenarios where this can fall over such as
When HTML-encoded content like is used
Valid HTML which doesn't require a closing tag e.g. <br> is used
(With that said, these things can be handled by some pre-processing)
To summarise, it's by all means not the most robust approach, however, if you can be sure that the HTML you are parsing fits the bill then it would be a pretty neat solution.
You could go fully OOP and write some model classes:
// Code kept short, minimal ctors
public class Cell
{
public string Content {get;set;}
public Cell() { this.Content = string.Empty; }
}
public class Row
{
public List<Cell> Cells {get;set;}
public Row() { this.Cells = new List<Cell>(); }
}
public class Table
{
public List<Row> Rows {get;set;}
public Table() { this.Rows = new List<Row>(); }
}
And then fill them up, for example like this:
var tables = new List<Table>();
foreach(var table in document_div.Cq().Find("TABLE"))
{
var t = new Table();
foreach(var tr in table.Cq().Find("TR"))
{
var r = new Row();
foreach(var td in tr.Cq().Find("td"))
{
var c = new Cell();
c.Contents = td.InnerHTML;
r.Cells.Add(c);
}
t.Rows.Add(r);
}
tables.Add(t);
}
// Assuming the HTML was correct, now you have a cleanly organized
// class structure representing the tables!
var aTable = tables.First();
var firstRow = aTable.Rows.First();
var firstCell = firstRow.Cells.First();
var firstCellContents = firstCell.Contents;
...
I'd probably choose this approach because I always prefer to know exactly what my data looks like, especially if/when I'm parsing from external/unsafe/unreliable sources.
Is there a better way to store this data, so I can easily access
particular tables, and specific rows and cells?
If you want to easily access table data, then create class which will hold data from table row with nicely named properties for corresponding columns. E.g. if you have users table
<table>
<tr><td>1</td><td>Bob</td></tr>
<tr><td>2</td><td>Joe</td></tr>
</table>
I would create following class to hold row data:
public class User
{
public int Id { get; set; }
public string Name { get; set; }
}
Second step would be parsing users from HTML. I suggest to use HtmlAgilityPack (available from NuGet) for parsing HTML:
HtmlDocument doc = new HtmlDocument();
doc.Load("index.html");
var users = from r in doc.DocumentNode.SelectNodes("//table/tr")
let cells = r.SelectNodes("td")
select new User
{
Id = Int32.Parse(cells[0].InnerText),
Name = cells[1].InnerText
};
// NOTE: you can check cells count before accessing them by index
Now you have collection of strongly-typed user objects (you can save them to list, to array or to dictionary - it depends on how you are going to use them). E.g.
var usersDictionary = users.ToDictionary(u => u.Id);
// Getting user by id
var user = usersDictionary[2];
// now you can read user.Name
Since your parsing an HTML table. Could you use an ADO.Net DataTable? If the content doesn't have too many row or col spans this may be an option, you wouldn't have to roll your own and it could be easily saved to a database or list of entities or whatever. Plus you get the benefit of strongly typed data types. As long as the HTML tables are consistent I would prefer an approach like this to make interoperability with the rest of the framework seamless and a ton less work.
Related
I have 3 Excel files containing data related to Client details, Company of Stocks and Order Details of Stocks Purchase. I want to parse all the data into a Multi-layer Dictionary using C# and run "Sorting" and "Searching" Functions on the same. I am a novice when it comes to C# and was wondering what would be the code for the same.
Data Eg: Stock Symbol Company Name S&P Sector
AAPL Apple Inc. IT
I could be barking up the wrong tree but with what you've given us to work with, I'm assuming you want to take the data matrix in the relevant worksheet and from that data, create an enumerable list of data with the relevant type so you can perform operations over it like sorting, filtering, etc. If that's what you want then the below is an example of that.
This is the workbook I created with some test data ...
You said you're a novice with C#, but, to make the below work, create a new .NET Framework project and add the NuGet package ... Microsoft.Office.Interop.Excel. I called the project InterExcelDotNet but you can change that to be whatever you want.
using Microsoft.Office.Interop.Excel;
using System.Collections.Generic;
using System.Linq;
namespace ExcelInteropDotNet
{
public class CompanyStockInfo
{
public string StockSymbol { get; set; }
public string CompanyName { get; set; }
public string SPSector { get; set; }
}
class Program
{
static void Main(string[] args)
{
// Change the below variabes to the relevant values for your needs.
string workbookName = #"c:\temp\Source Data.xlsx";
string worksheetName = "CompanyStockData";
// Create a new list with the type being the CompanyStockInfo type.
var companyStockInfoList = new List<CompanyStockInfo>();
// Create an instance of Excel, open the workbook, fetch the sheet and then
// find the last row in column A.
var xlApplication = new Application();
var xlWorkbook = xlApplication.Workbooks.Open(workbookName, ReadOnly: true);
var xlSrcSheet = xlWorkbook.Worksheets[worksheetName] as Worksheet;
var lastRow = xlSrcSheet.Cells[xlSrcSheet.Rows.Count,1].End[XlDirection.xlUp].Row;
// There may be a better way to do this but essentially, the below will loop through
// all cells from the 2nd row to the last row and create a new item in the list
// that stores all of the data.
for (long row = 2; row <= lastRow; row++)
{
companyStockInfoList.Add(new CompanyStockInfo()
{
StockSymbol = (xlSrcSheet.Cells[row, 1] as Range).Text,
CompanyName = (xlSrcSheet.Cells[row, 2] as Range).Text,
SPSector = (xlSrcSheet.Cells[row, 3] as Range).Text
});
}
xlApplication.Quit();
// You can use Linq to sort and search the list for the data you're wanting to
// get your hands on.
// Will filter all entries that have Inc. in the company name.
var filteredList = companyStockInfoList.Where(item => item.CompanyName.Contains("Inc."));
// Orders all entries by the company name in alphabetical order.
var orderedList = companyStockInfoList.OrderBy(item => item.CompanyName);
}
}
}
Now, having given you the above, you should understand that the Excel library in C# does allow you to perform operations over the workbook directly like you can do within excel, like SORT and FILTER. That may be another way to achieve what you're wanting.
Sort
AdvancedFilter
I'm not sure if all of that helps or not but I hope it does.
Good luck ...!
I have an ASP.NET MVC web application.
The SQL table has one column ProdNum and it contains data such as 4892-34-456-2311.
The user needs a form to search the database that includes this field.
The problem is that the user wants to have 4 separate fields in the UI razor view whereas each field should match with the 4 parts of data above between -.
For example ProdNum1, ProdNum2, ProdNum3 and ProdNum4 field should match with 4892, 34, 456, 2311.
Since the entire search form contains many fields including these 4 fields, the search logic is based on a predicate which is inherited from the PredicateBuilder class.
Something like this:
...other field to be filtered
if (!string.IsNullOrEmpty(ProdNum1) {
predicate = predicate.And(
t => t.ProdNum.toString().Split('-')[0].Contains(ProdNum1).ToList();
...other fields to be filtered
But the above code has run-time error:
The LINQ expression node type 'ArrayIndex' is not supported in LINQ to Entities`
Does anybody know how to resolve this issue?
Thanks a lot for all responses, finally, I found an easy way to resolve it.
instead of rebuilding models and change the database tables, I just add extra space in the search strings to match the search criteria. since the data format always is: 4892-34-456-2311, so I use Startwith(PODNum1) to search first field, and use Contains("-" + PODNum2 + "-") to search second and third strings (replace PODNum1 to PODNum3), and use EndWith("-" + PODNum4) to search 4th string. This way, I don't need to change anything else, it is simple.
Again, thanks a lot for all responses, much appreciated.
If i understand this correct,you have one column which u want to act like 4 different column ? This isn't worth it...For that,you need to Split each rows column data,create a class to handle the splitted data and finally use a `List .Thats a useless workaround.I rather suggest u to use 4 columns instead.
But if you still want to go with your existing applied method,you first need to Split as i mentioned earlier.For that,here's an example :
public void test()
{
SqlDataReader datareader = new SqlDataReader;
while (datareader.read)
{
string part1 = datareader(1).toString.Split("-")(0);///the 1st part of your column data
string part2 = datareader(1).toString.Split("-")(1);///the 2nd part of your column data
}
}
Now,as mentioned in the comments,you can rather a class to handle all the data.For example,let's call it mydata
public class mydata {
public string part1;
public string part2;
public string part3;
public string part4;
}
Now,within the While loop of the SqlDatareader,declare a new instance of this class and pass the values to it.An example :
public void test()
{
SqlDataReader datareader = new SqlDataReader;
while (datareader.read)
{
Mydata alldata = new Mydata;
alldata.Part1 = datareader(1).toString.Split("-")(0);
alldata.Part2 = datareader(1).toString.Split("-")(1);
}
}
Create a list of the class in class-level
public class MyForm
{
List<MyData> storedData = new List<MyData>;
}
Within the while loop of the SqlDatareader,add this at the end :
storedData.Add(allData);
So finally, u have a list of all the splitted data..So write your filtering logic easily :)
As already mentioned in a comment, the error means that accessing data via index (see [0]) is not supported when translating your expression to SQL. Split('-') is also not supported hence you have to resort to the supported functions Substring() and IndexOf(startIndex).
You could do something like the following to first transform the string into 4 number strings ...
.Select(t => new {
t.ProdNum,
FirstNumber = t.ProdNum.Substring(0, t.ProdNum.IndexOf("-")),
Remainder = t.ProdNum.Substring(t.ProdNum.IndexOf("-") + 1)
})
.Select(t => new {
t.ProdNum,
t.FirstNumber,
SecondNumber = t.Remainder.Substring(0, t.Remainder.IndexOf("-")),
Remainder = t.Remainder.Substring(t.Remainder.IndexOf("-") + 1)
})
.Select(t => new {
t.ProdNum,
t.FirstNumber,
t.SecondNumber,
ThirdNumber = t.Remainder.Substring(0, t.Remainder.IndexOf("-")),
FourthNumber = t.Remainder.Substring(t.Remainder.IndexOf("-") + 1)
})
... and then you could simply write something like
if (!string.IsNullOrEmpty(ProdNum3) {
predicate = predicate.And(
t => t.ThirdNumber.Contains(ProdNum3)
I have a list of objects which I sort multiple times throughout code and when the user interacts with the program. I was wondering if it would be better to insert new items into the list rather than add to the end of the list and resort the entire list.
The code below is for importing browser bookmarks - Here I add a bunch of bookmarks to the List (this._MyLinks) which are Link objects and then sort the final List - Which I think is probably best in this given scenario....
public void ImportBookmarks(string importFile)
{
using (var file = File.OpenRead(importFile))
{
var reader = new NetscapeBookmarksReader();
var bookmarks = reader.Read(file);
foreach (var b in bookmarks.AllLinks)
{
bool duplicate = this._MyLinks.Any(link => link._URL == b.Url);
if(duplicate)
{
continue;
}
Link bookmark = new Link();
bookmark._URL = b.Url;
bookmark._SiteName = b.Title;
bookmark.BrowserPath = "";
bookmark.BrowserName = "";
if (bookmark.AddToConfig(true))
{
this._MyLinks.Add(bookmark);
}
}
}
this._MyLinks = this._MyLinks.OrderBy(o => o._SiteName).ToList();
}
Now a user also has the option to add their own links (one at a time). Whenever the user adds a link the ENTIRE list is sorted again using
this._MyLinks = this._MyLinks.OrderBy(o => o._SiteName).ToList();
Is it better from a preformance standpoint (or just generally) to just insert the item directly into it's specified location? If so would you have suggestions on how I can go about doing that?
Thanks!
Since you want a sorted set of data you should be using a more appropriate data structure, specifically a sorted data structure, rather than using an unsorted data structure that you re-sort every time, or that forces you to inefficiently add items to the middle of a list.
SortedSet is specifically designed to maintain a sorted set of data efficiently.
I have a bunch of data that I'm pulling into my application which frankly is best represented as an Excel spreadsheet. By this I mean:
There are a lot of columns which need 'summing up'
There is a reasonable amount of data (basically a sheet of numbers)
At the moment this is just raw data in a database, but I also have a spreadsheet which shows this data (along with formulas that I need to replicate in my app).
At the moment I've just got a List<of T> of each row, however I believe there might be a better collection for storing data of this type. I basically need to be able to manipulate these numbers easily.
Any suggestions?
One option would be to use a DataTable which also has a builtin aggregation method.
For example(from MSDN):
// Presumes a DataTable named "Orders" that has a column named "Total."
DataTable table;
table = dataSet.Tables["Orders"];
// Declare an object variable.
object sumObject;
sumObject = table.Compute("Sum(Total)", "EmpID = 5");
Another advantage is that it supports LINQ queries with LINQ-To-DataSet.
If your "excel data" can be represented in models, I'd just use models. For example like so:
public class ExcelModel()
{
public string Id { get; set; }
public double value1 { get; set; }
public int value1 { get; set; }
}
Then you can easily create a List<ExcelModel>, and get the total like so:
List<ExcelModel> model = repository.GetAll(); //just an example
var total = model.sum(x => x.value1);
I would like to store values in a text file as comma or tab seperated values (it doesn't matter).
I am looking for a reusable library that can manipulate this data in this text file, as if it were a sql table.
I need select * from... and delete from where ID = ...... (ID will be the first column in the text file).
Is there some code plex project that does this kind of thing?
I do not need complex functionality like joining or relationships. I will just have 1 text file, which will become 1 database table.
SQLite
:)
Use LINQ to CSV.
http://www.codeproject.com/KB/linq/LINQtoCSV.aspx
http://www.thinqlinq.com/Post.aspx/Title/LINQ-to-CSV-using-DynamicObject.aspx
If its not CSV in that case
Let your file hold one record per line. Each record at runtime should be read into a Collection of type Record [assuming Record is custom class representing individual record]. You can do LINQ operations on the collection and write back the collection into file.
Use ODBC. There is a Microsoft Text Driver for csv-Files. I think this would be possible. I don't have tested if you can manipulate via ODBC, but you can test it easily.
For querying you can also use linq.
Have you looked at the FileHelpers library? It has the capability of reading and parsing a text file into CLR objects. Combining that with the power of something like LINQ to Objects, you have the functionality you need.
public class Item
{
public int ID { get; set; }
public string Type { get; set; }
public string Instance { get; set; }
}
class Program
{
static void Main(string[] args)
{
string[] lines = File.ReadAllLines("database.txt");
var list = lines
.Select(l =>
{
var split = l.Split(',');
return new Item
{
ID = int.Parse(split[0]),
Type = split[1],
Instance = split[2]
};
});
Item secondItem = list.Where(item => item.ID == 2).Single();
List<Item> newList = list.ToList<Item>();
newList.RemoveAll(item => item.ID == 2);
//override database.txt with new data from "newList"
}
}
What about data delete. LINQ for query, not for manipulation.
However, List provides a predicate-based RemoveAll that does what you
want:
newList.RemoveAll(item => item.ID == 2);
Also you can overview more advanced solution "LINQ to Text or CSV Files"
I would also suggest to use ODBC. This should not complicate the deployment, the whole configuration can be set in the connection-string so you do not need a DSN.
Together with a schema.ini file you can even set column names and data-types, check this KB article from MS.
sqllite or linq to text and csv :)