Excel file parsing/scraping using .NET - c#

Hi experts am trying to parse an excel file. its structure is very complex. The possible way i know are.
Use Office introp libraries
Use OLEDB provider and read excel file in a dataset.
But the issue is of its complexity like some columns,cells or rows blank etc.
What are the best possible ways to do this ?
thanks in advance.

I can recommend the ExcelDataReader (licensed under LGPL I think). It loads both .xls and .xlsx files, and lets you get the spreadsheet as a DataSet, with each worksheet being an individual DataTable. As far as I know from the scenarios I have used it in, it honours blank rows, empty cells, etc. Try it and see if you think it will handle your "very complex" structure. [I do notice one negative review on the site - but the rest are pretty positive. I've experienced an issue reading .xlsx if a worksheet is renamed]
I've also used the OLEDB approach in the past, but be warned that this has real problems in the way it tries to infer datatypes in the first few rows. If the datatype changes for a column, then this may well infer it wrongly. To make matters worse, when it does get it wrong, it will often return null as the value, making it difficult (or impossible) to tell a true null value from a datatype that changed after the first six or seven rows.

Personally i prefer to either use the OLEDB way, which is a bit clunky at best at times, or you can use a third party library that has put in the time/effort/energy to get access to the data.
SyncFusion has a pretty nice library for this.

I have my users first save the Excel spreadsheet as a CSV file. Then they upload the CSV file to my app. That makes it much simpler to parse.

I've used OLEDB myself to read uploaded Excel files, and its presents no real problems (except for nulls in fields, instead of blanks, which can be checked with IsDBNull). Also, third party open source tools like NPOI and Excel2007ReadWrite (http://www.codeproject.com/KB/office/OpenXML.aspx) can be useful.
I have thoroughly evaluated both of these third party tools, and both are pretty stable and easy to integrate. I would recommend NPOI for Excel 2003 files, and Excel2007ReadWrite for Excel 2007 files.

It sounds like you have a good understanding of the task at hand. You'll have to write business logic to untangle the complexities of the spreadsheet format and extract the data you're looking for.
It seems to me that VTSO/Interop is the best platform strategy for 2 reasons:
Access to the spreadsheet data will be a small part of the effort needed for your solution. So if using OLEDB saves a little time in data access, it will probably be irrelevant in terms of the overall project scope.
You may need to examine the contents of individual cells closely and take context information like formatting into account. With interop, you get full visibility of cell contents, context, and other sheet level context information like named ranges and lists. It is a risk to assume you won't need this type of information while decoding the spreadsheet.

Related

how to select rows from excel sheet in c#

Hi I have written a code to read from excel sheets and query them according to filter set
But am stuck at
Select * from [sheetname] where [col] not like '%something%'
How can I write the not part?
Rest all query just work fine
The one above ignores the not and executes
If you don't have to use ADO and OLE to read your spreadsheet, I would recommend using EP Plus. It's a project that allows you to work with Spreadsheets in a much better OOP paradigm. It also abstracts all of the gotchas that come from the different internal formatting of .xlsx files versus the older .xls files.
Is it just you don't have quotes around %something%?
Check out this if you want to melt your brain with Excel possibilities (search the comments for 'not like'), and perhaps solve your problem at the same time.

Working with tables

I'm making a small game using XNA, but it is cumbersome to effectively type-in the stats to all the entities in the game.
I was thinking that it would be much simpler to save the required information in a separate file with a table-format and use that instead.
I have looked into reading Excel tables with C# but it seems to be overly complex.
Is there any other table-format file types that let me easily edit the contents of the file and also read the contents using C# without too much hassle??
Basically, is there any other simple alternative to Excel? I just need the simplest table files to save some text in.
CSV is probably the simplest format to store table data, Excel can save data in it. There is no built in classes to read data from CSV as far as I know.
You may also consider XML or JSON to store data if you want some more structured data. Both have built in classes to serialize objects to/from.
If you are comfortable using Excel try exporting to a .CSV (Comma Seperate Value) file. The literal string will look like below.
row1col1,row1col2\nrow2col1,row2col2\nrow3col1,row3col2
The format is incredibly simple. Each row is on a separate line (separated by "\n") and the columns within a line are separated by commas. Very easy to parse just iterate though the lines and split on the commas.
while ((row = tr.ReadLine()) != null)
{
row.split(",")[0] //first column
row.split(",")[1] //second column
row.split(",")[2] //ect...
}
This may be overkill, but SQLlite might be worth looking into if you want expand-ability and maintainability. It is an easy setup and learning SQL will be useful in many applications.
This is a good tutorial to get you started:
http://www.dreamincode.net/forums/topic/157830-using-sqlite-with-c%23/
I understand if this isn't exactly what you were looking for, but I wanted to give you a broader range of options. If you are going for absolute simplicity go with CSV or XML like Alexei said.
Edit: If necessary there is a C# SQLlite version for managed environments(XBOX,WP7) http://forums.create.msdn.com/forums/p/47127/282261.aspx

Excel Data Processing with VSTO?

I find myself in possession of an Excel Spreadsheet containing about 3,000 rows of data that represent either additions or changes to data that I need to make to an SQL Table. As you can imagine that's a bit too much to handle manually. For a number of reasons beyond my control, I can't simply use an SSIS package or other simpler method to get these changes into the database. The only option I have is to create SQL scripts that will make the changes represented in the spreadsheet to MS SQL 2005.
I have absolutely no experience with Office automation or VSTO. I've tried looking online, but most of the tutorials I've seen seem a bit confusing to me.
So, my thought is that I'd use .NET and VSTO to iterate through the rows of data (or use LINQ, whatever makes sense) and determine if the item involved is an insert or an update item. There is color highlighting in the sheet to show the delta, so I suppose I could use that or I could look up some key data to establish if the entry exists. Once I establish what I'm dealing with, I could call methods that generate a SQL statement that will either insert or update the data. Inserts would be extremely easy, and I could use the delta highlights to determine which fields need to be updated for the update items.
I would be fine with either outputting the SQL to a file, or even adding the test of the SQL for a given row in the final cell of that row.
Any direction to some sample code, examples, how-tos or whatever would lead me in the right direction would be most appreciated. I'm not picky. If there's some tool I'm unaware of or a way to use an existing tool that I haven't thought of to accomplish the basic mission of generating SQL to accomplish the task, then I'm all for it.
If you need any other information feel free to ask.
Cheers,
Steve
I suggest before trying VSTO, keep things simple and get some experience how to solve such a problem with Excel VBA. IMHO that is the easiest way of learning the Excel object model, especially because you have the macro recorder at hand. You can re-use this knowledge later when you think you have to switch to C#, VSTO or Automation or (better !) Excel DNA.
For Excel VBA, there are lots of tutorials out there, here is one:
http://www.excel-vba.com/excel-vba-contents.htm
If you need to know how to execute arbitrary SQL commands like INSERT or UPDATE within a VBA program, look into this SO post:
Excel VBA to SQL Server without SSIS
Here is another SO post showing how to get data from an SQL server into an Excel spreadsheet:
Accessing SQL Database in Excel-VBA

How can I save large amounts of data in C#?

I'm writing a program in C# that will save lots of data points and then later make a graph. What is the best way to save these points?
Can I just use a really long array or should I use a text file or excel file or something like that?
Additional information: It probably wont be more than a couple thousand. And it would be good if I could access it from a windows mobile app. Basically a user will be able to save times that something happens at, and then the app will use the data to find a cross correlation.
If it's millions or even thousands of records, I would probably look at using a database. You can get SQL Server 2008 Express for free, or use MySQL, or something like that.
If you go that route, LINQ to SQL makes database access a piece of cake in .NET. Entity Framework is also available, but LINQ to SQL probably has a quicker time-to-implement.
If you use a text file or excel file, etc. You'll still need to load them back into memory to plot the graph.
So if you're collecting data over a long period of time, or you want to plot the graph some time in the future, write them to a plain text file. When you're ready to plot the graph, load the file up and plot the graph.
If the data collection is within a short period of time, don't bother writing to a file - it'll just add steps to the process for nothing.
A really easy way of doing this would be to serialize your object list into a BinaryWriter or XMLWriter, which automatically format your data into a readable and writable format so that, when your program needs to load the data, all you have to do is deserialize it (1 line of code).
Alternatively, if you have very many records, I suggest trying to use a database. It's quite easy to interface C# with SQL Server (there's a free version called Express Edition) or MySQL, and storing and retrieving huge amounts of data is not a pain. This would be the most efficient way to accomplish your task.
Depending on how much data you have and whether you want to accomplish something like this with 1 line of code (serialization) or interface with a seperate product (the database approach), you can choose either one of the above. Of course, if you wanted to, you could just manually write the contents of your data to a text file or CSV file, as you suggested, but, from personal experience, I recommend the methods I explained above.
It probably wont be more than a couple thousand. And it would be good if I could access it from a windows mobile app. Basically a user will be able to save times that something happens at, and then the app will use the data to find a cross correlation.
Is there any need for interoperability with other processes? If so, time to swat-up on file formats.
However, from the sound of it, you're asking on a matter of "style", with no real requirement to open the file anywhere but your own app. I'd suggest using a BinaryWriter for the task.
If debugging is an issue, a human-readable format might be preferable, but would be considerably larger than the binary equivalent.
Probably the quickest way to do it would be using binary serialization.

Best practice for Uploading Excel data in SQL Server using ASP.NET

I am looking for best practice for uploading excel data in Sql server 2000 database through asp.net web application. Excel data will be predefined Format with almost 42 columns and out of 42 10 fields are mandatory and rest are conditional mandatory. i.e. if data exists it should be in defined format. I also need to validate for special character, length, specified format and so on.
After validating, i need to store valid data into sql server table and provide export to excel functionality for invalid data for exporting in same excel format with indicator to identity the invalid cells.
Can any one suggest me to do the same in optimized way.
Thank you...
You can use ADO.NET to read the data in from the spreadsheet, as outlined here.
Read it in to memory and parse all the data as necessary. Store the parsed data into a DataTable, and then you can persist that data in bulk to the database using a couple of possible methods.
The quickest, most efficient way to bulkload data into SQL Server is using SqlBulkCopy. The alternative method is to use an SqlDataAdapter. I recently outlined both approaches, with examples and performance comparisons here.
You can, but you are, AFAIK, not allowed to use Excel COM Interop on a web server. And it is definately not recommended and supported (source).
So you are left with 2 options:
Try to switch to a different format (XML, CSV) or use an Excel XML format, that you can read and write using System.XML or System.XML.Linq.
Find a component that can read and write Excel binary files. There are commercial and open source components available.
FileHelpers for .net is a decent library that will do alot of the processing for you if you are looking for something quick and efficient without having to build a ton of it yourself. They have an example of loading excel files into a sql database like you describe.

Categories