Import data from multiple CSV files to an Excel sheet - c#

I need to import data from 50 similar csv files to a single excel sheet.
Is there any way to get only selected columns from each file and put them together in one sheet.
Structure of my csv files: A few columns exactly same in all the files. (I want them to be in excel) then one column with same column name but different data which I want to place next to each other with different names in the excel sheet. I do not not want all other remaining columns from csv files.
In short,
read all csv files,
get all the columns which are common to all csv & put in excel sheet.
Now, take one column from each file which has the same header but different data and put one after the other in excel sheet with named from the csv file name.
Leave remaining rest of the columns.
Write excel sheet to a excel file.
Initially I thought it can be easily done, but considering my programming skill in learning stage it is way difficult for me. Please help.

Microsoft Text Driver allows you to read CSV data in a DataSet, making data manipulation easy.
This Stack Overflow question is a good starting point.

Fastest way could be using FileHelpers to read CSV into a DataTable :
http://filehelpers.sourceforge.net/FileHelpers.CommonEngine.CsvToDataTable_overload_4.html
and then with EPPlus export that DataTable in excel, use method DataTableToExcelXlsx from this snippet:
https://stackoverflow.com/a/9569827/351383
With EPPlus you don't have to have Excel installed on machine that is executing this code, and you can use it on server (ASP.NET)

With a very simple coding, I was able to read the files. Now, the only thing we need to do is to make this code a bit fancy to loop thorough all the CSV files in the folder given and collect data. Once we read the data, it can be filtered and put to an excel as we want.
Of course, excel can import CSV itself, but it is not that practical to do this every time. And again we can add the code to application to use in flexibility, exactly what I am trying to do.
public static System.Data.DataTable GetDataTable(string strFileName)
{
System.Data.OleDb.OleDbConnection dbConnect = new System.Data.OleDb.OleDbConnection("Provider=Microsoft.Jet.OleDb.4.0; Data Source = " + System.IO.Path.GetDirectoryName(strFileName) + ";Extended Properties = \"Text;HDR=YES;FMT=TabDelimited\"");
dbConnect.Open();
string strQuery = "SELECT * FROM [" + System.IO.Path.GetFileName(strFileName) + "]";
System.Data.OleDb.OleDbDataAdapter adapter = new System.Data.OleDb.OleDbDataAdapter(strQuery, dbConnect);
System.Data.DataSet dSet = new System.Data.DataSet("CSV File");
adapter.Fill(dSet);
dbConnect.Close();
return dSet.dbTables[0];
}

Related

OleDbConnection only finds cell value when workbook is also open in Excel

I have a program (actually SSIS script task, but I don't suppose that matters) that creates an OLE DB connection to an Excel workbook, and reads the cell values in each worksheet, storing them in a SQL Server table.
Each worksheet has several sections of rows, each section being for a separate product. The first two rows of each product section are a quarter row, and a year row. Here is a screen shot:
I use an OleDbDataReader with a "Select *" command to read the data in each sheet into a DataTable. I have a column called "YearQuarter" in my SQL database, where I store a concatenation of the year row value and the preceding quarter row value, with a hyphen between the two strings:
My code is like this:
OleDbConnection oleExcelConnection = new OleDbConnection(
"Provider=Microsoft.ACE.OLEDB.12.0;" +
"Data Source=" + strWkbkFilePath + ";" +
"Mode=Read;" +
"Extended Properties=\"Excel 8.0;HDR=No;IMEX=1\"");
oleExcelConnection.Open();
DataTable dtCurrSheet = new DataTable();
// Name of table is in strLoadTblNm.
OleDbCommand oleExcelCommand;
OleDbDataReader oleExcelReader;
oleExcelCommand = excel_conn.CreateCommand();
oleExcelCommand.CommandText = "Select * From [" + strLoadTblNm + "]";
oleExcelCommand.CommandType = CommandType.Text;
oleExcelReader = oleExcelCommand.ExecuteReader();
// Load worksheet into data table
dtSheet.Load(oleExcelReader);
oleExcelReader.Close();
Looking at the output data, I noticed that I was getting inconsistent results. Some rows would have a YearQuarter column value that would have only the Year row value in them, while others would have the cell values from both rows. For example, I'd have "2009 - Year End" followed by just "2010", with no " - 1st Qtr." appended to it.
This is because that quarter cell valued is never loaded into the data reader, as the Dataset Visualizer shows:
Notice also that, in the Dataset, the column that is missing the Quarter cell value also has other numeric values missing their formatting (no commas).
If I save the file as a .csv, all cell values are preserved.
However, I noticed that it wasn't consistent. Sometimes I'd run my package and the same row would now have the full value. So, in the above example, I'd get "2010 - 1st Qtr."
I finally realized that it was working as expected only if I happened to have the workbook open in Excel at the same time that the program was running!
Why would this make a difference? Could it be that there is a macro or something in the workbook that is executed by Excel, but not when the workbook is accessed only via an OLE DB connection? Would the fact that it had been executed in Excel then affect the data obtained by OLE DB? If that's the case, how do I get around this? The spreadsheets are provided to me. So I can't modify them.
I think you're having issues with the auto-formatting thing Excel tries to apply. With an OLEDB connection, I can't see how having the sheet open fixes your problem (obviously very strange).
Try Adding IMEX = 1 to your connection options to treat the entire sheet as text to see if this is your issue. Pulled from OLEDB connection does not read data from excel sheet Also another good post from an external site: Tips for reading Excel spreadsheets using ADO.NET
Also, you're pulling data from an excel sheet and writing it to another excel sheet... Same workbook? I have a couple more ideas for ya though depending on your situation.
This bug turns out to be a "feature", and it should come with a big warning sign.
This article (thanks, #vb4all) explains that "ADO.NET scans the first 8 rows of data, and based on that, guesses the datatype for each column. Then it attempts to coerce all data from that column to that datatype, returning NULL whenever the coercion fails!"
In other words, it is treating the worksheet as a relation table, in which all values in a given column are of the same type. Of course, worksheet data is not bound by this restriction.
This behavior can be gotten around by setting IMEX=1 in the connection string options and then modifying these registry settings:
Hkey_Local_Machine/Software/Microsoft/Jet/4.0/Engines/Excel/ImportMixedTypes
Hkey_Local_Machine/Software/Microsoft/Jet/4.0/Engines/Excel/Typ
(Note: registry keys vary depending on 32 vs. 64 bit. E.g., for 64-bit, the first one would be HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Microsoft\Jet\4.0\Engines\Jet 4.0).
I think this was a very risky design, inviting data transfer errors that could easily go unnoticed.

reading tab delimited file into table

We read a tab delimited file into a DataTable with the following code.
//read the uploaded file...
var records = File.ReadAllLines(Server.MapPath("~/UploadFiles/") + Session.SessionID + "/orderEDI.txt").ToList();
//load the data into the temporary table...
records.ForEach(record => loadTable.Rows.Add(record.Split((char)9)));
This works just fine providing there are not more tabs in the file than there are columns in the DataTable.
I'm looking to find out if there's a way I can limit the number of columns it reads from the file. Or any other suggestions around this issue. It must read an absolute minimum of 10 columns (and ideally, only 10)
I build the DataTable and add columns to it before this load occurs. Would it be better to not add columns and just load the file into the DataTable and then read the table by column index rather than name?
Really not sure which way to go and would appreciate some experienced opinions.
Thanks
Since split results in an array, why don't you just use Take(10)?
//read the uploaded file...
var records = File.ReadAllLines(Server.MapPath("~/UploadFiles/") + Session.SessionID + "/orderEDI.txt").ToList();
//load the data into the temporary table...
records.ForEach(record => loadTable.Rows.Add((record.Split((char)9)).Take(10)));

Importing an Excel WorkSheet into a Datatable

I have been asked to create import functionality in my application. I am getting an excel worksheet as input. The worksheet has column headers followed by data. The users want to simply select an xls file from their system, click upload and the tool deletes the table in the database and adds this new data.
I thought the best way would be too bring the data into a datatable object and do a foeach for every row in the datatable insert row by row into the db.
My question is what can anyone give me code to open an excel file, know what line the data starts on in the file, and import the data into a datable object?
Take a look at Koogra.
You instantiate a WorkBook object from a path to an XLS file.
You access a WorkSheet object from the workbook's Sheets property.
You can enumerate over the rows in the worksheet by accessing the sheet's Rows property from index MinRow to MaxRow.
You can enumerate over the cells in a given row by accessing the row's Cells property from index MinColumn to MaxColumn.
Each cell has a Value property (object) as well as a FormattedValue method (string).
Give it a try -- I've found it to be extremely intuitive and easy to use.
You can make use of an OleDbConnection to connect to excel file and the query it using SQL queries.
If it is an Asp.Net application, then you make use of the FileUpload control and get the bytes from the file. Then you will have to manually convert it to a datatable.
Try out these links:
OleDbConnection to excel file
Byte array to datatable
What your looking for is the concept described Here
Providing you dont want to use a third party library anyway, else Dans solution will suit you
First you have to download the dll file namely
NExcel.dll
By using this dll you can make various object which are very useful for
import excel data in .net using both vb as well as c#.
Good luck.

Problems reading in an Excel file in C#

I'm reading an Excel file with OLDB Connection using this code
var connectionString = string.Format("Provider=Microsoft.Jet.OLEDB.4.0; data source={0}; Extended Properties=Excel 8.0;", fileName);
var fileName = string.Format("{0}\\s23.xls", Directory.GetCurrentDirectory());
var adapter = new OleDbDataAdapter("SELECT * FROM [TEJ3$] ", connectionString);
DataTable dt=new DataTable();
adapter.Fill(dt, "Table1");
and after runing this code my data table is filled. But I have a column that has many string cells and few empty cells ; in excel file this cells have numeric values.
Someone has an idea?
Check the first examples here: http://www.connectionstrings.com/excel
What often goes wrong is that Excel will estimate the type of a column based upon the first X rows. When after that the values don't match, these rows get empty values. I'm afraid that going into the registry is sometime the only way to get the Excel driver to scan all rows first (as described in the connectionstrings.com article).
Play around with the HDR and IMEX settings in your environment. In some cases that will help as well.
I had this exact problem and solve it with using IMEX setting. In case others are wondering how to include the IMEX, here is what I have on my connection string
string connectionString = #"Provider=Microsoft.Jet.OLEDB.4.0;Extended Properties='Excel 8.0;IMEX=1';Data Source={0};";
connectionString = string.Format(connectionString, excelWorkbookPath);
SpreadsheetGear for .NET will let you load Excel workbooks from C# and access the underlying cell values or the formatted values in any order no matter how the workbook is laid out or what the types of the cells are.
You can see live ASP.NET samples here and download the free trial here if you want to try it yourself.
Disclaimer: I own SpreadsheetGear LLC
Just make sure that your excel file is not open. Close your excel application & then start your program.
Sunil

Scientific notation when importing from Excel in .Net

I have a C#/.Net job that imports data from Excel and then processes it. Our client drops off the files and we process them. I don't have any control over the original file.
I use the OleDb library to fill up a dataset. The file contains some numbers like 30829300, 30071500, etc... The data type for those columns is "Text".
Those numbers are converted to scientific notation when I import the data. Is there anyway to prevent this from happening?
One workaround to this issue is to change your select statement, instead of SELECT * do this:
"SELECT Format([F1], 'General Number') From [Sheet1$]"
-or-
"SELECT Format([F1], \"#####\") From [Sheet1$]"
However, doing so will blow up if your cells contain more than 255 characters with the following error:
"Multiple-step OLE DB operation generated errors. Check each OLE DB status value, if available. No work was done."
Fortunately my customer didn't care about erroring out in this scenario.
This page has a bunch of good things to try as well:
http://www.dicks-blog.com/archives/2004/06/03/external-data-mixed-data-types/
The OleDb library will, more often than not, mess up your data in an Excel spreadsheet. This is largely because it forces everything into a fixed-type column layout, guessing at the type of each column from the values in the first 8 cells in each column. If it guesses wrong, you end up with digit strings converted to scientific-notation. Blech!
To avoid this you're better off skipping the OleDb and reading the sheet directly yourself. You can do this using the COM interface of Excel (also blech!), or a third-party .NET Excel-compatible reader. SpreadsheetGear is one such library that works reasonably well, and has an interface that's very similar to Excel's COM interface.
Using this connection string:
Provider=Microsoft.ACE.OLEDB.12.0; data source={0}; Extended Properties=\"Excel 12.0;HDR=NO;IMEX=1\"
with Excel 2010 I have noticed the following. If the Excel file is open when you run the OLEDB SELECT then you get the current version of the cells, not the saved file values. Furthermore the string values returned for a long number, decimal value and date look like this:
5.0130370071e+012
4.08
36808
If the file is not open then the returned values are:
5013037007084
£4.08
Monday, October 09, 2000
If you look at the actual .XSLX file using Open XML SDK 2.0 Productivity Tool (or simply unzip the file and view the XML in notepad) you will see that Excel 2007 actually stores the raw data in scientific format.
For example 0.00001 is stored as 1.0000000000000001E-5
<x:c r="C18" s="11" xmlns:x="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
<x:v>1.0000000000000001E-5</x:v>
</x:c>
Looking at the cell in Excel its displayed as 0.00001 in both the cell and the formula bar. So it not always true that OleDB is causing the issue.
I have found that the easiest way is to choose Zip format, rather than text format for columns with large 'numbers'.
Have you tried casting the value of the field to (int) or perhaps (Int64) as you are reading it?
Look up the IMEX=1 connection string option and TypeGuessRows registry setting on google.
In truth, there is no easy way round this because the reader infers column data types by looking at the first few rows (8 by default). If the rows contain all numbers then you're out of luck.
An unfortunate workaround which I've used in the past is to use the HDR=NO connection string option and set the TypeGuessRows registry setting value to 1, which forces it to read the first row as valid data to make its datatype determination, rather than a header.
It's a hack, but it works. The code reads the first row (containing the header) as text, and then sets the datatype accordingly.
Changing the registry is a pain (and not always possible) but I'd recommend restoring the original value afterwards.
If your import data doesn't have a header row, then an alternative option is to pre-process the file and insert a ' character before each of the numbers in the offending column. This causes the column data to be treated as text.
So all in all, there are a bunch of hacks to work around this, but nothing really foolproof.
I had this same problem, but was able to work around it without resorting to the Excel COM interface or 3rd party software. It involves a little processing overhead, but appears to be working for me.
First read in the data to get the column names
Then create a new DataSet with each of these columns, setting each of their DataTypes to string.
Read the data in again into this new
dataset. Voila - the scientific
notation is now gone and everything is read in as a string.
Here's some code that illustrates this, and as an added bonus, it's even StyleCopped!
public void ImportSpreadsheet(string path)
{
string extendedProperties = "Excel 12.0;HDR=YES;IMEX=1";
string connectionString = string.Format(
CultureInfo.CurrentCulture,
"Provider=Microsoft.ACE.OLEDB.12.0;Data Source={0};Extended Properties=\"{1}\"",
path,
extendedProperties);
using (OleDbConnection connection = new OleDbConnection(connectionString))
{
using (OleDbCommand command = connection.CreateCommand())
{
command.CommandText = "SELECT * FROM [Worksheet1$]";
connection.Open();
using (OleDbDataAdapter adapter = new OleDbDataAdapter(command))
using (DataSet columnDataSet = new DataSet())
using (DataSet dataSet = new DataSet())
{
columnDataSet.Locale = CultureInfo.CurrentCulture;
adapter.Fill(columnDataSet);
if (columnDataSet.Tables.Count == 1)
{
var worksheet = columnDataSet.Tables[0];
// Now that we have a valid worksheet read in, with column names, we can create a
// new DataSet with a table that has preset columns that are all of type string.
// This fixes a problem where the OLEDB provider is trying to guess the data types
// of the cells and strange data appears, such as scientific notation on some cells.
dataSet.Tables.Add("WorksheetData");
DataTable tempTable = dataSet.Tables[0];
foreach (DataColumn column in worksheet.Columns)
{
tempTable.Columns.Add(column.ColumnName, typeof(string));
}
adapter.Fill(dataSet, "WorksheetData");
if (dataSet.Tables.Count == 1)
{
worksheet = dataSet.Tables[0];
foreach (var row in worksheet.Rows)
{
// TODO: Consume some data.
}
}
}
}
}
}
}
I got one solution from somewhere else but it worked perfectly for me.
No need to make any code change, just format excel columns cells to 'General" instead of any other formatting like "number" or "text", then even Select * from [$Sheet1] or Select Column_name from [$Sheet1] will read it perfectly even with large numeric values more than 9 digits
I googled around this state..
Here are my solulition steps
For template excel file
1-format Excel coloumn as Text
2- write macro to disable error warnings for Number -> text convertion
Private Sub Workbook_BeforeClose(Cancel As Boolean)
Application.ErrorCheckingOptions.BackgroundChecking = Ture
End Sub
Private Sub Workbook_Open()
Application.ErrorCheckingOptions.BackgroundChecking = False
End Sub
On codebehind
3- while reading data to import
try to parse incoming data to Int64 or Int32....

Categories