Problems reading in an Excel file in C#

Problems reading in an Excel file in C# - c#

I'm reading an Excel file with OLDB Connection using this code
var connectionString = string.Format("Provider=Microsoft.Jet.OLEDB.4.0; data source={0}; Extended Properties=Excel 8.0;", fileName);
var fileName = string.Format("{0}\\s23.xls", Directory.GetCurrentDirectory());
var adapter = new OleDbDataAdapter("SELECT * FROM [TEJ3$] ", connectionString);
DataTable dt=new DataTable();
adapter.Fill(dt, "Table1");
and after runing this code my data table is filled. But I have a column that has many string cells and few empty cells ; in excel file this cells have numeric values.
Someone has an idea?

Check the first examples here: http://www.connectionstrings.com/excel
What often goes wrong is that Excel will estimate the type of a column based upon the first X rows. When after that the values don't match, these rows get empty values. I'm afraid that going into the registry is sometime the only way to get the Excel driver to scan all rows first (as described in the connectionstrings.com article).
Play around with the HDR and IMEX settings in your environment. In some cases that will help as well.

I had this exact problem and solve it with using IMEX setting. In case others are wondering how to include the IMEX, here is what I have on my connection string
string connectionString = #"Provider=Microsoft.Jet.OLEDB.4.0;Extended Properties='Excel 8.0;IMEX=1';Data Source={0};";
connectionString = string.Format(connectionString, excelWorkbookPath);

SpreadsheetGear for .NET will let you load Excel workbooks from C# and access the underlying cell values or the formatted values in any order no matter how the workbook is laid out or what the types of the cells are.
You can see live ASP.NET samples here and download the free trial here if you want to try it yourself.
Disclaimer: I own SpreadsheetGear LLC

Just make sure that your excel file is not open. Close your excel application & then start your program.
Sunil

Related

Excel data extraction - Issue with column data type

I am writing a C# library to read in Excel files (both xls and xlsx) and I'm coming across an issue.
Exactly the same as what was expressed in this question, if my Excel file has a column that has string values, but has a numeric value in the first row, the OLEDB provider assumes that column to be numeric and returns NULL for the values in that column that are not numeric.
I am aware that, as in the answer provided, I can make a change in the registry, but since this is a library I plan to use on many machines and don't want to change every user's registry values, I was wondering if there is a better solution.
Maybe a DB provider other than ACE.OLEDB (and it seems JET is no longer supported well enough to be considered)?
Also, since this needs to work on XLS / XLSX, options such as EPPlus / XML readers won't work for the xls version.

Your connection string should look like this
Provider=Microsoft.ACE.OLEDB.12.0;Data Source=c:\myFolder\myExcelfile.xlsx;Extended Properties="Excel 12.0 Xml;HDR=YES;IMEX=1";
IMEX=1 in the connection string is the part that you need to treat the column as mixed datatype. This should work fine without the need to edit the registry.
HDR=Yes is simply to mark the first row as column headers and is not needed in your particular problem, however I've included it anyways.
To always use IMEX=1 is a safer way to retrieve data for mixed data columns.
Source: https://www.connectionstrings.com/excel/
Edit:
Here is the data I'm using:
Here is the output:
This is the exact code I used:
string connString = #"Provider=Microsoft.ACE.OLEDB.12.0;Data Source=C:\test.xlsx;Extended Properties=""Excel 12.0 Xml;HDR=YES;IMEX=1""";
using (DbClass db = new DbClass(connString))
{
var x = db.dataReader("SELECT * FROM [Sheet1$]");
while (x.Read())
{
for (int i = 0; i < x.FieldCount; i++)
Console.Write(x[i] + "\t");
Console.WriteLine("");
}
}
The DbClass is a simple wrapper I made in order to make life easier. It can be found here:
http://tech.reboot.pro/showthread.php?tid=4713

OleDbConnection only finds cell value when workbook is also open in Excel

I have a program (actually SSIS script task, but I don't suppose that matters) that creates an OLE DB connection to an Excel workbook, and reads the cell values in each worksheet, storing them in a SQL Server table.
Each worksheet has several sections of rows, each section being for a separate product. The first two rows of each product section are a quarter row, and a year row. Here is a screen shot:
I use an OleDbDataReader with a "Select *" command to read the data in each sheet into a DataTable. I have a column called "YearQuarter" in my SQL database, where I store a concatenation of the year row value and the preceding quarter row value, with a hyphen between the two strings:
My code is like this:
OleDbConnection oleExcelConnection = new OleDbConnection(
"Provider=Microsoft.ACE.OLEDB.12.0;" +
"Data Source=" + strWkbkFilePath + ";" +
"Mode=Read;" +
"Extended Properties=\"Excel 8.0;HDR=No;IMEX=1\"");
oleExcelConnection.Open();
DataTable dtCurrSheet = new DataTable();
// Name of table is in strLoadTblNm.
OleDbCommand oleExcelCommand;
OleDbDataReader oleExcelReader;
oleExcelCommand = excel_conn.CreateCommand();
oleExcelCommand.CommandText = "Select * From [" + strLoadTblNm + "]";
oleExcelCommand.CommandType = CommandType.Text;
oleExcelReader = oleExcelCommand.ExecuteReader();
// Load worksheet into data table
dtSheet.Load(oleExcelReader);
oleExcelReader.Close();
Looking at the output data, I noticed that I was getting inconsistent results. Some rows would have a YearQuarter column value that would have only the Year row value in them, while others would have the cell values from both rows. For example, I'd have "2009 - Year End" followed by just "2010", with no " - 1st Qtr." appended to it.
This is because that quarter cell valued is never loaded into the data reader, as the Dataset Visualizer shows:
Notice also that, in the Dataset, the column that is missing the Quarter cell value also has other numeric values missing their formatting (no commas).
If I save the file as a .csv, all cell values are preserved.
However, I noticed that it wasn't consistent. Sometimes I'd run my package and the same row would now have the full value. So, in the above example, I'd get "2010 - 1st Qtr."
I finally realized that it was working as expected only if I happened to have the workbook open in Excel at the same time that the program was running!
Why would this make a difference? Could it be that there is a macro or something in the workbook that is executed by Excel, but not when the workbook is accessed only via an OLE DB connection? Would the fact that it had been executed in Excel then affect the data obtained by OLE DB? If that's the case, how do I get around this? The spreadsheets are provided to me. So I can't modify them.

I think you're having issues with the auto-formatting thing Excel tries to apply. With an OLEDB connection, I can't see how having the sheet open fixes your problem (obviously very strange).
Try Adding IMEX = 1 to your connection options to treat the entire sheet as text to see if this is your issue. Pulled from OLEDB connection does not read data from excel sheet Also another good post from an external site: Tips for reading Excel spreadsheets using ADO.NET
Also, you're pulling data from an excel sheet and writing it to another excel sheet... Same workbook? I have a couple more ideas for ya though depending on your situation.

This bug turns out to be a "feature", and it should come with a big warning sign.
This article (thanks, #vb4all) explains that "ADO.NET scans the first 8 rows of data, and based on that, guesses the datatype for each column. Then it attempts to coerce all data from that column to that datatype, returning NULL whenever the coercion fails!"
In other words, it is treating the worksheet as a relation table, in which all values in a given column are of the same type. Of course, worksheet data is not bound by this restriction.
This behavior can be gotten around by setting IMEX=1 in the connection string options and then modifying these registry settings:
Hkey_Local_Machine/Software/Microsoft/Jet/4.0/Engines/Excel/ImportMixedTypes
Hkey_Local_Machine/Software/Microsoft/Jet/4.0/Engines/Excel/Typ
(Note: registry keys vary depending on 32 vs. 64 bit. E.g., for 64-bit, the first one would be HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Microsoft\Jet\4.0\Engines\Jet 4.0).
I think this was a very risky design, inviting data transfer errors that could easily go unnoticed.

Import Excel to DataTable string's are empty

To import excel to datatable, I am using the simple code:
string connectionString = string.Format("Provider=Microsoft.ACE.OLEDB.12.0; data source={0}; Extended Properties=Excel 12.0;", physicalFolder + FileUpload1.FileName);
OleDbDataAdapter adapter = new OleDbDataAdapter("SELECT * FROM [Sheet1$]", connectionString);
DataSet ds = new DataSet();
When in one of the rows of excel, if my row looks like below
strings are ommited and my data set looks like this
However if I add some strings and if my upload looks like this:
Then my dataset looks like it does not omit the strings:

Try to change your oledbconnection string as following format:
Code Snippet
OleDbConnection con = new OleDbConnection(
#"Provider=Microsoft.Jet.OLEDB.4.0;Data Source=C:\book1.xls;Extended Properties='Excel 8.0;HDR=Yes;IMEX=1'");
Note: "IMEX=1;" tells the driver to always read "intermixed" (numbers, dates, strings etc) data columns as text. Note that this option might affect excel sheet write access negative.

MD.Unicorn's answer is not 100% correct. Your OLEDB provider uses a settings named TypeGuessRows to determine how many rows are read to decide the data type of a column. Unfortunately this setting cannot be specified in the connection string and must be changed in the system registry. See this question for more details.

This is because the provider decides on the type of the column from first row of the column (the row after the header row). When first row contains a number, the type of column is double or another number type, so it cannot contain string values.
I tried every possible way (setting the table structure beforehand, using a DataReader, changing the format of the cell, ...) and they all failed. It seem to be the problem with Microsoft.Jet.OLEDB provider. I highly recomment you to use a third party excel reading library. There are plenty of open source libraries available.
If your file is a Excel 2007 (.xlsx) file, I highly recommend using EPPluse. It is also available as a NuGet package.
Otherwise, you can take a look at this answer to find a few more libraries.

use IMEX=1 in connection string. hope it will resolve this issue..

How to fix "Not a legal OleAut date." when reading an Excel file in C#?

I have been working with excel spreadsheets and so far I never had any problems with them.. But this error,"Not a legal OleAut date.", showed up out of the blue when I tried to read an excel file. Does anyone know how I can fix this. Here is the code I use to read the excel and put the data into a dataset. It has worked fine previously but after I made some changes (which doesn't involve dates) to the data source this error showed up.
var fileName = string.Format("C:\\Drafts\\Excel 97-2003 formats\\All Data 09 26 2012_Edited.xls");
var connectionString = string.Format("Provider=Microsoft.Jet.OLEDB.4.0; data source={0}; Extended Properties=Excel 8.0;", fileName);
var adapter = new OleDbDataAdapter("SELECT * FROM [Sheet1$]", connectionString);
DataSet Originalds = new DataSet();
adapter.Fill(Originalds, "Employees"); // this is where the error shows up

I sort of figured out a work around to this problem I changed the connection string to the latest oleDB provider.
var connectionString = string.Format("Provider=Microsoft.ACE.OLEDB.12.0;Data Source={0};Extended Properties='Excel 12.0 Xml;HDR=YES;'", fileName);
I also made sure that the empty space after the last row of my excel data is not being read as a value.

Look for different date formats, in my case one of the column was in dd/mm/yyyy format and the other was in mm/dd/yyyy. Once the date formats were rectified data import worked.

In my case, I had a date in my table inserted manually with the wrong format, it was 01/01/0019 instead of 01/01/2019.

I have this issue while reading excel in oledb.
I found that in excel many date columns are empty in the beginning records. i had 6000 records in excel and one of the datecolumn has around first 4000 empty cells so excel is not able to understand what to convert the cell so i added one dummy record fill with date and process my file. Or you can move few records which has date value in it at top

How can you fix this? Hi!jack your data with a dummy row at row 1 and force the column(s) in question into a string (in this case only - it is a data type error, so apply the fix according to the type).
It is necessary to understand what the data adaptor does, which interprets the data type of each column by examining, by default, the first 8 rows of data (sans header if HDR=Yes in connect string) and deciding on a data type (it can be over-ridden -yes there is an override - in the connection string to 16 rows - almost never very helpful).
Data adaptors can do other nasty things, like skip strings in columns of mixed data types, like string/double (which is really just a string column, but not to the adaptor if the first rows are all double). It won't even give you the courtesy of an error in this example.
This often occurs in data coming from ERP sources that contains "Free Form" columns. User defined columns are the usual suspects. It can be very difficult to find in other data type issues. I once spent quite a bit of time resolving an issue with a column that typed as a string with a max length of 255 chars. Deep in the data there were cells that exceeded that length and threw errors.
If you don't want to advance to the level of "Genetic Engineering" in working with data adaptors, the fastest way to resolve an issue like this is to hi!jack your data and force the column(s) in question to the correct type (or incorrect, which you can then correct in your own code if need be). Plan B is to give the data back to the customer/user and tell them to correct it. Good luck with Plan B. There is a reason it isn't Plan A.
More on manipulating via the connection string and similar issues with the adaptor - but be wary, results are not going to be 100% fool proof. I've tested changing IMEX and HDR settings extensively. If you want to get through the project quickly, hi!jack the data. OleDB & mixed Excel datatypes : missing data
Here is another posting similar in context, note all of the possible time consuming solutions. I have yet to be convinced there is a better solution, and it simply defies the logic a programmer brings to the keyboard every morning. Too bad, you have a job to do, sometimes you have to be a hack. DateTime format mismatch on importing from Excel Sheet

Scientific notation when importing from Excel in .Net

I have a C#/.Net job that imports data from Excel and then processes it. Our client drops off the files and we process them. I don't have any control over the original file.
I use the OleDb library to fill up a dataset. The file contains some numbers like 30829300, 30071500, etc... The data type for those columns is "Text".
Those numbers are converted to scientific notation when I import the data. Is there anyway to prevent this from happening?

One workaround to this issue is to change your select statement, instead of SELECT * do this:
"SELECT Format([F1], 'General Number') From [Sheet1$]"
-or-
"SELECT Format([F1], \"#####\") From [Sheet1$]"
However, doing so will blow up if your cells contain more than 255 characters with the following error:
"Multiple-step OLE DB operation generated errors. Check each OLE DB status value, if available. No work was done."
Fortunately my customer didn't care about erroring out in this scenario.
This page has a bunch of good things to try as well:
http://www.dicks-blog.com/archives/2004/06/03/external-data-mixed-data-types/

The OleDb library will, more often than not, mess up your data in an Excel spreadsheet. This is largely because it forces everything into a fixed-type column layout, guessing at the type of each column from the values in the first 8 cells in each column. If it guesses wrong, you end up with digit strings converted to scientific-notation. Blech!
To avoid this you're better off skipping the OleDb and reading the sheet directly yourself. You can do this using the COM interface of Excel (also blech!), or a third-party .NET Excel-compatible reader. SpreadsheetGear is one such library that works reasonably well, and has an interface that's very similar to Excel's COM interface.

Using this connection string:
Provider=Microsoft.ACE.OLEDB.12.0; data source={0}; Extended Properties=\"Excel 12.0;HDR=NO;IMEX=1\"
with Excel 2010 I have noticed the following. If the Excel file is open when you run the OLEDB SELECT then you get the current version of the cells, not the saved file values. Furthermore the string values returned for a long number, decimal value and date look like this:
5.0130370071e+012
4.08
36808
If the file is not open then the returned values are:
5013037007084
£4.08
Monday, October 09, 2000

If you look at the actual .XSLX file using Open XML SDK 2.0 Productivity Tool (or simply unzip the file and view the XML in notepad) you will see that Excel 2007 actually stores the raw data in scientific format.
For example 0.00001 is stored as 1.0000000000000001E-5
<x:c r="C18" s="11" xmlns:x="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
<x:v>1.0000000000000001E-5</x:v>
</x:c>
Looking at the cell in Excel its displayed as 0.00001 in both the cell and the formula bar. So it not always true that OleDB is causing the issue.

I have found that the easiest way is to choose Zip format, rather than text format for columns with large 'numbers'.

Have you tried casting the value of the field to (int) or perhaps (Int64) as you are reading it?

Look up the IMEX=1 connection string option and TypeGuessRows registry setting on google.
In truth, there is no easy way round this because the reader infers column data types by looking at the first few rows (8 by default). If the rows contain all numbers then you're out of luck.
An unfortunate workaround which I've used in the past is to use the HDR=NO connection string option and set the TypeGuessRows registry setting value to 1, which forces it to read the first row as valid data to make its datatype determination, rather than a header.
It's a hack, but it works. The code reads the first row (containing the header) as text, and then sets the datatype accordingly.
Changing the registry is a pain (and not always possible) but I'd recommend restoring the original value afterwards.
If your import data doesn't have a header row, then an alternative option is to pre-process the file and insert a ' character before each of the numbers in the offending column. This causes the column data to be treated as text.
So all in all, there are a bunch of hacks to work around this, but nothing really foolproof.

I had this same problem, but was able to work around it without resorting to the Excel COM interface or 3rd party software. It involves a little processing overhead, but appears to be working for me.
First read in the data to get the column names
Then create a new DataSet with each of these columns, setting each of their DataTypes to string.
Read the data in again into this new
dataset. Voila - the scientific
notation is now gone and everything is read in as a string.
Here's some code that illustrates this, and as an added bonus, it's even StyleCopped!
public void ImportSpreadsheet(string path)
{
string extendedProperties = "Excel 12.0;HDR=YES;IMEX=1";
string connectionString = string.Format(
CultureInfo.CurrentCulture,
"Provider=Microsoft.ACE.OLEDB.12.0;Data Source={0};Extended Properties=\"{1}\"",
path,
extendedProperties);
using (OleDbConnection connection = new OleDbConnection(connectionString))
{
using (OleDbCommand command = connection.CreateCommand())
{
command.CommandText = "SELECT * FROM [Worksheet1$]";
connection.Open();
using (OleDbDataAdapter adapter = new OleDbDataAdapter(command))
using (DataSet columnDataSet = new DataSet())
using (DataSet dataSet = new DataSet())
{
columnDataSet.Locale = CultureInfo.CurrentCulture;
adapter.Fill(columnDataSet);
if (columnDataSet.Tables.Count == 1)
{
var worksheet = columnDataSet.Tables[0];
// Now that we have a valid worksheet read in, with column names, we can create a
// new DataSet with a table that has preset columns that are all of type string.
// This fixes a problem where the OLEDB provider is trying to guess the data types
// of the cells and strange data appears, such as scientific notation on some cells.
dataSet.Tables.Add("WorksheetData");
DataTable tempTable = dataSet.Tables[0];
foreach (DataColumn column in worksheet.Columns)
{
tempTable.Columns.Add(column.ColumnName, typeof(string));
}
adapter.Fill(dataSet, "WorksheetData");
if (dataSet.Tables.Count == 1)
{
worksheet = dataSet.Tables[0];
foreach (var row in worksheet.Rows)
{
// TODO: Consume some data.
}
}
}
}
}
}
}

I got one solution from somewhere else but it worked perfectly for me.
No need to make any code change, just format excel columns cells to 'General" instead of any other formatting like "number" or "text", then even Select * from [$Sheet1] or Select Column_name from [$Sheet1] will read it perfectly even with large numeric values more than 9 digits

I googled around this state..
Here are my solulition steps
For template excel file
1-format Excel coloumn as Text
2- write macro to disable error warnings for Number -> text convertion
Private Sub Workbook_BeforeClose(Cancel As Boolean)
Application.ErrorCheckingOptions.BackgroundChecking = Ture
End Sub
Private Sub Workbook_Open()
Application.ErrorCheckingOptions.BackgroundChecking = False
End Sub
On codebehind
3- while reading data to import
try to parse incoming data to Int64 or Int32....

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.