Scientific notation when importing from Excel in .Net

Scientific notation when importing from Excel in .Net - c#

I have a C#/.Net job that imports data from Excel and then processes it. Our client drops off the files and we process them. I don't have any control over the original file.
I use the OleDb library to fill up a dataset. The file contains some numbers like 30829300, 30071500, etc... The data type for those columns is "Text".
Those numbers are converted to scientific notation when I import the data. Is there anyway to prevent this from happening?

One workaround to this issue is to change your select statement, instead of SELECT * do this:
"SELECT Format([F1], 'General Number') From [Sheet1$]"
-or-
"SELECT Format([F1], \"#####\") From [Sheet1$]"
However, doing so will blow up if your cells contain more than 255 characters with the following error:
"Multiple-step OLE DB operation generated errors. Check each OLE DB status value, if available. No work was done."
Fortunately my customer didn't care about erroring out in this scenario.
This page has a bunch of good things to try as well:
http://www.dicks-blog.com/archives/2004/06/03/external-data-mixed-data-types/

The OleDb library will, more often than not, mess up your data in an Excel spreadsheet. This is largely because it forces everything into a fixed-type column layout, guessing at the type of each column from the values in the first 8 cells in each column. If it guesses wrong, you end up with digit strings converted to scientific-notation. Blech!
To avoid this you're better off skipping the OleDb and reading the sheet directly yourself. You can do this using the COM interface of Excel (also blech!), or a third-party .NET Excel-compatible reader. SpreadsheetGear is one such library that works reasonably well, and has an interface that's very similar to Excel's COM interface.

Using this connection string:
Provider=Microsoft.ACE.OLEDB.12.0; data source={0}; Extended Properties=\"Excel 12.0;HDR=NO;IMEX=1\"
with Excel 2010 I have noticed the following. If the Excel file is open when you run the OLEDB SELECT then you get the current version of the cells, not the saved file values. Furthermore the string values returned for a long number, decimal value and date look like this:
5.0130370071e+012
4.08
36808
If the file is not open then the returned values are:
5013037007084
£4.08
Monday, October 09, 2000

If you look at the actual .XSLX file using Open XML SDK 2.0 Productivity Tool (or simply unzip the file and view the XML in notepad) you will see that Excel 2007 actually stores the raw data in scientific format.
For example 0.00001 is stored as 1.0000000000000001E-5
<x:c r="C18" s="11" xmlns:x="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
<x:v>1.0000000000000001E-5</x:v>
</x:c>
Looking at the cell in Excel its displayed as 0.00001 in both the cell and the formula bar. So it not always true that OleDB is causing the issue.

I have found that the easiest way is to choose Zip format, rather than text format for columns with large 'numbers'.

Have you tried casting the value of the field to (int) or perhaps (Int64) as you are reading it?

Look up the IMEX=1 connection string option and TypeGuessRows registry setting on google.
In truth, there is no easy way round this because the reader infers column data types by looking at the first few rows (8 by default). If the rows contain all numbers then you're out of luck.
An unfortunate workaround which I've used in the past is to use the HDR=NO connection string option and set the TypeGuessRows registry setting value to 1, which forces it to read the first row as valid data to make its datatype determination, rather than a header.
It's a hack, but it works. The code reads the first row (containing the header) as text, and then sets the datatype accordingly.
Changing the registry is a pain (and not always possible) but I'd recommend restoring the original value afterwards.
If your import data doesn't have a header row, then an alternative option is to pre-process the file and insert a ' character before each of the numbers in the offending column. This causes the column data to be treated as text.
So all in all, there are a bunch of hacks to work around this, but nothing really foolproof.

I had this same problem, but was able to work around it without resorting to the Excel COM interface or 3rd party software. It involves a little processing overhead, but appears to be working for me.
First read in the data to get the column names
Then create a new DataSet with each of these columns, setting each of their DataTypes to string.
Read the data in again into this new
dataset. Voila - the scientific
notation is now gone and everything is read in as a string.
Here's some code that illustrates this, and as an added bonus, it's even StyleCopped!
public void ImportSpreadsheet(string path)
{
string extendedProperties = "Excel 12.0;HDR=YES;IMEX=1";
string connectionString = string.Format(
CultureInfo.CurrentCulture,
"Provider=Microsoft.ACE.OLEDB.12.0;Data Source={0};Extended Properties=\"{1}\"",
path,
extendedProperties);
using (OleDbConnection connection = new OleDbConnection(connectionString))
{
using (OleDbCommand command = connection.CreateCommand())
{
command.CommandText = "SELECT * FROM [Worksheet1$]";
connection.Open();
using (OleDbDataAdapter adapter = new OleDbDataAdapter(command))
using (DataSet columnDataSet = new DataSet())
using (DataSet dataSet = new DataSet())
{
columnDataSet.Locale = CultureInfo.CurrentCulture;
adapter.Fill(columnDataSet);
if (columnDataSet.Tables.Count == 1)
{
var worksheet = columnDataSet.Tables[0];
// Now that we have a valid worksheet read in, with column names, we can create a
// new DataSet with a table that has preset columns that are all of type string.
// This fixes a problem where the OLEDB provider is trying to guess the data types
// of the cells and strange data appears, such as scientific notation on some cells.
dataSet.Tables.Add("WorksheetData");
DataTable tempTable = dataSet.Tables[0];
foreach (DataColumn column in worksheet.Columns)
{
tempTable.Columns.Add(column.ColumnName, typeof(string));
}
adapter.Fill(dataSet, "WorksheetData");
if (dataSet.Tables.Count == 1)
{
worksheet = dataSet.Tables[0];
foreach (var row in worksheet.Rows)
{
// TODO: Consume some data.
}
}
}
}
}
}
}

I got one solution from somewhere else but it worked perfectly for me.
No need to make any code change, just format excel columns cells to 'General" instead of any other formatting like "number" or "text", then even Select * from [$Sheet1] or Select Column_name from [$Sheet1] will read it perfectly even with large numeric values more than 9 digits

I googled around this state..
Here are my solulition steps
For template excel file
1-format Excel coloumn as Text
2- write macro to disable error warnings for Number -> text convertion
Private Sub Workbook_BeforeClose(Cancel As Boolean)
Application.ErrorCheckingOptions.BackgroundChecking = Ture
End Sub
Private Sub Workbook_Open()
Application.ErrorCheckingOptions.BackgroundChecking = False
End Sub
On codebehind
3- while reading data to import
try to parse incoming data to Int64 or Int32....

Related

Excel data extraction - Issue with column data type

I am writing a C# library to read in Excel files (both xls and xlsx) and I'm coming across an issue.
Exactly the same as what was expressed in this question, if my Excel file has a column that has string values, but has a numeric value in the first row, the OLEDB provider assumes that column to be numeric and returns NULL for the values in that column that are not numeric.
I am aware that, as in the answer provided, I can make a change in the registry, but since this is a library I plan to use on many machines and don't want to change every user's registry values, I was wondering if there is a better solution.
Maybe a DB provider other than ACE.OLEDB (and it seems JET is no longer supported well enough to be considered)?
Also, since this needs to work on XLS / XLSX, options such as EPPlus / XML readers won't work for the xls version.

Your connection string should look like this
Provider=Microsoft.ACE.OLEDB.12.0;Data Source=c:\myFolder\myExcelfile.xlsx;Extended Properties="Excel 12.0 Xml;HDR=YES;IMEX=1";
IMEX=1 in the connection string is the part that you need to treat the column as mixed datatype. This should work fine without the need to edit the registry.
HDR=Yes is simply to mark the first row as column headers and is not needed in your particular problem, however I've included it anyways.
To always use IMEX=1 is a safer way to retrieve data for mixed data columns.
Source: https://www.connectionstrings.com/excel/
Edit:
Here is the data I'm using:
Here is the output:
This is the exact code I used:
string connString = #"Provider=Microsoft.ACE.OLEDB.12.0;Data Source=C:\test.xlsx;Extended Properties=""Excel 12.0 Xml;HDR=YES;IMEX=1""";
using (DbClass db = new DbClass(connString))
{
var x = db.dataReader("SELECT * FROM [Sheet1$]");
while (x.Read())
{
for (int i = 0; i < x.FieldCount; i++)
Console.Write(x[i] + "\t");
Console.WriteLine("");
}
}
The DbClass is a simple wrapper I made in order to make life easier. It can be found here:
http://tech.reboot.pro/showthread.php?tid=4713

OleDbConnection only finds cell value when workbook is also open in Excel

I have a program (actually SSIS script task, but I don't suppose that matters) that creates an OLE DB connection to an Excel workbook, and reads the cell values in each worksheet, storing them in a SQL Server table.
Each worksheet has several sections of rows, each section being for a separate product. The first two rows of each product section are a quarter row, and a year row. Here is a screen shot:
I use an OleDbDataReader with a "Select *" command to read the data in each sheet into a DataTable. I have a column called "YearQuarter" in my SQL database, where I store a concatenation of the year row value and the preceding quarter row value, with a hyphen between the two strings:
My code is like this:
OleDbConnection oleExcelConnection = new OleDbConnection(
"Provider=Microsoft.ACE.OLEDB.12.0;" +
"Data Source=" + strWkbkFilePath + ";" +
"Mode=Read;" +
"Extended Properties=\"Excel 8.0;HDR=No;IMEX=1\"");
oleExcelConnection.Open();
DataTable dtCurrSheet = new DataTable();
// Name of table is in strLoadTblNm.
OleDbCommand oleExcelCommand;
OleDbDataReader oleExcelReader;
oleExcelCommand = excel_conn.CreateCommand();
oleExcelCommand.CommandText = "Select * From [" + strLoadTblNm + "]";
oleExcelCommand.CommandType = CommandType.Text;
oleExcelReader = oleExcelCommand.ExecuteReader();
// Load worksheet into data table
dtSheet.Load(oleExcelReader);
oleExcelReader.Close();
Looking at the output data, I noticed that I was getting inconsistent results. Some rows would have a YearQuarter column value that would have only the Year row value in them, while others would have the cell values from both rows. For example, I'd have "2009 - Year End" followed by just "2010", with no " - 1st Qtr." appended to it.
This is because that quarter cell valued is never loaded into the data reader, as the Dataset Visualizer shows:
Notice also that, in the Dataset, the column that is missing the Quarter cell value also has other numeric values missing their formatting (no commas).
If I save the file as a .csv, all cell values are preserved.
However, I noticed that it wasn't consistent. Sometimes I'd run my package and the same row would now have the full value. So, in the above example, I'd get "2010 - 1st Qtr."
I finally realized that it was working as expected only if I happened to have the workbook open in Excel at the same time that the program was running!
Why would this make a difference? Could it be that there is a macro or something in the workbook that is executed by Excel, but not when the workbook is accessed only via an OLE DB connection? Would the fact that it had been executed in Excel then affect the data obtained by OLE DB? If that's the case, how do I get around this? The spreadsheets are provided to me. So I can't modify them.

I think you're having issues with the auto-formatting thing Excel tries to apply. With an OLEDB connection, I can't see how having the sheet open fixes your problem (obviously very strange).
Try Adding IMEX = 1 to your connection options to treat the entire sheet as text to see if this is your issue. Pulled from OLEDB connection does not read data from excel sheet Also another good post from an external site: Tips for reading Excel spreadsheets using ADO.NET
Also, you're pulling data from an excel sheet and writing it to another excel sheet... Same workbook? I have a couple more ideas for ya though depending on your situation.

This bug turns out to be a "feature", and it should come with a big warning sign.
This article (thanks, #vb4all) explains that "ADO.NET scans the first 8 rows of data, and based on that, guesses the datatype for each column. Then it attempts to coerce all data from that column to that datatype, returning NULL whenever the coercion fails!"
In other words, it is treating the worksheet as a relation table, in which all values in a given column are of the same type. Of course, worksheet data is not bound by this restriction.
This behavior can be gotten around by setting IMEX=1 in the connection string options and then modifying these registry settings:
Hkey_Local_Machine/Software/Microsoft/Jet/4.0/Engines/Excel/ImportMixedTypes
Hkey_Local_Machine/Software/Microsoft/Jet/4.0/Engines/Excel/Typ
(Note: registry keys vary depending on 32 vs. 64 bit. E.g., for 64-bit, the first one would be HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Microsoft\Jet\4.0\Engines\Jet 4.0).
I think this was a very risky design, inviting data transfer errors that could easily go unnoticed.

Import Excel to DataTable string's are empty

To import excel to datatable, I am using the simple code:
string connectionString = string.Format("Provider=Microsoft.ACE.OLEDB.12.0; data source={0}; Extended Properties=Excel 12.0;", physicalFolder + FileUpload1.FileName);
OleDbDataAdapter adapter = new OleDbDataAdapter("SELECT * FROM [Sheet1$]", connectionString);
DataSet ds = new DataSet();
When in one of the rows of excel, if my row looks like below
strings are ommited and my data set looks like this
However if I add some strings and if my upload looks like this:
Then my dataset looks like it does not omit the strings:

Try to change your oledbconnection string as following format:
Code Snippet
OleDbConnection con = new OleDbConnection(
#"Provider=Microsoft.Jet.OLEDB.4.0;Data Source=C:\book1.xls;Extended Properties='Excel 8.0;HDR=Yes;IMEX=1'");
Note: "IMEX=1;" tells the driver to always read "intermixed" (numbers, dates, strings etc) data columns as text. Note that this option might affect excel sheet write access negative.

MD.Unicorn's answer is not 100% correct. Your OLEDB provider uses a settings named TypeGuessRows to determine how many rows are read to decide the data type of a column. Unfortunately this setting cannot be specified in the connection string and must be changed in the system registry. See this question for more details.

This is because the provider decides on the type of the column from first row of the column (the row after the header row). When first row contains a number, the type of column is double or another number type, so it cannot contain string values.
I tried every possible way (setting the table structure beforehand, using a DataReader, changing the format of the cell, ...) and they all failed. It seem to be the problem with Microsoft.Jet.OLEDB provider. I highly recomment you to use a third party excel reading library. There are plenty of open source libraries available.
If your file is a Excel 2007 (.xlsx) file, I highly recommend using EPPluse. It is also available as a NuGet package.
Otherwise, you can take a look at this answer to find a few more libraries.

use IMEX=1 in connection string. hope it will resolve this issue..

How to fix "Not a legal OleAut date." when reading an Excel file in C#?

I have been working with excel spreadsheets and so far I never had any problems with them.. But this error,"Not a legal OleAut date.", showed up out of the blue when I tried to read an excel file. Does anyone know how I can fix this. Here is the code I use to read the excel and put the data into a dataset. It has worked fine previously but after I made some changes (which doesn't involve dates) to the data source this error showed up.
var fileName = string.Format("C:\\Drafts\\Excel 97-2003 formats\\All Data 09 26 2012_Edited.xls");
var connectionString = string.Format("Provider=Microsoft.Jet.OLEDB.4.0; data source={0}; Extended Properties=Excel 8.0;", fileName);
var adapter = new OleDbDataAdapter("SELECT * FROM [Sheet1$]", connectionString);
DataSet Originalds = new DataSet();
adapter.Fill(Originalds, "Employees"); // this is where the error shows up

I sort of figured out a work around to this problem I changed the connection string to the latest oleDB provider.
var connectionString = string.Format("Provider=Microsoft.ACE.OLEDB.12.0;Data Source={0};Extended Properties='Excel 12.0 Xml;HDR=YES;'", fileName);
I also made sure that the empty space after the last row of my excel data is not being read as a value.

Look for different date formats, in my case one of the column was in dd/mm/yyyy format and the other was in mm/dd/yyyy. Once the date formats were rectified data import worked.

In my case, I had a date in my table inserted manually with the wrong format, it was 01/01/0019 instead of 01/01/2019.

I have this issue while reading excel in oledb.
I found that in excel many date columns are empty in the beginning records. i had 6000 records in excel and one of the datecolumn has around first 4000 empty cells so excel is not able to understand what to convert the cell so i added one dummy record fill with date and process my file. Or you can move few records which has date value in it at top

How can you fix this? Hi!jack your data with a dummy row at row 1 and force the column(s) in question into a string (in this case only - it is a data type error, so apply the fix according to the type).
It is necessary to understand what the data adaptor does, which interprets the data type of each column by examining, by default, the first 8 rows of data (sans header if HDR=Yes in connect string) and deciding on a data type (it can be over-ridden -yes there is an override - in the connection string to 16 rows - almost never very helpful).
Data adaptors can do other nasty things, like skip strings in columns of mixed data types, like string/double (which is really just a string column, but not to the adaptor if the first rows are all double). It won't even give you the courtesy of an error in this example.
This often occurs in data coming from ERP sources that contains "Free Form" columns. User defined columns are the usual suspects. It can be very difficult to find in other data type issues. I once spent quite a bit of time resolving an issue with a column that typed as a string with a max length of 255 chars. Deep in the data there were cells that exceeded that length and threw errors.
If you don't want to advance to the level of "Genetic Engineering" in working with data adaptors, the fastest way to resolve an issue like this is to hi!jack your data and force the column(s) in question to the correct type (or incorrect, which you can then correct in your own code if need be). Plan B is to give the data back to the customer/user and tell them to correct it. Good luck with Plan B. There is a reason it isn't Plan A.
More on manipulating via the connection string and similar issues with the adaptor - but be wary, results are not going to be 100% fool proof. I've tested changing IMEX and HDR settings extensively. If you want to get through the project quickly, hi!jack the data. OleDB & mixed Excel datatypes : missing data
Here is another posting similar in context, note all of the possible time consuming solutions. I have yet to be convinced there is a better solution, and it simply defies the logic a programmer brings to the keyboard every morning. Too bad, you have a job to do, sometimes you have to be a hack. DateTime format mismatch on importing from Excel Sheet

How to force ADO.Net to use only the System.String DataType in the readers TableSchema

I am using an OleDbConnection to query an Excel 2007 Spreadsheet. I want force the OleDbDataReader to use only string as the column datatype.
The system is looking at the first 8 rows of data and inferring the data type to be Double. The problem is that on row 9 I have a string in that column and the OleDbDataReader is returning a Null value since it could not be cast to a Double.
I have used these connection strings:
Provider=Microsoft.ACE.OLEDB.12.0;Data Source="ExcelFile.xlsx";Persist Security Info=False;Extended Properties="Excel 12.0;IMEX=1;HDR=No"
Provider=Microsoft.Jet.OLEDB.4.0;Data Source="ExcelFile.xlsx";Persist Security Info=False;Extended Properties="Excel 8.0;HDR=No;IMEX=1"
Looking at the reader.GetSchemaTable().Rows[7].ItemArray[5], it's dataType is Double.
Row 7 in this schema correlates with the specific column in Excel I am having issues with. ItemArray[5] is its DataType column
Is it possible to create a custom TableSchema for the reader so when accessing the ExcelFiles, I can treat all cells as text instead of letting the system attempt to infer the datatype?
I found some good info at this page: Tips for reading Excel spreadsheets using ADO.NET
The main quirk about the ADO.NET interface is how datatypes are handled. (You'll notice I've been carefully avoiding the question of which datatypes are returned when reading the spreadsheet.) Are you ready for this? ADO.NET scans the first 8 rows of data, and based on that guesses the datatype for each column. Then it attempts to coerce all data from that column to that datatype, returning NULL whenever the coercion fails!
Thank you,
Keith
Here is a reduced version of my code:
using (OleDbConnection connection = new OleDbConnection(BuildConnectionString(dataMapper).ToString()))
{
connection.Open();
using (OleDbCommand cmd = new OleDbCommand())
{
cmd.Connection = connection;
cmd.CommandText = SELECT * from [Sheet1$];
using (OleDbDataReader reader = cmd.ExecuteReader())
{
using (DataTable dataTable = new DataTable("TestTable"))
{
dataTable.Load(reader);
base.SourceDataSet.Tables.Add(dataTable);
}
}
}
}

As you have discovered, OLEDB uses Jet which is limited in the manner in which it can be tweaked. If you are set on using an OleDbConnection to read from an Excel file, then you need to set the HKLM\...\Microsoft\Jet\4.0\Engines\Excel\TypeGuessRows value to zero so that the system will scan the entire resultset.
That said, if you are open to using an alternative engine to read from an Excel file, you might consider trying the ExcelDataReader. It reads all columns as strings but will let you use dataReader.Getxxx methods to get typed values. Here's a sample that fills a DataSet:
DataSet result;
const string path = #"....\Test.xlsx";
using ( var fileStream = new FileStream( path, FileMode.Open, FileAccess.Read ) )
{
using ( var excelReader = ExcelReaderFactory.CreateOpenXmlReader( fileStream ) )
{
excelReader.IsFirstRowAsColumnNames = true;
result = excelReader.AsDataSet();
}
}

Note for 64bit OS it is here:
My Computer\HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Microsoft\Jet\4.0\Engines\Excel

Check out the final answer on this page.
Just noticed the page you refer to says the same thing ...
Update:
The problem seems to be with the JET engine itself and not ADO. Once JET decides on the type, it sticks to it. Anything done after that has no effect; like casting the values to string in the SQL (e.g. Cstr([Column])) just results in an empty string being returned.
At this point (if there are no other answers) I'd opt for other methods: modifying the spreadsheet; modifying registry (not ideal since you will be messing with the settings for every other app the uses JET); Excel automation or a third party component that does not use JET.
If Automation option is to slow then maybe just use it to save the spreadsheet in a different format which is easier to handle.

I have faced the same issue and determined that this is something that many people commonly experience. Here are a number of solutions that have been suggested, many of which I have attempted to implement:
Add the following to your connection string(Source):
TypeGuessRows=0;ImportMixedTypes=Text
Add the following to your connection string(Source, More Discussion, Even More):
IMEX=1;HDR=NO;
Edit the following registry settings, disable "TypeGuessRows", and "ImportMixedTypes" set to "Text"(Source, Not Recommended, More Documentation):
Hkey_Local_Machine/Software/Microsoft/Jet/4.0/Engines/Excel/TypeGuessRows
Hkey_Local_Machine/Software/Microsoft/Jet/4.0/Engines/Excel/ImportMixedTypes
Consider using an alternative library for reading the excel file:
EPPlus
ExcelDataReader (also suggested be #Thomas)
OpenXml
Format all data in the source file as Text(at least the first 8 rows), though I understand that's typically impractical(Source, though this is relation to SSIS, but it's the same concepts)
Use a Schema.ini file to define the data type before importing the file, I found this in relation to using "Jet.OleDb" directly, maybe requiring you to modifying your connection string. This may only be applicable to CSV's I have not tried this approach.(Source, Related Post)
None of these have worked for me(though I believe they have worked for others). I am of the opinion expressed by #Asher that there is really no good solution to this problem. In my software I simply display an error message to the user(if any required column contain empty values) instructing them to format all columns as "Text".
Honestly, I think this book is more applicable to situation. The issue, already stated multiple times is:
"The data type at the destination is varchar but the assumed data
type of "double" nullifies any data that doesn't fit."(Source)
"But the problem is actually with the OLEDBDataReader. The problem
is that if it sees mostly numbers in a column, it assumes everything
is a number - if a row item being read is not a number, it simply
sets it to null! Ouch!"(Source)
"The problem seems to be with the JET engine itself and not ADO. Once
JET decides on the type, it sticks to it."(#Asher)
While I haven't found any of this documented in an official capacity I think that it's very clear that this is an intentional design decision and simply how the Jet Database Library works. I hesitate to call this library entirely useless because I think for many people some of these solutions do work, but so far for my project, I have come to the conclusion that this library cannot read multiple data types in a single column and is ill suited for general data retrieval.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.