I am trying to read the uploaded Excel content, but the following two methods cannot use reader.GetInt32(1):
I set the Excel field as "common format" and then input a number 1.
--> Show InvalidCastException
I set the Excel field as "integer format" and then input a number 1.
--> Show InvalidCastException
So I confirm the Excel field type, and found out it was a double.
var fieldType = reader.GetFieldType(1);
I have two question:
Is this a setting of Excel itself which causes my get type is a Double?
How to get the Excel fields as int in my case?
Because the content of the fields I should insert into the database must be int.
Try this
try {
int aaa = (int)reader.GetValue(1)
} catch {
//Do something here to log conversion errors
}
Be aware that ADO.NET reads the top rows on the Excel to decide the columns data types. If you have a text at, say, row 200, but the first rows are numeric the column will be numeric and you'll get an error when you read the 200th row.
Also, you can use EPPlus as Excel reader and avoid ADO.NET. In your case, reading an integer value could be
int aaa = sheet.Cells[r, 1].GetValue<int>();
The following can solve my second problem, but I'm still curious about the first question.
Convert.ToInt32(reader.GetValue(1));
Related
I've been using LinqToExcel to import data from .xlsx files successfully for a while. Recently, however, I was sent a .csv file that I'm unable to read the data of.
Let's say that the file contains the following data:
Col1 Col2 Col3
A B C
D E F
I've created a class for mapping the columns as such:
public class Test
{
[ExcelColumn("Col1")]
public string Col1 { get; set; }
[ExcelColumn("Col2")]
public string Col2 { get; set; }
[ExcelColumn("Col3")]
public string Col3 { get; set; }
}
Then I try to read the data like so:
var test = from c in excel.Worksheet<Test>()
select c;
The query successfully returns two Test-objects, but all property values are null.
I even tried to read the data without class and header:
var test = from c in excel.WorksheetNoHeader()
select c;
In this case, the query also returns two rows, both with three cells/values. But again all of these values are null. What could be the issue here?
I should also note that the file opens and looks perfectly fine in Excel. Furthermore using StreamReader, I'm able to read all of its rows and values.
What type of data is in each of those columns? (string, numeric, ...)
According to Initializing the Microsoft Excel driver
TypeGuessRows
The number of rows to be checked for the data type. The data type is
determined given the maximum number of kinds of data found. If there
is a tie, the data type is determined in the following order: Number,
Currency, Date, Text, Boolean. If data is encountered that does not
match the data type guessed for the column, it is returned as a Null
value. On import, if a column has mixed data types, the entire column
will be cast according to the ImportMixedTypes setting. The default
number of rows to be checked is 8. Values are of type REG_DWORD.
See post Can I specify the data type for a column rather than letting linq-to-excel decide?
The post Setting TypeGuessRows for excel ACE Driver states how to change the value for TypeGuessRows.
When the driver determines that an Excel column contains text data,
the driver selects the data type (string or memo) based on the longest
value that it samples. If the driver does not discover any values
longer than 255 characters in the rows that it samples, it treats the
column as a 255-character string column instead of a memo column.
Therefore, values longer than 255 characters may be truncated. To
import data from a memo column without truncation, you must make sure
that the memo column in at least one of the sampled rows contains a
value longer than 255 characters, or you must increase the number of
rows sampled by the driver to include such a row. You can increase the
number of rows sampled by increasing the value of TypeGuessRows under
the HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Jet\4.0\Engines\Excel
registry key.
One more thing we need to keep in mind is that the registry
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Jet\4.0\Engines\Excel\TypeGuessRows
only applies to Excel 97- 2003. For Excel 2007 and higher version,
Excel Open XML (.XLSX extension) actually uses ACE OLE DB provider
rather JET provider. If you want to keep the file extension as .XLSX,
you need to modify the following registry key according to your Excel
version:
Excel 2007: HKEY_LOCAL_MACHINE\Software\Microsoft\Office\12.0\Access
Connectivity Engine\Engines\Excel\TypeGuessRows Excel 2010:
HKEY_LOCAL_MACHINE\Software\Microsoft\Office\14.0\Access Connectivity
Engine\Engines\Excel\TypeGuessRows Excel 2013:
HKEY_LOCAL_MACHINE\Software\Microsoft\Office\15.0\Access Connectivity
Engine\Engines\Excel\TypeGuessRows
Did you try to materialize your query by calling ToList or ToArray at the end?
I tried to recreate your case and had no trouble reading the data from the Excel file using the following code snippet:
var excel = new ExcelQueryFactory(FilePath);
List<Test> tests = (
from c in excel.Worksheet<Test>()
select c
)
.ToList();
It returns two objects with all properties filled properly.
One minor thing, when I added ToList initially, I got the following exception:
The 'Microsoft.ACE.OLEDB.12.0' provider is not registered on the local machine.'
Which according to what they say in the official docs seems reasonable since I was missing Microsoft Access Database Engine 2010 Distributable on my machine.
this is my first post so sorry if it doesn't look good or if the formatting is weird.
Anyways, I need to find a way to get the correct value of numeric (non-string) cells using OpenXML, but with some spreadsheets my current method doesn't seem to work.
There seems to be a difference in how I need to code for varying spreadsheets.
Once I've accessed an Excel file and have opened one of its Sheets I get the number of columns and rows for the current sheet and begin my loops. Within these loops is the code below that is used to add each cell to the datatable using the Cell Reference (made using the parent loops) as the guide. This is the looped through code:
//making the cell index/location of the cell (for example: B10, E17, AE14, ...) based on our row/column indexing so far
cellRef = ConvertColNumToLetter(startColNum + column) + (y + StartRow).ToString();
//getting the specific cell at the cell index of our cellRef
workingCell = cells.Where(c => c.CellReference == cellRef).FirstOrDefault();
//using a try-catch to solve the issue of the program failing whenever the cell has no inner text to take
try
{
if (workingCell.InnerText == null)
value = "";
else
{
value = workingCell.InnerText;
}
//if the cell's data is a string
if (workingCell.DataType != null && workingCell.DataType.Value == CellValues.SharedString)
{
var stringTable = workbookPart.GetPartsOfType<SharedStringTablePart>().FirstOrDefault();
if (stringTable != null)
value = stringTable.SharedStringTable.ElementAt(int.Parse(value)).InnerText;
dataRow[colIterator] = value.ToString();
colIterator++;
}
//if the cell's data is numeric
else if (workingCell.CellValue != null)
{
value = workingCell.CellValue.InnerText.ToString();
dataRow[colIterator] = value;
colIterator++;
}
//if the cell doesn't have any data
else
{
dataRow[colIterator] = "";
colIterator++;
}
}
//this will only fail if the cell didn't have any data/the data is null
catch
{
dataRow[colIterator] = "";
colIterator++;
}
My code works just fine for 99% of the spreadsheets it encounters. My only problem is with numeric values not being accessed correctly on one specific spreadsheet but it seems to get the correct numeric values on all other spreadsheets.
With the normal spreadsheets, whatever numeric value that is stored in the formula is returned. As an example, here is a cell's value from one such spreadsheet: "31313308.87". When my program reaches that cell, here is the workingCell's details (the workingCell is the cell at the current cell reference location):
which as you can see are the values that I would want and would go through the program and store the correct values into the datatable.
But the spreadsheet that's giving me issues is strange. When I get to one of the numeric cells I end up just getting the completely wrong values. Opening up the actual spreadsheet, there is a numeric cell with a value of "502028.6", but my program gets this in the workingCell's details:
I'm not sure where the "279907" comes from or how I should go about getting the correct values.
Here's another strange thing. The next three values of the spreadsheet are zeros which all return 0 in the CellValue and InnerText of the workingCell. The next value on the spreadsheet is "-160", and in the workingCell's details it shows:
1500 in the CellValue and InnerText?
Again, this is the only spreadsheet that this is happening on (or at least the only one that I know of). The program returns the correct string values as well, but on this one specific spreadsheet, none of the numeric values seem to be picked up with my program (despite it working and returning the correct numeric values for the hundreds of other spreadsheets I've used it on so far).
Is there some simple fix that I'm just not seeing? I would love any pointers or suggestions you may have for me, and I will be happy to clarify anything if you need some additional information.
Thank you for all of your help in advance!
--EDIT:
I forgot to specify when I asked the question, but the incorrect data (the incorrect CellValues and InnerTexts) is actually successfully processed into the else if statement then added to the datatable. I don't have any problem as far as adding values to the datatable goes, but I just can't seem to get the correct CellValues and InnerTexts for numeric cells on the one specific spreadsheet
I am reading a xlsx-file via oledb. There are some rows where a column (containing a date-string) returns null and some rows where the column (also containing a date-string) returns the date-string. In excel the column-type is set to "date".
Here is my connection-string:
$"Provider=Microsoft.ACE.OLEDB.12.0;Data
Source={PATH_TO_FILE};Extended Properties=\"Excel 12.0
Xml;HDR=NO\""
Here is the command-text to query the data:
$"SELECT * FROM [SHEET_NAME$A4:BC] WHERE
F1 IS NOT NULL"
Here is how i read the data from the data-record:
var test = dataRecord.GetValue(dataRecord.GetOrdinal("F39"));
Her are some examples what the inspector shows me when test contains the date-string:
{07.01.1975 00:00:00}
{03.08.1987 00:00:00}
{03.10.1988 00:00:00}
{01.05.1969 00:00:00}
{20.12.2016 00:00:00}
{18.07.2011 00:00:00}
In other cases the inspector only show:
{}
Here is a screenshot from the xlsx-document where i have marked a line in red where the return-value is empty and green where the actual date-string is returned:
The date-strings are formatted like dd.mm.yyyy
Why do these rows return an empty value instead of the date-string?
As suggested by AndyG i have checked if the date-string values might fail in dependece of the format ("dd.mm.yyyy" vs. "mm.dd.yyyy"). But there are cases which are invalid for "mm.dd.yyyy" that dont fail.
I was not able to solve the problem, but was able to bypass it, by changing the column-type in Excel to text.
I had to copy the whole xls-file, delete the content of the copy, set the column-type to text, copy the content from the first file and paste it into the second file. Otherwise Excel was changing the date-strings to the numbers which are used to store the date.
Now I can read the cells correctly.
Two years too late but after struggling with this for the better part of several hours I hope this might help someone:
It sounds likely the first row in your excel document contains column names and not actual data, which means they are of a different Excel data type (General/Text + DateTime).
The fix to handle is pretty simple - adjust your connection string to reflect this using the HDR property in Extended Properties:
$"Provider=Microsoft.ACE.OLEDB.12.0;Data Source={PATH_TO_FILE};Extended Properties=\"Excel 12.0 Xml;HDR=YES\""
HDR = true means the first row contains field names
You can read more about it here:
https://www.connectionstrings.com/ace-oledb-12-0/
Additionally, if you are encountering this on odd lines in your document like OP, ensure that the data type is identical for the entire column, except your column titles if using HDR=true
Excel can sometimes flip DateTime fields to General fields which would cause this behavior
I am using Linq to Excel library to get acces to one of excel sheets, the problem which i got into is, that my call cant fin a column with specifik name.
public IQueryable<Cell> getExcel()
{
var excel = new ExcelQueryFactory();
excel.FileName = #"C:\Users\Timsen\Desktop\QUOTATION.CSV";
var indianaCompanies = from c in excel.Worksheet() select c["ARTICLE TEXT"];
return indianaCompanies;
}
Error :
base {System.SystemException} = {"'ARTICLE TEXT' column name does not exist. Valid column names are 'QUOT NO;DEBTOR;ITEM;ART NO;HWS NO#;BRANCH PRICE;QTY;PR;ARTICLE T', 'F2', 'F3', 'F4', 'F5'"}
Name of tables in Excell
QUOT
NO
DEBTOR
ITEM
ART
NO
HWS
NO.
BRANCH PRICE
QTY
PR
ARTICLE TEXT
TYPE NAME
SALES PRICE
QT%
DIS
AMOUNT
UNI
B
ARTG
SUPPL
DUTY
UPDAte Sample of Excel :
Can you show us the first line or two of the csv file?
If I'm interpreting the error message correctly, the header line has semicolons instead of commas for separators.
Specifically, the error message appears to list these as the column names (note that it's using single quotes and commas to try and make it clear, which seems useful).
'QUOT NO;DEBTOR;ITEM;ART NO;HWS NO#;BRANCH PRICE;QTY;PR;ARTICLE T'
'F2'
'F3'
'F4'
'F5'
Since that first column name is 64 characters, I'm assuming it cut off at that point and the rest of the columns would be in there as well (still semicolon-delimited) if that limit wasn't in place.
Not sure off-hand if you can specify a different delimiter with the linq-to-excel project or not, since it appears to use Jet for csv files, as per https://github.com/paulyoder/LinqToExcel/blob/792e0807b2cf2cb6b74f55565ad700d2fcf31e19/src/LinqToExcel/Query/ExcelUtilities.cs
If making it a 'real' csv isn't an option and the library doesn't support specifying an alternate delimiter, you might just be able to get the articles text by going through the lines in the file (except the first) and pull out the 12th column (since that appears to be the article text).
So, something like:
var articleTextValues =
// Skip(1) since we don't want the header
from line in File.ReadAllLines(#"C:\Users\Timsen\Desktop\QUOTATION.CSV").Skip(1)
select line.Split(';')[11];
Change your code to this:
public IQueryable<Cell> getExcel()
{
var excel = new ExcelQueryFactory();
excel.FileName = #"C:\Users\Timsen\Desktop\QUOTATION.CSV";
var indianaCompanies = from c in excel.Worksheet() select c["ARTICLE T"];
return indianaCompanies;
}
It lists the valid column names. It is having problems with some of the headers for the columns indicated by the fact QTY;PR; means it parsed two different columns instead of one. F2 indicates it does not know what the header should actually be called.
The simplest solution is to verify the data being imported into the following query matches your excel document.
var indianaCompanies = from c in excel.Worksheet() select *;
I believe that will work.
I have a C#/.Net job that imports data from Excel and then processes it. Our client drops off the files and we process them. I don't have any control over the original file.
I use the OleDb library to fill up a dataset. The file contains some numbers like 30829300, 30071500, etc... The data type for those columns is "Text".
Those numbers are converted to scientific notation when I import the data. Is there anyway to prevent this from happening?
One workaround to this issue is to change your select statement, instead of SELECT * do this:
"SELECT Format([F1], 'General Number') From [Sheet1$]"
-or-
"SELECT Format([F1], \"#####\") From [Sheet1$]"
However, doing so will blow up if your cells contain more than 255 characters with the following error:
"Multiple-step OLE DB operation generated errors. Check each OLE DB status value, if available. No work was done."
Fortunately my customer didn't care about erroring out in this scenario.
This page has a bunch of good things to try as well:
http://www.dicks-blog.com/archives/2004/06/03/external-data-mixed-data-types/
The OleDb library will, more often than not, mess up your data in an Excel spreadsheet. This is largely because it forces everything into a fixed-type column layout, guessing at the type of each column from the values in the first 8 cells in each column. If it guesses wrong, you end up with digit strings converted to scientific-notation. Blech!
To avoid this you're better off skipping the OleDb and reading the sheet directly yourself. You can do this using the COM interface of Excel (also blech!), or a third-party .NET Excel-compatible reader. SpreadsheetGear is one such library that works reasonably well, and has an interface that's very similar to Excel's COM interface.
Using this connection string:
Provider=Microsoft.ACE.OLEDB.12.0; data source={0}; Extended Properties=\"Excel 12.0;HDR=NO;IMEX=1\"
with Excel 2010 I have noticed the following. If the Excel file is open when you run the OLEDB SELECT then you get the current version of the cells, not the saved file values. Furthermore the string values returned for a long number, decimal value and date look like this:
5.0130370071e+012
4.08
36808
If the file is not open then the returned values are:
5013037007084
£4.08
Monday, October 09, 2000
If you look at the actual .XSLX file using Open XML SDK 2.0 Productivity Tool (or simply unzip the file and view the XML in notepad) you will see that Excel 2007 actually stores the raw data in scientific format.
For example 0.00001 is stored as 1.0000000000000001E-5
<x:c r="C18" s="11" xmlns:x="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
<x:v>1.0000000000000001E-5</x:v>
</x:c>
Looking at the cell in Excel its displayed as 0.00001 in both the cell and the formula bar. So it not always true that OleDB is causing the issue.
I have found that the easiest way is to choose Zip format, rather than text format for columns with large 'numbers'.
Have you tried casting the value of the field to (int) or perhaps (Int64) as you are reading it?
Look up the IMEX=1 connection string option and TypeGuessRows registry setting on google.
In truth, there is no easy way round this because the reader infers column data types by looking at the first few rows (8 by default). If the rows contain all numbers then you're out of luck.
An unfortunate workaround which I've used in the past is to use the HDR=NO connection string option and set the TypeGuessRows registry setting value to 1, which forces it to read the first row as valid data to make its datatype determination, rather than a header.
It's a hack, but it works. The code reads the first row (containing the header) as text, and then sets the datatype accordingly.
Changing the registry is a pain (and not always possible) but I'd recommend restoring the original value afterwards.
If your import data doesn't have a header row, then an alternative option is to pre-process the file and insert a ' character before each of the numbers in the offending column. This causes the column data to be treated as text.
So all in all, there are a bunch of hacks to work around this, but nothing really foolproof.
I had this same problem, but was able to work around it without resorting to the Excel COM interface or 3rd party software. It involves a little processing overhead, but appears to be working for me.
First read in the data to get the column names
Then create a new DataSet with each of these columns, setting each of their DataTypes to string.
Read the data in again into this new
dataset. Voila - the scientific
notation is now gone and everything is read in as a string.
Here's some code that illustrates this, and as an added bonus, it's even StyleCopped!
public void ImportSpreadsheet(string path)
{
string extendedProperties = "Excel 12.0;HDR=YES;IMEX=1";
string connectionString = string.Format(
CultureInfo.CurrentCulture,
"Provider=Microsoft.ACE.OLEDB.12.0;Data Source={0};Extended Properties=\"{1}\"",
path,
extendedProperties);
using (OleDbConnection connection = new OleDbConnection(connectionString))
{
using (OleDbCommand command = connection.CreateCommand())
{
command.CommandText = "SELECT * FROM [Worksheet1$]";
connection.Open();
using (OleDbDataAdapter adapter = new OleDbDataAdapter(command))
using (DataSet columnDataSet = new DataSet())
using (DataSet dataSet = new DataSet())
{
columnDataSet.Locale = CultureInfo.CurrentCulture;
adapter.Fill(columnDataSet);
if (columnDataSet.Tables.Count == 1)
{
var worksheet = columnDataSet.Tables[0];
// Now that we have a valid worksheet read in, with column names, we can create a
// new DataSet with a table that has preset columns that are all of type string.
// This fixes a problem where the OLEDB provider is trying to guess the data types
// of the cells and strange data appears, such as scientific notation on some cells.
dataSet.Tables.Add("WorksheetData");
DataTable tempTable = dataSet.Tables[0];
foreach (DataColumn column in worksheet.Columns)
{
tempTable.Columns.Add(column.ColumnName, typeof(string));
}
adapter.Fill(dataSet, "WorksheetData");
if (dataSet.Tables.Count == 1)
{
worksheet = dataSet.Tables[0];
foreach (var row in worksheet.Rows)
{
// TODO: Consume some data.
}
}
}
}
}
}
}
I got one solution from somewhere else but it worked perfectly for me.
No need to make any code change, just format excel columns cells to 'General" instead of any other formatting like "number" or "text", then even Select * from [$Sheet1] or Select Column_name from [$Sheet1] will read it perfectly even with large numeric values more than 9 digits
I googled around this state..
Here are my solulition steps
For template excel file
1-format Excel coloumn as Text
2- write macro to disable error warnings for Number -> text convertion
Private Sub Workbook_BeforeClose(Cancel As Boolean)
Application.ErrorCheckingOptions.BackgroundChecking = Ture
End Sub
Private Sub Workbook_Open()
Application.ErrorCheckingOptions.BackgroundChecking = False
End Sub
On codebehind
3- while reading data to import
try to parse incoming data to Int64 or Int32....