I'm working with a CSV that contains characters like:
” and •
I am reading the CSV via OleDb and the provider is Microsoft.Jet.OLEDB.4.0. when the data is loaded into the OleDbCommand, the characters are converted to the following respectively:
“ and •
I suspected there might be a collation setting in the connection string but I was unable to find anything about this.
I can confirm the following:
I can see the original character in the CSV when I open it.
If I run a select on the file through OleDb WHERE [field] LIKE '%•%' I get 0 rows but if SELECT WHERE [field] LIKE '%“%' I get rows returned.
Any thoughts?
Finally! Thanks to #HABJAN I was able to get to the resolution which is as simple as setting the CharaterSet in the Extended Properties of the connection string. For my situation it was UTF-8...commonly used by default in PHPMyAdmin which is where my data was retrieved from.
Resulting working connection string:
"Provider=Microsoft.Jet.OLEDB.4.0;Data Source=\"{0}\";Extended Properties=\"text;HDR=Yes;FMT=Delimited;CharacterSet=65001;\""
Key is CharacterSet=65001 (Code Page Identifier for UTF-8) which might have been obvious to some collation savvy individuals but I've somehow managed to avoid these issues over the years and never come across it in this respect.
I was also able to get HABJAN's solution to work when also following the documentation found # http://msdn.microsoft.com/en-us/library/ms709353%28v=vs.85%29.aspx and setting the CharacterSet to the same as above.
For my situation, this is the better method as it is a simpler/more maintainable solution, but +1 to HABJAN for helping me get there!
Thanks
You can create schema.ini file and play with format and CharacterSet properties.
Take a look at this sample: How to read data from Unicode formatted text file and import to Data Table using .Net
And here is another sample that will show you how to read csv file with schema.ini: Importing CSV file into Database with Schema.ini
Related
This is partly a question for the Microsoft forums too, but I think there might be some coding involved.
We have a system built in C# .NET that generates CSV files. However, we have problems with special characters "æÆøØåÅ". The thing is, when I open the file in NotePad, everything is correct. But when I open the file in Excel, these characters are wrong. If I open in NotePad and save without actually doing any changes, it works in Excel. But I dont understand why? Is there some hidden information added to the file that can we adjusted in our C# code to make it correct in the first place?
There are other questions like this, but all answers I could find are workarounds for when you already have a wrong CSV file. In our case, we create this file, and the people we send the files too are usually not computer-people capable of changing encoding, etc.
Edit:
Here is the code we tried to use at the end, after generating our result CSV-string:
string result = "some;æøå;string";
byte[] bytes = System.Text.Encoding.GetEncoding(65001).GetBytes(result.ToString());
return System.Text.Encoding.GetEncoding(65001).GetString(bytes);
I need to import sheets which look like the following:
March Orders
***Empty Row
Week Order # Date Cust #
3.1 271356 3/3/10 010572
3.1 280353 3/5/10 022114
3.1 290822 3/5/10 010275
3.1 291436 3/2/10 010155
3.1 291627 3/5/10 011840
The column headers are actually row 3. I can use an Excel Sourch to import them, but I don't know how to specify that the information starts at row 3.
I Googled the problem, but came up empty.
have a look:
the links have more details, but I've included some text from the pages (just in case the links go dead)
http://social.msdn.microsoft.com/Forums/en-US/sqlintegrationservices/thread/97144bb2-9bb9-4cb8-b069-45c29690dfeb
Q:
While we are loading the text file to SQL Server via SSIS, we have the
provision to skip any number of leading rows from the source and load
the data to SQL server. Is there any provision to do the same for
Excel file.
The source Excel file for me has some description in the leading 5
rows, I want to skip it and start the data load from the row 6. Please
provide your thoughts on this.
A:
Easiest would be to give each row a number (a bit like an identity in
SQL Server) and then use a conditional split to filter out everything
where the number <=5
http://social.msdn.microsoft.com/Forums/en/sqlintegrationservices/thread/947fa27e-e31f-4108-a889-18acebce9217
Q:
Is it possible during import data from Excel to DB table skip first 6 rows for example?
Also Excel data divided by sections with headers. Is it possible for example to skip every 12th row?
A:
YES YOU CAN. Actually, you can do this very easily if you know the number columns that will be imported from your Excel file. In
your Data Flow task, you will need to set the "OpenRowset" Custom
Property of your Excel Connection (right-click your Excel connection >
Properties; in the Properties window, look for OpenRowset under Custom
Properties). To ignore the first 5 rows in Sheet1, and import columns
A-M, you would enter the following value for OpenRowset: Sheet1$A6:M
(notice, I did not specify a row number for column M. You can enter a
row number if you like, but in my case the number of rows can vary
from one iteration to the next)
AGAIN, YES YOU CAN. You can import the data using a conditional split. You'd configure the conditional split to look for something in
each row that uniquely identifies it as a header row; skip the rows
that match this 'header logic'. Another option would be to import all
the rows and then remove the header rows using a SQL script in the
database...like a cursor that deletes every 12th row. Or you could
add an identity field with seed/increment of 1/1 and then delete all
rows with row numbers that divide perfectly by 12. Something like
that...
http://social.msdn.microsoft.com/Forums/en-US/sqlintegrationservices/thread/847c4b9e-b2d7-4cdf-a193-e4ce14986ee2
Q:
I have an SSIS package that imports from an Excel file with data
beginning in the 7th row.
Unlike the same operation with a csv file ('Header Rows to Skip' in
Connection Manager Editor), I can't seem to find a way to ignore the
first 6 rows of an Excel file connection.
I'm guessing the answer might be in one of the Data Flow
Transformation objects, but I'm not very familiar with them.
A:
Question Sign in to vote 1 Sign in to vote rbhro, actually there were
2 fields in the upper 5 rows that had some data that I think prevented
the importer from ignoring those rows completely.
Anyway, I did find a solution to my problem.
In my Excel source object, I used 'SQL Command' as the 'Data Access
Mode' (it's drop down when you double-click the Excel Source object).
From there I was able to build a query ('Build Query' button) that
only grabbed records I needed. Something like this: SELECT F4,
F5, F6 FROM [Spreadsheet$] WHERE (F4 IS NOT NULL) AND (F4
<> 'TheHeaderFieldName')
Note: I initially tried an ISNUMERIC instead of 'IS NOT NULL', but
that wasn't supported for some reason.
In my particular case, I was only interested in rows where F4 wasn't
NULL (and fortunately F4 didn't containing any junk in the first 5
rows). I could skip the whole header row (row 6) with the 2nd WHERE
clause.
So that cleaned up my data source perfectly. All I needed to do now
was add a Data Conversion object in between the source and destination
(everything needed to be converted from unicode in the spreadsheet),
and it worked.
My first suggestion is not to accept a file in that format. Excel files to be imported should always start with column header rows. Send it back to whoever provides it to you and tell them to fix their format. This works most of the time.
We provide guidance to our customers and vendors about how files must be formatted before we can process them and it is up to them to meet the guidlines as much as possible. People often aren't aware that files like that create a problem in processing (next month it might have six lines before the data starts) and they need to be educated that Excel files must start with the column headers, have no blank lines in the middle of the data and no repeating the headers multiple times and most important of all, they must have the same columns with the same column titles in the same order every time. If they can't provide that then you probably don't have something that will work for automated import as you will get the file in a differnt format everytime depending on the mood of the person who maintains the Excel spreadsheet. Incidentally, we push really hard to never receive any data from Excel (only works some of the time, but if they have the data in a database, they can usually accomodate). They also must know that any changes they make to the spreadsheet format will result in a change to the import package and that they willl be charged for those development changes (assuming that these are outside clients and not internal ones). These changes must be communicated in advance and developer time scheduled, a file with the wrong format will fail and be returned to them to fix if not.
If that doesn't work, may I suggest that you open the file, delete the first two rows and save a text file in a data flow. Then write a data flow that will process the text file. SSIS did a lousy job of supporting Excel and anything you can do to get the file in a different format will make life easier in the long run.
My first suggestion is not to accept a file in that format. Excel files to be imported should always start with column header rows. Send it back to whoever provides it to you and tell them to fix their format. This works most of the time.
Not entirely correct.
SSIS forces you to use the format and quite often it does not work correctly with excel
If you can't change he format consider using our Advanced ETL Processor.
You can skip rows or fields and you can validate the data the way you want.
http://www.dbsoftlab.com/etl-tools/advanced-etl-processor/overview.html
Sky is the limit
You can just use the OpenRowset property you can find in the Excel Source properties.
Take a look here for details:
SSIS: Read and Export Excel data from nth Row
Regards.
I've got something interesting happening, and I can't seem to find anything on it, though I have found some interesting information on similar cases.
I'm trying to create an Excel workbook on a server (which is Windows 2003 Server 32-bit) using only OLEDB and ADO.NET. (I tried ADO, but it didn't work at all - different story.) I'm using a CREATE TABLE statement to generate the Excel workbook and then I'm using an INSERT INTO statement to add the values to the spreadsheet.
The problem I'm having occurs on my development machine - a Windows 7 64-bit laptop. I have NOT moved this to the production environment (the server) yet.
The data I pull is stored in a datatable (System.Data), and comes from an MS Access database. The connection strings to the spreadsheet are different depending on whether I'm trying to create the newer format or the older one.
Here's the thing: When I use Microsoft.Jet.OLEDB.4.0 as the provider, I get a spreadsheet which is unattractive to say the least, but has all the correct data in the correct formats. In particular, I'm speaking about date columns here - they're formatted as m/d/yyyy, which is correct.
When I use Microsoft.ACE.OLEDB.12.0 as the provider, the date formats - from the exact same datatable - come in formatted as strings with format yyyy-mm-dd hh:mm:ss.
I have no idea why the data isn't in the same format in both. I checked the datatable's datatype for one of the date columns, and it checks in as DateTime no matter which provider I use.
Here are the connection strings in use:
OleDbConnection cn = new OleDbConnection(string.Format("Provider=Microsoft.ACE.OLEDB.12.0;Data source={0}{1};" +
"Extended Properties='Excel 12.0 Xml;HDR=YES;'", ssPth, ssXlsx));
The second is:
OleDbConnection cn = new OleDbConnection(string.Format("Provider=Microsoft.Jet.OLEDB.4.0;Data source={0}{1};" +
"Extended Properties='Excel 8.0;HDR=YES;';", ssPth, ssXls));
I use the exact same CREATE TABLE and INSERT INTO statements with both versions. The OLEDB.4.0 spreadsheet comes out with dates in the date columns. The OLEDB.12.0 version doesn't.
Anyone have any ideas why this might be happening? While I can simply use the Jet provider, it's not able to create the xlsx-format of recent Excel versions. (It does, but there's the nasty warning message when opening.) But I can't use the newer version if the date columns are strings instead of dates.
I've tried adding "IMEX=1" as part of the Extended Properties section of the connection string and this causes an error during the execution of the INSERT INTO statement. Researching that error revealed removing the "IMEX=1" solves that problem, so I can't use that in the connection string.
I'm stumped as to how to handle this, and most of the posts here and in other forums I've read indicate how to solve this when the data is coming FROM Excel, but nothing I've found indicates why data from a database looks different going TO Excel.
Any assistance is greatly appreciated.
I am attempting to create a UDL file programmatically in C#. In my program, I want to show the user the Data Link properties window but with my own default values for the connection string. I initially thought to do the following:
string[] lines = new string[]
{
"[oledb]",
"; Everything after this line is an OLE DB initstring",
"Provider=SQLOLEDB.1;Persist Security Info=False"
};
File.WriteAllLines("Test.udl", lines);
Process p = Process.Start("Test.udl");
p.WaitForExit();
However, I get this error when trying to open the file:
File cannot be opened. Ensure it is a valid Data Link file.
This is strange because I created an empty file, named it something.udl, opened it, clicked OK, and then opened the contents of the file which was:
[oledb]
; Everything after this line is an OLE DB initstring
Provider=SQLOLEDB.1;Persist Security Info=False
But there was a newline character at the end of the connection string. I used KDiff to compare the this file and the file I created in my program and it said the "Files are equal text but the they are not binary equal" or something to that effect.
I believe it has to do with how the File.WriteAllLines method writes the strings. So I attempted to use different encodings with the method but with no success. Any ideas on where I am going wrong?
I am using this MSDN link as a reference about UDL files. Its also interesting to note that if I open a new text file and past in all of the lines in my lines array, I arrive at the same error.
All you need to do is use the Unicode encoding:
File.WriteAllLines("Test.udl", lines, Encoding.Unicode);
When creating the file in a plain-text editor, use the UTF-16 Little Endian encoding and include a Byte Order Mark (since Microsoft started on the Intel platform they consider that the "default" when they talk about UTF-16).
When using a program, make sure to use that particular encoding as well, programming languages might still default to a legacy codepage or use UTF-8, in which case opening the UDL file will trigger the error shown in the question.
I have a webpage where the user can upload an excel file. I'm trying 2 different files - one works without a problem, and the other one gives me this error:
Error: Length cannot be less than zero. Parameter name: length
I know that sometimes this occurs when the file size is zero, but that is not the case here.
Can anyon shed light on this issue? Please let me know if you need more info.
As noted, more info is needed. It's not clear if you are opening the Excel file and processing directly from it or reading the data from Excel directly to a DataTable via ODBC, or something else.
Most of my problems reading Excel files are caused either by column titles, or by data in a particular column being different types. Check first to see if your two Excel files have the same columns, all columns have names, etc.
When you read to a DataTable, the program takes a guess at the data type of each column. If the first several cells are empty, the guess may be wrong. If your data is like mine, a column that looks like it's all numbers may be half actual numbers, half strings. Or, a column of dates may have an illegal value.
I have better luck writing the data from Excel to a .csv file, and having the program write a schema.ini and read it with the Microsoft Text Driver, but that may not suit your data.
Is there an update panel in your page ?
I had a problem when I was trying to use FileUpload with an Update Panel in the Page.
Very weird case, i suggest you could try to make a copy of file that works and try with this to see if works too.
or maybe, verify if both files were saved with the same version of Excel.