Excel opening CSV with wrong encoding - c#

This is partly a question for the Microsoft forums too, but I think there might be some coding involved.
We have a system built in C# .NET that generates CSV files. However, we have problems with special characters "æÆøØåÅ". The thing is, when I open the file in NotePad, everything is correct. But when I open the file in Excel, these characters are wrong. If I open in NotePad and save without actually doing any changes, it works in Excel. But I dont understand why? Is there some hidden information added to the file that can we adjusted in our C# code to make it correct in the first place?
There are other questions like this, but all answers I could find are workarounds for when you already have a wrong CSV file. In our case, we create this file, and the people we send the files too are usually not computer-people capable of changing encoding, etc.
Edit:
Here is the code we tried to use at the end, after generating our result CSV-string:
string result = "some;æøå;string";
byte[] bytes = System.Text.Encoding.GetEncoding(65001).GetBytes(result.ToString());
return System.Text.Encoding.GetEncoding(65001).GetString(bytes);

Related

Two files that are binary identical, yet exhibit different behavior

I'm posting with tags asp.net and excel because that is the origination of my problem, but I'm not really sure this is the right place - ultimately, my problem is that I have two files (served by an ASP.Net application) which are identical based on a binary file compare using
fc /B A.xls B.xls
However, they exhibit different behavior: the first one opens fine in Excel; the second one does not. I conclude, then, that there is something different about the files beyond what the FC utility checks.
I have tried sending these two files to a friend to ask for his help, but discovered that when I do so, the problem file gets "fixed". In fact, if I do just about anything with this file, it gets "fixed". By fixed, I mean that it then opens fine in Excel. For example, if I zip it, then extract it from the zip, it is fine. If I open in Notepad++ and "Save As", it is fine. Same with Wordpad. Using plain old Notepad does NOT fix it.
So, obviously, there is some difference about these two files that I am missing.
I'm not sure if I will have any luck asking people to visit a random website, but if you want to see an example of the behavior, I have created a minimal page to duplicate the problem at http://rodj.me/ExcelTest
Click on the link for "MinimalHtml.aspx", and the app will serve an HTML based xls file using the following in the Page Load:
protected void Page_Load(object sender, EventArgs e)
{
Response.ContentType = "application/vnd.ms-excel";
Response.AddHeader("Content-Disposition", "filename=MinimalHtml.xls");
}
Depending on your browser and browser settings (my tests have been in Chrome), you may get Excel opened with a blank page. Regardless, you should get the file MinimalHtml.xls downloaded. It is a plain text file. You should find that this file will NOT open in Excel. However, if you zip the file, then extract it from zip, it WILL open.
I'm curious about what other file differences I'm missing when just doing an FC compare, but ultimately, I need to get the ASP.Net application corrected to serve the HTML version of the Excel file correctly. Interestingly, if I create an XML version of the spreadsheet, it downloads/opens fine. That is what the "MinimalXml.aspx" link does.
Can anyone help with either 1) how to figure out what is different about the two files; or 2) what must change in the ASP.Net application to get it to serve the file correctly?
I think your problem might be a Microsoft security patch. See this article:
Infoworld article
When you open the file directly, the patch causes the issue which results in a blank page because the file contents is HTML not Excel. When you download the file in a Zip file and unzip it, it is deemed safe and opens correctly.

Universal Data Link - File cannot be opened. Ensure it is a valid Data Link file

I am attempting to create a UDL file programmatically in C#. In my program, I want to show the user the Data Link properties window but with my own default values for the connection string. I initially thought to do the following:
string[] lines = new string[]
{
"[oledb]",
"; Everything after this line is an OLE DB initstring",
"Provider=SQLOLEDB.1;Persist Security Info=False"
};
File.WriteAllLines("Test.udl", lines);
Process p = Process.Start("Test.udl");
p.WaitForExit();
However, I get this error when trying to open the file:
File cannot be opened. Ensure it is a valid Data Link file.
This is strange because I created an empty file, named it something.udl, opened it, clicked OK, and then opened the contents of the file which was:
[oledb]
; Everything after this line is an OLE DB initstring
Provider=SQLOLEDB.1;Persist Security Info=False
But there was a newline character at the end of the connection string. I used KDiff to compare the this file and the file I created in my program and it said the "Files are equal text but the they are not binary equal" or something to that effect.
I believe it has to do with how the File.WriteAllLines method writes the strings. So I attempted to use different encodings with the method but with no success. Any ideas on where I am going wrong?
I am using this MSDN link as a reference about UDL files. Its also interesting to note that if I open a new text file and past in all of the lines in my lines array, I arrive at the same error.
All you need to do is use the Unicode encoding:
File.WriteAllLines("Test.udl", lines, Encoding.Unicode);
When creating the file in a plain-text editor, use the UTF-16 Little Endian encoding and include a Byte Order Mark (since Microsoft started on the Intel platform they consider that the "default" when they talk about UTF-16).
When using a program, make sure to use that particular encoding as well, programming languages might still default to a legacy codepage or use UTF-8, in which case opening the UDL file will trigger the error shown in the question.

Generating proper CSV files

I'm having a problem programmatically generating a proper CSV file that is then downloaded by the user and opened in excel in my ASP.NET project. Excel seems to open the file properly but when I go to “save as” it defaults to Unicode text. I understand that CSV is basically a text file but if you try creating a CSV in Excel, saving, and then going to save as it will default the save as type to CSV. Therefore I believe something extra is being saved along with the file. I’ve made sure the HTTP header context-type is set to “text/csv” so I am sure that the response is correct to the user.
We generate a lot of CSV where I work, and I've noticed this a lot. There's a really good chance that your file is just fine.
The problem with CSV is that it's not defined by any standard, so every app interprets it slightly different. Excel probably does this for any CSV file which isn't precisely in its preferred format.
Maybe Excel expects CSV to be ASCII, and you've got a UTF BOM in the file which makes it decide tab-delimited "Unicode text" is a better fit.
This should work:
protected void btnDownload_Click(object sender, EventArgs e)
{
Response.AddHeader("Content-Disposition", "attachment;filename=myfilename.csv");
Response.ContentType = "text/csv";
Response.Write("1;computer;1000");
Response.End();
}
Have you looked at a binary dump of the file to make sure the file being downloaded is identical to the file you're looking at locally? There could be different line terminators being used (e.g. ) that might be causing Excel to tolerantly read it in and display it, but default to saving it as unicode text.
On a Linux (or cygwin) system, using "od -a -x" will tell you how the file is made up.

How to extract text from Word files using C#?

I am trying to convert a large number (100,000) of word DOC files, these are quite old. From around 1995 to 2000 version of Word, i supposed. I keep going around in circles from what i see here in stack overflow and the MS documentation.
What i want do so is simply read the file, stick the text into a string, parse the string, take out the structure stuff (the file is actually a structured report, looks like Patient: Jon Doe). At that point, I know what i am doing. I can parse the string data, stick it into useful variables, then stick this data into a database. But I do not know how to actually put the text into a string. Any help?
PPS i found this reference which supposedly puts a DOC file into a text file. It's a start, but i'd rather avoid doing a bunch of file manipulations.
If you try to use the Word object model, you must always instantiate a certain version of Word on the client (since running Word on a server is not recommended). Unfortunately, you'll depend of the restriction of Word concerning older files, e.g. in Word 2010 you can open files from Office 95 only in sandbox mode (i.e you're not able to access the file content programmatically). Additionally, you'll have to deal with unknown template content (documents with macros attached, for example).
In your case I'd rather look for a 3p-component which allows to access the content.
I know from document management systems like OpenText eDocs and Autonomy iManage that they use other tools to full-index documents of all types and can present the content in a viewer application. So if you look in this direction, may be you find something useful.
A word file is just a normal file as far as your code goes.
Try this:
using System.IO;
StreamReader streamReader = new StreamReader(filePath);
string text = streamReader.ReadToEnd();
streamReader.Close();

Is there an easy way to determine the type of a file without knowing the file's extension?

I have a table with a binary column which stores files of a number of different possible filetypes (PDF, BMP, JPEG, WAV, MP3, DOC, MPEG, AVI etc.), but no columns that store either the name or the type of the original file. Is there any easy way for me to process these rows and determine the type of each file stored in the binary column? Preferably it would be a utility that only reads the file headers, so that I don't have to fully extract each file to determine its type.
Clarification: I know that the approach here involves reading just the beginning of each file. I'm looking for a good resource (aka links) that can do this for me without too much fuss. Thanks.
Also, just C#/.NET on Windows, please. I'm not using Linux and can't use Cygwin (doesn't work on Windows CE, among other reasons).
you can use these tools to find the file format.
File Analyser
http://www.softpedia.com/get/Programming/Other-Programming-Files/File-Analyzer.shtml
What Format
http://www.jozy.nl/whatfmt.html
PE file format analyser
http://peid.has.it/
This website may be helpful for you.
http://mark0.net/onlinetrid.aspx
Note:
i have included the download links to make sure that you are getting the right tool name and information.
please verify the source before you download them.
i have used a tool in the past i think it is File Analyser, which will tell you the closest match.
happy tooling.
This is not a complete answer, but a place to start would be a "magic numbers" library. This examines the first few bytes of a file to determine a "magic number", which is compared against a known list of them. This is (at least part) of how the file command on Linux systems works.
Someone else asked a similar question and posted the code used to do exactly this. You should be able to take what is posted here, and slightly modify it so that it pulls from your database.
https://stackoverflow.com/questions/58510
In addition to that, it looks like someone has written a library based off of magic numbers to do this, however, it looks like the site requires registration, and some form of alternate access in order to download this lirbary. The documentation is avaliable for free without registration, that may be helpful.
http://software.topcoder.com/catalog/c_component.jsp?comp=13249160&ver=2
The easiest way I know is to use file command that it is also available in Windows with Cygwin .
A lot of filetypes have well defined headers that begin the file. You could check the first few bytes to check to see how the file begins.
Easiest way to do this would be through access to a *nix (or cygwin) system that has the 'file' command:
$ file visitors.*
visitors.html: HTML document text
visitors.png: PNG image data, 5360 x 2819, 8-bit colormap, non-interlaced
You could write a C# application that piped the first X bytes of each binary column to the file command (using - as the file name)
You need to use some p/invoke interop code to call the SHGetFileInfo method from the Win32 API. This article may also help.

Categories