What all types of file can be read with File class c#? - c#

I have tried reading Text File and XML File with File Class, it works fine.I was wondering if we can read excel or word or other types.
var str = File.ReadAllLines("Test.xlsx");
While debugging ,str shows special characters.
Hope I had made question clear.Kindly Advise
Down votes are welcomed,if accompanied by proper comment to improve :).
Thanks in advance.

XML and Text Files are plain files, where text on screen appear like they are in file. That's why File.ReadAllLines work.
With Excel, it is different. It has encoded logic in file, which when read by a special programs (read MSExcel) decodes it and displays it correctly on screen.
Think of it as a encoded or obfuscated file read by programs specially defined to decrypt them.
To read Excel file in DotNet, you can use them to be transferred into DataSet/DataTable like this Read Excel File in C# (Example)

With File.ReadAllLines you can read text files (and XML is -as we know- as well a text file).
Of course then function reads other kind of data files as well - but you will not get meaningful results. The binary data is interpreted as characters. This will not work for Office files.

The MSDN documentation for File.ReadAllLines() states that:
This method attempts to automatically detect the encoding of a file based on the presence of byte order marks. Encoding formats UTF-8 and UTF-32 (both big-endian and little-endian) can be detected.
Therefore you can read text files with one of the UTF encodings it supports. To read files that use other encodings (e.g. Windows ANSI, non-Latin text) you should use the overload that takes an Encoding parameter.

Related

How to convert ANSI text to Unicode correctly without installing language pack? [duplicate]

I've got a FileUpload control in an ASP.NET web page which is used to upload a file, the contents of which (in a stream) are processed in the C# code behind and output on the page later, using HtmlEncode.
But, some of this output is becoming mangled, specifically the symbol '£' is output as the Unicode FFFD REPLACEMENT CHARACTER. I've tracked this down to the input file, which is Windows 1252 ('ANSI') encoded.
The question is,
How do I determine whether the file is encoded as 1252 or UTF8? It could be either, and
How do I convert it to UTF8 if it is in Windows 1252, preserving the symbol £ etc?
I've looked online but cannot find a satisfactory answer.
If you know that the file is encoded with Windows 1252, you can open the file with a StreamReader and pass the proper encoding. That is:
StreamReader reader = new StreamReader("filename", Encoding.GetEncoding("Windows-1252"), true);
The "true" tells it to set the encoding based on the byte order marks at the front of the file, if they're there. Otherwise it opens it as Windows-1252.
You can then read the file and, if you want to convert to UTF-8, write to a file that you've opened with that endcoding.
The short answer to your first question is that there isn't a 100% satisfactory way to determine the encoding of a file. If there are byte order marks, you can determine what flavor of Unicode it is, but without the BOM, you're stuck with using heuristics to determine the encoding.
I don't have a good reference for the heuristics. You might search for "how does Notepad determine the character set". I recall seeing something about that some time ago.
In practice, I've found the following to work for most of what I do:
StreamReader reader = new StreamReader("filename", Encoding.Default, true);
Most of the files I read are those that I create with .NET's StreamWriter, and they're in UTF-8 with the BOM. Other files that I get are typically written with some tool that doesn't understand Unicode or code pages, and I just treat it as a stream of bytes, which Encoding.Default does well.

Determine File Encoding for Large Files in SSIS / C#

I have very large text files that are being imported via SSIS into our database.These files come from hundreds of companies and a variety of different source systems. Most of these files are fine importing with code page 1252, but in some files, buried somewhere in one of the rows, there might be some oddball characters that don't fit in the 1252 code page.
I've implemented a solution based on this SO answer, which allows me to proceed with code page 1252 on one path if the file's encoding is ANSI/ASCII, OR it can go down another path with a 65001 code page. This seems to work in a lot of cases, but is not reliable enough to be something we could use in production.
using(var r = new StreamReader(filename, Encoding.Default))
{
richtextBox1.Text = r.ReadToEnd();
var encoding = r.CurrentEncoding;
}
I'm far from an expert on file encoding, but I'm guessing that it's because it only reads a certain portion of the file and if everything looks like ANSI characters, it will assume it is ANSI (these files are almost guaranteed not to have a BOM)?
Would I have to read the entire file into memory and examine every character to come to a mostly accurate file encoding? How can I do this when reading an extremely large file into memory would cause huge problems?
Is there a way to accomplish this with a reasonable level of certainty? I don't need to account for any kind of foreign languages as these are all English, but we've encountered the occasional strange character included in these files. I'm thinking we need to allow for ASCII, UTF-8 and UTF-16.
Is there a way to just be able to determine whether to use code page 1252 or 65001 in SSIS?
On a related note, if ASCII is a subset of UTF-8, why is it that when I import ALL the files as code page 65001, some of the characters don't translate correctly? Shouldn't UTF-8 work for everything if it encompasses ASCII?

Opening a Unix file in Windows Notepad++?

I receive a file from a supplier that I download per SFTP. Our systems are all working on Windows.
When I open the File in Notepad++ the status bar says "UNIX" and "UTF-8"
The special characters aren't displayed correctly.
I tried to convert the file to the different formats Notepad++ allows but no one converted the char 'OSC' to the german letter 'ä'. Is this a known Unix-Windows-thing? My google-foo obviously isn't good enough.
Which kind of conversion should I try to display the file correctly?
How can I achieve the same programmatically in C#?
It is common on windows that a file's encoding doesn't match what the editor or even its xml header say it is. People are sloppy. Maybe it's really UTF-16, or the unstandard windows extended ascii thing which I think is probably cp-1252. (It's not common on *nix since we all usually just use utf-8, no need for others... not saying *nix users are much less sloppy)
To figure out which encoding it is, I would make a copy of the file, then delete the bits that are not a problem (leaving Mägenwil as the entire file) and then save, and use the linux command "file" which will tell what the right encoding is (reliable only for small files... it doesn't read the whole file; maybe notepad++ will do the exact same thing). The reason for deleting the other bits is that it might be a mix of UTF-8 which the editor has used for detection, plus something else.
I would try the iconv command in linux to test. For example:
iconv -f UTF-16 -t UTF-8 -o outfile infile
And any encoding conversion should be possible in C# or any featureful language, as long as you know how it was mutilated so you can reverse it. And if you find that it is part utf-8 and part something else, then remember not to convert the whole file, but only the important parts.

string encoding in C# - strange characters

I have a file that i need to import.
The problem is that I have problems with a lot of characters in that file.
For example these names are wrong:
Björn (in file) - Should be Björn
Ã…ke (in file) - Should be Åke
Unfortunately I can't recreate the file with the correct encoding.
Also there are a lot of characters that are wrong (these was just examples). I can't do a search and replace on all (if there isn't a dictionary with all conversions).
Can I decode the strings in some way?
thanks Patrik
Edit:
Just some more info that I should added before (I blame my tiredness).
The file is an .xlsx file.
I debugged this with Notepad++. I copied the correct strings into Notepad++. I used Encoding | Convert to UTF-8. Then I selected Encoding | Encode as ANSI. This has the effect of interpreting the UTF-8 bytes as if they were ANSI. And when I did this I end up with the same erroneous values as you. So clearly when you read the file you are interpreting is as ANSI rather than UTF-8.
The solution then is that your file has been encoded as UTF-8. Make sure that the file is interpreted as UTF-8 when you read it. I can't tell you exactly how to do that since you didn't show how you were reading the file in the first place.
It's possible that your file does not contain a byte-order-mark (BOM). If so then specify the encoding when you read the file by passing Encoding.UTF8.
I've just tried your first example, and it definitely looks like that's UTF-8.
It's unclear what you're using to look at the file in the first place, but if you load it with a text editor which understands UTF-8 and tell it that it's a UTF-8 file, it should be fine.
When you load it with .NET, you should just be able to use File.OpenText, File.ReadAllText etc - most IO dealing with encodings in .NET defaults to UTF-8 anyway.

asp.net character encoding problem utf8

I'm storing some html-encoded data in a sql server database and I've written a script to output the data in a csv format minus the html tags and I'm getting a weird issue when html-decoding the remaining data. For example the data contains a quote character (which is html-encoded as ’), but when I try to html-decode it the data comes out as a series of weird characters (’). Does anyone know how to solve this issue? The output encoding of the page is UTF-8 if that helps.
Any advice would be much appreciated!
Cheers
Tim
Those 3 weird characters are how UTF-8 encodes the HTML entity ’. (They're actually the octets 0xE2 0x80 0x99, and those bytes render as "’" in your computer's default charset windows-1252.) So I don't think you've got an issue with your encoding.
It's evidently a known problem that Excel 2000 has problems with .csv files in UTF-8 encoding. The solution, bizarrely enough, is to switch the filename extension to .txt, at which point Excel 2000 will evidently import the file correctly.
If the data is read from the CSV files, open the csv file in notepad press Save As in the fiile menu, save the file as Encoding-UTF8.

Categories