Determine File Encoding for Large Files in SSIS / C#

Determine File Encoding for Large Files in SSIS / C# - c#

I have very large text files that are being imported via SSIS into our database.These files come from hundreds of companies and a variety of different source systems. Most of these files are fine importing with code page 1252, but in some files, buried somewhere in one of the rows, there might be some oddball characters that don't fit in the 1252 code page.
I've implemented a solution based on this SO answer, which allows me to proceed with code page 1252 on one path if the file's encoding is ANSI/ASCII, OR it can go down another path with a 65001 code page. This seems to work in a lot of cases, but is not reliable enough to be something we could use in production.
using(var r = new StreamReader(filename, Encoding.Default))
{
richtextBox1.Text = r.ReadToEnd();
var encoding = r.CurrentEncoding;
}
I'm far from an expert on file encoding, but I'm guessing that it's because it only reads a certain portion of the file and if everything looks like ANSI characters, it will assume it is ANSI (these files are almost guaranteed not to have a BOM)?
Would I have to read the entire file into memory and examine every character to come to a mostly accurate file encoding? How can I do this when reading an extremely large file into memory would cause huge problems?
Is there a way to accomplish this with a reasonable level of certainty? I don't need to account for any kind of foreign languages as these are all English, but we've encountered the occasional strange character included in these files. I'm thinking we need to allow for ASCII, UTF-8 and UTF-16.
Is there a way to just be able to determine whether to use code page 1252 or 65001 in SSIS?
On a related note, if ASCII is a subset of UTF-8, why is it that when I import ALL the files as code page 65001, some of the characters don't translate correctly? Shouldn't UTF-8 work for everything if it encompasses ASCII?

Related

How to convert ANSI text to Unicode correctly without installing language pack? [duplicate]

I've got a FileUpload control in an ASP.NET web page which is used to upload a file, the contents of which (in a stream) are processed in the C# code behind and output on the page later, using HtmlEncode.
But, some of this output is becoming mangled, specifically the symbol '£' is output as the Unicode FFFD REPLACEMENT CHARACTER. I've tracked this down to the input file, which is Windows 1252 ('ANSI') encoded.
The question is,
How do I determine whether the file is encoded as 1252 or UTF8? It could be either, and
How do I convert it to UTF8 if it is in Windows 1252, preserving the symbol £ etc?
I've looked online but cannot find a satisfactory answer.

If you know that the file is encoded with Windows 1252, you can open the file with a StreamReader and pass the proper encoding. That is:
StreamReader reader = new StreamReader("filename", Encoding.GetEncoding("Windows-1252"), true);
The "true" tells it to set the encoding based on the byte order marks at the front of the file, if they're there. Otherwise it opens it as Windows-1252.
You can then read the file and, if you want to convert to UTF-8, write to a file that you've opened with that endcoding.
The short answer to your first question is that there isn't a 100% satisfactory way to determine the encoding of a file. If there are byte order marks, you can determine what flavor of Unicode it is, but without the BOM, you're stuck with using heuristics to determine the encoding.
I don't have a good reference for the heuristics. You might search for "how does Notepad determine the character set". I recall seeing something about that some time ago.
In practice, I've found the following to work for most of what I do:
StreamReader reader = new StreamReader("filename", Encoding.Default, true);
Most of the files I read are those that I create with .NET's StreamWriter, and they're in UTF-8 with the BOM. Other files that I get are typically written with some tool that doesn't understand Unicode or code pages, and I just treat it as a stream of bytes, which Encoding.Default does well.

What all types of file can be read with File class c#?

I have tried reading Text File and XML File with File Class, it works fine.I was wondering if we can read excel or word or other types.
var str = File.ReadAllLines("Test.xlsx");
While debugging ,str shows special characters.
Hope I had made question clear.Kindly Advise
Down votes are welcomed,if accompanied by proper comment to improve :).
Thanks in advance.

XML and Text Files are plain files, where text on screen appear like they are in file. That's why File.ReadAllLines work.
With Excel, it is different. It has encoded logic in file, which when read by a special programs (read MSExcel) decodes it and displays it correctly on screen.
Think of it as a encoded or obfuscated file read by programs specially defined to decrypt them.
To read Excel file in DotNet, you can use them to be transferred into DataSet/DataTable like this Read Excel File in C# (Example)

With File.ReadAllLines you can read text files (and XML is -as we know- as well a text file).
Of course then function reads other kind of data files as well - but you will not get meaningful results. The binary data is interpreted as characters. This will not work for Office files.

The MSDN documentation for File.ReadAllLines() states that:
This method attempts to automatically detect the encoding of a file based on the presence of byte order marks. Encoding formats UTF-8 and UTF-32 (both big-endian and little-endian) can be detected.
Therefore you can read text files with one of the UTF encodings it supports. To read files that use other encodings (e.g. Windows ANSI, non-Latin text) you should use the overload that takes an Encoding parameter.

Opening a Unix file in Windows Notepad++?

I receive a file from a supplier that I download per SFTP. Our systems are all working on Windows.
When I open the File in Notepad++ the status bar says "UNIX" and "UTF-8"
The special characters aren't displayed correctly.
I tried to convert the file to the different formats Notepad++ allows but no one converted the char 'OSC' to the german letter 'ä'. Is this a known Unix-Windows-thing? My google-foo obviously isn't good enough.
Which kind of conversion should I try to display the file correctly?
How can I achieve the same programmatically in C#?

It is common on windows that a file's encoding doesn't match what the editor or even its xml header say it is. People are sloppy. Maybe it's really UTF-16, or the unstandard windows extended ascii thing which I think is probably cp-1252. (It's not common on *nix since we all usually just use utf-8, no need for others... not saying *nix users are much less sloppy)
To figure out which encoding it is, I would make a copy of the file, then delete the bits that are not a problem (leaving Mägenwil as the entire file) and then save, and use the linux command "file" which will tell what the right encoding is (reliable only for small files... it doesn't read the whole file; maybe notepad++ will do the exact same thing). The reason for deleting the other bits is that it might be a mix of UTF-8 which the editor has used for detection, plus something else.
I would try the iconv command in linux to test. For example:
iconv -f UTF-16 -t UTF-8 -o outfile infile
And any encoding conversion should be possible in C# or any featureful language, as long as you know how it was mutilated so you can reverse it. And if you find that it is part utf-8 and part something else, then remember not to convert the whole file, but only the important parts.

ANSI vs SHIFT JIS vs UTF-8 in c#

I have been trying to figure the difference for quite sometime now. The issue is with a file that is in ANSI encoding has japanese characters like: ‚È‚‚Æ‚à1‚Â‚ÌINCREMENTs‚ª•K—v‚Å‚·. It equivalent in shift-jis is 少なくとも1つのINCREMENT行が必要です. which is expected to be in japanese.
I need to display these characters after reading from file(in ANSI) on a webpage. There are some other files in UTF-8 displaying characters right not seeing this. I am finding it difficult to figure out whats the difference and how do I change encoding to do right things here..
I use c# for reading this file and displaying it, I also need to write the string back into file if its modified on web. Any encoding and decoding schemas here?

As far as code pages are concerned, "ANSI" (and Encoding.Default in .NET) basically just means "the non-Unicode codepage used by this system" - exactly what codepage that is, depends on how the system is configured, but on a Western European system, it's likely to be Windows-1252.
For the system where that text comes from, then "ANSI" would appear to mean Shift-JIS - so unless your system has the same code page, you'll need to tell your code to read the text as Shift-JIS.
Assuming you're reading the file with a StreamReader, there are various constructors that take an Encoding, so just grab a Shift-JIS encoding with Encoding.GetEncoding("shift_jis") or Encoding.GetEncoding(932) and use it to construct your StreamReader.

How does Encoding.Default work in .NET?

I'm reading a file using:
var source = File.ReadAllText(path);
and the character © wasn't being loaded correctly.
Then, I changed it to:
var source = File.ReadAllText(path, Encoding.UTF8);
and nothing.
I decided to try using
var source = File.ReadAllText(path, Encoding.Default);
and it worked perfectly.
Then I debugged it and tried to find which Encoding did the trick, and I found that it was UTF-7.
What I want to know is:
Is it recommended to use Encoding.Default, and can it guarantee all the characters of the file will be read without problems?

Encoding.Default will only guarantee that all UTF-7 character sets will be read correctly (google for the whole set). On the other hand, if you try to read a file not encoded with UTF-8 in the UTF-8 mode, you'll get corrupted characters like you did.
For instance if the file is encoded UTF-16 and if you read it in UTF-16 mode, you'll be fine even if the file does not contain a single UTF-16 specific character. It all boils down to the file's encoding.
You'll need to do the save - reopen stuff with the same encoding to be safe from corruptions. Otherwise, try to use UTF-7 as much as you can since it is the most compact yet 'email safe' encoding possible, which is why it is default in most .NET framework setups.

It is not recommended to use Encoding.Default.
Quote from MSDN:
Different computers can use different
encodings as the default, and the
default encoding can even change on a
single computer. Therefore, data
streamed from one computer to another
or even retrieved at different times
on the same computer might be
translated incorrectly. In addition,
the encoding returned by the Default
property uses best-fit fallback to map
unsupported characters to characters
supported by the code page. For these
two reasons, using the default
encoding is generally not recommended.
To ensure that encoded bytes are
decoded properly, your application
should use a Unicode encoding, such as
UTF8Encoding or UnicodeEncoding, with
a preamble. Another option is to use a
higher-level protocol to ensure that
the same format is used for encoding
and decoding.

It sounds like you are interested in auto-detecting the encoding of a file, in some sort of situation where you are not in control of the encoding used to save it. There are several questions on StackOverflow addressing this; some cursory browsing points to Determine a string's encoding in C# as a pretty good one. My favorite answer is the one pointing to a C# port of Mozilla's universal charset detector.

I think the ur file is in utf-7 encoding.nothing more.
visit this page Your Answer

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Determine File Encoding for Large Files in SSIS / C# - c#

Related

How to convert ANSI text to Unicode correctly without installing language pack? [duplicate]

What all types of file can be read with File class c#?

Opening a Unix file in Windows Notepad++?

ANSI vs SHIFT JIS vs UTF-8 in c#

How does Encoding.Default work in .NET?

Categories

Resources