I am writing a program that can detect the character-encodings of a file and then converts it to Unicode. In order to test the application I would like to have a collection of text files with different character-encodings. At least with the most common encodings. I've already been googleing for a while but didn't find any. I'm sure I am not the first one with this problem so I don't understand why I cannot find anything like that.
Does anyone know an easy way to get many differently encoded text files?
Related
I have very large text files that are being imported via SSIS into our database.These files come from hundreds of companies and a variety of different source systems. Most of these files are fine importing with code page 1252, but in some files, buried somewhere in one of the rows, there might be some oddball characters that don't fit in the 1252 code page.
I've implemented a solution based on this SO answer, which allows me to proceed with code page 1252 on one path if the file's encoding is ANSI/ASCII, OR it can go down another path with a 65001 code page. This seems to work in a lot of cases, but is not reliable enough to be something we could use in production.
using(var r = new StreamReader(filename, Encoding.Default))
{
richtextBox1.Text = r.ReadToEnd();
var encoding = r.CurrentEncoding;
}
I'm far from an expert on file encoding, but I'm guessing that it's because it only reads a certain portion of the file and if everything looks like ANSI characters, it will assume it is ANSI (these files are almost guaranteed not to have a BOM)?
Would I have to read the entire file into memory and examine every character to come to a mostly accurate file encoding? How can I do this when reading an extremely large file into memory would cause huge problems?
Is there a way to accomplish this with a reasonable level of certainty? I don't need to account for any kind of foreign languages as these are all English, but we've encountered the occasional strange character included in these files. I'm thinking we need to allow for ASCII, UTF-8 and UTF-16.
Is there a way to just be able to determine whether to use code page 1252 or 65001 in SSIS?
On a related note, if ASCII is a subset of UTF-8, why is it that when I import ALL the files as code page 65001, some of the characters don't translate correctly? Shouldn't UTF-8 work for everything if it encompasses ASCII?
This is partly a question for the Microsoft forums too, but I think there might be some coding involved.
We have a system built in C# .NET that generates CSV files. However, we have problems with special characters "æÆøØåÅ". The thing is, when I open the file in NotePad, everything is correct. But when I open the file in Excel, these characters are wrong. If I open in NotePad and save without actually doing any changes, it works in Excel. But I dont understand why? Is there some hidden information added to the file that can we adjusted in our C# code to make it correct in the first place?
There are other questions like this, but all answers I could find are workarounds for when you already have a wrong CSV file. In our case, we create this file, and the people we send the files too are usually not computer-people capable of changing encoding, etc.
Edit:
Here is the code we tried to use at the end, after generating our result CSV-string:
string result = "some;æøå;string";
byte[] bytes = System.Text.Encoding.GetEncoding(65001).GetBytes(result.ToString());
return System.Text.Encoding.GetEncoding(65001).GetString(bytes);
I receive a file from a supplier that I download per SFTP. Our systems are all working on Windows.
When I open the File in Notepad++ the status bar says "UNIX" and "UTF-8"
The special characters aren't displayed correctly.
I tried to convert the file to the different formats Notepad++ allows but no one converted the char 'OSC' to the german letter 'ä'. Is this a known Unix-Windows-thing? My google-foo obviously isn't good enough.
Which kind of conversion should I try to display the file correctly?
How can I achieve the same programmatically in C#?
It is common on windows that a file's encoding doesn't match what the editor or even its xml header say it is. People are sloppy. Maybe it's really UTF-16, or the unstandard windows extended ascii thing which I think is probably cp-1252. (It's not common on *nix since we all usually just use utf-8, no need for others... not saying *nix users are much less sloppy)
To figure out which encoding it is, I would make a copy of the file, then delete the bits that are not a problem (leaving Mägenwil as the entire file) and then save, and use the linux command "file" which will tell what the right encoding is (reliable only for small files... it doesn't read the whole file; maybe notepad++ will do the exact same thing). The reason for deleting the other bits is that it might be a mix of UTF-8 which the editor has used for detection, plus something else.
I would try the iconv command in linux to test. For example:
iconv -f UTF-16 -t UTF-8 -o outfile infile
And any encoding conversion should be possible in C# or any featureful language, as long as you know how it was mutilated so you can reverse it. And if you find that it is part utf-8 and part something else, then remember not to convert the whole file, but only the important parts.
I've some documents in MHTML format and in pdf format. I want to know whether the content is same or not in MHTML and PDF. How can i compare the difference?
You will need an MHTML parser as well as a PDF parser library. Then you traverse both documents in parallell and compare the contents. Not that this is definitely non-trivial to do as you will have to build a mapping system between elements in the different file formats.
If you want to take into account that content can be written in different ways (e.g. tables vs. tabs) and still look exactly the same to the user things get very complicated quickly.
My gut feeling from the way you are asking your questions is that this project is way larger and more complex than you are ready for.
I've been looking on certain sites for some time now, but I cant seem to find anything usable about file formats.
There is a certain file format on my computer, which I want to re-create to make add-ons for a program. Unfortunatly I would be the first to do so for that certain format, which makes it all the more hard. There are programs to ádd information to the file, but those programs are not open-source unfortunatly. But that does mean it's possible to figure out the file format somehow.
The closest I came to finding usable information about re-creating a file format was, "open it in notepad or a hex editor, and see if you can find anything usable"..
This certain file format contains information, so nothing like music files or images in case you'r wondering.
I'm just wondering if there is any guide on how to create a file format, or figuring out how an existing file format works. I believe this sort of format is called a Tabulated data format?
It really does depend on the file format.
Ideally, you find some documentation on how the file works, and use that. This is easy if the file uses a public format, so for HTML files or PNG files you can easily find that information. Proprietary formats often have published spec's too, or at least a publicly available API for manipulating them, depending on the company's policy on actively encouraging this sort of extension.
Next best is using examples of working code (whether published source or reverse engineered in itself) that deal with the file as a reference implementation.
Otherwise, reverse engineering is as good as you can do. Opening it in notepad and a hex editor (even with a binary format, looking at it parsed as text can tell you something; even with a text-based format, looking at it in a hex editor can tell you if they are making use of non-printable characters) is indeed the way to go. It's a detective job and while sometimes easy, often very hard, esp. since you may miss ways they deal with edge-cases that aren't hit in the samples you use.
The difficulty with obscure formats distributed with games is that they are often compiled from either a declarative definition language, a scripting language or directly from a set of resources like textures and meshes.
In some games, one compiled file will contain bits and pieces of all of the above, with no available documentation on the tools and formats used to piece it together. Some people call that "fun".
If you can't get anything from the hex, can't find any documentation and can't find a tool to read the file, you're probably best off asking the community to see if anyone is familiar with the technology.