I have a project where everything that is stored in database is encrypted. For encoding we use System.Text.Encoding.Default.GetBytes(text).
The problem is that now the client wants to add support for polish (and other nordic) characters and using the Default encoding doesn't work, the polish characters get converted to english characters (e.g Ą gets converted to A).
I can't change the encoding (Unicode seems to work) as the previous data will be lost.
Is there any way to get around this and add support for new characters while keeping the old data?
To be clear; you realise that "encoding" is not "encrypting", but I suppose you encrypt the byte array you get from encoding your string data?
Then I'd suggest either decrypting and re-encoding and re-encrypting all existing data using UTF-8 (the most efficient encoding for Western alphabets), or add a "version" or "encoding" column indicating with which encoding the data was encrypted.
Related
Well, when using IO.File.ReadAllText(path) or ReadAllText(path, System.Text.Encoding.UTF8) to read a text file which is saved in ANSI encoding, non-latin characters aren't displayed correctly.
So, I decided to use Encoding.Default. It worked just fine, but I see recommendations against using it everywhere (like here and here) because it "will only guarantee that all UTF-7 character sets will be read correctly". Also Microsoft
says:
Gets an encoding for the operating system's current ANSI code page.
However, it seems to me that it can recognize a file with any encoding. I tested that on a file that contains Chinese, Japanese, and Arabic characters -the file is saved in utf8 encoding-, and I was able to display the file correctly.
Code used:
Dim loadedText As String = IO.File.ReadAllText(path, System.Text.Encoding.Default)
MessageBox.Show(loadedText, "utf8")
Output:
So my question in points:
Is there something I'm missing here?
Why is it not recommended to use Encoding.Default when reading a file?
I know that a file with ANSI encoding would be displayed incorrectly if the default system encoding/system locale is changed, which is something I don't care about in my current case. But..
Is there even another way to prevent this from happening?
Side note: Please don't mind me using the c# tag. Although my code is in VB, any answer with C# code is welcomed.
File.ReadAllText actually tries to auto-detect the encoding. If the encoding cannot be determined from a BOM, then the encoding argument is used to decode the file.
This method attempts to automatically detect the encoding of a file based on the presence of byte order marks. Encoding formats UTF-8 and UTF-32 (both big-endian and little-endian) can be detected.
If you used Encoding.UTF8 to write the file, then it would include a BOM. Your Encoding.Default is likely being ignored.
Using Encoding.Default is not recommended because it is operating system's ANSI code page, which is limited to given code page's character set. In other words, text file created in Notepad (ANSI encoding) in Czech Windows will be displayed incorrectly in English Windows. For this reason, everything should be saved and opened in UTF-8 encoding.
Saved in ANSI and opened in Unicode may not work
Saved in Unicode and opened in ANSI will not work
Saved in ANSI and opened in another ANSI may not work
I've been using the DOMDocument object from VB6 (MSXML) to create and save an XML file that has an encrypted string. However, this string I think has certain special characters...
<EncryptedPassword>ÆÔ¤ïÎ
߯8KHÖN›¢)Þ,qiãÔÙ</EncryptedPassword>
With this, I go into my C# Project, and de-serialise this XML file in UTF-8 encoding and it fails on this string. I've tried serialisation via ASCII and this gets a couple characters more, but still fails. If I put a plain text string in this place, all is ok! :(
I'm thinking that maybe I'm better converting the string into an MD5 type string from VB6 first, and decoding the MD5 string in .NET and then decrypting the actual string with special characters, but it's an extra step to code all this up and was hoping someone might have a better idea for me here?
Thanks in advance!
The best thing for you to do is to encode your encrypted string in something that will use the ASCII charset. The easiest way to do this is to take your encrypted string and then encode it into Base64 and write this encoded value to the XML element.
And in .net, simply take the value of the XML element and decode it from Base64 and 'voila', you have your enrypted string.
.Net can easily decode a base64 string, see: http://msdn.microsoft.com/en-us/library/system.text.encoding.ascii.aspx. (This page may make it look a bit complicated than it really is).
VB6 does not have native support for Base64 encoding but a quick trawl on google throws up some examples on how it can be achieved quite easily:
http://www.vbforums.com/showthread.php?t=379072
http://www.nonhostile.com/howto-encode-decode-base64-vb6.asp
http://www.mcmillan.org.nz/Programming/post/Base64-Class.aspx
http://www.freevbcode.com/ShowCode.asp?ID=2038
I've concluded that storing these characters in the XML file is wrong. VB6 allows this, but .NET doesn't! Therefore I have converted the string to a Base64 array in line with this link: -
http://www.nonhostile.com/howto-encode-decode-base64-vb6.asp
Now, on the .NET side, the file will de-serialise back into my class where I now store the password as a byte array. I then convert this back to the string I need to decrypt which now appears to raise another problem!
string password = Encoding.UTF7.GetString(this.EncryptedPassword);
With this encoding conversion, I get the string almost exactly back to how I want, but there is a small greater than character that is just not translating properly! Then a colleague found a stack overflow post that had the final answer! There's a discrepancy between VB6 and .NET on this type of conversion. Doing the following instead did the trick: -
string password = Encoding.GetEncoding(1252).GetString(this.EncryptedPassword);
Thanks for all the help, much appreciated. The original post about this is # .Net unicode problem, vb6 legacy
I have a file that i need to import.
The problem is that I have problems with a lot of characters in that file.
For example these names are wrong:
Björn (in file) - Should be Björn
Ã…ke (in file) - Should be Åke
Unfortunately I can't recreate the file with the correct encoding.
Also there are a lot of characters that are wrong (these was just examples). I can't do a search and replace on all (if there isn't a dictionary with all conversions).
Can I decode the strings in some way?
thanks Patrik
Edit:
Just some more info that I should added before (I blame my tiredness).
The file is an .xlsx file.
I debugged this with Notepad++. I copied the correct strings into Notepad++. I used Encoding | Convert to UTF-8. Then I selected Encoding | Encode as ANSI. This has the effect of interpreting the UTF-8 bytes as if they were ANSI. And when I did this I end up with the same erroneous values as you. So clearly when you read the file you are interpreting is as ANSI rather than UTF-8.
The solution then is that your file has been encoded as UTF-8. Make sure that the file is interpreted as UTF-8 when you read it. I can't tell you exactly how to do that since you didn't show how you were reading the file in the first place.
It's possible that your file does not contain a byte-order-mark (BOM). If so then specify the encoding when you read the file by passing Encoding.UTF8.
I've just tried your first example, and it definitely looks like that's UTF-8.
It's unclear what you're using to look at the file in the first place, but if you load it with a text editor which understands UTF-8 and tell it that it's a UTF-8 file, it should be fine.
When you load it with .NET, you should just be able to use File.OpenText, File.ReadAllText etc - most IO dealing with encodings in .NET defaults to UTF-8 anyway.
I have a website where a user can upload a txt file of data and the data will be imported into the db. However, some users are uploading the data in UTF-8, and others are uploading it in UTF-16.
byte[] fileData = null;
uploader.PostedFile.InputStream.Read(fileData, 0, length);
data = TLCommon.EncodeJsString(System.Text.Encoding.UTF8.GetString(fileData));
When the file is saved in UTF-16 and uploaded, the data is garbage. How can I handle this situation?
There are various heuristics you can employ, such as checking for a high percentage of 00 bytes in the stream. (These won't be present in UTF-8, but are common in UTF-16 text that contains ASCII characters.)
This however, can't distinguish between UTF-8 and Windows-1252, which are incompatible 8-bit encodings that are both very common on U.S. English Windows systems. You can add more checks, such as looking for byte sequences that are invalid in one encoding but not another, but this starts to get very complex and typically doesn't distinguish between different single-byte encodings.
Microsoft provides a library named MLang, which can automatically detect UTF-8, UTF-16, and many 8-bit codepages using statistical analysis of the bytes in the stream. Its accuracy is quite good if it has a large-enough sample of text to work with. I blogged about how to use this method, and posted the full source code on GitHub.
There are a few options you can use: check the content-type to see if it includes a charset parameter which would indicate the encoding (e.g. Content-Type: text/plain; charset=utf-16); check if the uploaded data has a BOM (the first few bytes in the file, which would map to the unicode character U+FEFF - 2 bytes for UTF-16, 3 for UTF-8), or if you know something about the file (is the first character supposed to be ascii, such as in XML, which start with a '<') then you can use it to find out the encoding. But if you don't have those pieces of information you'll have to guess by using some heuristics.
I'm reading a file using:
var source = File.ReadAllText(path);
and the character © wasn't being loaded correctly.
Then, I changed it to:
var source = File.ReadAllText(path, Encoding.UTF8);
and nothing.
I decided to try using
var source = File.ReadAllText(path, Encoding.Default);
and it worked perfectly.
Then I debugged it and tried to find which Encoding did the trick, and I found that it was UTF-7.
What I want to know is:
Is it recommended to use Encoding.Default, and can it guarantee all the characters of the file will be read without problems?
Encoding.Default will only guarantee that all UTF-7 character sets will be read correctly (google for the whole set). On the other hand, if you try to read a file not encoded with UTF-8 in the UTF-8 mode, you'll get corrupted characters like you did.
For instance if the file is encoded UTF-16 and if you read it in UTF-16 mode, you'll be fine even if the file does not contain a single UTF-16 specific character. It all boils down to the file's encoding.
You'll need to do the save - reopen stuff with the same encoding to be safe from corruptions. Otherwise, try to use UTF-7 as much as you can since it is the most compact yet 'email safe' encoding possible, which is why it is default in most .NET framework setups.
It is not recommended to use Encoding.Default.
Quote from MSDN:
Different computers can use different
encodings as the default, and the
default encoding can even change on a
single computer. Therefore, data
streamed from one computer to another
or even retrieved at different times
on the same computer might be
translated incorrectly. In addition,
the encoding returned by the Default
property uses best-fit fallback to map
unsupported characters to characters
supported by the code page. For these
two reasons, using the default
encoding is generally not recommended.
To ensure that encoded bytes are
decoded properly, your application
should use a Unicode encoding, such as
UTF8Encoding or UnicodeEncoding, with
a preamble. Another option is to use a
higher-level protocol to ensure that
the same format is used for encoding
and decoding.
It sounds like you are interested in auto-detecting the encoding of a file, in some sort of situation where you are not in control of the encoding used to save it. There are several questions on StackOverflow addressing this; some cursory browsing points to Determine a string's encoding in C# as a pretty good one. My favorite answer is the one pointing to a C# port of Mozilla's universal charset detector.
I think the ur file is in utf-7 encoding.nothing more.
visit this page Your Answer