FromBase64String/UTF Encoding - c#

My issue is based around a string of data that I'm getting back from an API call. I'm passing the raw data into FromBase64String and then encoding the byte array back to a string. I'm expecting a valid pdfsharp return that I'm saving to a file. None of the decoded string values below contain the correct data. I know the original base64 coded api return string is valid since I can open it in notepadd++ and use a base64 decoder to create the properly formatted pdf document.
byte[] todecode_byte = Convert.FromBase64String(data);
string decodedUTF7 = Encoding.UTF7.GetString(todecode_byte);
string decodedUTF8 = Encoding.UTF8.GetString(todecode_byte);
The closest representation to what I think it should be (the notepadd++ converted version) is the UTF7. But there seems to be some missing data within the embedded images within the document. UTF8 has some structural differences when comparing to the working document.
For example...
My control...
%PDF-1.7
%ÓôÌá
1 0 obj
<<
UTF7...
%PDF-1.7
%ÓôÌá
1 0 obj
<<
UTF8...
%PDF-1.7
%����
1 0 obj
<<
But, again, the UTF7 version seems to have issues revolving around the images that are embeded futher down in the document. Either way, both versions create an 88k pdf document that opens as a blank page. The control (using notepadd++), when saved as a pdf document, is about half of that size and will open displaying all of the correct information.

I'm expecting a valid pdfsharp return that I'm saving to a file.
If it's meant to be a PDF file, I wouldn't try to convert that to a string at all. It's simply not text - it's binary data. It should be as simple as:
byte[] binaryData = Convert.FromBase64String(data);
File.WriteAllBytes("file.pdf", binaryData);

Related

C# - String stored as Base64, but the retreived string is not a valid base64 string

I'm using Base64 encoding to store values from my data structure into a string.
Basically what I do is convert a byte array into base64 string
string StoredData = Convert.ToBase64String(ByteArray);
I then divide StoredData into strings of a maximum length of 256 Characters and store them as an ASCII string (in AutoCAD XData as an DxfCode.ExtendedDataAsciiString) .
When I want to retrieve my data I do the following:
First I combine each 256 long string using StoredData = sting1 + string2 + ...
Then I convert StoredData back into ByteArray using
var ByteArray = Convert.FromBase64String(StoredData);
Now this has worked great for me and my clients until a month ago, where one of my clients has had some crash and errors popping up.
I asked him to send me his stored data, and I got surprised to see that his data contained invalid Base64 Characters (see sample below)
tM7x24QLLLALr5ivAx3XFAM7uciYXrCjKXSFd3XOL/KGIc3C+JMO8QjHT/4c+puYrNLq5r9Is0vpDKyuxw9I6R3f1LuOYSdHS6XgZJEyMvGwSHNRSYJ/a0IoumQftB3XspQRwp4QSd7qcUVsrXw0+2RS/sd2vAvUFxEQgwsHaabb01YjchGeyxr1f78A4qy2BL/oHAsRak9UYN0mDzhZgbhpahlgdK3eWd8b2BTM01lWh74pYUrJR+JfQ0tw0Eu㿔
Z/1JxBMUv2cB6NrFehSuNF9l4dhAaZQ+TcIClZmk/ZC8TJ0rKka/J+HqhLDAwWExB3nXoIi00uJnE7J4R6rU+Q==
as you can see the first 256 long string had an invalid Base64 character (㿔)
Why is that happening? can this be related to the users computer? I tried to replicate this error without any success and because I don't have access to their computers, I'm starting to think it might be something on their side.
The application uses .Net framework version 4.5.
Edit: it turned out client has sent me a recovered document which didn't recover the text strings properly which explains the corrupted string.
It turns out the app has crashed and client has recovered the drawing document with corrupted string.

How do I extract UTF-8 strings out of a JSON file using LitJSON, as JsonData does not seem to convert?

I've tried many methods to extract some strings out of a JSON file using LitJson in Unity.
I've encoding converts all over, tried getting byte arrays and sending them around and nothing seems to work.
I went to the very start of where I create the JsonData object and tried to run the following test:
public JsonData CreateJSONDataObject()
{
Debug.Assert(pathName != null, "No JSON Data path name set. Please set before commencing read.");
string jsonString = File.ReadAllText(Application.dataPath + pathName, System.Text.Encoding.UTF8);
JsonData jsonDataObject = JsonMapper.ToObject(jsonString);
Debug.Log("Test compatibility: ë | " + jsonDataObject["Roots"][2]["name"]);
return jsonDataObject;
}
I made sure my jsonString is using UTF-8, however the output shows this:
Test compatibility: ë | W�den
I've tried many other methods, but as this is making sure to encode right when creating a JsonData object I can't think of what I am doing wrong as I just don't know enough about JSON.
Thank you in advance.
This type of problem occurs when a text file is written with one encoding and read using a different one. I was able to reproduce your problem with the following program, which removes the JSON serialization from the equation entirely:
string file = #"c:\temp\test.txt";
string text = "Wöden";
File.WriteAllText(file, text, Encoding.Default));
string text2 = File.ReadAllText(file, Encoding.UTF8);
Debug.WriteLine(text2);
Since you are reading with UTF-8 and it is not working, the real question is, what encoding was used to write the file originally? You should be using the same encoding to read it back. I suspect that the file was originally created using either Windows-1252 or iso-8859-1 instead of UTF-8. Try using one of those when you read the file, e.g.:
string jsonString = File.ReadAllText(Application.dataPath + pathName,
Encoding.GetEncoding("Windows-1252"));
You said in the comments that your JSON file was not created programmatically, but was "written by hand", meaning you used Notepad or some other text editor to make the file. If that is so, then that explains how you got into this situation. When you save the file, you should have the option to choose an encoding. For Notepad at least, the default encoding is "ANSI", which most likely maps to Windows-1252 (Western European), but depends on your locale. If you are in the Baltic region, for example, it would be Windows-1257 (Baltic). In any case, "ANSI" is not UTF-8. If you want to save the file in UTF-8 encoding, you have to specifically choose that option. Whatever option you use to save the file, that is the encoding you need to use to read it the next time, whether it is with a text editor or with code. Using the wrong encoding to read the file is what causes the corruption.
To change the encoding of a file, you first have to read it in using the same encoding that it was saved in originally, and then you can write it back out using a different encoding. You can do that with your text editor, simply by re-saving the file with a different encoding, or you can do that programmatically:
string text = File.ReadAllText(file, originalEncoding);
File.WriteAllText(file, text, newEncoding);
The key is knowing which encoding was used originally, and therein lies the rub. For legacy encodings (such as Windows-12xx) there is no way to tell because there is no marker in the file which identifies it. Unicode encodings (e.g. UTF-8, UTF-16), on the other hand, do write out a marker at the beginning of the file, called a BOM, or byte-order mark, which can be detected programmatically. That, coupled with the fact that Unicode encodings can represent all characters, is why they are much preferred over legacy encodings.
For more information, I highly recommend reading What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

C# Unrecognized characters while reading from binary file

I have some items who's information is split into two parts, one is contents of a binary file, and other is textual entry inside .txt file. I am trying to make an app that will pack this info into one textual file (textual file because I have reasons to want this file to also be humanly readable as well), with ability to later unpack that file back by creating new binary file and text entry.
The first problem I ran into so far: some info is lost when converting binary into string (or perhaps sooner, during reading of bytes), and I'm not sure if the file is in weird format or I'm doing something wrong. Some characters get shown as question marks.
Example of characters which are replaced with question marks:
ýÿÿ
This is the part where info is read from the binary file and gets encoded into a string (which is how I inteded to store it inside a text file).
byte[] binaryFile = File.ReadAllBytes(pathBinary);
// I also tried this for some reason: byte[] binaryFile = Encoding.ASCII.GetBytes(File.ReadAllText(pathBinary));
string binaryFileText = Convert.ToBase64String(binaryFile); //this is the coded string that goes into joined file to hold binary file information, when decoded the result shows question marks instead of some characters
MessageBox.Show("binary file text: " + Encoding.ASCII.GetString(binaryFile), "debug", MessageBoxButtons.OK, MessageBoxIcon.Information); //this also shows question marks
I expect a few more caveats along the way with second functionality of the app (unpacking back into text and binary), but so far my main problem is unrecognized characters during reading of the binary file or converting it into string, which makes this data unusable in storing as text for purpose of reproducing the file. Any help would be appreciated.
There is no universal conversion of binary string data to a string. A string is a series of unicode characters and as such can hold any character of the unicode range.
Binary data is a series of bytes and as such can be anything from video to a string in various formats.
Since there are multiple binary string representations, you need an Encoding to convert one into the other. The encoding you choose has to match the binary string format. If it doesn't you will get the wrong result.
You are using ASCII encoding for the conversion, which is obviously incorrect. ASCII can not encode the full unicode range. That means even if you use it for encoding, the result of the decoding will not always match the original text.
If you have both, encoding and decoding under control, use an Encoding that can do the full round trip, such as UTF8 or Unicode. If you don't encode the string yourself, use the correct Encoding.

Byte array to text for ScintillaNET

I'm writing a windows forms application in c#. The application allows the user to select source code-files from a listbox and displays them in colored code using ScintillaNET. The files are saved as byte arrays in a database. I've managed to make the conversion from a file on my hard drive to byte array and store it. The user should also be able to edit the code and then save it to the database without having to dowload the file to their local hard drive first, I don't know how to approach this.
Basically I want to save the text from the ScintillNET control and convert it to a byte array.
And the other way around, take a byte array and print out the text as it originally appeared in ScintillaNET.
You can use the "Encoding" class from System.Text.
System.Text.Encoding.Unicode.GetBytes("Example");
This will return a byte array with the bytes equivalent to the text "string" using the unicode encoding. There are other encoding available, but I suggest using unicode since it supports more characters (anything you find in windows charmap, for example). In my case is because I'm latin and certain letters aren't available in UTF and I have my doubts about ASCII.
Now to convert from the byte array to string use:
byte[] exampleByteArray = MemStream.ToArray();
System.Text.Encoding.Unicode.GetString(exampleByteArray);
This code will return the string saved previously as a byte array in a memory stream. You can load the byte array with other methods, in you your case you are gonna load it from the database and call System.Text.Encoding.Unicode.GetString().
I believe you are looking for the System.Text.Encoding namespace...
// a sample string...
string example = "A string example...";
// convert string to bytes
byte[] bytes = Encoding.UTF8.GetBytes(example);
// convert bytes to string
string str = System.Text.Encoding.UTF8.GetString(bytes);

Converting SQL Server varBinary data into string C#

I need help figuring out how to convert data that comes in from a SQL Server table column that is set as varBinary(max) into a string in order to display it in a label.
This is in C# and I'm using a DataReader.
I can pull the data in using:
var BinaryString = reader[1];
i know that this column holds text that was previously convert to binary.
It really depends on which encoding was used when you originally converted from string to binary:
byte[] binaryString = (byte[])reader[1];
// if the original encoding was ASCII
string x = Encoding.ASCII.GetString(binaryString);
// if the original encoding was UTF-8
string y = Encoding.UTF8.GetString(binaryString);
// if the original encoding was UTF-16
string z = Encoding.Unicode.GetString(binaryString);
// etc
The binary data must be encoded text - and you need to know which encoding was used in order to accurately convert it back to text. So for example, you might use:
byte[] binaryData = reader[1];
string text = Encoding.UTF8.GetString(binaryData);
or
byte[] binaryData = reader[1];
string text = Encoding.Unicode.GetString(binaryData);
or various other options... but you need to know the right encoding. Otherwise it's like trying to load a JPEG file into an image viewer which only reads PNG... but worse, because if you get the wrong encoding it may appear to work for some strings.
The next thing to work out is why it's being stored as binary in the first place... if it's meant to be text, why isn't it being stored that way.
You need to know what encoding was used to create the binary. Then you can use
System.Text.Encoding.UTF8.GetString(reader[1]);
And change UTF8 for whatever encoding was used.

Categories