Read file into byte array is different to string

Read file into byte array is different to string - c#

I have a file in visual studio with the following contents:"{"Name":"Pete"}"
If I read the file with the following code it appears to create a string with the original value:
byte[] byteArray = System.IO.File.ReadAllBytes(filePath);
string jsonResponse = System.Text.Encoding.UTF8.GetString(byteArray);
However, the string is actually different to the version that exists if I use the following code:
string jsonResponse = "{\"Name\":\"Pete\"}";
Why? (The reason I think it is different is because when I pass each version to a json deserializer it behaves differently)
Thanks.

Given your final comment in the question, I suspect the problem is that you've got a byte-order mark at the start of the file. Try loading the file like this instead:
string jsonResponse = File.ReadAllText(filePath);
I believe that will strip the BOM for you. Alternatively, you could try explicitly trimming it yourself:
jsonResponse = jsonResponse.TrimStart('\feff');

My guess would be that you have a terminating newline in your file.
You can easily verify if two strings have the same content in C# by just comparing them with a == b.
Here's a short code sample that might help you identify the problem. The strings are output surrounded by < >, which should help you identify surrounding whitespace (which, by the way, can be removed using String.Trim).
byte[] byteArray = System.IO.File.ReadAllBytes(filePath);
string fromFile = System.Text.Encoding.UTF8.GetString(byteArray);
string fromString = "{\"Name\":\"Pete\"}";
if (fromFile == fromString) {
Console.WriteLine("Strings are the same.");
} else {
Console.WriteLine("Strings are different!");
Console.WriteLine("fromFile: <" + fromFile + ">");
Console.WriteLine("fromString: <" + fromString + ">");
}

Related

How do I use Xamarins System.JSON?

I'm trying to deserialize a JSON string into an object. I am using the System.JSON library within Xamarin and this is what I have so far:
ServerConnection.Receive (bb);
data = Encoding.ASCII.GetString (bb);
try{
MemoryStream stream = new MemoryStream(bb);
JsonValue jsonCounters = JsonObject.Load(stream);
}
catch(Exception error){
Console.WriteLine ("ERROR: " + error.Message);
}
The problem I am having is that jsonCounters is always null. I understand that JSON.NET is a better library, but using it would require upgrading my account, and that's something I'm not really ready to do just yet.
EDIT:
I followed the link supplied by JamesMontemagno. I then wrote the following into my application:
ServerConnection.Receive (bb);
data = Encoding.ASCII.GetString (bb);
try{
JsonValue value = JsonValue.Parse(data);
JsonObject jsonCounters = value as JsonObject;
}
catch(Exception error){
Console.WriteLine ("ERROR: " + error.Message);
}
The only problem is that when I create the byte array that I receive on I do:
byte[] bb = new byte[1024]
and the issue is then when I receive and try to parse the JSON it seems that the difference between the JSON length and the length of the byte array isn't lost, it's just converted to white space at the end of the JSON, which causes JsonValue.Parse to fail with ERROR: extra characters in JSON input. At line 1, column 642. I tried data = data.Trim(), but that did a whole lot of nothing.

JamesMontemagno was right with his link. It showed me the correct usage of JsonObject. After I was able to get the Json to atleast atempt to parse I needed to remove the trailing white space. I did that by using the following code to create a new byte array that would parse.
ServerConnection.Receive (rawRecieve);
int i = rawRecieve.Length - 1;
while (rawRecieve [i] == 0) {
--i;
}
byte[] cleanRecieve = new byte[i+1];
Array.Copy(rawRecieve, cleanRecieve, i+1);
I stole that bit of code from: Removing trailing nulls from byte array in C#

Encoding detection for a string-data in a byte[] succeed and after that all string comparisons failed

How it is all setup:
I receive a byte[] which contains CSV data
I don't know the encoding (should be unicode / utf8)
I need to detect the encoding or fallback to a default (the text may contain umlauts, so the encoding is important)
I need to read the header line and compare it with defined strings
After a short search I how to get a string out of the byte[] I found How to convert byte[] to string? which stated to use something like
string result = System.Text.Encoding.UTF8.GetString(byteArray);
I (know) use this helper to detect the encoding and afterwards the Encoding.GetString method to read the string like so:
string csvFile = TextFileEncodingDetector.DetectTextByteArrayEncoding(data).GetString(data);
But when I now try to compare values from this result string with static strings in my code all comparisons fails!
// header is the first line from the string that I receive from EncodingHelper.ReadData(data)
for (int i = 0; i < headers.Count; i++) {
switch (headers[i].Trim().ToLower()) {
case "number":
// do
break;
default:
throw new Exception();
}
}
// where (headers[i].Trim().ToLower()) => "number"
While this seems to be a problem with the encoding of both strings my question is:
How can I detect the encoding of a string from a byte[] and convert it into the default encoding so that I am able to work with that string data?
Edit
The code supplied above was working as long the string data came from a file that was saved this way:
string tempFile = Path.GetTempFileName();
StreamReader reader = new StreamReader(inputStream);
string line = null;
TextWriter tw = new StreamWriter(tempFile);
fileCount++;
while ((line = reader.ReadLine()) != null)
{
if (line.Length > 1)
{
tw.WriteLine(line);
}
}
tw.Close();
and afterwards read out with
File.ReadAllText()
This
A. Forces the file to be unicode (ANSI format kills all umlauts)
B. requires the written file be accessible
Now I only got the inputStream and tried what I posted above. And as I mentioned this worked before and the strings look identical. But they are not.
Note: If I use ANSI encoded file, which uses Encoding.Default all works fine.
Edit 2
While ANSI encoded data work the UTF8 Encoded (notepadd++ only show UTF-8 not w/o BOM) start with char [0]: 65279
So where is my error because I guess System.Text.Encoding.UTF8.GetString(byteArray) is working the right way.

Yes, Encoding.GetString doesn't strip the BOM (see https://stackoverflow.com/a/11701560/613130). You could:
string result;
using (var memoryStream = new MemoryStream(byteArray))
{
result = new StreamReader(memoryStream).ReadToEnd();
}
The StreamReader will autodetect the encoding (your encoding detector is a copy of the StreamReader.DetectEncoding())

The input is not a valid Base-64 string as it contains a non-base 64 character

I have a REST service that reads a file and sends it to another console application after converting it to Byte array and then to Base64 string. This part works, but when the same stream is received at the application, it gets manipulated and is no longer a valid Base64 string. Some junk characters are getting introduced into the stream.
The exception received when converting the stream back to Byte is
The input is not a valid Base-64 string as it contains a non-base 64
character, more than two padding characters, or a non-white space
character among the padding characters
At Service:
[WebGet(UriTemplate = "ReadFile/Convert", ResponseFormat = WebMessageFormat.Json)]
public string ExportToExcel()
{
string filetoexport = "D:\\SomeFile.xls";
byte[] data = File.ReadAllBytes(filetoexport);
var s = Convert.ToBase64String(data);
return s;
}
At Application:
var client = new RestClient("http://localhost:56877/User/");
var request = new RestRequest("ReadFile/Convert", RestSharp.Method.GET);
request.AddHeader("Accept", "application/Json");
request.AddHeader("Content-Type", "application/Json");
request.OnBeforeDeserialization = resp => {resp.ContentType = "application/Json";};
var result = client.Execute(request);
byte[] d = Convert.FromBase64String(result.Content);

Check if your image data contains some header information at the beginning:
imageCode = "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAMgAAABkC...
This will cause the above error.
Just remove everything in front of and including the first comma, and you good to go.
imageCode = "iVBORw0KGgoAAAANSUhEUgAAAMgAAABkC...

Very possibly it's getting converted to a modified Base64, where the + and / characters are changed to - and _. See http://en.wikipedia.org/wiki/Base64#Implementations_and_history
If that's the case, you need to change it back:
string converted = base64String.Replace('-', '+');
converted = converted.Replace('_', '/');

We can remove unnecessary string input in front of the value.
string convert = hdnImage.Replace("data:image/png;base64,", String.Empty);
byte[] image64 = Convert.FromBase64String(convert);

Remove the unnecessary string through Regex
Regex regex=new Regex(#"^[\w/\:.-]+;base64,");
base64File=regex.Replace(base64File,string.Empty);

Since you're returning a string as JSON, that string will include the opening and closing quotes in the raw response. So your response should probably look like:
"abc123XYZ=="
or whatever...You can try confirming this with Fiddler.
My guess is that the result.Content is the raw string, including the quotes. If that's the case, then result.Content will need to be deserialized before you can use it.

Just in case you don't know the type of uploaded image, and you just you need to remove its base64 header:
var imageParts = model.ImageAsString.Split(',').ToList<string>();
//Exclude the header from base64 by taking second element in List.
byte[] Image = Convert.FromBase64String(imageParts[1]);

Probably the string would be like this data:image/jpeg;base64,/9j/4QN8RXh...
First split for / and get the second token.
var StrAfterSlash = Face.Split('/')[1];
Then Split for ; and get the first token which will be the format. In my case it's jpeg.
var ImageFormat =StrAfterSlash.Split(';')[0];
Then remove the line data:image/jpeg;base64, for the collected format
CleanFaceData=Face.Replace($"data:image/{ImageFormat };base64,",string.Empty);

I arranged a similar context as you described and I faced the same error. I managed to get it working by removing the " from the beginning and the end of the content and by replacing \/ with /.
Here is the code snippet:
var result = client.Execute(request);
var response = result.Content
.Substring(1, result.Content.Length - 2)
.Replace(#"\/","/");
byte[] d = Convert.FromBase64String(response);
As an alternative, you might consider using XML for the response format:
[WebGet(UriTemplate = "ReadFile/Convert", ResponseFormat = WebMessageFormat.Xml)]
public string ExportToExcel() { //... }
On the client side:
request.AddHeader("Accept", "application/xml");
request.AddHeader("Content-Type", "application/xml");
request.OnBeforeDeserialization = resp => { resp.ContentType = "application/xml"; };
var result = client.Execute(request);
var doc = new System.Xml.XmlDocument();
doc.LoadXml(result.Content);
var xml = doc.InnerText;
byte[] d = Convert.FromBase64String(xml);

var spl = item.Split('/')[1];
var format =spl.Split(';')[0];
stringconvert=item.Replace($"data:image/{format};base64,",String.Empty);

please check if there is no == as a postfix,
just add == chars at last of string
// "........V/XeAeH/wALVWKtD8lz/AAAAABJRU5ErkJggg"
"........V/XeAeH/wALVWKtD8lz/AAAAABJRU5ErkJggg==" /* yes */

As Alex Filipovici mentioned the issue was a wrong encoding. The file I read in was UTF-8-BOM and threw the above error on Convert.FromBase64String(). Changing to UTF-8 did work without problems.

And some times it started with double quotes,
most of the times when you call API from dotNetCore 2 for getting file
string string64 = string64.Replace(#"""", string.Empty);
byte[] bytes = Convert.ToBase64String(string64);

I get this error because a field was varbinary in sqlserver table instead of varchar.

Strip the byte order mark from string in C#

In C#, I have a string that I'm obtaining from WebClient.DownloadString. I've tried setting client.Encoding to new UTF8Encoding(false), but that's made no difference - I still end up with a byte order mark for UTF-8 at the beginning of the result string. I need to remove this (to parse the resulting XML with LINQ), and want to do so in memory.
So I have a string that starts with \x00EF\x00BB\x00BF, and I want to remove that if it exists. Right now I'm using
if (xml.StartsWith(ByteOrderMarkUtf8))
{
xml = xml.Remove(0, ByteOrderMarkUtf8.Length);
}
but that just feels wrong. I've tried all sorts of code with streams, GetBytes, and encodings, and nothing works. Can anyone provide the "right" algorithm to strip a BOM from a string?

I recently had issues with the .NET 4 upgrade, but until then the simple answer is
String.Trim()
removes the BOM up until .NET 3.5.
However, in .NET 4 you need to change it slightly:
String.Trim(new char[]{'\uFEFF'});
That will also get rid of the byte order mark, though you may also want to remove the ZERO WIDTH SPACE (U+200B):
String.Trim(new char[]{'\uFEFF','\u200B'});
This you could also use to remove other unwanted characters.
Some further information is from
String.Trim Method:
The .NET Framework 3.5 SP1 and earlier versions maintain an internal list of white-space characters that this method trims. Starting with the .NET Framework 4, the method trims all Unicode white-space characters (that is, characters that produce a true return value when they are passed to the Char.IsWhiteSpace method). Because of this change, the Trim method in the .NET Framework 3.5 SP1 and earlier versions removes two characters, ZERO WIDTH SPACE (U+200B) and ZERO WIDTH NO-BREAK SPACE (U+FEFF), that the Trim method in the .NET Framework 4 and later versions does not remove. In addition, the Trim method in the .NET Framework 3.5 SP1 and earlier versions does not trim three Unicode white-space characters: MONGOLIAN VOWEL SEPARATOR (U+180E), NARROW NO-BREAK SPACE (U+202F), and MEDIUM MATHEMATICAL SPACE (U+205F).

I had some incorrect test data, which caused me some confusion. Based on How to avoid tripping over UTF-8 BOM when reading files I found that this worked:
private readonly string _byteOrderMarkUtf8 =
Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble());
public string GetXmlResponse(Uri resource)
{
string xml;
using (var client = new WebClient())
{
client.Encoding = Encoding.UTF8;
xml = client.DownloadString(resource);
}
if (xml.StartsWith(_byteOrderMarkUtf8, StringComparison.Ordinal))
{
xml = xml.Remove(0, _byteOrderMarkUtf8.Length);
}
return xml;
}
Setting the client Encoding property correctly reduces the BOM to a single character. However, XDocument.Parse still will not read that string. This is the cleanest version I've come up with to date.

This works as well
int index = xmlResponse.IndexOf('<');
if (index > 0)
{
xmlResponse = xmlResponse.Substring(index, xmlResponse.Length - index);
}

A quick and simple method to remove it directly from a string:
private static string RemoveBom(string p)
{
string BOMMarkUtf8 = Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble());
if (p.StartsWith(BOMMarkUtf8))
p = p.Remove(0, BOMMarkUtf8.Length);
return p.Replace("\0", "");
}
How to use it:
string yourCleanString=RemoveBom(yourBOMString);

If the variable xml is of type string, you did something wrong already - in a character string, the BOM should not be represented as three separate characters, but as a single code point.
Instead of using DownloadString, use DownloadData, and parse byte arrays instead. The XML parser should recognize the BOM itself, and skip it (except for auto-detecting the document encoding as UTF-8).

I had a very similar problem (I needed to parse an XML document represented as a byte array that had a byte order mark at the beginning of it). I used one of Martin's comments on his answer to come to a solution. I took the byte array I had (instead of converting it to a string) and created a MemoryStream object with it. Then I passed it to XDocument.Load, which worked like a charm. For example, let's say that xmlBytes contains your XML in UTF-8 encoding with a byte mark at the beginning of it. Then, this would be the code to solve the problem:
var stream = new MemoryStream(xmlBytes);
var document = XDocument.Load(stream);
It's that simple.
If starting out with a string, it should still be easy to do (assume xml is your string containing the XML with the byte order mark):
var bytes = Encoding.UTF8.GetBytes(xml);
var stream = new MemoryStream(bytes);
var document = XDocument.Load(stream);

I wrote the following post after coming across this issue.
Essentially instead of reading in the raw bytes of the file's contents using the BinaryReader class, I use the StreamReader class with a specific constructor which automatically removes the byte order mark character from the textual data I am trying to retrieve.

It's of course best if you can strip it out while still on the byte array level to avoid unwanted substrings / allocs. But if you already have a string, this is perhaps the easiest and most performant way to handle this.
Usage:
string feed = ""; // input
bool hadBOM = FixBOMIfNeeded(ref feed);
var xElem = XElement.Parse(feed); // now does not fail
/// <summary>
/// You can get this or test it originally with: Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble())[0];
/// But no need, this way we have a constant. As these three bytes `[239, 187, 191]` (a BOM) evaluate to a single C# char.
/// </summary>
public const char BOMChar = (char)65279;
public static bool FixBOMIfNeeded(ref string str)
{
if (string.IsNullOrEmpty(str))
return false;
bool hasBom = str[0] == BOMChar;
if (hasBom)
str = str.Substring(1);
return hasBom;
}

Pass the byte buffer (via DownloadData) to string Encoding.UTF8.GetString(byte[]) to get the string rather than download the buffer as a string. You probably have more problems with your current method than just trimming the byte order mark. Unless you're properly decoding it as I suggest here, Unicode characters will probably be misinterpreted, resulting in a corrupted string.
Martin's answer is better, since it avoids allocating an entire string for XML that still needs to be parsed anyway. The answer I gave best applies to general strings that don't need to be parsed as XML.

I ran into this when I had a Base64 encoded file to transform into the string. While I could have saved it to a file and then read it correctly, here's the best solution I could think of to get from the byte[] of the file to the string (based lightly on TrueWill's answer):
public static string GetUTF8String(byte[] data)
{
byte[] utf8Preamble = Encoding.UTF8.GetPreamble();
if (data.StartsWith(utf8Preamble))
{
return Encoding.UTF8.GetString(data, utf8Preamble.Length, data.Length - utf8Preamble.Length);
}
else
{
return Encoding.UTF8.GetString(data);
}
}
Where StartsWith(byte[]) is the logical extension:
public static bool StartsWith(this byte[] thisArray, byte[] otherArray)
{
// Handle invalid/unexpected input
// (nulls, thisArray.Length < otherArray.Length, etc.)
for (int i = 0; i < otherArray.Length; ++i)
{
if (thisArray[i] != otherArray[i])
{
return false;
}
}
return true;
}

StreamReader sr = new StreamReader(strFile, true);
XmlDocument xdoc = new XmlDocument();
xdoc.Load(sr);

Yet another generic variation to get rid of the UTF-8 BOM preamble:
var preamble = Encoding.UTF8.GetPreamble();
if (!functionBytes.Take(preamble.Length).SequenceEqual(preamble))
preamble = Array.Empty<Byte>();
return Encoding.UTF8.GetString(functionBytes, preamble.Length, functionBytes.Length - preamble.Length);

Use a regex replace to filter out any other characters other than the alphanumeric characters and spaces that are contained in a normal certificate thumbprint value:
certficateThumbprint = Regex.Replace(certficateThumbprint, #"[^a-zA-Z0-9\-\s*]", "");
And there you go. Voila!! It worked for me.

I solved the issue with the following code
using System.Xml.Linq;
void method()
{
byte[] bytes = GetXmlBytes();
XDocument doc;
using (var stream = new MemoryStream(docBytes))
{
doc = XDocument.Load(stream);
}
}

Encoding Conversion problem

I've got a little problem changing the ecoding of a string. Actually I read from a DB strings that are encoded using the codepage 850 and I have to prepare them in order to be suitable for an interoperable WCF service.
From the DB I read characters \x10 and \x11 (triangular shapes) and i want to convert them to the Unicode format in order to prevent serialization/deserialization problem during WCF call. (Chars
and are not valid according of the XML specs even if WCF serialize them).
Now, I use following code in order to covert string encoding, but nothing happens. Result string is in fact identical to the original one.
I'm probably missing something...
Please help me!!!
Emanuele
static class UnicodeEncodingExtension
{
public static string Convert(this Encoding sourceEncoding, Encoding targetEncoding, string value)
{
string reEncodedString = null;
byte[] sourceBytes = sourceEncoding.GetBytes(value);
byte[] targetBytes = Encoding.Convert(sourceEncoding, targetEncoding, sourceBytes);
reEncodedString = sourceEncoding.GetString(targetBytes);
return reEncodedString;
}
}
class Program
{
private static Encoding Cp850Encoding = Encoding.GetEncoding(850);
private static Encoding UnicodeEncoding = Encoding.UTF8;
static void Main(string[] args)
{
string value;
string resultValue;
value = "\x10";
resultValue = Cp850Encoding.Convert(UnicodeEncoding, value);
value = "\x11";
resultValue = Cp850Encoding.Convert(UnicodeEncoding, value);
value = "\u25b6";
resultValue = UnicodeEncoding.Convert(Cp850Encoding, value);
value = "\u25c0";
resultValue = UnicodeEncoding.Convert(Cp850Encoding, value);
}
}

It seems you think there is a problem based on an incorrect understanding. But jmservera is correct - all strings in .NET are encoded internally as unicode.
You didn't say exactly what you want to accomplish. Are you experiencing a problem at the other end of the wire?
Just FYI, you can set the text encoding on a WCF binding with the textMessageEncoding element in the config file.

I suspect this line may be your culprit
reEncodedString = sourceEncoding.GetString(targetBytes);
which seems to take your target encoded string of bytes and asks your sourceEncoding to make a string out of them. I've not had a chance to verify it but I suspect the following might be better
reEncodedString = targetEncoding.GetString(targetBytes);

All the strings stored in string are in fact Unicode.Unicode. Read: Strings in .Net and C# and The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Edit: I suppose that you want the Convert function to automatically change \x11 to \u25c0, but the problem here is that \x11 is valid in almost any encoding, the differences usually start in character \x80, so the Convert function will maintain it even if you do that:
string reEncodedString = null;
byte[] unicodeBytes = UnicodeEncoding.Unicode.GetBytes(value);
byte[] sourceBytes = Encoding.Convert(Encoding.Unicode,
sourceEncoding, unicodeBytes);
You can see in unicode.org the mappings from CP850 to Unicode. So, for this conversion to happen you will have to change these characters manually.

byte[] sourceBytes =Encoding.Default.GetBytes(value)
Encoding.UTF8.GetString(sourceBytes)
this sequence usefull for download unicode file from service(for example xml file that contain persian character)

You should try this:
byte[] sourceBytes = sourceEncoding.GetBytes(value);
var convertedString = Encoding.UTF8.GetString(sourceBytes);

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Read file into byte array is different to string - c#

Related

How do I use Xamarins System.JSON?

Encoding detection for a string-data in a byte[] succeed and after that all string comparisons failed

The input is not a valid Base-64 string as it contains a non-base 64 character

Strip the byte order mark from string in C#

Encoding Conversion problem

Categories

Resources