Converting string from memorystream to binary[] contains leading crap - c#

--Edit with more bgnd information--
A (black box) COM object returns me a string.
A 2nd COM object expects this same string as byte[] as input and returns a byte[] with the processed data.
This will be fed to a browser as downloadable, non-human-readable file that will be loaded in a client side stand-alone application.
so I get the string inputString from the 1st COM and convert it into a byte[] as follows
BinaryFormatter bf = new BinaryFormatter();
MemoryStream ms = new MemoryStream();
bf.Serialize(ms, inputString);
obj = ms.ToArray();
I feed it to the 2nd COM and read it back out.
The result gets written to the browser.
Response.ContentType = "application/octet-stream";
Response.AddHeader("content-disposition", "attachment; filename="test.dat");
Response.BinaryWrite(obj);
The error occurs in the 2nd COm because the formatting is incorrect.
I went to check the original string and that was perfectly fine. I then pumped the result from the 1st com directly to the browser and watched what came out. It appeared that somewhere along the road extra unreadable characters are added. What are these characters, what are they used for and how can I prevent them from making my 2nd COM grind to a halt?
The unreadable characters are of this kind:
NUL/SOH/NUL/NUL/NUL/FF/FF/FF/FF/SOH/NUL/NUL/NUL etc
Any ideas?
--Answer--
Use
System.Text.Encoding.UTF8.GetBytes(theString)
rather then
BinaryFormatter.Serialize()

BinaryFormatter is almost certainly not what you want to use.
If you just need to convert a string to bytes, use Encoding.GetBytes for a suitable encoding, of course. UTF-8 is usually correct, but check whether the document specifies an encoding.

Okay, with your updated information: your 2nd COM object expects binary data, but you want to create that binary data from a string. Does it treat it as plain binary data?
My guess is that something is going to reverse this process on the client side. If it's eventually going to want to reconstruct the data as a string, you need to pick the right encoding to use, and use it on both sides. UTF-8 is a good bet in most cases, but if the client side is just going to write out the data to a file and use it as an XML file, you need to choose the appropriate encoding based on the XML.
You said before that the first few characters of the string were just "<foo>" (or something similar) - does that mean there's no XML declaration? If not, choose UTF-8. Otherwise, you should look at the XML declaration and use that to determine your encoding (again defaulting to UTF-8 if the declaration doesn't specify the encoding).
Once you've got the right encoding, use Encoding.GetBytes as mentioned in earlier answers.

I think you are missing the point of BinarySerialization.
For starters, what Type is formulaXml?
Binary serialization will compact that into a machine represented value, NOT XML! The content will look like:
ÿÿÿÿ AIronScheme, Version=1.0.0.0, Culture=neutral, Public
Perhaps you should be looking at the XML serializer instead.
Update:
You want to write out some XML as a 'content-disposition' stream.
To do this, do something like:
byte[] buffer = Encoding.Default.GetBytes(formulaXml);
Response.BinaryWrite(buffer);
That should work like you hoped (I think).

The BinaryFormatter's job is to convert the objects into some opaque serialisation format that can only be understood by another BinaryFormatter at the other end.
(Just about to mention Encoding.GetBytes as well, but Jon beat me to it.)
You'll probably want to use System.Text.Encoding.UTF8.GetBytes().

Is the crap in the beginning two bytes long?
This could be the byte order mark of a Unicode encoded string.
http://en.wikipedia.org/wiki/Byte-order_mark

Related

I have many byte arrays; each is a string. How do I find the encoding each uses?

I have an application that reads binary data from a database. Each byte array retrieved represents a string. The strings, though, have all come from different encodings (most commonly ASCII, UTF-8 BOM, and UTF-16 LE, but there are others). In my own application, I'm trying to convert the byte array back to a string, but the encoding that was used to go from string to bytes is not stored with the bytes. Is it possible in C# to determine or infer the encoding used from the byte array?
The use case is simplified below. Assume the byte array is always a string. Also assume the string can use any encoding.
byte[] bytes = Convert.FromBase64(stringAsBytesAsBase64);
string originalString = Encoding.???.GetString(bytes);
For text that is XML, the XML specification gives requirements and how to determine the encoding.
In the absence of external character encoding information (such as
MIME headers), parsed entities which are stored in an encoding other
than UTF-8 or UTF-16 must begin with a text declaration (see 4.3.1 The
Text Declaration) containing an encoding declaration:
…
In the absence of information provided by an external transport
protocol (e.g. HTTP or MIME), it is a fatal error for an entity
including an encoding declaration to be presented to the XML processor
in an encoding other than that named in the declaration, or for an
entity which begins with neither a Byte Order Mark nor an encoding
declaration to use an encoding other than UTF-8.
—https://www.w3.org/TR/xml/#charencoding
It seems that the storage design was to drop any "information provided by an external transport protocol". It is possible that what was stored meets the specification. You can inspect your data.
If the data is complete, just let your XML processing do the job:
byte[] bytes = Convert.FromBase64(stringAsBytesAsBase64);
using (var stream = new MemoryStream(bytes))
{
var doc = XDocument.Load(stream);
}
If you do need the XML back as text with a known encoding, you can then serialize it using whichever encoding you need.
Someone downvoted this. Perhaps because it didn't start out with a clear answer:
Is it possible in C# to determine or infer the encoding used from the byte array?
No.
Below is the best you can do and you'll see why it's problematic:
You can start with a list of known Encodings.GetEncodings() and eliminate possibilities. In the end, you will have many known possibilities, many known impossibilities and potentially unknown possibilities (for encodings that aren't supported in .NET, if any). That is all as far a hard facts go.
You could then apply heuristics or some knowledge of expected content to narrow the list further. And if the results of applying each of the remaining encodings are all the same, then you've very probably got the correct text even if you didn't identify the original encoding.

Converting Byte[] to string to remain the original byte format

I have large amount of data which consists of tables,font,bold,size,etc. Those data will be stored as byte[] in Database.
when i retrieve those data i need to convert byte[] into string,because i need to some find & replace from this string,after i convert this string into byte[],am losing the original data structure which means, i can't able to see any tables,font,bold etc. properly. So how can i find and replace in byte[] by converting string and also to keep remain the data in original format.
The short answer is don't. Figure out the format of the data and see what you can do to do the manipulation. If the data is actually text, just stored as byte[], your approach would work, provided you encode the string correctly (ie. if your DB expects UTF-8, use UTF-8 encoding, if it's windows-1251, use that).
If you have a structure where a part of it is a string, what you're doing can't really work well. First, you probably want to modify just the relevant parts of the field. On MS SQL, you have handy functions for that. But even then, you should know what's actually stored there, not just assume that a string replace will magically work.
Now, a hack could be to use an explicit encoding that doesn't break the non-string data. That would be some single-byte encoding that doesn't do anything fancy. This is OK as long as you use the same encoding while reading the text data - however, if you use any variant of unicode, you're out of luck; due to features like string normalization, you can't really guarantee that what comes in comes out the same way, per-byte. It's generally a bad practice anyway.
Don't forget that it's quite possible the string you are looking for is actually somewhere outside of the text fields - even by pure chance, it can happen, and certain practices make that even more likely.
Again: figure out the data format inside that data field - then you can decide how to do what you want.
Try this
string result = System.Text.Encoding.UTF8.GetString(byteArray)
To make Byte[] to String
byte[] byteArray = new byte[10]; // put your byte array here
public void byteToString()
{
stringTemp = "";
stringTemp = BitConverter.ToString(byteArray).Replace("-", "");
}
And your data still in byteArray.. :)
If the byte Array contains binary data and is no string, try to convert it to base64:
Convert.ToBase64String(yourByteArray);

C# Issue with reading XML with chars of different encodings in it

I faced a problem with reading the XML. The solution was found, but there are still some questions. The incorrect XML file is in encoded in UTF-8 and has appropriate mark in its header. But it also includes a char encoded in UTF-16 - 'é'. This code was used to read XML file for validating its content:
var xDoc = XDocument.Load(taxFile);
It raises exception for specified incorrect XML file: "Invalid character in the given encoding. Line 59, position 104." The quick fix is as follows:
XDocument xDoc = null;
using (var oReader = new StreamReader(taxFile, Encoding.UTF8))
{
xDoc = XDocument.Load(oReader);
}
This code doesn't raise exception for the incorrect file. But the 'é' character is loaded as �. My first question is "why does it work?".
Another point is using XmlReader doesn't raise exception until the node with 'é' is loaded.
XmlReader xmlTax = XmlReader.Create(filePath);
And again the workout with StreamReader helps. The same question.
It seems like the fix solution is not good enough, cause one day :) XML encoded in another format may appear and it could be proceed in the wrong way. BUT I've tried to process UTF-16 formatted XML file and it worked fine (configured to UTF-8).
The final question is if there are any options to be provided for XDocument/XmlReader to ignore characters encoding or smth like this.
Looking forward for your replies. Thanks in advance
The first thing to note is that the XML file is in fact flawed - mixing text encodings in the same file like this should not be done. The error is even more obvious when the file actually has an explicit encoding embedded.
As for why it can be read without exception with StreamReader, it's because Encoding contains settings to control what happens when incompatible data is encountered
Encoding.UTF8 is documented to use fallback characters. From http://msdn.microsoft.com/en-us/library/system.text.encoding.utf8.aspx:
The UTF8Encoding object that is returned by this property may not have
the appropriate behavior for your application. It uses replacement
fallback to replace each string that it cannot encode and each byte
that it cannot decode with a question mark ("?") character.
You can instantiate the encoding yourself to get different settings. This is most probably what XDocument.Load() does, as it would generally be bad to hide errors by default.
http://msdn.microsoft.com/en-us/library/system.text.utf8encoding.aspx
If you are being sent such broken XML files step 1 is to complain (loudly) about it. There is no valid reason for such behavior. If you then absolutely must process them anyway, I suggest having a look at the UTF8Encoding class and its DecoderFallbackProperty. It seems you should be able to implement a custom DecoderFallback and DecoderFallbackBuffer to add logic that will understand the UTF-16 byte sequence.

byte[] buffer handling on c-sharp

I'm writing a class which is used to work against a byte[] buffer. It contains methods like char Peek() and string ReadRestOfLine().
The problem is that I would like to add support for unicode and I don't really know how I should change those methods (they only support ASCII now).
How do I detect that the next bytes in the buffer is a unicode sequence (utf8 or utf16)? And how do I convert them to a char?
Update
Yes, the class is a bit similar to the StreamReader, but with the difference that it will avoid creating objects (like string, char[]) etc until the entire wanted string has been found. It's used in a high performance socket framework.
For instance: Let's say that I want write a proxy that will only check the URI in a HTTP request. If I where to use the StreamReader I would have to build a temp char array each time a new receive have been completed just to see if a new line character have been received.
By using a class that works directly against the byte[] buffer that socket.ReceiveAsync uses, I just have to traverse the buffer in my parser to know if the next step can be completed. No temporary objects are created.
For most protocols ASCII is used in the header area and UTF8 will not be a problem (the request body can be parsed using StreamReader). I'm just interested in how it can be solved avoiding to create unnecessary objects.
I don't think you want to go there. There are tons of stuff that can go wrong. First of all: What encoding are you using? Then, does the buffer contain the entire encoded string? Or does it start at some random position, possibly inside such a sequence?
Your classes sound a bit like a StreamReader for a MemoryStream. Maybe you can use those?
From the documentation:
Implements a TextReader that reads characters from a byte stream in a particular encoding.
If the point of your exercise is to figure out how to do this yourself... take a peek into how the library did it. I think you'll find the method StreamReader.Read() interesting:
Reads the next character from the input stream and advances the character position by one character.
There is a one-to-one correspondance between bytes and ASCII characters making it easy to treat bytes as characters. Modifying your code to handle various encodings of UNICODE may not be easy. However, to answer part of your question:
How do I detect that the next bytes in the buffer is a unicode sequence (utf8 or utf16)? And how do I convert them to a char?
You can use the System.Text.Encoding class. You can use the predefined encoding objects Encoding.Unicode and Encoding.UTF8 and use methods like GetCharCount, GetChars and GetString.
I've created a BufferSlice class which wraps the byte[] buffer and makes sure that only the assigned slice is used. I've also created a custom reader to parse the buffer.
UTF turned out to not be a problem since I only parse the buffer to find characters that is not multi-bytes (space, minus, semicolon etc). I then use Encoding.GetString from the last delimiter to the current to get a proper string back.

How do I safely create an XPathNavigator against a Stream in C#?

Given a Stream as input, how do I safely create an XPathNavigator against an XML data source?
The XML data source:
May possibly contain invalid hexadecimal characters that need to be removed.
May contain characters that do not match the declared encoding of the document.
As an example, some XML data sources in the cloud will have a declared encoding of utf-8, but the actual encoding is windows-1252 or ISO 8859-1, which can cause an invalid character exception to be thrown when creating an XmlReader against the Stream.
From the StreamReader.CurrentEncoding property documentation: "The current character encoding used by the current reader. The value can be different after the first call to any Read method of StreamReader, since encoding autodetection is not done until the first call to a Read method." This seems indicate that CurrentEncoding can be checked after the first read, but are we stuck storing this encoding when we need to write out the XML data to a Stream?
I am hoping to find a best practice for safely creating an XPathNavigator/IXPathNavigable instance against an XML data source that will gracefully handle encoding an invalid character issues (in C# preferably).
I had a similar issue when some XML fragments were imported into a CRM system using the wrong encoding (there was no encoding stored along with the XML fragments).
In a loop I created a wrapper stream using the current encoding from a list. The encoding was constructed using the DecoderExceptionFallback and EncoderExceptionFallback options (as mentioned by #Doug). If a DecoderFallbackException was thrown during processing the original stream is reset and the next-most-likely encoding is used.
Our encoding list was something like UTF-8, Windows-1252, GB-2312 and US-ASCII. If you fell off the end of the list then the stream was really bad and was rejected/ignored/etc.
EDIT:
I whipped up a quick sample and basic test files (source here). The code doesn't have any heuristics to choose between code pages that both match the same set of bytes, so a Windows-1252 file may be detected as GB2312, and vice-versa, depending on file content, and encoding preference ordering.
It's possible to use the DecoderFallback class (and a few related classes) to deal with bad characters, either by skipping them or by doing something else (restarting with a new encoding?).
When using a XmlTextReader or something similiar, the reader itself will figure out the encoding declared in the xml file.

Categories