I'm writing a class which is used to work against a byte[] buffer. It contains methods like char Peek() and string ReadRestOfLine().
The problem is that I would like to add support for unicode and I don't really know how I should change those methods (they only support ASCII now).
How do I detect that the next bytes in the buffer is a unicode sequence (utf8 or utf16)? And how do I convert them to a char?
Update
Yes, the class is a bit similar to the StreamReader, but with the difference that it will avoid creating objects (like string, char[]) etc until the entire wanted string has been found. It's used in a high performance socket framework.
For instance: Let's say that I want write a proxy that will only check the URI in a HTTP request. If I where to use the StreamReader I would have to build a temp char array each time a new receive have been completed just to see if a new line character have been received.
By using a class that works directly against the byte[] buffer that socket.ReceiveAsync uses, I just have to traverse the buffer in my parser to know if the next step can be completed. No temporary objects are created.
For most protocols ASCII is used in the header area and UTF8 will not be a problem (the request body can be parsed using StreamReader). I'm just interested in how it can be solved avoiding to create unnecessary objects.
I don't think you want to go there. There are tons of stuff that can go wrong. First of all: What encoding are you using? Then, does the buffer contain the entire encoded string? Or does it start at some random position, possibly inside such a sequence?
Your classes sound a bit like a StreamReader for a MemoryStream. Maybe you can use those?
From the documentation:
Implements a TextReader that reads characters from a byte stream in a particular encoding.
If the point of your exercise is to figure out how to do this yourself... take a peek into how the library did it. I think you'll find the method StreamReader.Read() interesting:
Reads the next character from the input stream and advances the character position by one character.
There is a one-to-one correspondance between bytes and ASCII characters making it easy to treat bytes as characters. Modifying your code to handle various encodings of UNICODE may not be easy. However, to answer part of your question:
How do I detect that the next bytes in the buffer is a unicode sequence (utf8 or utf16)? And how do I convert them to a char?
You can use the System.Text.Encoding class. You can use the predefined encoding objects Encoding.Unicode and Encoding.UTF8 and use methods like GetCharCount, GetChars and GetString.
I've created a BufferSlice class which wraps the byte[] buffer and makes sure that only the assigned slice is used. I've also created a custom reader to parse the buffer.
UTF turned out to not be a problem since I only parse the buffer to find characters that is not multi-bytes (space, minus, semicolon etc). I then use Encoding.GetString from the last delimiter to the current to get a proper string back.
Related
I have an application that reads binary data from a database. Each byte array retrieved represents a string. The strings, though, have all come from different encodings (most commonly ASCII, UTF-8 BOM, and UTF-16 LE, but there are others). In my own application, I'm trying to convert the byte array back to a string, but the encoding that was used to go from string to bytes is not stored with the bytes. Is it possible in C# to determine or infer the encoding used from the byte array?
The use case is simplified below. Assume the byte array is always a string. Also assume the string can use any encoding.
byte[] bytes = Convert.FromBase64(stringAsBytesAsBase64);
string originalString = Encoding.???.GetString(bytes);
For text that is XML, the XML specification gives requirements and how to determine the encoding.
In the absence of external character encoding information (such as
MIME headers), parsed entities which are stored in an encoding other
than UTF-8 or UTF-16 must begin with a text declaration (see 4.3.1 The
Text Declaration) containing an encoding declaration:
…
In the absence of information provided by an external transport
protocol (e.g. HTTP or MIME), it is a fatal error for an entity
including an encoding declaration to be presented to the XML processor
in an encoding other than that named in the declaration, or for an
entity which begins with neither a Byte Order Mark nor an encoding
declaration to use an encoding other than UTF-8.
—https://www.w3.org/TR/xml/#charencoding
It seems that the storage design was to drop any "information provided by an external transport protocol". It is possible that what was stored meets the specification. You can inspect your data.
If the data is complete, just let your XML processing do the job:
byte[] bytes = Convert.FromBase64(stringAsBytesAsBase64);
using (var stream = new MemoryStream(bytes))
{
var doc = XDocument.Load(stream);
}
If you do need the XML back as text with a known encoding, you can then serialize it using whichever encoding you need.
Someone downvoted this. Perhaps because it didn't start out with a clear answer:
Is it possible in C# to determine or infer the encoding used from the byte array?
No.
Below is the best you can do and you'll see why it's problematic:
You can start with a list of known Encodings.GetEncodings() and eliminate possibilities. In the end, you will have many known possibilities, many known impossibilities and potentially unknown possibilities (for encodings that aren't supported in .NET, if any). That is all as far a hard facts go.
You could then apply heuristics or some knowledge of expected content to narrow the list further. And if the results of applying each of the remaining encodings are all the same, then you've very probably got the correct text even if you didn't identify the original encoding.
Below is what the text looks like when viewed in NotePad++.
I need to get the IndexOf for that peice of the string. for use the the below code. And I can't figure out how to use the odd characters in my code.
int start = text.IndexOf("AppxxxxxxDB INFO");
Where the "xxxxx"'s represent the strange characters.
All these characters have corresponding ASCII codes, you can insert them in a string by escaping it.
For instance:
"App\x0000\x0001\x0000\x0003\x0000\x0000\x0000DB INFO"
or shorter:
"App\x00\x01\x00\x03\x00\x00\x00"+"DB INFO"
\xXXXX means you specify one character with XXXX the hexadecimal number corresponding to the character.
Notepad++ simply wants to make it a bit more convenient by rendering these characters by printing the abbreviation in a "bubble". But that's just rendering.
The origin of these characters is printer (and other media) directives. For instance you needed to instruct a printer to move to the next line, stop the printing job, nowadays they are still used. Some terminals use them to communicate color changes, etc. The most well known is \n or \x000A which means you start a new line. For text they are thus characters that specify how to handle text. A bit equivalent to modern html, etc. (although it's only a limited equivalence). \n is thus only a new line because there is a consensus about that. If one defines his/her own encoding, he can invent a new system.
Echoing #JonSkeet's warning, when you read a file into a string, the file's bytes are decoded according to a character set encoding. The decoder has to do something with bytes values or sequences that are invalid per the encoding rules. Typical decoders substitute a replacement character and attempt to go on.
I call that data corruption. In most cases, I'd rather have the decoder throw an exception.
You can use a standard decoder, customize one or create a new one with the Encoding class to get the behavior you want. Or, you can preserve the original bytes by reading the file as bytes instead of as text.
If you insist on reading the file as text, I suggest using the 437 encoding because it has 256 characters, one for every byte value, no restrictions on byte sequences and each 437 character is also in Unicode. The bytes that represent text will possibly decode the same characters that you want to search for as strings, but you have to check, comparing 437 and Unicode in this table.
Really, you should have and follow the specification for the file type you are reading. After all, there is no text but encoded text, and you have to know which encoding it is.
I am trying to pass a block of text to a system I do not own, which will pass the data to a system I do own.
Unfortunately, when the first system talks to the second system, it uses a TSV format. Thus, I wonder if there's a convenient way to take my block of text and encode it in an ASCII format without any kind of whitespace (mostly newlines and tabs, of course), and then later decode it.
When I'm doing the encoding, I'm working in C#. When I'm doing the decoding, I'm working in Javascript.
I realize that I can write my own code to essentially "manually" perform the encoding and decoding by creating my own scheme, but I wonder if there already exists one for this purpose.
One option which would blow up the size of your data but be really simple to implement: UTF-8 encode all the text, base64-encode that:
byte[] utf8 = Encoding.UTF8.GetBytes(text);
string base64 = Convert.ToBase64(utf);
That won't contain any whitespace, and can be converted back. It'll be significantly larger than the original string, and unreadable... but it'll work.
You could try using HttpUtility.UrlEncode(string) or Uri.EscapeDataString(string), which would percent-encode any whitespace in the passed in text (as well as other special characters, which means the encoded text may be much larger than the original).
On the javascript side you could then use decodeURIComponent(string) to decode it back to the original text.
I am having some issues with the default string encoding in C#. I need to read strings from certain files/packets. However, these strings include characters from the 128-256 range (extended ascii), and all of these characters show up as question marks , instead of the proper character. For example, when reading a string ,it could come up as "S?meStr?n?" if the string contained the extended ascii characters.
Now, is there any way to change the default encoding for my application? I know in java you could define the default character set from command line.
There's no one single "extended ASCII" encoding. There are lots of different 8-bit encodings which are compatible with ASCII for the bottom 128 values.
You need to find out what encoding your files actually use, and specific that when reading the data with StreamReader (or whatever else you're using). For example, you may want encoding Windows-1252:
Encoding encoding = Encoding.GetEncoding(1252);
.NET strings are always sequences of UTF-16 code points. You can't change that, and you shouldn't try. (That's true in Java as well, and you really shouldn't use the platform default encoding when calling getBytes() etc unless that's what you really, really mean.)
An Encoding can be specified in at least one overload of functions for reading text - for example, ReadAllText(string, Encoding).
So if you no a file's encoded using Windows-1252, then you can specify it like so:
string contents = File.ReadAllText(someFilePath, Encoding.GetEncoding(1252));
Of course, doing this requires knowing ahead of time which code page is being used.
--Edit with more bgnd information--
A (black box) COM object returns me a string.
A 2nd COM object expects this same string as byte[] as input and returns a byte[] with the processed data.
This will be fed to a browser as downloadable, non-human-readable file that will be loaded in a client side stand-alone application.
so I get the string inputString from the 1st COM and convert it into a byte[] as follows
BinaryFormatter bf = new BinaryFormatter();
MemoryStream ms = new MemoryStream();
bf.Serialize(ms, inputString);
obj = ms.ToArray();
I feed it to the 2nd COM and read it back out.
The result gets written to the browser.
Response.ContentType = "application/octet-stream";
Response.AddHeader("content-disposition", "attachment; filename="test.dat");
Response.BinaryWrite(obj);
The error occurs in the 2nd COm because the formatting is incorrect.
I went to check the original string and that was perfectly fine. I then pumped the result from the 1st com directly to the browser and watched what came out. It appeared that somewhere along the road extra unreadable characters are added. What are these characters, what are they used for and how can I prevent them from making my 2nd COM grind to a halt?
The unreadable characters are of this kind:
NUL/SOH/NUL/NUL/NUL/FF/FF/FF/FF/SOH/NUL/NUL/NUL etc
Any ideas?
--Answer--
Use
System.Text.Encoding.UTF8.GetBytes(theString)
rather then
BinaryFormatter.Serialize()
BinaryFormatter is almost certainly not what you want to use.
If you just need to convert a string to bytes, use Encoding.GetBytes for a suitable encoding, of course. UTF-8 is usually correct, but check whether the document specifies an encoding.
Okay, with your updated information: your 2nd COM object expects binary data, but you want to create that binary data from a string. Does it treat it as plain binary data?
My guess is that something is going to reverse this process on the client side. If it's eventually going to want to reconstruct the data as a string, you need to pick the right encoding to use, and use it on both sides. UTF-8 is a good bet in most cases, but if the client side is just going to write out the data to a file and use it as an XML file, you need to choose the appropriate encoding based on the XML.
You said before that the first few characters of the string were just "<foo>" (or something similar) - does that mean there's no XML declaration? If not, choose UTF-8. Otherwise, you should look at the XML declaration and use that to determine your encoding (again defaulting to UTF-8 if the declaration doesn't specify the encoding).
Once you've got the right encoding, use Encoding.GetBytes as mentioned in earlier answers.
I think you are missing the point of BinarySerialization.
For starters, what Type is formulaXml?
Binary serialization will compact that into a machine represented value, NOT XML! The content will look like:
ÿÿÿÿ AIronScheme, Version=1.0.0.0, Culture=neutral, Public
Perhaps you should be looking at the XML serializer instead.
Update:
You want to write out some XML as a 'content-disposition' stream.
To do this, do something like:
byte[] buffer = Encoding.Default.GetBytes(formulaXml);
Response.BinaryWrite(buffer);
That should work like you hoped (I think).
The BinaryFormatter's job is to convert the objects into some opaque serialisation format that can only be understood by another BinaryFormatter at the other end.
(Just about to mention Encoding.GetBytes as well, but Jon beat me to it.)
You'll probably want to use System.Text.Encoding.UTF8.GetBytes().
Is the crap in the beginning two bytes long?
This could be the byte order mark of a Unicode encoded string.
http://en.wikipedia.org/wiki/Byte-order_mark