Calculating length of string that contains Unicode characters - c#

We have been struggling with calculating the length of a string which contains Unicode characters e.g (Word characters such as ’) that get pasted into our systems's text-area's then get saved. When we need to return that same saved string and we need to calculate our Response content-length the normal c# string.Length does not calculate the length correctly since Unicode has more than one character.
We have tried using the System.Globalization.StringInfo class to read the amount of chars but to no avail as it still comes up short on the correct length of the request. Say for instance the Json response object's last closing curly bracket gets cut off by the browser since the length is too short.
If someone can perhaps shed any light on something that they have used that works it would be greatly appreciated. What we tried so far:
Response.AddHeader("content-length", content.GetType() == typeof(string) ?
new System.Globalization.StringInfo(content.ToString()).LengthInTextElements.
ToString() : ((byte[])content).Length.ToString());

Given an arbitrary block of bytes, which you know represents some text, and which is encoded using UTF8, the only way to know the actual character count is to decode the text. E.g. by passing it to Encoding.UTF8.GetString(). Then you just look at the length of the string returned.
That said, the Content-Length field of an HTTP response is supposed to indicate the length of the response in bytes. If you must set the length yourself, you should just use the total byte count. But if you are using e.g. HttpResponse I would expect this field to be set automatically on your behalf.

Related

How do I use C#'s IndexOf when strange characters are in the string

Below is what the text looks like when viewed in NotePad++.
I need to get the IndexOf for that peice of the string. for use the the below code. And I can't figure out how to use the odd characters in my code.
int start = text.IndexOf("AppxxxxxxDB INFO");
Where the "xxxxx"'s represent the strange characters.
All these characters have corresponding ASCII codes, you can insert them in a string by escaping it.
For instance:
"App\x0000\x0001\x0000\x0003\x0000\x0000\x0000DB INFO"
or shorter:
"App\x00\x01\x00\x03\x00\x00\x00"+"DB INFO"
\xXXXX means you specify one character with XXXX the hexadecimal number corresponding to the character.
Notepad++ simply wants to make it a bit more convenient by rendering these characters by printing the abbreviation in a "bubble". But that's just rendering.
The origin of these characters is printer (and other media) directives. For instance you needed to instruct a printer to move to the next line, stop the printing job, nowadays they are still used. Some terminals use them to communicate color changes, etc. The most well known is \n or \x000A which means you start a new line. For text they are thus characters that specify how to handle text. A bit equivalent to modern html, etc. (although it's only a limited equivalence). \n is thus only a new line because there is a consensus about that. If one defines his/her own encoding, he can invent a new system.
Echoing #JonSkeet's warning, when you read a file into a string, the file's bytes are decoded according to a character set encoding. The decoder has to do something with bytes values or sequences that are invalid per the encoding rules. Typical decoders substitute a replacement character and attempt to go on.
I call that data corruption. In most cases, I'd rather have the decoder throw an exception.
You can use a standard decoder, customize one or create a new one with the Encoding class to get the behavior you want. Or, you can preserve the original bytes by reading the file as bytes instead of as text.
If you insist on reading the file as text, I suggest using the 437 encoding because it has 256 characters, one for every byte value, no restrictions on byte sequences and each 437 character is also in Unicode. The bytes that represent text will possibly decode the same characters that you want to search for as strings, but you have to check, comparing 437 and Unicode in this table.
Really, you should have and follow the specification for the file type you are reading. After all, there is no text but encoded text, and you have to know which encoding it is.

byte[] buffer handling on c-sharp

I'm writing a class which is used to work against a byte[] buffer. It contains methods like char Peek() and string ReadRestOfLine().
The problem is that I would like to add support for unicode and I don't really know how I should change those methods (they only support ASCII now).
How do I detect that the next bytes in the buffer is a unicode sequence (utf8 or utf16)? And how do I convert them to a char?
Update
Yes, the class is a bit similar to the StreamReader, but with the difference that it will avoid creating objects (like string, char[]) etc until the entire wanted string has been found. It's used in a high performance socket framework.
For instance: Let's say that I want write a proxy that will only check the URI in a HTTP request. If I where to use the StreamReader I would have to build a temp char array each time a new receive have been completed just to see if a new line character have been received.
By using a class that works directly against the byte[] buffer that socket.ReceiveAsync uses, I just have to traverse the buffer in my parser to know if the next step can be completed. No temporary objects are created.
For most protocols ASCII is used in the header area and UTF8 will not be a problem (the request body can be parsed using StreamReader). I'm just interested in how it can be solved avoiding to create unnecessary objects.
I don't think you want to go there. There are tons of stuff that can go wrong. First of all: What encoding are you using? Then, does the buffer contain the entire encoded string? Or does it start at some random position, possibly inside such a sequence?
Your classes sound a bit like a StreamReader for a MemoryStream. Maybe you can use those?
From the documentation:
Implements a TextReader that reads characters from a byte stream in a particular encoding.
If the point of your exercise is to figure out how to do this yourself... take a peek into how the library did it. I think you'll find the method StreamReader.Read() interesting:
Reads the next character from the input stream and advances the character position by one character.
There is a one-to-one correspondance between bytes and ASCII characters making it easy to treat bytes as characters. Modifying your code to handle various encodings of UNICODE may not be easy. However, to answer part of your question:
How do I detect that the next bytes in the buffer is a unicode sequence (utf8 or utf16)? And how do I convert them to a char?
You can use the System.Text.Encoding class. You can use the predefined encoding objects Encoding.Unicode and Encoding.UTF8 and use methods like GetCharCount, GetChars and GetString.
I've created a BufferSlice class which wraps the byte[] buffer and makes sure that only the assigned slice is used. I've also created a custom reader to parse the buffer.
UTF turned out to not be a problem since I only parse the buffer to find characters that is not multi-bytes (space, minus, semicolon etc). I then use Encoding.GetString from the last delimiter to the current to get a proper string back.

"Unable to translate Unicode character" error when saving to txt file

Additional information: Unable to
translate Unicode character \uDFFF at
index 195 to specified code page.
I made an algorithm, who's result are binary values (different lengths). I transformed it into uint, and then into chars and saved into stringbuilder, as you can see below:
uint n = Convert.ToUInt16(tmp_chars, 2);
_koded_text.Append(Convert.ToChar(n));
My problem is, that when i try to save those values into .txt i get the previously mentioned error.
StreamWriter file = new StreamWriter(filename);
file.WriteLine(_koded_text);
file.Close();
What i am saving is this: "忿췾᷿]볯褟ﶞ痢ﳻ��伞ﳴ㿯ﹽ翼蛿㐻ﰻ筹��﷿₩マ랿鳿⏟麞펿"... which are some weird signs.
What i need is to convert those binary values into some kind of string of chars and save it to txt. I saw somewhere that converting to UTF8 should help, but i don't know how to. Would changing files encoding help too?
You cannot transform binary data to a string directly. The Unicode characters in a string are encoded using utf16 in .NET. That encoding uses two bytes per character, providing 65536 distinct values. Unicode however has over one million codepoints. To make that work, the Unicode codepoints above \uffff (above the BMP, Basic Multilingual Plane) are encoded with a surrogate pair. The first one has a value between 0xd800 and 0xdbff, the second between 0xdc00 and 0xdfff. That provides 2 ^ (10 + 10) = 1 million additional codes.
You can perhaps see where this leads, in your case the code detects a high surrogate value (0xdfff) that isn't paired with a low surrogate. That's illegal. Lots more possible mishaps, several codepoints are unassigned, several are diacritics that get mangled when the string is normalized.
You just can't make this work. Base64 encoding is the standard way to carry binary data across a text stream. It uses 6 bits per character, 3 bytes require 4 characters. The character set is ASCII so the odds of the receiving program decoding the character back to binary incorrectly are minimal. Only a decades old IBM mainframe that uses EBCDIC could get you into trouble. Or just plain avoid encoding to text and keep it binary.
Since you're trying to encode binary data to a text stream this SO question already contains an answer to the question: "How do I encode something as base64?" From there plain ASCII/ANSI text is fine for the output encoding.

How to use strange characters in a query string

I am using silverlight / ASP .NET and C#. What if I want to do this from silverlight for instance,
// I have left out the quotes to show you literally what the characters
// are that I want to use
string password = vtakyoj#"5
string encodedPassword = HttpUtility.UrlEncode(encryptedPassword, Encoding.UTF8);
// encoded password now = vtakyoj%23%225
URI uri = new URI("http://www.url.com/page.aspx#password=vtakyoj%23%225");
HttpPage.Window.Navigate(uri);
If I debug and look at the value of uri it shows up as this (we are still inside the silverlight app),
http://www.url.com?password=vtakyoj%23"5
So the %22 has become a quote for some reason.
If I then debug inside the page.aspx code (which of course is ASP .NET) the value of Request["password"] is actually this,
vtakyoj#"5
Which is the original value. How does that work? I would have thought that I would have to go,
HttpUtility.UrlDecode(Request["password"], Encoding.UTF8)
To get the original value.
Hope this makes sense?
Thanks.
First lets start with the UTF8 business. Esentially in this case there isn't any. When a string contains characters with in the standard ASCII character range (as your password does) a UTF8 encoding of that string is identical to a single byte ASCII string.
You start with this:-
vtakyoj#"5
The HttpUtility.UrlEncode somewhat aggressively encodes it to:-
vtakyoj%23%225
Its encoded the # and " however only # has special meaning in a URL. Hence when you view string value of the Uri object in Silverlight you see:-
vtakyoj%23"5
Edit (answering supplementary questions)
How does it know to decode it?
All data in a url must be properly encoded thats part of its being valid Url. Hence the webserver can rightly assume that all data in the query string has been appropriately encoded.
What if I had a real string which had %23 in it?
The correct encoding for "%23" would be "%3723" where %37 is %
Is that a documented feature of Request["Password"] that it decodes it?
Well I dunno, you'd have check the documentation I guess. BTW use Request.QueryString["Password"] the presence of this same indexer directly on Request was for the convenience of porting classic ASP to .NET. It doesn't make any real difference but its better for clarity since its easier to make the distinction between QueryString values and Form values.
if I don't use UFT8 the characters are being filtered out.
Aare you sure that non-ASCII characters may be present in the password? Can you provide an example you current example does not need encoding with UTF-8?
If Request["password"] is to work, you need "http://url.com?password=" + HttpUtility.UrlEncode("abc%$^##"). I.e. you need ? to separate the hostname.
Also the # syntax is username:password#hostname, but it has been disabled in IE7 and above IIRC.

Can a Base64 String contain tabs?

Simple yes or no question, and I'm 90% sure that it is no... but I'm not sure.
Can a Base64 string contain tabs?
It depends on what you're asking. If you are asking whether or not tabs can be base-64 encoded, then the answer is "yes" since they can be treated the same as any other ASCII character.
However, if you are asking whether or not base-64 output can contain tabs, then the answer is no. The following link is for an article detailing base-64, including which characters are considered valid:
http://en.wikipedia.org/wiki/Base64
The short answer is no - but Base64 cannot contain carriage returns either.
That is why, if you have multiple lines of Base64, you strip out any carriage returns, line feeds, and anything else that is not in the Base64 alphabet
That includes tabs.
From wikipedia.com:
The current version of PEM (specified
in RFC 1421) uses a 64-character
alphabet consisting of upper- and
lower-case Roman alphabet characters
(A–Z, a–z), the numerals (0–9), and
the "+" and "/" symbols. The "="
symbol is also used as a special
suffix code. The original
specification, RFC 989, additionally
used the "*" symbol to delimit encoded
but unencrypted data within the output
stream.
As you can see, tab characters are not included. However, you can of course encode a tab character into a base64 string.
Sure. Tab is just ASCII character 9, and that has a base64 representation just like any other integer.
Base64 specification (RFC 4648) states in Section 3.3 that any encountered non-alphabet characters should be rejected unless explicitly allowed by another specification:
Implementations MUST reject the
encoded data if it contains
characters outside the base alphabet
when interpreting base-encoded
data, unless the specification
referring to this document explicitly
states otherwise. Such specifications
may instead state, as MIME does,
that characters outside the base
encoding alphabet should simply be
ignored when interpreting data ("be
liberal in what you accept").
Note that this means that any
adjacent carriage return/ line feed
(CRLF) characters constitute
"non-alphabet characters" and are
ignored.
Specs such as PEM (RFC 1421) and MIME (RFC 2045) specify that Base64 strings can be broken up by whitespaces. Per referenced RFC 822, a tab (HTAB) is considered a whitespace character.
So, when Base64 is used in context of either MIME or PEM (and probably other similar specifications), it can contain whitespace, including tabs, which should be handled (stripped out) while decoding the encoded content.
Haha, as you see from the responses, this is actually not such a simple yes no answer.
A resulting Base64 string after conversion cannot contain a tab character, but It seems to me that you are not asking that, seems to me that you are asking can you represent a string (before conversion) containing a tab in Base64, and the answer to that is yes.
I would add though that really what you should do is make sure that you take care to preserve the encoding of your string, i.e. convert it to an array of bytes with your correct encoding (Unicode, UTF-8 whatever) then convert that array of bytes to base64.
EDIT: A simple test.
private void button2_Click(object sender, EventArgs e)
{
StringBuilder sb = new StringBuilder();
string test = "The rain in spain falls \t mainly on the plain";
sb.AppendLine(test);
UTF8Encoding enc = new UTF8Encoding();
byte[] b = enc.GetBytes(test);
string cvtd = Convert.ToBase64String(b);
sb.AppendLine(cvtd);
byte[] c = Convert.FromBase64String(cvtd);
string backAgain = enc.GetString(c);
sb.AppendLine(backAgain);
MessageBox.Show(sb.ToString());
}
It seems that there is lots of confusion here; and surprisingly most answers are of "No" variety. I don't think that is a good canonical answer.
The reason for confusion is probably the fact that Base64 is not strictly specified; multiple practical implementations and interpretations exist.
You can check out link text for more discussion on this.
In general, however, conforming base64 codecs SHOULD understand linefeeds, as they are mandated by some base64 definitions (76 character segments, then linefeed etc).
Because of this, most decoders also allow for indentation whitespace, and quite commonly any whitespace between 4-character "triplets" (so named since they encode 3 bytes).
So there's a good chance that in practice you can use tabs and other white space.
But I would not add tabs myself if generating base64 content sent to a service -- be conservative at what you send, (more) liberal at what you receive.
Convert.FromBase64String() in the .NET framework does not seem to mind them. I believe all whitespace in the string is ignored.
string xxx = "ABCD\tDEFG"; //simulated Base64 encoded string w/added tab
Console.WriteLine(xxx);
byte[] xx = Convert.FromBase64String(xxx); // convert string back to binary
Console.WriteLine(BitConverter.ToString(xx));
output:
ABCD DEFG
00-10-83-0C-41-46
The relevant clause of RFC-2045 (6:8)
The encoded output stream must be
represented in lines of no more
than 76 characters each. All line
breaks or other characters not
found in Table 1 must be ignored by
decoding software. In base64 data,
characters other than those in Table
1, line breaks, and other white
space probably indicate a transmission
error, about which a warning
message or even a message rejection
might be appropriate under some
circumstances.
YES!
Base64 is used to encode ANY 8bit value (Decimal 0 to 255) into a string using a set of safe characters. TAB is decimal 9.
Base 64 uses one of the following character sets:
Data: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
URLs: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_
Binary Attachments (eg: email) in text are also encoded using this system.

Categories