Once I register a nick with a IRC server (in this case Freenode), I receive a message like this:
:NickServ!NickServ#services. NOTICE IRCLIBtester :*IRCLIBtester* is not a registered nickname.
I have inserted asterisks(*) where the weird 0x02 byte is received. Since the servers 005 ISUPPORT contained
CASEMAPPING=rfc1459 CHARSET=ascii
I assumed the messages would be pure ASCII, but in ASCII 0x02 is the Start-of-text marker. Looking at how other clients (in this case HexChat) parsed the string I noticed they took it as a "bold-font" toggle, so the nick would be in bold. Is this practice? And if so, which format is this?
My first thought is RTF, but since I display the text in a richtextbox(C#), it should have parsed the 0x02 byte itself right?
0x02 is the code for BOLD in IRC. These are undocumented in the RFC (1459 and onwards) but can be found via google.
You may find other format codes here: http://forum.egghelp.org/viewtopic.php?t=3867
Related
I have code that reads data from a textbox.text control into a byte array. It uses UTF8 encoding and there has not been any issues. The code reads, say, M number of bytes from the textbox, and adds it to output, as bytes. That all works fine.
When the data is written back, if the text is Non-English language, there are often problems. For instance if the text is the Chinese char 南 say repeated a few times, which seems to be, for the text box, 0xE5, 0x8D, 0x97.
When the data is written back to the text box, if say, the first write ended on 0xE5, when the next batch of data is written back starting with 0x8D 0x97, it is transformed somehow to 0xEF 0xBF 0xBD.
I'm just using Array.Copy. Nothing special. With English, no problem. With Chinese (and Japanese as well), the first write goes OK but the second write has some of these "corrupted" chars.
The problem mus t not be related to reading from/writing to textbox. The problem is how you convert text to byte and back. you have not provided any code, so my code must not be exactly what you want but for converting UTF-8 string to bytes you can do:
byte[] bytes = System.Text.Encoding.UTF8.GetBytes(textBox1.Text);
To convert byte[] to string:
textbox1.Text = System.Text.Encoding.UTF8.GetString(bytes);
If you Ignore Encoding and just use ascii encoding, it will lead to loss of data when converting to byte.
There is also a question related to converting Chinese to byte[]:
How to encode and decode Broken Chinese/Unicode characters?
First, thanks for that information. I only used Chinese as an example. The code will not know the language and should not care. It could be Hindi or Japanese. Your conversion byte[] to string is what I use.
After I posted the question I realized that the code seems to correctly handle data, just not writing back to the Textbox text control. I'm not sure what the control is doing, perhaps it "detects" the language or detects it's not UTF8 and tries some kind of encoding.
BUT in any case I deferred writing the bytes back into the text box until the end and that seems to work just fine. That is to say, I keep adding the bytes back into an array using Array.Copy(...) and at the end write the whole thing back into the text box using UTF8, as you mentioned.
I have a string that I receive from a third party app and I would like to display it correctly in any language using C# on my Windows Surface.
Due to incorrect encoding, a piece of my string looks like this in Farsi (Persian-Arabic):
مدل-رنگ-موی-جدید-5-436x500
whereas it should look like this:
مدل-رنگ-موی-جدید-5-436x500
This link convert this correctly:
http://www.ltg.ed.ac.uk/~richard/utf-8.html
How I can do it in c#?
It is very hard to tell exactly what is going on from the description of your question. We would all be much better off if you provided us with an example of what is happening using a single character instead of a whole string, and if you chose an example character which does not belong to some exotic character set, for example the bullet character (u2022) or something like that.
Anyhow, what is probably happening is this:
The letter "ر" is represented in UTF-8 as a byte sequence of D8 B1, but what you see is "ر", and that's because in UTF-16 Ø is u00D8 and ± is u00B1. So, the incoming text was originally in UTF-8, but in the process of importing it to a dotNet Unicode String in your application it was incorrectly interpreted as being in some 8-bit character set such as ANSI or Latin-1. That's why you now have a Unicode String which appears to contain garbage.
However, the process of converting 8-bit characters to Unicode is for the most part not destructive, so all of the information is still there, that's why the UTF-8 tool that you linked to can still kind of make sense out of it.
What you need to do is convert the string back to an array of ANSI (or Latin-1, whatever) bytes, and then re-construct the string the right way, which is a conversion of UTF-8 to Unicode.
I cannot easily reproduce your situation, so here are some things to try:
byte[] bytes = System.Text.Encoding.Ansi.GetBytes( garbledUnicodeString );
followed by
string properUnicodeString = System.Text.Encoding.UTF8.GetString( bytes );
I'm looking for Encoding/Decoding algorithm.
I have tried this:
http://codeproblem.hamaraquetta.com/articles/languages/81-net-framework/76-encoding-sms-in-pdu-format-in-net?showall=&start=1
and no luck. :(
Here is what I'm trying
This is the text:
This is a long text message greater than 160 characters. You can encode it to PDU format using the SMS-PDU lib for .NET, It also supports UCS-2 encoding, and special characters like { [ ] } are also supported. Its quite simple to use in your code.
From this text there should 2 messages encoded to septets and after I should be able to submit the message.
This is the result i get:
Part 1:
0041000C917952205197720000A00500033F0201A8E8F41C949E83C220F6DB7D06D1CB783AA85D9ECFC3E732E82C2F87E96539888E0EBB41311B0C344687E5E131BD2C9FBB40D9771D340EBB4165F7F84D2E83D27410FD0D8212AB20F35BDE0ED341F579DA7D06D1D165D0B4396D418955103B2D0699DF7290CB59A4B240493A28CC9EBF41F33A1CFE96D3E7A0EA70DA9281CAEEF19B9C769F59
Part 2:
0041000C917952205197720000690500033F020240613719348797C7E9301B344687E5E131BD2C9F83D8E97519B44181363CD0C607DAA4406179191466CFDFA0791D0E7FCBE965B20B94A4CF41F17A9A5E06CDD36D38BB0CA2BF41F57919947683F2EFBA1C347E93CB2E
this is doesn't work.
How do I solve this?
Btw: this is the phonenumber i know it's important.
+972502157927
Library works completely correctly. ComposeLongSms() returns a string array of PDUs and you should send("submit" as you said) all these PDUs to your GSM modem like separate SMSes. Any concatenating won't work, you can notice that each PDU starts with the same part, which contains encoded additional information for outgoing SMS. You can verify your PDUs here
i read some data from a device. Then i send this data to a web server via xml. The data should be represented in xml so this makes me convert characters between 0-31 because these chars can not be displayed on xml.
The question is how can i convert the chars between 0-31 decimal in a string like [00]abcde[01]fgh[02]...
Are there any built-in function in .net framework or any accepted pattern?
Thanks
You should use standard XML encoding:
Your XML API will do that for you, so you don't need to worry about anything.
You can simply encode the number as an XML entity you write &# followed by the number and a semicolon
so 1 becomes and 13 becomes
and so on and so forth
However as noted by dan04 you can't represent 0 as a numeric character reference, so in the case where your data might include 0 you will have to use a different encoding. You could encode the entire binary data as base64
Most XML toolboxes will do the encoding to NCRs for you though so you really shouldn't have to worry about that
I'm not entirely sure if the question even makes sense. I'm converting a byte array taken from an ID3 tag and converting it to a string. Most text frames in an ID3 tag use ISO 8859-1 encoding but it depends on the frame. In any case, if you look up what 0x00 is in the ISO 8859-1 codes it is invalid.
To further complicate, either due programmer error or just poor formatting, some of the strings end in 0x00 and some do not.
When converting a series of bytes into a string using ISO 8859-1 encoding do you have manually check the end of the string to see if it is a null? Or will the encoding object through whatever method it uses to convert in the first place deal with the null properly? Furthermore, is there some sort of function that could normalize or "fix" the null terminated string?
When you try to display these strings they do not display properly.
I am using C# for this particular project.
Some extra info here about ID3 Tags: ID3 Specs
Or am I completely misunderstanding the whole thing? Is a null terminator simply a way a particular language handles strings and it has nothing to do with encoding?
Edit: I used System.Text.Encoding.GetEncoding("iso-8859-1") followed by a GetString call
If you use Encoding.GetEncoding(28591), it just converts a byte 0 to the Unicode U+0000. Encodings generally assume that they have to convert all the bytes - they don't look for terminators.
This treatment of 0 as Unicode 0 is inline with the Wikipedia description:
In 1992, the IANA registered the character map ISO_8859-1:1987, more commonly known by its preferred MIME name of ISO-8859-1 (note the extra hyphen over ISO 8859-1), a superset of ISO 8859-1, for use on the Internet. This map assigns the C0 and C1 control characters to the unassigned code values thus provides for 256 characters via every possible 8-bit value.
The C0 and C1 control characters page includes:
0: Originally used to allow gaps to be left on paper tape for edits. Later used for padding after a code that might take a terminal some time to process (e.g. a carriage return or line feed on a printing terminal). Now often used as a string terminator, especially in the C programming language.
Sample code:
using System;
using System.Text;
class Program
{
static void Main(string[] args)
{
byte[] data = { 0, 0 };
Encoding latin1 = Encoding.GetEncoding(28591);
string text = latin1.GetString(data);
Console.WriteLine(text.Length); // 2
Console.WriteLine((int) text[0]); // 0
Console.WriteLine((int) text[1]); // 0
}
}
Happily, ASCII, ISO-8859-1 and Unicode all agree on codepoints in the range 0..127. Thus your character '\0' will be encoded identically in ASCII, ISO-8859-1 and UTF-8.
If your program assigns special semantics to the zero byte, you have to take care of that appropriately.