Whenever i send any ASCII value greater than 127 to the com port i get junk output values at the serial port.
ComPort.Write(data);
Strictly speaking ASCII only contains 128 possible symbols.
You will need to use a different character set to send anything other than the 33 control characters and the 94 letters and symbols that are ASCII.
To make things more confusing, ASCII is used as a starting point for several larger (conflicting) character sets.
This list is not comprehensive, but the most most common are:
Extended_ASCII which is ASCII with 128 more characters in it.
ISO 8859-1 is ASCII with all characters required to represent all Western European languages.
Windows-1252 is ISO 8859-1 with a few alterations (including printable characters in the 0x80-0x9F range).
ISO 8859-2 which is the equivalent for Eastern European languages.
Windows-1250 is also for Eastern european languages, but bears little resemblance to 8859-2.
UTF-8 is also derived from ASCII, but has different characters from 128 to 255.
Back to your problem: your encoding is set to ASCII, so you are prevented from sending any characters outside of that character set. You need to set your encoding to an appropriate character set for the data you are sending. In my experience (which is admittedly in USA and Western Europe) Windows-1252 is the most used.
In C#, you can send either a string or bytes through a Serial Port. If you send a string, it uses the SerialPort.Encoding to convert that string into bytes to send. You can set this to the appropriate encoding for your purposes, or you can manually convert convert a string with a System.Text.Encoding object.
Set the encoder on the com port:
ComPort.Encoding = Encoding.GetEncoding("Windows-1252");
or manually encode the string:
System.Text.Encoding enc = System.Text.Encoding.GetEncoding("Windows-1252");
byte[] sendBuffer = enc.GetBytes(command);
ComPort.write(sendBytes, 0, sendBuffer.Length);
both do functionally the same thing.
EDIT:
0x96 is a valid character in Windows-1252
This outputs a long hyphen. (normal hyphen is 0x2D)
System.Text.Encoding enc = System.Text.Encoding.GetEncoding("windows-1252");
byte[] buffer = new byte[]{0x96};
Console.WriteLine(enc.GetString(buffer));
The issue was resolved when the encoding on the serial port was changed.
this.ComPort.Encoding = Encoding.GetEncoding(28591);
Related
I am having issue trying to convert a string containing Simplified Chinese to double byte encoding (GB2312). This is for printing Chinese characters to a zebra printer.
The specs I am looking at show an example with the text of "冈区色呆" which they show as converting to a hex value of 38_54_47_78_49_2b_34_74.
In my C# code I am trying to convert this using the below code as a test. My result seems to be off by 7 in the leading hex value. What am I missing here?
private const string SimplifiedChineseChars = "冈区色呆";
[TestMethod]
public void GetBackCorrectHexValues()
{
byte[] bytes = Encoding.GetEncoding(20936).GetBytes(SimplifiedChineseChars);
string hex = BitConverter.ToString(bytes).Replace("-", "_");
//I get the following: B8_D4_C7_F8_C9_AB_B4_F4
//I am expecting: 38_54_47_78_49_2b_34_74
}
The only thing that makes sense to me is that 38_54_47_78_49_2b_34_74 is some form of 7-bit encoding.
Interestingly, a 7-bit version of the GB2312 encoding does exist, and is called the HZ character encoding.
Here is the wikipedia entry on HZ. Interesting parts:
The HZ ... encoding was invented to facilitate the use of Chinese characters through e-mail, which at that time only allowed 7-bit characters.
the HZ code uses only printable, 7-bit characters to represent Chinese characters.
And, according to this Microsoft reference page on EncodingInfo.GetEncoding, this character encoding is supported in .NET:
52936 hz-gb-2312 Chinese Simplified (HZ)
If I try your code, and replace the character encoding to use HZ, I get:
static void Main(string[] args)
{
const string SimplifiedChineseChars = "冈区色呆";
byte[] bytes = Encoding.GetEncoding("hz-gb-2312").GetBytes(SimplifiedChineseChars);
string hex = BitConverter.ToString(bytes).Replace("-", "_");
Console.WriteLine(hex);
}
Output:
7E_7B_38_54_47_78_49_2B_34_74_7E_7D
So, you basically get exactly what you are looking for, except that it adds the escape sequences ~{ and ~} before and after the chinese character bytes. Those escape sequences are necessary because this encoding supports mixing ASCII character bytes (single byte encoding) with GB chinese character bytes (double byte encoding). The escape sequences mark the areas that should not be interpreted as ASCII.
If you choose to use the hz-gb-2312 encoding, you would have to strip any unwanted escape sequences yourself, if you think you don't need them. But, perhaps you do need them. You'll have to figure out exactly what your printer is expecting.
Alternatively, if you really don't want to have those escape sequences and if you are not worried about having to handle ASCII characters, and are confident that you only have to deal with chinese double byte characters, then you could choose to stick with using the vanilla GB2312 encoding, and then drop the most significant bit of every byte yourself to essentially convert the results to 7-bit encoding.
Here is what the code could look like. Notice that I mask each byte value with 0x7F to drop the 8th bit.
static void Main(string[] args)
{
const string SimplifiedChineseChars = "冈区色呆";
byte[] bytes = Encoding.GetEncoding("gb2312") // vanilla gb2312 encoding
.GetBytes(SimplifiedChineseChars)
.Select(b => (byte)(b & 0x7F)) // retain 7 bits only
.ToArray();
string hex = BitConverter.ToString(bytes).Replace("-", "_");
Console.WriteLine(hex);
}
Output:
38_54_47_78_49_2B_34_74
In C# I need to get the ASCII code of some characters.
So I convert the char To byte Or int, then print the result.
String sample="A";
int AsciiInt = sample[0];
byte AsciiByte = (byte)sample[0];
For characters with ASCII code 128 and less, I get the right answer.
But for characters greater than 128 I get irrelevant answers!
I am sure all characters are less than 0xFF.
Also I have Tested System.Text.Encoding and got the same results.
For example: I get 172 For a char with actual byte value of 129!
Actually ASCII characters Like ƒ , ‡ , ‹ , “ , ¥ , © , Ï , ³ , · , ½ , » , Á Each character takes 1 byte and goes up to more than 193.
I Guess There is An Unicode Equivalent for Them and .Net Return That Because Interprets Strings As Unicode!
What If SomeOne Needs To Access The Actual Value of a byte , Whether It is a valid Known ASCII Character Or Not!!!
But For Characters Upper Than 128 I get Irrelevant answers
No you don't. You get the bottom 8 bits of the UTF-16 code unit corresponding to the char.
Now if your text were all ASCII, that would be fine - because ASCII only goes up to 127 anyway. It sounds like you're actually expecting the representation in some other encoding - so you need to work out which encoding that is, at which point you can use:
Encoding encoding = ...;
byte[] bytes = encoding.GetBytes(sample);
// Now extract the bytes you want. Note that a character may be represented by more than
// one byte.
If you're essentially looking for an encoding which treats bytes 0 to 255 respectively as U+0000 to U+00FF respectively, you should use ISO-8859-1, which you can access using Encoding.GetEncoding(28591).
You can't just ignore the issue of encoding. There is no inherent mapping between bytes and characters - that's defined by the encoding.
If I use your example of 131, on my system, this produces â. However, since you're obviously on an arabic system, you most likely have Windows-1256 encoding, which produces ƒ for 131.
In other words, if you need to use the correct encoding when converting characters to bytes and vice versa. In your case,
var sample = "ƒ";
var byteValue = Encoding.GetEncoding("windows-1256").GetBytes(sample)[0];
Which produces 131, as you seem to expect. Most importantly, this will work on all computers - if you want to have this system locale-specific, Encoding.Default can also work for you.
The only reason your method seems to work for bytes under 128 is that in UTF-8, the characters correspond to the ASCII standard mapping. However, you're misusing the term ASCII - it really only refers to these 7-bit characters. What you're calling ASCII is actually an extended 8-bit charset - all characters with the 8-bit set are charset-dependent.
We're no longer in a world when you can assume your application will only run on computers with the same locale you have - .NET is designed for this, which is why all strings are unicode. At the very least, read this http://www.joelonsoftware.com/articles/Unicode.html for an explanation of how encodings work, and to get rid of some of the serious and dangerous misconceptions you seem to have.
Theoretical question :
Let's say there is one source which knows only how to transmit ASCII chars. (0..127)
And let's say there is an endpoint which receives these chars .
Can the endpoint decode those chars as utf8 ?
ascii chars
...
...
|
|
V
read as utf ?
Something like this pseudo code :
var txt="אבג";
var _bytes=Encoding.ASCII.GetBytes(txt); <= it wont recognize [א] here
...transmit...
var myUtfString=Encoding.UTF8.GetString(getBytesFromWire(); <= some magic has to be done here
That is possible, but not using UTF8.
UTF8 works by encoding multibyte characters into sequences of bytes that are between 128 and 255.
Your ASCII protocol will not be able to transmit those bytes.
Instead, you need some mechanism to store arbitrary Unicode codepoints or bytes in pure ASCII text:
You can encode the Unicode text using any encoding to get a stream of (non-ASCII) bytes, then transmit those bytes using Base64 encoding
You can use the UTF7 encoding to encode Unicode codepoints using pure ASCII characters.
This will be substantially more space-efficient than Base64 if your text is mostly ASCII.
var txt = "אבג";
var str = Convert.ToBase64String(Encoding.UTF8.GetBytes(txt)); //<--ASCII
//Transmit
var txt2 = Encoding.UTF8.GetString(Convert.FromBase64String(str));
In. c#
We can use below classes to do encoding:
System.Text.Encoding.UTF8
System.Text.Encoding.UTF16
System.Text.Encoding.ASCII
Why there is no System.Text.Encoding.Base64?
We can only use Convert.From(To)Base64String method, what's special of base64?
Can I say base64 is the same encoding method as UTF-8? Or UTF-8 is one of base64?
UTF-8 and UTF-16 are methods to encode Unicode strings to byte sequences.
See: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Base64 is a method to encode a byte sequence to a string.
So, these are widely different concepts and should not be confused.
Things to keep in mind:
Not every byte sequence represents an Unicode string encoded in UTF-8 or UTF-16.
Not every Unicode string represents a byte sequence encoded in Base64.
Base64 is a way to encode binary data, while UTF8 and UTF16 are ways to encode Unicode text. Note that in a language like Python 2.x, where binary data and strings are mixed, you can encode strings into base64 or utf8 the same way:
u'abc'.encode('utf16')
u'abc'.encode('base64')
But in languages where there's a more well-defined separation between the two types of data, the two ways of representing data generally have quite different utilities, to keep the concerns separate.
UTF-8 is like the other UTF encodings a character encoding to encode characters of the Unicode character set UCS.
Base64 is an encoding to represent any byte sequence by a sequence of printable characters (i.e. A–Z, a–z, 0–9, +, and /).
There is no System.Text.Encoding.Base64 because Base64 is not a text encoding but rather a base conversion like the hexadecimal that uses 0–9 and A–F (or a–f) to represent numbers.
Simply speaking, a charcter enconding, like UTF8 , or UTF16 are useful for to match numbers, i.e. bytes to characters and viceversa, for example in ASCII 65 is matched to "A" , while a base encoding is used mainly to translate bytes to bytes so that the resulting bytes converted from a single byte are printable and are a subset of the ASCII charachter encoding, for that reason you can see Base64 also as a bytes to text encoding mechanism. The main reason to use Base64 is to be trasmit data over a channel that doesn't allow binary data transfer.
That said, now it should be clear that you can have a stream encoded in Base64 that rapresent a stream UTF8 encoded.
I have a requirement to produce text files with ASCII encoding. I have a database full of Greek, French, and German characters with Umlauts and Accents. Is this even possible?
string reportString = report.makeReport();
Dictionary<string, string> replaceCharacters = new Dictionary<string, string>();
byte[] encodedReport = Encoding.ASCII.GetBytes(reportString);
Response.BufferOutput = false;
Response.ContentType = "text/plain";
Response.AddHeader("Content-Disposition", "attachment;filename=" + reportName + ".txt");
Response.OutputStream.Write(encodedReport, 0, encodedReport.Length);
Response.End();
When I get the reportString back the characters are represented faithfully. When I save the text file I have ? in place of the special characters.
As I understand it the ASCII standard is for American English only and something UTF 8 would be for the international audience. Is this a correct?
I'm going to make the statement that if the requirement is ASCII encoding we can't have the accents and umlauts represented correctly.
Or, am I way off and doing/saying something stupid?
You cannot represent accents and umlauts in an ASCII encoded file simply because these characters are not defined in the standard ASCII charset.
Before Unicode this was handled by "code pages", you can think of a code page as a mapping between Unicode characters and the 256 values that can fit into a single byte (obviously, in every code page most of the Unicode characters are missing).
The original ASCII code page includes only English letters - but it's unlikely someone really wants the original 7-bit code page, they probably call any 8-bit character set ASCII.
The English code page known as Latin-1 is ISO-8859-1 or Windows-1252 (the first is the ISO standard, the second is the closest code page supported by Windows).
To support characters not in Latin-1 you need to encode using different code pages, for example:
874 — Thai
932 — Japanese
936 — Chinese (simplified) (PRC, Singapore)
949 — Korean
950 — Chinese (traditional) (Taiwan, Hong Kong)
1250 — Latin (Central European languages)
1251 — Cyrillic
1252 — Latin (Western European languages)
1253 — Greek
1254 — Turkish
1255 — Hebrew
1256 — Arabic
1257 — Latin (Baltic languages)
1258 — Vietnamese
UTF-8 is something completely different, it encodes the entire Unicode character set by using variable number of bytes per characters, numbers and English letters are encoded the same as ASCII (and Windows-1252) most other languages are encoded at 2 to 4 bytes per character.
UTF-8 is mostly compatible with ASCII systems because English is encoded the same as ASCII and there are no embedded nulls in the strings.
Converting between .net strings (UTF-16LE) and other encoding is done by the System.Text.Encoding class.
IMPORTANT NOTE: the most important thing is that the system on the receiving end will use the same code page and teh system on the sending end - otherwise you will get gibberish.
The ASCII characer set only contains A-Z in upper and lowe case, digits, and some punctuation. No greek characters, no umlauts, no accents.
You can use a character set from the group that is sometimes referred to as "extended ASCII", which uses 256 characters instead of 128.
The problem with using a different character set than ASCII is that you have to use the correct one, i.e. the one that the receiving part is expecting, or it will fail to interpret any of the extended characters correctly.
You can use Encoding.GetEncoding(...) to create an extended encoding. See the reference for the Encoding class for a list of possible encodings.
You are correct.
Pure US ASCII is a 7-bit encoding, featuring English characters only.
You need a different encoding to capture characters from other alphabets. UTF-8 is a good choice.
UTF-8 is backward compatible with ASCII, so if you encode your files as UTF-8, then ASCII clients can read whatever is in their character set, and Unicode clients can read all the extended characters.
There's no way to get all the accents you want in ASCII; some accented characters (like ü) are however available in the "extended ASCII" (8-bit) character set.
Various of the encodings mentioned by other answers can be loosely described as extended ASCII.
When your users are asking for ASCII encoding, they are probably asking for one of these.
A statement like "if the requirement is ASCII encoding we can't have the accents and umlauts represented correctly" risks sounding pedantic to a non-technical user. An alternative is to get a sample of what they want (probably either the ANSI or OEM code page of their PC), determine the appropriate code page, and specify that.
The above is only partially correct. While it's true that you can't encode those characters in ASCII, you can represent them. They exist because some typewriters and early computers couldn't handle those characters.
Ä=Ae
ä=ae
ö=oe
Ö=Oe
ü=ue
Ü=Ue
ß=sz
Edit:
Andyraddaz has already written code that replaces lots of Unicode Characters with ASCII Representations. They might not be correct for some Languages/Cultures, but at least you wont have encoding errors.
https://gist.github.com/andyraddatz/e6a396fb91856174d4e3f1bf2e10951c