Selecting a value to decode hex string in PDF

Selecting a value to decode hex string in PDF - c#

While we have a hexadecimal string, how to convert the hex string as it matches the characters in CharacterMap table. I have tried splitting the hex string as consecutive two character codes and then get the decoded values using System.Globalization.NumberStyles.HexNumber. But some times it goes wrong and in those cases the above logic fails converting the hex string as four character sub strings and decoding it produces good result.
For example:
In the case: hex string <030402> converted to substrings 03,04 and 02 and then decoding the substrings produces correct result.
In the case: hex string <0000> converted as 00,00 and then decoding the substrings produces incorrect result. In this case converting the hex sting as a whole 0000 to hexadecimal produces correct result.
Could anyone help me with this. Thanks in Advance.

Seeing some of your specific breaking examples with expected and actual results would help. But it sounds like you may not be accounting for Unicode's variable encoding lengths. Under UTF-8, the range of a character's first two bytes indicates the character's size:
00-9f : one byte
c2-df : two bytes
e0-ef : three bytes
f0-f4 : four bytes

Related

Chinese Simplified to Hex GB2312 encoding in C#

I am having issue trying to convert a string containing Simplified Chinese to double byte encoding (GB2312). This is for printing Chinese characters to a zebra printer.
The specs I am looking at show an example with the text of "冈区色呆" which they show as converting to a hex value of 38_54_47_78_49_2b_34_74.
In my C# code I am trying to convert this using the below code as a test. My result seems to be off by 7 in the leading hex value. What am I missing here?
private const string SimplifiedChineseChars = "冈区色呆";
[TestMethod]
public void GetBackCorrectHexValues()
{
byte[] bytes = Encoding.GetEncoding(20936).GetBytes(SimplifiedChineseChars);
string hex = BitConverter.ToString(bytes).Replace("-", "_");
//I get the following: B8_D4_C7_F8_C9_AB_B4_F4
//I am expecting: 38_54_47_78_49_2b_34_74
}

The only thing that makes sense to me is that 38_54_47_78_49_2b_34_74 is some form of 7-bit encoding.
Interestingly, a 7-bit version of the GB2312 encoding does exist, and is called the HZ character encoding.
Here is the wikipedia entry on HZ. Interesting parts:
The HZ ... encoding was invented to facilitate the use of Chinese characters through e-mail, which at that time only allowed 7-bit characters.
the HZ code uses only printable, 7-bit characters to represent Chinese characters.
And, according to this Microsoft reference page on EncodingInfo.GetEncoding, this character encoding is supported in .NET:
52936 hz-gb-2312 Chinese Simplified (HZ)
If I try your code, and replace the character encoding to use HZ, I get:
static void Main(string[] args)
{
const string SimplifiedChineseChars = "冈区色呆";
byte[] bytes = Encoding.GetEncoding("hz-gb-2312").GetBytes(SimplifiedChineseChars);
string hex = BitConverter.ToString(bytes).Replace("-", "_");
Console.WriteLine(hex);
}
Output:
7E_7B_38_54_47_78_49_2B_34_74_7E_7D
So, you basically get exactly what you are looking for, except that it adds the escape sequences ~{ and ~} before and after the chinese character bytes. Those escape sequences are necessary because this encoding supports mixing ASCII character bytes (single byte encoding) with GB chinese character bytes (double byte encoding). The escape sequences mark the areas that should not be interpreted as ASCII.
If you choose to use the hz-gb-2312 encoding, you would have to strip any unwanted escape sequences yourself, if you think you don't need them. But, perhaps you do need them. You'll have to figure out exactly what your printer is expecting.
Alternatively, if you really don't want to have those escape sequences and if you are not worried about having to handle ASCII characters, and are confident that you only have to deal with chinese double byte characters, then you could choose to stick with using the vanilla GB2312 encoding, and then drop the most significant bit of every byte yourself to essentially convert the results to 7-bit encoding.
Here is what the code could look like. Notice that I mask each byte value with 0x7F to drop the 8th bit.
static void Main(string[] args)
{
const string SimplifiedChineseChars = "冈区色呆";
byte[] bytes = Encoding.GetEncoding("gb2312") // vanilla gb2312 encoding
.GetBytes(SimplifiedChineseChars)
.Select(b => (byte)(b & 0x7F)) // retain 7 bits only
.ToArray();
string hex = BitConverter.ToString(bytes).Replace("-", "_");
Console.WriteLine(hex);
}
Output:
38_54_47_78_49_2B_34_74

ASCII Code Of Characters

In C# I need to get the ASCII code of some characters.
So I convert the char To byte Or int, then print the result.
String sample="A";
int AsciiInt = sample[0];
byte AsciiByte = (byte)sample[0];
For characters with ASCII code 128 and less, I get the right answer.
But for characters greater than 128 I get irrelevant answers!
I am sure all characters are less than 0xFF.
Also I have Tested System.Text.Encoding and got the same results.
For example: I get 172 For a char with actual byte value of 129!
Actually ASCII characters Like ƒ , ‡ , ‹ , “ , ¥ , © , Ï , ³ , · , ½ , » , Á Each character takes 1 byte and goes up to more than 193.
I Guess There is An Unicode Equivalent for Them and .Net Return That Because Interprets Strings As Unicode!
What If SomeOne Needs To Access The Actual Value of a byte , Whether It is a valid Known ASCII Character Or Not!!!

But For Characters Upper Than 128 I get Irrelevant answers
No you don't. You get the bottom 8 bits of the UTF-16 code unit corresponding to the char.
Now if your text were all ASCII, that would be fine - because ASCII only goes up to 127 anyway. It sounds like you're actually expecting the representation in some other encoding - so you need to work out which encoding that is, at which point you can use:
Encoding encoding = ...;
byte[] bytes = encoding.GetBytes(sample);
// Now extract the bytes you want. Note that a character may be represented by more than
// one byte.
If you're essentially looking for an encoding which treats bytes 0 to 255 respectively as U+0000 to U+00FF respectively, you should use ISO-8859-1, which you can access using Encoding.GetEncoding(28591).

You can't just ignore the issue of encoding. There is no inherent mapping between bytes and characters - that's defined by the encoding.
If I use your example of 131, on my system, this produces â. However, since you're obviously on an arabic system, you most likely have Windows-1256 encoding, which produces ƒ for 131.
In other words, if you need to use the correct encoding when converting characters to bytes and vice versa. In your case,
var sample = "ƒ";
var byteValue = Encoding.GetEncoding("windows-1256").GetBytes(sample)[0];
Which produces 131, as you seem to expect. Most importantly, this will work on all computers - if you want to have this system locale-specific, Encoding.Default can also work for you.
The only reason your method seems to work for bytes under 128 is that in UTF-8, the characters correspond to the ASCII standard mapping. However, you're misusing the term ASCII - it really only refers to these 7-bit characters. What you're calling ASCII is actually an extended 8-bit charset - all characters with the 8-bit set are charset-dependent.
We're no longer in a world when you can assume your application will only run on computers with the same locale you have - .NET is designed for this, which is why all strings are unicode. At the very least, read this http://www.joelonsoftware.com/articles/Unicode.html for an explanation of how encodings work, and to get rid of some of the serious and dangerous misconceptions you seem to have.

How do I convert a 16-bit UCS-2 integer value into a char?

I am parsing values from a binary file. One value I am parsing is a 16-bit number which represents the UCS-2 encoding of a unicode character. I'm converting it to a character like this:
char c = (char)myInteger;
Is this safe?

Yes, as long as there are no byte-ordering issues this should be fine.

What does this C# string format mean?

From my previous question, Converting chinese character to Unicode, I had a good answer but with some code I didn't understand:
Console.WriteLine("U+{0:x4}", (int)myChar);
Could anyone explain this?

Console.WriteLine("U+{0:x4}", (int)myChar);
is the equivalent to the call:
Console.WriteLine("U+{0}", ((int)myChar).ToString("x4"));
In a format string, the : indicates that the item should be displayed using the provided format. The x4 part indicates that the integer should be printed in its hexadecimal form using 4 characters. Refer to standard numeric format strings for more information.

The 0 indicates which positional argument to substitute. The x displays a hexadecimal number, the 4 has it display four digits.
For example, the character ȿ (LATIN SMALL LETTER S WITH SWASH TAIL, codepoint 575) is printed as U+023F since 57510 = 23F16.

That will simply create the literal string "U+1234"... now if you are wanting to convert a unicode code point into a char, you want Convert.ToChar(myChar)
http://msdn.microsoft.com/en-us/library/3hkfdkcx.aspx

C# Build hexadecimal notation string

How do I build an escape sequence string in hexadecimal notation.
Example:
string s = "\x1A"; // this will create the hex-value 1A or dec-value 26
I want to be able to build strings with hex-values between 00 to FF like this (in this example 1B)
string s = "\x" + "1B"; // Unrecognized escape sequence
Maybe there's another way of making hexadecimal strings...

Please try to avoid the \x escape sequence. It's difficult to read because where it stops depends on the data. For instance, how much difference is there at a glance between these two strings?
"\x9Good compiler"
"\x9Bad compiler"
In the former, the "\x9" is tab - the escape sequence stops there because 'G' is not a valid hex character. In the second string, "\x9Bad" is all an escape sequence, leaving you with some random Unicode character and " compiler".
I suggest you use the \u escape sequence instead:
"\u0009Good compiler"
"\u0009Bad compiler"
(Of course for tab you'd use \t but I hope you see what I mean...)
This is somewhat aside from the original question of course, but that's been answered already :)

You don't store hexadecimal values in strings.
You can, but it would just be that, a string, and would have to be cast to an integer or a byte to actually read its value.
You can assign a hexadecimal value as a literal to an int or a byte though:
Byte value = 0x0FF;
int value = 0x1B;
So, its easily possible to pass an hexadecimal literal into your string:
string foo = String.Format("{0} hex test", 0x0BB);
Which would create this string "126 hex test".
But I don't think that's what you wanted?

There's an '\u' escape code for hexadecimal 16 bits unicode character codes.
Console.WriteLine( "Look, I'm so happy : \u263A" );

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Selecting a value to decode hex string in PDF - c#

Related

Chinese Simplified to Hex GB2312 encoding in C#

ASCII Code Of Characters

How do I convert a 16-bit UCS-2 integer value into a char?

What does this C# string format mean?

C# Build hexadecimal notation string

Categories

Resources