how can I Deserialize emoji in json in C#

how can I Deserialize emoji in json in C# - c#

I have a json file that include emoji when I want to deserialize it , it could not deserialize emoji to string.
my code is:
var mystring ={"message":"jjasdajdasjdj laslla aasdasd ssdfdsf!!! 🙌\u{1F3FD}", "updated_time":"2015-04-14T22:37:13+0000", "id":"145193995506_148030368559"}
FaceBookIdea ideaDetails = JsonConvert.DeserializeObject<FaceBookIdea>((mystring).ToString());
the error is :
{"Input string was not in a correct format."}
when I remove emoji it works well.
Thank a lot for your help

Your problem is that this portion of your message string does not conform to the JSON standard:
"\u{1F3FD}"
According to the standard, \u four-hex-digits represents a unicode character literal given by the hex value of its code point. Your string \u{1F3FD} with its curly braces does not conform to this convention, and so Json.NET throws an exception upon trying to parse it. You will see a similar error if you upload your JSON to https://jsonformatter.curiousconcept.com/.
Thus it would seem, to fix your JSON to make it conform to the standard, you need to format your character like \uXXXX using the appropriate 4 hex digits. However, your character, U+1F3FD, is larger than 0xFFFF and does not exist on the Unicode Basic Multilingual Plane. It cannot be represented as a single 4-digit hex number. c# (and utf-16 in general) represents such Unicode characters as surrogate pairs -- pairs of two two-byte chars. You will need to do the same here. The UTF-16 (hex) representation of your character is
0xD83C 0xDFFD
Thus your JSON character needs to be:
\uD83C\uDFFD
And for your entire string:
{"message":"jjasdajdasjdj laslla aasdasd ssdfdsf!!! 🙌\uD83C\uDFFD", "updated_time":"2015-04-14T22:37:13+0000", "id":"145193995506_148030368559"}

Related

Decode UTF-8 bytes as Latin-1 characters

I have a string that I receive from a third party app and I would like to display it correctly in any language using C# on my Windows Surface.
Due to incorrect encoding, a piece of my string looks like this in Farsi (Persian-Arabic):
Ù…Ø¯Ù„-Ø±Ù†Ú¯-Ù…ÙˆÛŒ-Ø¬Ø¯ÛŒØ¯-5-436x500
whereas it should look like this:
مدل-رنگ-موی-جدید-5-436x500
This link convert this correctly:
http://www.ltg.ed.ac.uk/~richard/utf-8.html
How I can do it in c#?

It is very hard to tell exactly what is going on from the description of your question. We would all be much better off if you provided us with an example of what is happening using a single character instead of a whole string, and if you chose an example character which does not belong to some exotic character set, for example the bullet character (u2022) or something like that.
Anyhow, what is probably happening is this:
The letter "ر" is represented in UTF-8 as a byte sequence of D8 B1, but what you see is "Ø±", and that's because in UTF-16 Ø is u00D8 and ± is u00B1. So, the incoming text was originally in UTF-8, but in the process of importing it to a dotNet Unicode String in your application it was incorrectly interpreted as being in some 8-bit character set such as ANSI or Latin-1. That's why you now have a Unicode String which appears to contain garbage.
However, the process of converting 8-bit characters to Unicode is for the most part not destructive, so all of the information is still there, that's why the UTF-8 tool that you linked to can still kind of make sense out of it.
What you need to do is convert the string back to an array of ANSI (or Latin-1, whatever) bytes, and then re-construct the string the right way, which is a conversion of UTF-8 to Unicode.
I cannot easily reproduce your situation, so here are some things to try:
byte[] bytes = System.Text.Encoding.Ansi.GetBytes( garbledUnicodeString );
followed by
string properUnicodeString = System.Text.Encoding.UTF8.GetString( bytes );

custom encode ascii chars between 0-31 in C#

i read some data from a device. Then i send this data to a web server via xml. The data should be represented in xml so this makes me convert characters between 0-31 because these chars can not be displayed on xml.
The question is how can i convert the chars between 0-31 decimal in a string like [00]abcde[01]fgh[02]...
Are there any built-in function in .net framework or any accepted pattern?
Thanks

You should use standard XML encoding:
Your XML API will do that for you, so you don't need to worry about anything.

You can simply encode the number as an XML entity you write &# followed by the number and a semicolon
so 1 becomes  and 13 becomes 
 and so on and so forth
However as noted by dan04 you can't represent 0 as a numeric character reference, so in the case where your data might include 0 you will have to use a different encoding. You could encode the entire binary data as base64
Most XML toolboxes will do the encoding to NCRs for you though so you really shouldn't have to worry about that

Decoding Base64 / Quoted Printable encoded UTF8 string

In my ASP.Net application working process, I need to do some work with string, which equals something like
=?utf-8?B?SWhyZSBCZXN0ZWxsdW5nIC0gVmVyc2FuZGJlc3TDpHRpZ3VuZyAtIDExMDU4OTEyNDY=?=
How can I decode it to normal human language?
Thanks in advance!
Update:
Convert.FromBase64String() does not work for string, which equals
=?UTF-8?Q?Bestellbest=C3=A4tigung?=
I get The format of s is invalid. s contains a non-base-64 character, more than two padding characters, or a non-white space-character among the padding characters. exception.
Update:
Solution Here
Alternative solution
Update:
What kind of string encoding is that: Nweiß ???

It's actually a base-64 string:
string zz = "SWhyZSBCZXN0ZWxsdW5nIC0gVmVyc2FuZGJlc3TDpHRpZ3VuZyAtIDExMDU4OTEyNDY=";
byte[] dd = Convert.FromBase64String(zz);
// Returns Ihre Bestellung - Versandbestätigung - 1105891246
string yy = System.Text.Encoding.UTF8.GetString(dd);

I've written a library that will decode these sorts of strings. You can find it at http://github.com/jstedfast/MimeKit
Specifically, take a look at MimeKit.Utils.Rfc2047.DecodeText()

This seems to be MIME Header Encoding. The Q in your second example indicates that it is Quoted Printable.
This question seems to cover the variants fairly well. In a quick search I didn't find any .NET libraries to decode this automatically, but it shouldn't be hard to do manually if you need to.

That's not UTF8. Thats a Base64 encoded string.
the UTF-8 only indicates that the target string is in UTF8 format.
After decoding the Base64 string:
SWhyZSBCZXN0ZWxsdW5nIC0gVmVyc2FuZGJlc3TDpHRpZ3VuZyAtIDExMDU4OTEyNDY=
You'll get the following result:
Ihre Bestellung - Versandbestätigung - 1105891246
See Base64 online decode/encode

Looks like a base64 string.
Try Convert.FromBase64String
http://msdn.microsoft.com/en-us/library/system.convert.frombase64string.aspx

This is an encoded word, which is used in email headers when there is non-ASCII content. Encoded words are defined in RFC 2047:
https://www.rfc-editor.org/rfc/rfc2047#section-2
The BNF for an encoded word is:
encoded-word = "=?" charset "?" encoding "?" encoded-text "?="
So the correct way to interpret this is:
The data is the stuff between the 3rd and 4th question marks
It has been Base64 encoded (the 'B' stands for Base64; if it were a
'Q' then it would be quoted-printable).
Once you decode the
data, it will be in the UTF-8 character set.
The result, as #Shai correctly pointed out, is:
Ihre Bestellung - Versandbestätigung - 1105891246
This is German. The umlaut is obviously the reason for the UTF-8 and thus the need for an encoded word. The translation is:
Your order - Delivery confirmation - 1105891246
Apparently it's a tracking number for an order.
All modern email clients (and Outlook) transparently support encoded words.

This is a bit of guesswork, but let's try
remove =? from start and ?= from end
keep the start up to the next ? as the character set
Remove the B? - don't know, what it is
Convert the rest to a byte[] via System.Convert.FromBase64String()
Convert this to the final String via Encoding.GetSTring() using the character set remembered in the second step

"Unable to translate Unicode character" error when saving to txt file

Additional information: Unable to
translate Unicode character \uDFFF at
index 195 to specified code page.
I made an algorithm, who's result are binary values (different lengths). I transformed it into uint, and then into chars and saved into stringbuilder, as you can see below:
uint n = Convert.ToUInt16(tmp_chars, 2);
_koded_text.Append(Convert.ToChar(n));
My problem is, that when i try to save those values into .txt i get the previously mentioned error.
StreamWriter file = new StreamWriter(filename);
file.WriteLine(_koded_text);
file.Close();
What i am saving is this: "忿췾᷿］볯褟ﶞ痢ﳻ��伞ﳴ㿯ﹽ翼蛿㐻ﰻ筹��﷿￦ﾏ랿鳿⏟麞펿"... which are some weird signs.
What i need is to convert those binary values into some kind of string of chars and save it to txt. I saw somewhere that converting to UTF8 should help, but i don't know how to. Would changing files encoding help too?

You cannot transform binary data to a string directly. The Unicode characters in a string are encoded using utf16 in .NET. That encoding uses two bytes per character, providing 65536 distinct values. Unicode however has over one million codepoints. To make that work, the Unicode codepoints above \uffff (above the BMP, Basic Multilingual Plane) are encoded with a surrogate pair. The first one has a value between 0xd800 and 0xdbff, the second between 0xdc00 and 0xdfff. That provides 2 ^ (10 + 10) = 1 million additional codes.
You can perhaps see where this leads, in your case the code detects a high surrogate value (0xdfff) that isn't paired with a low surrogate. That's illegal. Lots more possible mishaps, several codepoints are unassigned, several are diacritics that get mangled when the string is normalized.
You just can't make this work. Base64 encoding is the standard way to carry binary data across a text stream. It uses 6 bits per character, 3 bytes require 4 characters. The character set is ASCII so the odds of the receiving program decoding the character back to binary incorrectly are minimal. Only a decades old IBM mainframe that uses EBCDIC could get you into trouble. Or just plain avoid encoding to text and keep it binary.

Since you're trying to encode binary data to a text stream this SO question already contains an answer to the question: "How do I encode something as base64?" From there plain ASCII/ANSI text is fine for the output encoding.

Can a Base64 String contain tabs?

Simple yes or no question, and I'm 90% sure that it is no... but I'm not sure.
Can a Base64 string contain tabs?

It depends on what you're asking. If you are asking whether or not tabs can be base-64 encoded, then the answer is "yes" since they can be treated the same as any other ASCII character.
However, if you are asking whether or not base-64 output can contain tabs, then the answer is no. The following link is for an article detailing base-64, including which characters are considered valid:
http://en.wikipedia.org/wiki/Base64

The short answer is no - but Base64 cannot contain carriage returns either.
That is why, if you have multiple lines of Base64, you strip out any carriage returns, line feeds, and anything else that is not in the Base64 alphabet
That includes tabs.

From wikipedia.com:
The current version of PEM (specified
in RFC 1421) uses a 64-character
alphabet consisting of upper- and
lower-case Roman alphabet characters
(A–Z, a–z), the numerals (0–9), and
the "+" and "/" symbols. The "="
symbol is also used as a special
suffix code. The original
specification, RFC 989, additionally
used the "*" symbol to delimit encoded
but unencrypted data within the output
stream.
As you can see, tab characters are not included. However, you can of course encode a tab character into a base64 string.

Sure. Tab is just ASCII character 9, and that has a base64 representation just like any other integer.

Base64 specification (RFC 4648) states in Section 3.3 that any encountered non-alphabet characters should be rejected unless explicitly allowed by another specification:
Implementations MUST reject the
encoded data if it contains
characters outside the base alphabet
when interpreting base-encoded
data, unless the specification
referring to this document explicitly
states otherwise. Such specifications
may instead state, as MIME does,
that characters outside the base
encoding alphabet should simply be
ignored when interpreting data ("be
liberal in what you accept").
Note that this means that any
adjacent carriage return/ line feed
(CRLF) characters constitute
"non-alphabet characters" and are
ignored.
Specs such as PEM (RFC 1421) and MIME (RFC 2045) specify that Base64 strings can be broken up by whitespaces. Per referenced RFC 822, a tab (HTAB) is considered a whitespace character.
So, when Base64 is used in context of either MIME or PEM (and probably other similar specifications), it can contain whitespace, including tabs, which should be handled (stripped out) while decoding the encoded content.

Haha, as you see from the responses, this is actually not such a simple yes no answer.
A resulting Base64 string after conversion cannot contain a tab character, but It seems to me that you are not asking that, seems to me that you are asking can you represent a string (before conversion) containing a tab in Base64, and the answer to that is yes.
I would add though that really what you should do is make sure that you take care to preserve the encoding of your string, i.e. convert it to an array of bytes with your correct encoding (Unicode, UTF-8 whatever) then convert that array of bytes to base64.
EDIT: A simple test.
private void button2_Click(object sender, EventArgs e)
{
StringBuilder sb = new StringBuilder();
string test = "The rain in spain falls \t mainly on the plain";
sb.AppendLine(test);
UTF8Encoding enc = new UTF8Encoding();
byte[] b = enc.GetBytes(test);
string cvtd = Convert.ToBase64String(b);
sb.AppendLine(cvtd);
byte[] c = Convert.FromBase64String(cvtd);
string backAgain = enc.GetString(c);
sb.AppendLine(backAgain);
MessageBox.Show(sb.ToString());
}

It seems that there is lots of confusion here; and surprisingly most answers are of "No" variety. I don't think that is a good canonical answer.
The reason for confusion is probably the fact that Base64 is not strictly specified; multiple practical implementations and interpretations exist.
You can check out link text for more discussion on this.
In general, however, conforming base64 codecs SHOULD understand linefeeds, as they are mandated by some base64 definitions (76 character segments, then linefeed etc).
Because of this, most decoders also allow for indentation whitespace, and quite commonly any whitespace between 4-character "triplets" (so named since they encode 3 bytes).
So there's a good chance that in practice you can use tabs and other white space.
But I would not add tabs myself if generating base64 content sent to a service -- be conservative at what you send, (more) liberal at what you receive.

Convert.FromBase64String() in the .NET framework does not seem to mind them. I believe all whitespace in the string is ignored.
string xxx = "ABCD\tDEFG"; //simulated Base64 encoded string w/added tab
Console.WriteLine(xxx);
byte[] xx = Convert.FromBase64String(xxx); // convert string back to binary
Console.WriteLine(BitConverter.ToString(xx));
output:
ABCD DEFG
00-10-83-0C-41-46
The relevant clause of RFC-2045 (6:8)
The encoded output stream must be
represented in lines of no more
than 76 characters each. All line
breaks or other characters not
found in Table 1 must be ignored by
decoding software. In base64 data,
characters other than those in Table
1, line breaks, and other white
space probably indicate a transmission
error, about which a warning
message or even a message rejection
might be appropriate under some
circumstances.

YES!
Base64 is used to encode ANY 8bit value (Decimal 0 to 255) into a string using a set of safe characters. TAB is decimal 9.
Base 64 uses one of the following character sets:
Data: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
URLs: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_
Binary Attachments (eg: email) in text are also encoded using this system.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

how can I Deserialize emoji in json in C# - c#

Related

Decode UTF-8 bytes as Latin-1 characters

custom encode ascii chars between 0-31 in C#

Decoding Base64 / Quoted Printable encoded UTF8 string

"Unable to translate Unicode character" error when saving to txt file

Can a Base64 String contain tabs?

Categories

Resources