I have a string that I receive from a third party app and I would like to display it correctly in any language using C# on my Windows Surface.
Due to incorrect encoding, a piece of my string looks like this in Farsi (Persian-Arabic):
مدل-رنگ-موی-جدید-5-436x500
whereas it should look like this:
مدل-رنگ-موی-جدید-5-436x500
This link convert this correctly:
http://www.ltg.ed.ac.uk/~richard/utf-8.html
How I can do it in c#?
It is very hard to tell exactly what is going on from the description of your question. We would all be much better off if you provided us with an example of what is happening using a single character instead of a whole string, and if you chose an example character which does not belong to some exotic character set, for example the bullet character (u2022) or something like that.
Anyhow, what is probably happening is this:
The letter "ر" is represented in UTF-8 as a byte sequence of D8 B1, but what you see is "ر", and that's because in UTF-16 Ø is u00D8 and ± is u00B1. So, the incoming text was originally in UTF-8, but in the process of importing it to a dotNet Unicode String in your application it was incorrectly interpreted as being in some 8-bit character set such as ANSI or Latin-1. That's why you now have a Unicode String which appears to contain garbage.
However, the process of converting 8-bit characters to Unicode is for the most part not destructive, so all of the information is still there, that's why the UTF-8 tool that you linked to can still kind of make sense out of it.
What you need to do is convert the string back to an array of ANSI (or Latin-1, whatever) bytes, and then re-construct the string the right way, which is a conversion of UTF-8 to Unicode.
I cannot easily reproduce your situation, so here are some things to try:
byte[] bytes = System.Text.Encoding.Ansi.GetBytes( garbledUnicodeString );
followed by
string properUnicodeString = System.Text.Encoding.UTF8.GetString( bytes );
Hello thank you for reading;
I've tried to decrypt thise Base64 string in anyway possible. Also tried searching Stackoverflow and tried other methods. Every response it's just jibberish.
v6kEwElTQI%2fNlQc87zM7Od2%2fsaAghvSbCVyYaJRTf4U%3d
Hope you've got any ideas!
It's not exactly readable, but it does successfully Base64 decode:
Firstly, I manually de-URL-encoded it. There are three URL escape sequences in there:
%2f : '/' (two of these)
%3d : '='
Giving me v6kEwElTQI/NlQc87zM7Od2/saAghvSbCVyYaJRTf4U=
With those changed, it's Base64.
Then, you can use the old methods of decoding in C# .NET:
byte[] arr = Convert.FromBase64String('v6kEwElTQI/NlQc87zM7Od2/saAghvSbCVyYaJRTf4U=');
Console.WriteLine(System.Text.Encoding.Default.GetString(arr));
// As a note, I'm using Default encoding here. Your system may use
// a different encoding by default. Alternatively, you can replace
// 'Default' here with ASCII or Unicode.
Which gave me ¿©ÀIS#Í•<ï3;9Ý¿± †ô› \˜h”S…
Using unicode I get 쀄卉轀闍㰇㏯㤻뿝ꂱ蘠鯴尉梘厔蕿 instead.
I have ruled out that it's not an image (at least in a System.Drawing.Image format.)
While decoding it was fun, I have to agree with the comments. It's important to know what the data is going to be after it's decoded since it'll only be a nameless stream of bytes at that point.
I have a URL like the following
http://mysite.com/default.aspx?q=%E1
Where %E1 is supposed to be á. When I call Request.QueryString from my C# page I receive
http://mysite.com/default.aspx?q=%ufffd
It does this for any accented character. %E1, %E3, %E9, %ED etc. all get passed as %ufffd. Normal encoded values (%2D, %2E, %27) all get passed correctly.
The config file already has the responseEncoding/requestEncoding in the globalization section set to UTF-8.
How could I read the correct values?
Please note that I'm not the one generating the query string and I have no control over it.
While it's true that á is encoded as U+00E1, the UTF-8 encoding (which is relevant for URL parameters) is 0xC3 0xA1.
You can verify by called a Wikipedia entry on an accented letter, such as http://en.wikipedia.org/wiki/%C3%81
U+FFFD is the Unicode Replacement Character which indicates the a given character value cannot be correctly encoded in Unicode.
Update:
Your question has two points.
First: How do I encode a Unicode string as parameter. Use
"?q=" + HttpUtility.UrlEncode(value)
Second: How do I retrieve a Unicode value? Use:
Request["q"]
If you receive the %E1 from some other source you do not control, maybe the RawUrl can help you. (I have not tried)
Using HttpUtility.UrlEncode and passing via the URL the receiving page sees the variables as:
brand new -> brand+new
Airconaire+Ltd -> Airconaire+Ltd
Can you see how the first and the second both have a + in them where they didn't at the start? I'm assuming this is something to do with the encoding (specifically RFC3986 or RFC2396) but how do I solve this?
I think ideally the spaces should be converted to %20 but is this the best way forward?
Try using HttpUtility.UrlPathEncode rather than URLEncode.
The UrlEncode() method can be used to encode the entire URL, including query-string values. If characters such as blanks and punctuation are passed in an HTTP stream, they might be misinterpreted at the receiving end. URL encoding converts characters that are not allowed in a URL into character-entity equivalents; URL decoding reverses the encoding. For example, when the characters < and > are embedded in a block of text to be transmitted in a URL, they are encoded as %3c and %3e.
You can encode a URL using with the UrlEncode() method or the UrlPathEncode() method. However, the methods return different results. The UrlEncode() method converts each space character to a plus character (+). The UrlPathEncode() method converts each space character into the string "%20", which represents a space in hexadecimal notation. Use the UrlPathEncode() method when you encode the path portion of a URL in order to guarantee a consistent decoded URL, regardless of which platform or browser performs the decoding.
http://msdn.microsoft.com/en-us/library/4fkewx0t.aspx
Simple yes or no question, and I'm 90% sure that it is no... but I'm not sure.
Can a Base64 string contain tabs?
It depends on what you're asking. If you are asking whether or not tabs can be base-64 encoded, then the answer is "yes" since they can be treated the same as any other ASCII character.
However, if you are asking whether or not base-64 output can contain tabs, then the answer is no. The following link is for an article detailing base-64, including which characters are considered valid:
http://en.wikipedia.org/wiki/Base64
The short answer is no - but Base64 cannot contain carriage returns either.
That is why, if you have multiple lines of Base64, you strip out any carriage returns, line feeds, and anything else that is not in the Base64 alphabet
That includes tabs.
From wikipedia.com:
The current version of PEM (specified
in RFC 1421) uses a 64-character
alphabet consisting of upper- and
lower-case Roman alphabet characters
(A–Z, a–z), the numerals (0–9), and
the "+" and "/" symbols. The "="
symbol is also used as a special
suffix code. The original
specification, RFC 989, additionally
used the "*" symbol to delimit encoded
but unencrypted data within the output
stream.
As you can see, tab characters are not included. However, you can of course encode a tab character into a base64 string.
Sure. Tab is just ASCII character 9, and that has a base64 representation just like any other integer.
Base64 specification (RFC 4648) states in Section 3.3 that any encountered non-alphabet characters should be rejected unless explicitly allowed by another specification:
Implementations MUST reject the
encoded data if it contains
characters outside the base alphabet
when interpreting base-encoded
data, unless the specification
referring to this document explicitly
states otherwise. Such specifications
may instead state, as MIME does,
that characters outside the base
encoding alphabet should simply be
ignored when interpreting data ("be
liberal in what you accept").
Note that this means that any
adjacent carriage return/ line feed
(CRLF) characters constitute
"non-alphabet characters" and are
ignored.
Specs such as PEM (RFC 1421) and MIME (RFC 2045) specify that Base64 strings can be broken up by whitespaces. Per referenced RFC 822, a tab (HTAB) is considered a whitespace character.
So, when Base64 is used in context of either MIME or PEM (and probably other similar specifications), it can contain whitespace, including tabs, which should be handled (stripped out) while decoding the encoded content.
Haha, as you see from the responses, this is actually not such a simple yes no answer.
A resulting Base64 string after conversion cannot contain a tab character, but It seems to me that you are not asking that, seems to me that you are asking can you represent a string (before conversion) containing a tab in Base64, and the answer to that is yes.
I would add though that really what you should do is make sure that you take care to preserve the encoding of your string, i.e. convert it to an array of bytes with your correct encoding (Unicode, UTF-8 whatever) then convert that array of bytes to base64.
EDIT: A simple test.
private void button2_Click(object sender, EventArgs e)
{
StringBuilder sb = new StringBuilder();
string test = "The rain in spain falls \t mainly on the plain";
sb.AppendLine(test);
UTF8Encoding enc = new UTF8Encoding();
byte[] b = enc.GetBytes(test);
string cvtd = Convert.ToBase64String(b);
sb.AppendLine(cvtd);
byte[] c = Convert.FromBase64String(cvtd);
string backAgain = enc.GetString(c);
sb.AppendLine(backAgain);
MessageBox.Show(sb.ToString());
}
It seems that there is lots of confusion here; and surprisingly most answers are of "No" variety. I don't think that is a good canonical answer.
The reason for confusion is probably the fact that Base64 is not strictly specified; multiple practical implementations and interpretations exist.
You can check out link text for more discussion on this.
In general, however, conforming base64 codecs SHOULD understand linefeeds, as they are mandated by some base64 definitions (76 character segments, then linefeed etc).
Because of this, most decoders also allow for indentation whitespace, and quite commonly any whitespace between 4-character "triplets" (so named since they encode 3 bytes).
So there's a good chance that in practice you can use tabs and other white space.
But I would not add tabs myself if generating base64 content sent to a service -- be conservative at what you send, (more) liberal at what you receive.
Convert.FromBase64String() in the .NET framework does not seem to mind them. I believe all whitespace in the string is ignored.
string xxx = "ABCD\tDEFG"; //simulated Base64 encoded string w/added tab
Console.WriteLine(xxx);
byte[] xx = Convert.FromBase64String(xxx); // convert string back to binary
Console.WriteLine(BitConverter.ToString(xx));
output:
ABCD DEFG
00-10-83-0C-41-46
The relevant clause of RFC-2045 (6:8)
The encoded output stream must be
represented in lines of no more
than 76 characters each. All line
breaks or other characters not
found in Table 1 must be ignored by
decoding software. In base64 data,
characters other than those in Table
1, line breaks, and other white
space probably indicate a transmission
error, about which a warning
message or even a message rejection
might be appropriate under some
circumstances.
YES!
Base64 is used to encode ANY 8bit value (Decimal 0 to 255) into a string using a set of safe characters. TAB is decimal 9.
Base 64 uses one of the following character sets:
Data: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
URLs: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_
Binary Attachments (eg: email) in text are also encoded using this system.