How can I insert ASCII special characters (e.g. with the ASCII value 0x01) into a string?
I ask because I am using the following:
str.Replace( "<TAG1>", Convert.ToChar(0x01).ToString() );
and I feel that there must be a better way than this. Any Ideas?
Update:
Also If I use this methodology, do I need to worry about unicode & ASCII clashing?
I believe you can use \uXXXX to insert specified codes into your string.
ETA: I just tested it and it works. :-)
using System;
class Uxxxx {
public static void Main() {
Console.WriteLine("\u20AC");
}
}
Also If I use this methodology, do I need to worry about unicode & ASCII clashing?
Your first problem will be your tags clashing with ASCII. Once you get to TAG10, you will clash with 0x0A: line feed. If you ensure that you will never get more than nine tags, you should be safe. Unicode-encoding (or rather: UTF8) is identical to ASCII-encoding when the byte-values are between 0 and 127. They only differ when the top-bit is set.
and I feel that there must be a better
way than this. Any Ideas?
It looks as if you're trying to manipulate a binary chunk using textual tools. If you want to insert the byte 0x01, for example, you're not manipulating text anymore, since you don't care what that byte might represent, and since it looks like you don't even care which encoding you'll be outputting.
A better way would be to treat the thing you're manipulating as a binary chunk of data, which would let you insert bits and bytes easily, without using brittle workarounds and worrying about side effects.
Related
I have a string that I receive from a third party app and I would like to display it correctly in any language using C# on my Windows Surface.
Due to incorrect encoding, a piece of my string looks like this in Farsi (Persian-Arabic):
مدل-رنگ-موی-جدید-5-436x500
whereas it should look like this:
مدل-رنگ-موی-جدید-5-436x500
This link convert this correctly:
http://www.ltg.ed.ac.uk/~richard/utf-8.html
How I can do it in c#?
It is very hard to tell exactly what is going on from the description of your question. We would all be much better off if you provided us with an example of what is happening using a single character instead of a whole string, and if you chose an example character which does not belong to some exotic character set, for example the bullet character (u2022) or something like that.
Anyhow, what is probably happening is this:
The letter "ر" is represented in UTF-8 as a byte sequence of D8 B1, but what you see is "ر", and that's because in UTF-16 Ø is u00D8 and ± is u00B1. So, the incoming text was originally in UTF-8, but in the process of importing it to a dotNet Unicode String in your application it was incorrectly interpreted as being in some 8-bit character set such as ANSI or Latin-1. That's why you now have a Unicode String which appears to contain garbage.
However, the process of converting 8-bit characters to Unicode is for the most part not destructive, so all of the information is still there, that's why the UTF-8 tool that you linked to can still kind of make sense out of it.
What you need to do is convert the string back to an array of ANSI (or Latin-1, whatever) bytes, and then re-construct the string the right way, which is a conversion of UTF-8 to Unicode.
I cannot easily reproduce your situation, so here are some things to try:
byte[] bytes = System.Text.Encoding.Ansi.GetBytes( garbledUnicodeString );
followed by
string properUnicodeString = System.Text.Encoding.UTF8.GetString( bytes );
I am having some issues with the default string encoding in C#. I need to read strings from certain files/packets. However, these strings include characters from the 128-256 range (extended ascii), and all of these characters show up as question marks , instead of the proper character. For example, when reading a string ,it could come up as "S?meStr?n?" if the string contained the extended ascii characters.
Now, is there any way to change the default encoding for my application? I know in java you could define the default character set from command line.
There's no one single "extended ASCII" encoding. There are lots of different 8-bit encodings which are compatible with ASCII for the bottom 128 values.
You need to find out what encoding your files actually use, and specific that when reading the data with StreamReader (or whatever else you're using). For example, you may want encoding Windows-1252:
Encoding encoding = Encoding.GetEncoding(1252);
.NET strings are always sequences of UTF-16 code points. You can't change that, and you shouldn't try. (That's true in Java as well, and you really shouldn't use the platform default encoding when calling getBytes() etc unless that's what you really, really mean.)
An Encoding can be specified in at least one overload of functions for reading text - for example, ReadAllText(string, Encoding).
So if you no a file's encoded using Windows-1252, then you can specify it like so:
string contents = File.ReadAllText(someFilePath, Encoding.GetEncoding(1252));
Of course, doing this requires knowing ahead of time which code page is being used.
I have a Excel Spreadsheet with lab data which looks like this:
µg/L (ppb)
I want to test for the presence of the Greek letter "µ" and if found I need to do something special.
Normally, I would write something like this:
if ( cell.StartsWith(matchSequence) ) {
//.. <-- universal symbol for "magic" :)
}
I know there is an Encoding API in the Framework, but should I use it for just this one edge-case or just copy the Greek micro symbol from the character map?
How would I test for the presence of a this unicode character? The character map seems like a "cheap" fix that will bite me later (I work for a company which is multinational).
I want to do something that is maintainable and not just some crazy math-voodoo conversion that only works for this edge case.
I guess I'm asking for best practice advice here.
Thanks!
You need to work out the unicode character you're interested in, then you can represent it with in code with an escape sequence.
For example, µ is U+00B5, so you just need:
if (text.Contains("\u00b5"))
You can find out the Unicode value from charmap or from the Unicode code charts.
The Unicode code point for micro µ is U+00B5 and is different from the "Greek letter mu" µ, which is at U+03BC. So you can use "\u00b5" to find it, and possibly also look for "\u03bc" as well - they look the same, so whoever created the spreadsheet could have used the wrong one!
You can create a Char from the numeric equivelent shown to you in the Character Map (displays as U+0050 for 'P'). To do this simply check the contains:
string value;
if (value.Contains(Char.ConvertFromUtf32(0x0050)))
;
C# code files are usually encoded in utf8, since the language is using this encoding. All strings and strign literals in c# (and other .NET languages) are encoded in utf16. So you can safely copy the micro character from the character map.
You can also use its integer value as unicode literal like 0x1234.
I'm building sitemaps and I need a way to quickly check how many UTF-8 encoded bbytes StringBuilder currently contains?
The naive way to do this would be to:
Encoding.UTF8.GetBytes(builder.ToString()).Length
But isn't this a bit bloated?
Using builder.Length doesn't work as certain charactes resolved to 2 bytes such as ÅÄÖ.
You could use this:
Encoding.UTF8.GetByteCount(builder.ToString());
Unfortunately, unlike Java where there is a CharSequence interface, you cannot directly process the StringBuilder without first converting it to a string.
I've got lots of text that I need to output, which includes all sorts of characters from many languages. Sometimes I need to output the text in character encodings other than Unicode (eg, Shift-JIS, or ISO-8859-2), in order to match the page it's going to.
If the text has characters that the encoding can't handle (eg, Japanese characters in ISO-8859-2 encoded output) I end up with odd characters in the output. I can escape them, but I'd rather do that only if it's really necessary.
So, my question is this: Is there a way I can tell ahead of time if an encoding can handle all the characters in my string?
EDIT:
I think the EncoderFallback is probably the right answer to the question I asked. Unfortunately it doesn't seem to work in my particular situation. My thought was to convert the characters to their HTML entity equivalents (eg, モ instead of モ). However, the encoder only converts the first such character it finds, and if I set the Response.ContentEncoding it never calls my EncoderFallback at all.
You can write your own EncoderFallback class assign that to the encoder before encoding.
Using this approach you need do nothing in advanced (which likely would be simply processing the output string looking for problems).
Instead your Fallback class need only handle replacements where the encoding does not have a value for a character.
Try to encode the string with an Encoding whose EncoderFallback is set to EncoderExceptionFallback. eg.:
Encoding e= Encoding.GetEncoding(932, new EncoderExceptionFallback(), new DecoderExceptionFallback());
Then catch EncoderFallbackException when you GetBytes().
I think the methods already should work. (The EncoderFallback solution seems quite nice.) Here's an alternative however, in case you prefer it.
Create an encoder for the encoding you want to test by calling encoding.GetEncoder().
You can then call the Convert method of the Encoder object, passing in your text, and looking at the value of the completed out parameter to determine whether it succeeded or not.
If speed is an issue, you may want to benchmark the various methods, but I suspect they would all have quite similar performance profiles.
Convert it to the target encoding, convert it back and compare it with the original?
Try Encoding.GetBytes() and Encoding.GetStrings() to convert hence and forth.
As an optimization you could search all used unicode characters from your original string and just use that to try out the encoding.