Convert string to normal text - c#

When I open one file it contains something like this:
It's that
What is this and how do I convert it to ASCII ?

This is HTML encoding, use WebUtility.HtmlDecode (in System.Net namespace):
string encoded = "It's that";
string decoded = System.Net.WebUtility.HtmlDecode(s);

HttpUtility.HtmlDecode() will do the trick.

Those are HTML entities. They represent ascii characters. You can decode them using HttpUtility.HTMLDecode().
If you're just trying to read this one line, you could also rename the file to a .html file and open it in your browser of choice. There are even tools that do this online.

The number between the &# ; is likely an ASCII code.
Convert the numbers manually or use the HTMLDecode to save yourself some time...

If you're using .Net Framework 4.0 or higher then the System.Net.WebUtility.HtmlDecode(s) will work.
I needed this solution for an SSRS report where only 3.5 was supported. Since the namespace above wasn't available I went the alternate route of
System.Web.HttpUtility.HtmlDecode(rawString)

Related

Wrong characters for accents in one Windows-1252 encoded XML

In the XML i need to read in C#, i find characters such as
é, É.
As far as i know , i should not find those characters in a windows-1252 encoded XML. Can i fix that problem in C# or the XML itself must be updated?
Thanks in advance.
It does look like the XML needs to be updated.
You could certainly write something that reads it in as the UTF-8 it really is and writes it back out as the Windows-1252 it claimed to be, but why bother? XML in Windows-1252 is like someone using their smart-phone while dressed ye olde knight at a Renaissance Faire anyway. Just drop the incorrect declaration from the first line and away you go.
The simple answer is: you're probably using the wrong encoding. From this I'd say you should be using UTF-8. You can force it by downloading the document before parsing it.
I should note that downloading URL's is tricky: web servers often report the wrong encoding. That is also the reason why the HTML5 standard includes a section on encoding detection. I'm afraid there's no easy generic solution for this -- we ended up implementing our own encoding detection algorithms for our web crawlers.

VB6 to C# XML String Conversion Special Characters

I've been using the DOMDocument object from VB6 (MSXML) to create and save an XML file that has an encrypted string. However, this string I think has certain special characters...
<EncryptedPassword>ÆÔ¤ïÎ
߯8KHÖN›¢)Þ,qiãÔÙ</EncryptedPassword>
With this, I go into my C# Project, and de-serialise this XML file in UTF-8 encoding and it fails on this string. I've tried serialisation via ASCII and this gets a couple characters more, but still fails. If I put a plain text string in this place, all is ok! :(
I'm thinking that maybe I'm better converting the string into an MD5 type string from VB6 first, and decoding the MD5 string in .NET and then decrypting the actual string with special characters, but it's an extra step to code all this up and was hoping someone might have a better idea for me here?
Thanks in advance!
The best thing for you to do is to encode your encrypted string in something that will use the ASCII charset. The easiest way to do this is to take your encrypted string and then encode it into Base64 and write this encoded value to the XML element.
And in .net, simply take the value of the XML element and decode it from Base64 and 'voila', you have your enrypted string.
.Net can easily decode a base64 string, see: http://msdn.microsoft.com/en-us/library/system.text.encoding.ascii.aspx. (This page may make it look a bit complicated than it really is).
VB6 does not have native support for Base64 encoding but a quick trawl on google throws up some examples on how it can be achieved quite easily:
http://www.vbforums.com/showthread.php?t=379072
http://www.nonhostile.com/howto-encode-decode-base64-vb6.asp
http://www.mcmillan.org.nz/Programming/post/Base64-Class.aspx
http://www.freevbcode.com/ShowCode.asp?ID=2038
I've concluded that storing these characters in the XML file is wrong. VB6 allows this, but .NET doesn't! Therefore I have converted the string to a Base64 array in line with this link: -
http://www.nonhostile.com/howto-encode-decode-base64-vb6.asp
Now, on the .NET side, the file will de-serialise back into my class where I now store the password as a byte array. I then convert this back to the string I need to decrypt which now appears to raise another problem!
string password = Encoding.UTF7.GetString(this.EncryptedPassword);
With this encoding conversion, I get the string almost exactly back to how I want, but there is a small greater than character that is just not translating properly! Then a colleague found a stack overflow post that had the final answer! There's a discrepancy between VB6 and .NET on this type of conversion. Doing the following instead did the trick: -
string password = Encoding.GetEncoding(1252).GetString(this.EncryptedPassword);
Thanks for all the help, much appreciated. The original post about this is # .Net unicode problem, vb6 legacy

string encoding in C# - strange characters

I have a file that i need to import.
The problem is that I have problems with a lot of characters in that file.
For example these names are wrong:
Björn (in file) - Should be Björn
Ã…ke (in file) - Should be Åke
Unfortunately I can't recreate the file with the correct encoding.
Also there are a lot of characters that are wrong (these was just examples). I can't do a search and replace on all (if there isn't a dictionary with all conversions).
Can I decode the strings in some way?
thanks Patrik
Edit:
Just some more info that I should added before (I blame my tiredness).
The file is an .xlsx file.
I debugged this with Notepad++. I copied the correct strings into Notepad++. I used Encoding | Convert to UTF-8. Then I selected Encoding | Encode as ANSI. This has the effect of interpreting the UTF-8 bytes as if they were ANSI. And when I did this I end up with the same erroneous values as you. So clearly when you read the file you are interpreting is as ANSI rather than UTF-8.
The solution then is that your file has been encoded as UTF-8. Make sure that the file is interpreted as UTF-8 when you read it. I can't tell you exactly how to do that since you didn't show how you were reading the file in the first place.
It's possible that your file does not contain a byte-order-mark (BOM). If so then specify the encoding when you read the file by passing Encoding.UTF8.
I've just tried your first example, and it definitely looks like that's UTF-8.
It's unclear what you're using to look at the file in the first place, but if you load it with a text editor which understands UTF-8 and tell it that it's a UTF-8 file, it should be fine.
When you load it with .NET, you should just be able to use File.OpenText, File.ReadAllText etc - most IO dealing with encodings in .NET defaults to UTF-8 anyway.

How does Encoding.Default work in .NET?

I'm reading a file using:
var source = File.ReadAllText(path);
and the character © wasn't being loaded correctly.
Then, I changed it to:
var source = File.ReadAllText(path, Encoding.UTF8);
and nothing.
I decided to try using
var source = File.ReadAllText(path, Encoding.Default);
and it worked perfectly.
Then I debugged it and tried to find which Encoding did the trick, and I found that it was UTF-7.
What I want to know is:
Is it recommended to use Encoding.Default, and can it guarantee all the characters of the file will be read without problems?
Encoding.Default will only guarantee that all UTF-7 character sets will be read correctly (google for the whole set). On the other hand, if you try to read a file not encoded with UTF-8 in the UTF-8 mode, you'll get corrupted characters like you did.
For instance if the file is encoded UTF-16 and if you read it in UTF-16 mode, you'll be fine even if the file does not contain a single UTF-16 specific character. It all boils down to the file's encoding.
You'll need to do the save - reopen stuff with the same encoding to be safe from corruptions. Otherwise, try to use UTF-7 as much as you can since it is the most compact yet 'email safe' encoding possible, which is why it is default in most .NET framework setups.
It is not recommended to use Encoding.Default.
Quote from MSDN:
Different computers can use different
encodings as the default, and the
default encoding can even change on a
single computer. Therefore, data
streamed from one computer to another
or even retrieved at different times
on the same computer might be
translated incorrectly. In addition,
the encoding returned by the Default
property uses best-fit fallback to map
unsupported characters to characters
supported by the code page. For these
two reasons, using the default
encoding is generally not recommended.
To ensure that encoded bytes are
decoded properly, your application
should use a Unicode encoding, such as
UTF8Encoding or UnicodeEncoding, with
a preamble. Another option is to use a
higher-level protocol to ensure that
the same format is used for encoding
and decoding.
It sounds like you are interested in auto-detecting the encoding of a file, in some sort of situation where you are not in control of the encoding used to save it. There are several questions on StackOverflow addressing this; some cursory browsing points to Determine a string's encoding in C# as a pretty good one. My favorite answer is the one pointing to a C# port of Mozilla's universal charset detector.
I think the ur file is in utf-7 encoding.nothing more.
visit this page Your Answer

Writing c# source code to files

I'm having a stupid problem. I'm reading some .cs files from disk.
Doing lots of regex and other operations on them with a .net program i've made.
Then write them back to disc.
The resulting files get the wrong encoding somehow. What encoding are c# source files? And then there is the first byte-order thing, is that needed?
Does it get written when i use File.WriteAllText()?
The program changing the files is a simple .net application, and the code is simply
string text = System.IO.File.ReadAllText(fn);
string newText = Regex.Replace(text, regexStr, replaceStr);
System.IO.File.WriteAllText(fn, newText);
The c# files have comments and strings don't seem to be part of the standard codepage.
One of the problematic characters is "ä"
Solution:
this seems to work correctly
string text = System.IO.File.ReadAllText(fn, Encoding.GetEncoding(1252));
string newText = Regex.Replace(text, regexStr, replaceStr);
System.IO.File.WriteAllText(fn, newText, Encoding.GetEncoding(1252));
System.IO.File.ReadAllText(fn) tries to guess the encoding of the input file. This can go horribly wrong.
Visual Studio 2008 creates files by default in UTF-8. Similarly you should try to use UTF-8 where ever possible, by specifying Encoding.UTF8Encoding when writing the files to disk.
By default the files should be encoded with the same code page that is set in the regional settings of the machine. By default this will be 'Unicode (UTF-8 with signature) - Codepage 65001' you can use any code page you wish, for example you could also use 'Western European (windows) - Codepage 1252'.
I've written a few code gens in my time and always used ASCII encoding (plain windows text). What language are you using to do the regex ops on the CS files?

Categories