I created a simple Dictionary<char, char> that contains character combinations to replace local characters to ascii characters (ē -> e), but it does not work - when I see this dictionary in debug mode - I see, that local characters are wrong (instead of my local characters (latvian) I see some different characters)
I suspect it's something to do with encoding, although I don't know why is this happening and how to fix it...
if I make a simple string text = "with some local characters ā ē ū"; - if I check this in debug mode, encoding seems to be correct...
here is the instantiation of my dictionary:
and here is what values appear in this dictionary after instantiation:
Check that your source is encoding per the C# language specification. It must use one of the allowed Unicode encodings. UTF-8 is always allowed. (I'd say preferred.) Your editor should be able to tell you which your using and/or allow you to re-save it with a specific encoding.
In Visual Studio, you can re-save the file with a specific encoding using File » Advanced Save Options…, then File » Save.
Related
I am trying to do some kind of sentence processing in turkish, and I am using text file for database. But I can not read turkish characters from text file, because of that I can not process the data correctly.
string[] Tempdatabase = File.ReadAllLines(#"C:\Users\dialogs.txt");
textBox1.Text = Tempdatabase[5];
Output:
It's probably an encoding issue. Try using one of the Turkish code page identifiers.
var Tempdatabase =
File.ReadAllLines(#"C:\Users\dialogs.txt", Encoding.GetEncoding("iso-8859-9"));
You can fiddle around using Encoding as much as you like. This might eventually yield the expected result, but bear in mind that this may not work with other files.
Usually, C# processes strings and files using Unicode by default. So unless you really need something else, you should try this instead:
Open your text file in notepad (or any other program) and save it as an UTF-8 file. Then, you should get the expected results without any modifications in your code. This is because C# reads the file using the encoding you saved it with. This is default behavior, which should be preferred.
When you save your text file as UTF-8, then C# will interpret it as such.
This also applies to .html files inside Visual Studio, if you notice that they are displayed incorrectly (parsed with ASCII)
The file contains the text in a specific Turkish character set, not Unicode. If you don't specify any other behaviour, .net will assume Unicode text when reading text from a text file. You have two possible solutions:
Either change the text file to use Unicode (for example utf8) using an external text editor.
Or specify a specific character set to read for example:
string[] Tempdatabase = File.ReadAllLines(#"C:\Users\dialogs.txt", Encoding.Default);
This will use the local character set of the Windows system.
string[] Tempdatabase = File.ReadAllLines(#"C:\Users\dialogs.txt", Encoding.GetEncoding("Windows-1254");
This will use the Turkish character set defined by Microsoft.
Below is what the text looks like when viewed in NotePad++.
I need to get the IndexOf for that peice of the string. for use the the below code. And I can't figure out how to use the odd characters in my code.
int start = text.IndexOf("AppxxxxxxDB INFO");
Where the "xxxxx"'s represent the strange characters.
All these characters have corresponding ASCII codes, you can insert them in a string by escaping it.
For instance:
"App\x0000\x0001\x0000\x0003\x0000\x0000\x0000DB INFO"
or shorter:
"App\x00\x01\x00\x03\x00\x00\x00"+"DB INFO"
\xXXXX means you specify one character with XXXX the hexadecimal number corresponding to the character.
Notepad++ simply wants to make it a bit more convenient by rendering these characters by printing the abbreviation in a "bubble". But that's just rendering.
The origin of these characters is printer (and other media) directives. For instance you needed to instruct a printer to move to the next line, stop the printing job, nowadays they are still used. Some terminals use them to communicate color changes, etc. The most well known is \n or \x000A which means you start a new line. For text they are thus characters that specify how to handle text. A bit equivalent to modern html, etc. (although it's only a limited equivalence). \n is thus only a new line because there is a consensus about that. If one defines his/her own encoding, he can invent a new system.
Echoing #JonSkeet's warning, when you read a file into a string, the file's bytes are decoded according to a character set encoding. The decoder has to do something with bytes values or sequences that are invalid per the encoding rules. Typical decoders substitute a replacement character and attempt to go on.
I call that data corruption. In most cases, I'd rather have the decoder throw an exception.
You can use a standard decoder, customize one or create a new one with the Encoding class to get the behavior you want. Or, you can preserve the original bytes by reading the file as bytes instead of as text.
If you insist on reading the file as text, I suggest using the 437 encoding because it has 256 characters, one for every byte value, no restrictions on byte sequences and each 437 character is also in Unicode. The bytes that represent text will possibly decode the same characters that you want to search for as strings, but you have to check, comparing 437 and Unicode in this table.
Really, you should have and follow the specification for the file type you are reading. After all, there is no text but encoded text, and you have to know which encoding it is.
Line:
IList<string> text = await FileIO.ReadLinesAsync(file);
causes exception No mapping for the Unicode character exists in the target multi-byte code page
When I remove chars like ąśźćóż from my file it runs ok, but the problem is that I can't guarantee that those chars won't happen in future.
I tried changing the encoding in advanced save options but it is already
Unicode (UTF-8 with signature) - Codepage 65001
I have a hard time trying to figure this one out.
Make FileIO.ReadLinesAsync use a matching encoding. I don't know what you custom class does but according to the error message it does not use any Unicode encoding.
I think those characters ąśźćóż are UTF-16 encoded.So, it's better to use UTF-16. Use the overload ReadLinesAsync(IStorageFile, UnicodeEncoding) and set UnicodeEncdoing parameter to UnicodeEncoding.Utf16BE
From MSDN :
This method uses the character encoding of the specified file. If you
want to specify different encoding, call ReadLinesAsync(IStorageFile,
UnicodeEncoding) instead.
I have a problem with using Directory.Exists() on a string that contains an accented character.
This is the directory path: D:\ést_test\scenery. It is coming in as a simple string in a file that I am parsing:
[Area.121]
Title=ést_test
local=D:\AITests\ést_test
Layer=121
Active=FALSE
Required=FALSE
My code is taking the local value and adding \scenery to it. I need to test that this exists (which it does) and am simply using:
if (!Directory.Exists(area.Path))
{
// some handling code
area.AreaIsValid = false;
}
This returns false. It seems that the string handling that I am doing is replacing the accented character. The text visualizer in VS2012 is showing this (directoryManager is just a wrap around System.IO.Directory):
And the warning message as displayed is showing this:
So it seems that the accented character is not being recognized. Searching for this issue does turn up but mostly about removing or replacing the accented character. I am currently using 'normal' string handling. I tried using FileInfo but the path seems to get mangled anyway.
So my first question is how do I get the path stored into a string so that it will pass the Directory.Exists test?
This raises a wider question of non latin characters in path names. I have users all over the world so I can see arabic. Russian, Chinese and so on in paths. How can I handle all of these?
The problem is almost certainly that you're loading the file with the wrong encoding. The fact that it's a filename is irrelevant - the screenshots show that you've lost the relevant data before you call Directory.Exists.
You should make sure you know the file encoding (e.g. UTF-8, Cp1252 etc) and then pass that in as an argument into however you're loading the file (e.g. File.ReadAllText). If this isn't enough information to get you going, you'll need to tell us more about the file (to work out what encoding it's in) and more about your code (how you're reading it).
Once you've managed to load the correct data, I'd hope that the file aspect just handles itself automatically.
I am having some issues with the default string encoding in C#. I need to read strings from certain files/packets. However, these strings include characters from the 128-256 range (extended ascii), and all of these characters show up as question marks , instead of the proper character. For example, when reading a string ,it could come up as "S?meStr?n?" if the string contained the extended ascii characters.
Now, is there any way to change the default encoding for my application? I know in java you could define the default character set from command line.
There's no one single "extended ASCII" encoding. There are lots of different 8-bit encodings which are compatible with ASCII for the bottom 128 values.
You need to find out what encoding your files actually use, and specific that when reading the data with StreamReader (or whatever else you're using). For example, you may want encoding Windows-1252:
Encoding encoding = Encoding.GetEncoding(1252);
.NET strings are always sequences of UTF-16 code points. You can't change that, and you shouldn't try. (That's true in Java as well, and you really shouldn't use the platform default encoding when calling getBytes() etc unless that's what you really, really mean.)
An Encoding can be specified in at least one overload of functions for reading text - for example, ReadAllText(string, Encoding).
So if you no a file's encoded using Windows-1252, then you can specify it like so:
string contents = File.ReadAllText(someFilePath, Encoding.GetEncoding(1252));
Of course, doing this requires knowing ahead of time which code page is being used.