File ReadAllLines turns foreign language into gibberish (�)

File ReadAllLines turns foreign language into gibberish (�) - c#

I am creating a tool that replaces some text in a text file. My problem is that File ReadAllLines turns the Hebrew characters into Gibberish (weird question marks �)
Does anyone know why this is happening? Note that I do have a problem with Hebrew in games and such.. And in Notepad, I can't save Hebrew documents. I can write Hebrew letters but when I save it tells me there's a problem with that.
EDIT - Tried this, but it only turned the Hebrew into regular question marks and not "special" ones-
string[] lines = File.ReadAllLines(fullFilenameDir);
byte[] htmlBytes = Encoding.Convert(Encoding.ASCII, Encoding.Unicode, Encoding.ASCII.GetBytes(String.Join("\r\n", lines)));
char[] htmlChars = new char[Encoding.Unicode.GetCharCount(htmlBytes)];
Encoding.Unicode.GetChars(htmlBytes, 0, htmlBytes.Length, htmlChars, 0);

Try using the Windows-1255 code page to get the encoder.
var myLines = File.ReadAllLines(#"C:\MyFile.txt", Encoding.GetEncoding("Windows-1255"));

Related

How to correctly encode arabic subtitles in c#?

Hello there i am creating a video player with subtitles support using MediaElement class and SubtitlesParser library, i faced an issue with 7 arabic subtitle files (.srt) being displayed ???? or like this:
I tried multiple diffrent encoding but with no luck:
SubtitlesList = new SubtitlesParser.Classes.Parsers.SubParser().ParseStream(fileStream);
subLine = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(subLine));
or
SubtitlesList = new SubtitlesParser.Classes.Parsers.SubParser().ParseStream(fileStream,Encoding.UTF8);
Then i found this and based on the answer i used Encoding.Default "ANSI" to parse subtitles then re-interpret the encoded text:
SubtitlesList = new SubtitlesParser.Classes.Parsers.SubParser().ParseStream(fileStream, Encoding.Default);
var arabic = Encoding.GetEncoding(1256);
var latin = Encoding.GetEncoding(1252);
foreach (var item in SubtitlesList)
{
List<string> lines = new List<string>();
lines.AddRange(item.Lines.Select(line => arabic.GetString(latin.GetBytes(line))));
item.Lines = lines;
}
this worked only on 4 files but the rest still show ?????? and nothing i tried till now worked on them, this what i found so far:
exoplayer weird arabic persian subtitles format (this gave me a hint about the real problem).
C# Converting encoded string IÜÜæØÜÜ?E? to readable arabic (Same answer).
convert string from Windows 1256 to UTF-8 (Same answer).
How can I transform string to UTF-8 in C#? (It works for Spanish language but not arabic).
Also am hoping to find a single solution to correctly display all the files is this possible ?
please forgive my simple language English is not my native language

i think i found the answer to my question, as a beginner i only had a basic knowledge of encoding till i found this article
What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text
Your text editor, browser, word processor or whatever else that's trying to read the document is assuming the wrong encoding. That's all. The document is not broken , there's no magic you need to perform, you simply need to select the right encoding to display the document.
I hope this helps anyone else who got confused about the correct way to handel this, there is no way to know the files correct encoding, only the user can.

Can not read turkish characters from text file to string array

I am trying to do some kind of sentence processing in turkish, and I am using text file for database. But I can not read turkish characters from text file, because of that I can not process the data correctly.
string[] Tempdatabase = File.ReadAllLines(#"C:\Users\dialogs.txt");
textBox1.Text = Tempdatabase[5];
Output:

It's probably an encoding issue. Try using one of the Turkish code page identifiers.
var Tempdatabase =
File.ReadAllLines(#"C:\Users\dialogs.txt", Encoding.GetEncoding("iso-8859-9"));

You can fiddle around using Encoding as much as you like. This might eventually yield the expected result, but bear in mind that this may not work with other files.
Usually, C# processes strings and files using Unicode by default. So unless you really need something else, you should try this instead:
Open your text file in notepad (or any other program) and save it as an UTF-8 file. Then, you should get the expected results without any modifications in your code. This is because C# reads the file using the encoding you saved it with. This is default behavior, which should be preferred.
When you save your text file as UTF-8, then C# will interpret it as such.
This also applies to .html files inside Visual Studio, if you notice that they are displayed incorrectly (parsed with ASCII)

The file contains the text in a specific Turkish character set, not Unicode. If you don't specify any other behaviour, .net will assume Unicode text when reading text from a text file. You have two possible solutions:
Either change the text file to use Unicode (for example utf8) using an external text editor.
Or specify a specific character set to read for example:
string[] Tempdatabase = File.ReadAllLines(#"C:\Users\dialogs.txt", Encoding.Default);
This will use the local character set of the Windows system.
string[] Tempdatabase = File.ReadAllLines(#"C:\Users\dialogs.txt", Encoding.GetEncoding("Windows-1254");
This will use the Turkish character set defined by Microsoft.

German character ß encoding in Livelink using C#

I have folder name that contains German special character such äÄéöÖüß.The following screenshot display contents of LiveLink server.
I want to extract folder from Livelink server using C#.
valueis obtained from LLserver.
var bytes = new List<byte>(value.Length);
foreach (var c in value)
{
bytes.Add((byte)c);
}
var result = Encoding.UTF8.GetString(bytes.ToArray());
Finally, the result is äÄéöÖü�x .where ß is seen as box character '�x'. All other characters present in folder name are decoded successfully/properly except the ß character.
I am just wondering why the same code works for all other German special characters but not for ß.
Could anybody help to fix this problem in C#?
Thanks in advance.

Go to admin panel of server Livelink/livelink.exe?func=admin.sysvars
and set Character Set: UTF-8
and code section change as follow
byte[] bytes = Encoding.Default.GetBytes(value);
var retValue = Encoding.UTF8.GetString(bytes);
It works fine.

You guessed your encoding to be UTF8 and it obviously is not. You will need to find out what encoding the byte stream really represents and use that instead. We cannot help you with that, you will have to ask the sender of said bytes.

How do I read and write smart quotes (and other silly characters) in C#

I'm writing a program that reads all the text in a file into a string, loops over that string looking at the characters, and then appends the characters back to another string using a Stringbuilder. The issue I'm having is when it's written back out, the special characters such as “ and ” , come out looking like ï¿½ characters instead. I don't need to do a conversion, I just want it written back out the way I read it in:
StringBuilder sb = new StringBuilder();
string text = File.ReadAllText(filePath);
for (int i = 0; i < text.Length; ++i) {
if (text[i] != '{') { // looking for opening curly brace
sb.Append(text[i]);
continue;
}
// Do stuff
}
File.WriteAllText(destinationFile, sb.ToString());
I tried using different Encodings (UTF-8, UTF-16, ASCII), but then it just came out even worse; I started getting question mark symbols and Chinese characters (yes, a bit of a shotgun approach, but I was just experimenting).
I did read this article: http://www.joelonsoftware.com/articles/Unicode.html
...but it didn't really explain why I was seeing what I saw, unless in C#, the reader starts cutting off bits when it hits weird characters like that. Thanks in advance for any help!

TL;DR that is definitely not UTF-8 and you are not even using UTF-8 to read the resulting file. Read as Windows1252, write as Windows1252 (If you are going to use the same viewing method to view the resulting file)
Well let's first just say that there is no way a file made by a regular user will be in UTF-8. Not all programs in windows even support it (excel, notepad..), let alone have it as default encoding (even most developer tools don't default to utf-8, which drives me insane). Since a lot of developers don't understand that such a thing as encoding even exists, then what chances do regular users have of saving their files in an utf-8 hostile environment?
This is where your problems first start. According to documentation, the overload you are using File.ReadAllText(filePath); can only detect UTF-8 or UTF-32.
Indeed, simply reading a file encoded normally in Windows-1252 that contains "a”a" results in a string "a�a", where � is the unicode replacement character (Read the wikipedia section, it describes exactly the situation you are in!) used to replace invalid bytes. When the replacement character is again encoded as UTF-8, and interpreted as Windows-1252, you will see ï¿½ because the bytes for � in UTF-8 are 0xEF, 0xBF, 0xBD which are the bytes for ï¿½ in Windows-1252.
So read it as Windows-1252 and you're half-way there:
Encoding windows1252 = Encoding.GetEncoding("Windows-1252");
String result = File.ReadAllText(#"C:\myfile.txt", windows1252);
Console.WriteLine(result); //Correctly prints "a”a" now
Because you saw ï¿½, the tool you are viewing the newly made file with is also using Windows-1252. So if the goal is to have the file show correct characters in that tool, you must encode the output as Windows-1252:
Encoding windows1252 = Encoding.GetEncoding("Windows-1252");
File.WriteAllText(#"C:\myFile", sb.toString(), windows1252);

Chances are the text will be UTF8.
File.ReadAllText(filePath, Encoding.UTF8)
coupled with
File.WriteAllText(destinationFile, sb.ToString(), Encoding.UTF8)
should cover off dealing with the Unicode characters. If you do one or the other you're going to get garbage output, both or nothing.

Error reading a from file - Encoding issue

Im reading a CSV file that was created from MS Excel. When I open it up in notepad it looks ok, but in Notepad++ I change the Encoding from ANSI to UTF8 and a few non printed characters turn up.
Specifically xFF. -(HEX Value)
In my C# app this character is causing an issue when reading the file so is there a way I can do a String.replace('xFF', ' '); on this?
Update
I found this link on SO, as it turns out it is the answer to my question but not my problem.
Link

Instead of String.Replace, Specify encoding while reading the file.
Example
File.ReadAllText("test.csv",System.Text.UTF8Encoding)

Guess your unicode representation is wrong. Try this
string foo = "foo\xff";
foo.Replace('\xff',' ');

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

File ReadAllLines turns foreign language into gibberish (�) - c#

Try using the Windows-1255 code page to get the encoder. var myLines = File.ReadAllLines(#"C:\MyFile.txt", Encoding.GetEncoding("Windows-1255"));

Related

How to correctly encode arabic subtitles in c#?

Can not read turkish characters from text file to string array

German character ß encoding in Livelink using C#

How do I read and write smart quotes (and other silly characters) in C#

Error reading a from file - Encoding issue

Categories

Resources