How to correctly encode arabic subtitles in c#?

How to correctly encode arabic subtitles in c#? - c#

Hello there i am creating a video player with subtitles support using MediaElement class and SubtitlesParser library, i faced an issue with 7 arabic subtitle files (.srt) being displayed ???? or like this:
I tried multiple diffrent encoding but with no luck:
SubtitlesList = new SubtitlesParser.Classes.Parsers.SubParser().ParseStream(fileStream);
subLine = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(subLine));
or
SubtitlesList = new SubtitlesParser.Classes.Parsers.SubParser().ParseStream(fileStream,Encoding.UTF8);
Then i found this and based on the answer i used Encoding.Default "ANSI" to parse subtitles then re-interpret the encoded text:
SubtitlesList = new SubtitlesParser.Classes.Parsers.SubParser().ParseStream(fileStream, Encoding.Default);
var arabic = Encoding.GetEncoding(1256);
var latin = Encoding.GetEncoding(1252);
foreach (var item in SubtitlesList)
{
List<string> lines = new List<string>();
lines.AddRange(item.Lines.Select(line => arabic.GetString(latin.GetBytes(line))));
item.Lines = lines;
}
this worked only on 4 files but the rest still show ?????? and nothing i tried till now worked on them, this what i found so far:
exoplayer weird arabic persian subtitles format (this gave me a hint about the real problem).
C# Converting encoded string IÜÜæØÜÜ?E? to readable arabic (Same answer).
convert string from Windows 1256 to UTF-8 (Same answer).
How can I transform string to UTF-8 in C#? (It works for Spanish language but not arabic).
Also am hoping to find a single solution to correctly display all the files is this possible ?
please forgive my simple language English is not my native language

i think i found the answer to my question, as a beginner i only had a basic knowledge of encoding till i found this article
What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text
Your text editor, browser, word processor or whatever else that's trying to read the document is assuming the wrong encoding. That's all. The document is not broken , there's no magic you need to perform, you simply need to select the right encoding to display the document.
I hope this helps anyone else who got confused about the correct way to handel this, there is no way to know the files correct encoding, only the user can.

Related

Japanese character in AdaptiveCard Bot Framework V4

I have been trying to print a simple card with a Japanese character but it keeps displaying boxes and unknown characters.
This is how I create my adaptive card, then I get the params and data in a json, just to make it neat.
string[] paths = { ".", "Cards", "pickLanguageCard.json" };
string fullPath = Path.Combine(paths);
var adaptiveCard = File.ReadAllText(fullPath);
return new Attachment()
{
ContentType = "application/vnd.microsoft.card.adaptive",
Content = JsonConvert.DeserializeObject(adaptiveCard),
};
Picture of the printed output:
As you can see the returned JSON data is also wrong. So this pinned it down to the main source of the Bot. I tried to the Json file containing the Japanese character, also change the encoding at the web.config but it didn't solve my problem. Back in Bot Framework v3, there is no problem in printing/displaying Japanese character. But when I tried v4 the Japanese characters get like that.
Any fix, solution, workaround will be appreciated. Thanks
Edit:
Tried using encoding param in ReadAllText, (Encoding.UTF8, Encoding.UTF32, Encoding.Unicode). In UTF8, other japanese character get to print but destroy the format of the Json unable to parse, also it occurs in utf32 and unicode. In default the character is the same.
Edit:
So after researching relentlessly, I found out that the JSON only encodes data to standard UTF-8 to make it lighter, and tried to using converting the characters to UTF-16 and it print successfully, but that seems wrong for me. Is there any other way to print correctly the Japanese characters?

When you edit JSON in Visual Studio 2019 and try to save the file with Japanese characters, Visual Studio will automatically offer to fix the format for you:
If you want to manually save your file with a specific encoding instead of relying on an automatic dialog box, you can use the Save with Encoding... option in the File > Save As... dialog:
If you select Codepage 65001, which is Unicode (UTF-8) with or without signature, your Japanese characters should display correctly:

Can not read turkish characters from text file to string array

I am trying to do some kind of sentence processing in turkish, and I am using text file for database. But I can not read turkish characters from text file, because of that I can not process the data correctly.
string[] Tempdatabase = File.ReadAllLines(#"C:\Users\dialogs.txt");
textBox1.Text = Tempdatabase[5];
Output:

It's probably an encoding issue. Try using one of the Turkish code page identifiers.
var Tempdatabase =
File.ReadAllLines(#"C:\Users\dialogs.txt", Encoding.GetEncoding("iso-8859-9"));

You can fiddle around using Encoding as much as you like. This might eventually yield the expected result, but bear in mind that this may not work with other files.
Usually, C# processes strings and files using Unicode by default. So unless you really need something else, you should try this instead:
Open your text file in notepad (or any other program) and save it as an UTF-8 file. Then, you should get the expected results without any modifications in your code. This is because C# reads the file using the encoding you saved it with. This is default behavior, which should be preferred.
When you save your text file as UTF-8, then C# will interpret it as such.
This also applies to .html files inside Visual Studio, if you notice that they are displayed incorrectly (parsed with ASCII)

The file contains the text in a specific Turkish character set, not Unicode. If you don't specify any other behaviour, .net will assume Unicode text when reading text from a text file. You have two possible solutions:
Either change the text file to use Unicode (for example utf8) using an external text editor.
Or specify a specific character set to read for example:
string[] Tempdatabase = File.ReadAllLines(#"C:\Users\dialogs.txt", Encoding.Default);
This will use the local character set of the Windows system.
string[] Tempdatabase = File.ReadAllLines(#"C:\Users\dialogs.txt", Encoding.GetEncoding("Windows-1254");
This will use the Turkish character set defined by Microsoft.

Using translate.google from code

Is there a way to use https://translate.google.co.za/ in code?
Maybe make use of Encoding, WebClients and Uri's, but I'm not sure on the correct way to do this.
In code I can get the translate to language and translate from language as well as the content, but how can I incorporate those parameters into the url and then display the end result?
Please Help
Code attempt:
UnicodeEncoding tmpEncoding = new UnicodeEncoding();
string url = String.Format("http://translate.google.co.za/#{0}/{1}/{2}", languageFrom, languageTo, content);
WebClient tmpClient = new WebClient();
tmpClient.Encoding = System.Text.Encoding.ASCII;
string result = tmpEncoding.GetString(tmpClient.DownloadData(url));
The result it gives me is a list of chinese or japanese characters. I dont know what I doing wrong. Maybe the Encoding?

Take a look at the following website click here
You can use the official Google Translate API for this
Take note that it will cost money to translate. Also take a look at other translate api which can be used inside .net
I Did some searching for ya, Bing translator service is a free API for a maximum of 2M characters a monthe from there on you have to pay for it. It also has a nice SDK to go with it.

I found an answer courtesy of Rick Strahl's Web Log (http://weblog.west-wind.com/posts/2011/Aug/06/Translating-with-Google-Translate-without-API-and-C-Code)
Although I didnt use the JavaScriptSerializer it gave me what I wanted. In the form of (\"content\").
So just a bit of string manipulation and I'm golden.
EDIT:
I ended up using the Serializer as the other way didn't give the special characters that form some words, i.e words from french woudnt have those characters that make the words frenchy. Instead it would give a question mark surrounded by a white diamond.

C# Reading files and encoding issue

I've searched everywhere for this answer so hopefully it's not a duplicate. I decided I'm just finally going to ask it here.
I have a file named Program1.exe When I drag that file into Notepad or Notepad++ I get all kinds of random symbols and then some readable text. However, when I try to read this file in C#, I either get inaccurate results, or just a big MZ. I've tried all supported encodings in C#. How can notepad programs read a file like this but I simply can't? I try to convert bytes to string and it doesn't work. I try to directly read line by line and it doesn't work. I've even tried binary and it doesn't work.
Thanks for the help! :)

Reading a binary file as text is a peculiar thing to do, but it is possible. Any of the 8-bit encodings will do it just fine. For example, the code below opens and reads an executable and outputs it to the console.
const string fname = #"C:\mystuff\program.exe";
using (var sw = new StreamReader(fname, Encoding.GetEncoding("windows-1252")))
{
var s = sw.ReadToEnd();
s = s.Replace('\x0', ' '); // replace NUL bytes with spaces
Console.WriteLine(s);
}
The result is very similar to what you'll see in Notepad or Notepad++. The "funny symbols" will differ based on how your console is configured, but you get the idea.
By the way, if you examine the string in the debugger, you're going to see something quite different. Those funny symbols are encoded as C# character escapes. For example, nul bytes (value 0) will display as \0 in the debugger, as NUL in Notepad++, and as spaces on the console or in Notepad. Newlines show up as \r in the debugger, etc.
As I said, reading a binary file as text is pretty peculiar. Unless you're just looking to see if there's human-readable data in the file, I can't imagine why you'd want to do this.
Update
I suspect the reason that all you see in the Windows Forms TextBox is "MZ" is that the Windows textbox control (which is what the TextBox ultimately uses), uses the NUL character as a string terminator, so won't display anything after the first NUL. And the first thing after the "MZ" is a NUL (shows as `\0' in the debugger). You'll have to replace the 0's in the string with spaces. I edited the code example above showing how you'd do that.

The exe is a binary file and if you try to read it as a text file you'll get the effect that you are describing. Try using something like a FileStream instead that does not care about the structure of the file but treats it just as a series of bytes.

c# encoding problems (question marks) while reading file from StreamReader

I've a problem while reading a .txt file from my Windows Phone app.
I've made a simple app, that reads a stream from a .txt file and prints it.
Unfortunately I'm from Italy and we've many letters with accents. And here's the problem, in fact all accented letters are printed as a question mark.
Here's the sample code:
var resourceStream = Application.GetResourceStream(new Uri("frasi.txt",UriKind.RelativeOrAbsolute));
if (resourceStream != null)
{
{
//System.Text.Encoding.Default, true
using (var reader = new StreamReader(resourceStream.Stream, System.Text.Encoding.UTF8))
{
string line;
line = reader.ReadLine();
while (line != null)
{
frasi.Add(line);
line = reader.ReadLine();
}
}
}
So, I'm asking you how to avoid this matter.
All the best.
[EDIT:] Solution: I didn't make sure the file was encoded in UTF-8- I saved it with the correct encoding and it worked like a charm. thank you Oscar

You need to use Encoding.Default. Change:
using (var reader = new StreamReader(resourceStream.Stream, System.Text.Encoding.UTF8))
to
using (var reader = new StreamReader(resourceStream.Stream, System.Text.Encoding.Default))

You have commented out is what you should be using if you do not know the exact encoding of your source data. System.Text.Encoding.Default uses the encoding for the operating system's current ANSI code page and provides the best chance of a correct encoding. This should detect the current region settings/encoding and use those.
However, from MSDN the warning:
Different computers can use different encodings as the default, and the default encoding can even change on a single computer. Therefore, data streamed from one computer to another or even retrieved at different times on the same computer might be translated incorrectly. In addition, the encoding returned by the Default property uses best-fit fallback to map unsupported characters to characters supported by the code page. For these two reasons, using the default encoding is generally not recommended. To ensure that encoded bytes are decoded properly, your application should use a Unicode encoding, such as UTF8Encoding or UnicodeEncoding, with a preamble. Another option is to use a higher-level protocol to ensure that the same format is used for encoding and decoding.
Despite this, in my experience with data coming from a number of different source and various different cultures, this is the one that provides the most consistent results out-of-the-box... Esp. for the case of diacritic marks which are turned to question marks when moving from ANSI to UTF8.
I hope this helps.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to correctly encode arabic subtitles in c#? - c#

Related

Japanese character in AdaptiveCard Bot Framework V4

Can not read turkish characters from text file to string array

Using translate.google from code

C# Reading files and encoding issue

c# encoding problems (question marks) while reading file from StreamReader

Categories

Resources