How to read Cyrillic symbols from a .txt file with C# - c#

I saw similar topics but could not find a solution. My problem is that I have a .txt file in which the symbols are in Bulgarian language / which is Cyrillic /, but after trying to read them, there is no sucess. I tried to read with this code:
StreamReader reader = new StreamReader(fileName,Encoding.UTF8);
if (File.Exists(fileName))
{
while ((line = reader.ReadLine()) != null)
{
Console.WriteLine(line);
}
}
And I also changed the Encoding value to all possible , as I tried with GetEncoding(1251), which I wrote is for cyrillic. And when I save the .txt file I tried to save it with each different encoding which was there / UNICODE,UTF-8,BigEndianUnicode,ANSI / in each combination with the Encoding I am settin through the code, but again no success.
Any ideas for how to read the cyrillic symbols in the right way will be appriciated.
And here is sample text for this: "Ето примерен текст."
Thanks in advance! :)

Your problem is that the console can't show cyrillic characters. Try putting a breakpoint on the Console.WriteLine and inspect the line variable. Clearly you'll need to know the correct encoding first! :-)
If you don't trust me, try this: make a console program that does this:
string line = "Ето примерен текст";
Console.WriteLine(line);
return 0;
put a breakpoint on the return 0;, watch the console and watch the line variable.
I'll add that unicode consoles should be one of the "new" things in .NET 4.5
And you can try to read this page: c# unicode string output

The problem you are having is not reading the text, but displaying it.
If your real intention is to display Unicode text in a console window, then you'll have to make a few changes. If however, you will be displaying the text in a WinForms or WPF app for instance, then you will not have problems - they work with Unicode by default.
By default, the console will not handle unicode, or use a font which has unicode glyphs. You need to do the following:
Save your text file as UTF8.
Start a console which is unicode enabled: cmd \u
Change the font to "Lucida Sans Unicode": console window menu -> properties -> font
Change the codepage to Unicode: chcp 65001
Run your app.
Your characters will now be displayed correctly:

Related

C# .csv-file in WinForm with Ä, Ö, Ü [duplicate]

I'm using the code below to read a text file that contains foreign characters, the file is encoded ANSI and looks fine in notepad. The code below doesn't work, when the file values are read and shown in the datagrid the characters appear as squares, could there be another problem elsewhere?
StreamReader reader = new StreamReader(inputFilePath, System.Text.Encoding.ANSI);
using (reader = File.OpenText(inputFilePath))
Thanks
Update 1: I have tried all encodings found under System.Text.Encoding. and all fail to show the file correctly.
Update 2: I've changed the file encoding (resaved the file) to unicode and used System.Text.Encoding.Unicode and it worked just fine. So why did notepad read it correctly? And why didn't System.Text.Encoding.Unicode read the ANSI file?
You may also try the Default encoding, which uses the current system's ANSI codepage.
StreamReader reader = new StreamReader(inputFilePath, Encoding.Default, true)
When you try using the Notepad "Save As" menu with the original file, look at the encoding combo box. It will tell you which encoding notepad guessed is used by the file.
Also, if it is an ANSI file, the detectEncodingFromByteOrderMarks parameter will probably not help much.
I had the same problem and my solution was simple: instead of
Encoding.ASCII
use
Encoding.GetEncoding("iso-8859-1")
The answer was found here.
Edit: more solutions. This maybe more accurate one:
Encoding.GetEncoding(1252);
Also, in some cases this will work for you too if your OS default encoding matches file encoding:
Encoding.Default;
Yes, it could be with the actual encoding of the file, probably unicode. Try UTF-8 as that is the most common form of unicode encoding. Otherwise if the file ASCII then standard ASCII encoding should work.
Using Encoding.Unicode won't accurately decode an ANSI file in the same way that a JPEG decoder won't understand a GIF file.
I'm surprised that Encoding.Default didn't work for the ANSI file if it really was ANSI - if you ever find out exactly which code page Notepad was using, you could use Encoding.GetEncoding(int).
In general, where possible I'd recommend using UTF-8.
Try a different encoding such as Encoding.UTF8. You can also try letting StreamReader find the encoding itself:
StreamReader reader = new StreamReader(inputFilePath, System.Text.Encoding.UTF8, true)
Edit: Just saw your update. Try letting StreamReader do the guessing.
For swedish Å Ä Ö the only solution form the ones above working was:
Encoding.GetEncoding("iso-8859-1")
Hopefully this will save someone time.
File.OpenText() always uses an UTF-8 StreamReader implicitly. Create your own StreamReader
instance instead and specify the desired encoding.
like
using (StreamReader reader = new StreamReader(#"C:\test.txt", Encoding.Default)
{
// ...
}
I solved my problem of reading portuguese characters, changing the source file on notepad++.
C#
var url = System.Web.HttpContext.Current.Server.MapPath(#"~/Content/data.json");
string s = string.Empty;
using (System.IO.StreamReader sr = new System.IO.StreamReader(url, System.Text.Encoding.UTF8,true))
{
s = sr.ReadToEnd();
}
I'm also reading an exported file which contains french and German languages. I used Encoding.GetEncoding("iso-8859-1"), true which worked out without any challenges.
for Arabic, I used Encoding.GetEncoding(1256). it is working good.
I had a similar problem with ProcessStartInfo and the property StandardOutputEncoding. I set it for German language console output to code page 850. This way I could read the output like ausführen instead of ausf�hren.

Saving source code formatting to TXT file

I am trying to extract the source code from a webpage and save it to a text file. However, I want to keep the formatting of the source code.
My code is below.
// this block fetches the source code from the URL entered.
private void buttonFetch_Click(object sender, EventArgs e)
{
using (WebClient webClient = new WebClient())
{
string s = webClient.DownloadString("http://www.ebay.com");
Clipboard.SetText(s, TextDataFormat.Text);
string[] lines = { s };
System.IO.File.WriteAllLines(#"C:\Users\user\Dropbox\Personal Projects\WriteLines.txt", lines);
MessageBox.Show(s.ToString(), "Source code",
MessageBoxButtons.OKCancel, MessageBoxIcon.Asterisk);
}
}
I would like the text file to show the source code as it is formatted in the Messagebox.
Messagebox screenshot:
Text file screenshot:
How would I go about getting the text document's formatting to be the same as in the Messagebox?
I agree with the comment, but I'll add just a note. If you open it in Notepad++, N++ will detect the line endings and display the file nicely for you. In Notepad++ you can go into the menu and change the Line Endings to Windows. If you then re-save it and open it in Notepad itself, it will look correctly. The problem is that the base Notepad doesn't understand different line endings.
Hope it helps.
The problem is that the string you're downloading has LF-only line endings. The Windows standard is CRLF line endings. Windows Notepad is notoriously adamant about supporting only CRLF line endings. Other editors, including Visual Studio, correctly handle the LF-only versions.
You can convert the text to CRLF line endings easily enough:
string s = webClient.DownloadString("http://www.ebay.com");
string fixedString = s.Replace("\n", "\r\n");
System.IO.File.WriteAllText("filename", fixedString);
MessageBox.Show(fixedString, "Source code",
MessageBoxButtons.OKCancel, MessageBoxIcon.Asterisk);
Note also that it is not necessary to call ToString on a string.
Try this:
string[] lines = s.Split('\n');
System.IO.File.WriteAllLines(#"C:\Users\user\Dropbox\Personal Projects\WriteLines.txt", lines);

Irregular character/text encoding issue with writing back to file

I'm using this function to read text lines from a file:
string[] postFileLines = System.IO.File.ReadAllLines(pstPathTextBox.Text);
Inserting a few additional lines at strategic spots, then writing the text lines back to a file with:
TextWriter textW = new StreamWriter(filePath);
for (int i = 0; i < linesToWrite.Count; i++)
{
textW.WriteLine(linesToWrite[i]);
}
textW.Close();
This works perfectly well until the text file I am reading in contains an international or special character. When writing back to the file, I don't get the same character - it is a box.
Ex:
Before = W:\Contrat à faire aujourdhui\ `
After = W:\Contrat � faire aujourdhui\ `
This webpage is portraying it as a question mark, but in the text file it's a rect white box.
Is there a way to include the correct encoding in my application to be able to handle these characters? Or, if not, throw a warning saying it was not able to properly write given line?
Add encondig like this:
File.ReadAllLines(path, Encoding.UTF8);
and
new StreamWriter(filePath, Encoding.UTF8);
Hope it helps.
use This , works for me
string txt = System.IO.File.ReadAllText(inpPath, Encoding.GetEncoding("iso-8859-1"));
You can try UTF encoding while writing to the file as well,
textW.WriteLine(linesToWrite[i],Encoding.UTF8);
You may be need to write Single-byte Character Sets
Using Encoding.GetEncodings() you can easily get all possible encoding. ("DOS" encoding are System.Text.SBCSCodePageEncoding)
In your case you may need to use
File.ReadAllLines(path, Encoding.GetEncoding("IBM850"));
and
new StreamWriter(filePath, Encoding.GetEncoding("IBM850"));
Bonne journée! ;)

StreamReader is unable to correctly read extended character set (UTF8)

I am having an issue where I am unable to read a file that contains foreign characters. The file, I have been told, is encoded in UTF-8 format.
Here is the core of my code:
using (FileStream fileStream = fileInfo.OpenRead())
{
using (StreamReader reader = new StreamReader(fileStream, System.Text.Encoding.UTF8))
{
string line;
while (!string.IsNullOrEmpty(line = reader.ReadLine()))
{
hashSet.Add(line);
}
}
}
The file contains the word "achôcre" but when examining it during debugging it is adding it as "ach�cre".
(This is a profanity file so I apologize if you speak French. I for one, have no idea what that means)
The evidence clearly suggests that the file is not in UTF-8 format. Try System.Text.Encoding.Default and see if you get the correct text then — if you do, you know the file is in Windows-1252 (assuming that is your system default codepage). In that case, I recommend that you open the file in Notepad, then re-“Save As” it as UTF-8, and then you can use Encoding.UTF8 normally.
Another way to check what encoding the file is actually in is to open it in your browser. If the accents display correctly, then the browser has detected the correct character set — so look at the “View / Character set” menu to find out which one is selected. If the accents are not displaying correctly, then change the character set via that menu until they do.

Writing PHP in C# with a String Builder problem

I have the following C# code to produce a small PHP file. The reason I am doing this is to update 400 plus sites automatically. The sites are in PHP on a Windows Environment so using C# for utility apps is the easiest for me.
fileContents.AppendFormat("<?php{0}",Environment.NewLine);
fileContents.AppendFormat("# FileName=\"clientsite.php\"{0}",Environment.NewLine);
fileContents.AppendFormat("# HTTP=\"true\"{0}",Environment.NewLine);
fileContents.AppendFormat("$clientname = \"{0}\";{1}", clientsiteName, Environment.NewLine);
fileContents.AppendFormat("$version = \"v6.2i\";{0}",Environment.NewLine);
fileContents.Append("?>");
The end result of this file causes a strange character to appear on the PHP page that includes this page. When I manually open the created PHP file - press backspace on the last line then enter it works. Is there something better than Environment.NewLine to use for this? Or is there another problem I am missing?
EDIT: The character looks like something I can't reproduce on the keyboard (squiggle line) by ends with ?
You could just try "\n", I believe Environment.NewLine is "\r\n".
But it could also be about how you write the StringBuilder (I assume fileContents is a StringBuilder) to the file. If you e.g. use WriteAllText, you could try using different encoding.

Categories