Problem with Encoding in FileOpen and StringBuilder in i18n

Problem with Encoding in FileOpen and StringBuilder in i18n - c#

when I look at my file that I have saved with i18n, it is ok for example there is "Fïll Âll ülle~" in it which is what I want..but the way in the code I am trying to read the contects of this file and return it as a String is some thing like that:
string sLine = String.Empty;
StringBuilder sHTMLText = new StringBuilder();
int nFileHandle = FileSystem.FreeFile();
sHTMLText.Append(String.Empty);
FileSystem.FileOpen(nFileHandle, sFileName, OpenMode.Input, OpenAccess.Default, OpenShare.Default, -1);
while (!FileSystem.EOF(nFileHandle))
{
sLine = FileSystem.LineInput(nFileHandle);
sHTMLText.Append(sLine);
};
FileSystem.FileClose(nFileHandle);
return sHTMLText.ToString();
but when I am debugging it, it is corrupting the correct translated stuff like "Fïll Âll ülle~" and manipulating them, so I think my method is not doing it in a way that honors Encoding like I have set my computer Regional/Language Settings to French .... so How can I correct my existing code or write something similar that also cares about Encoding and the lang set on my computer?
Thsanks

Have a look at http://msdn.microsoft.com/en-us/library/ms143456.aspx use a StreamReader with the correct encoding.
hth
Mario

If you are trying to read a file that was saved in a non-Unicode encoding, then you must specify exactly what that encoding was when you open the file.
using System;
using System.IO;
using System.Text;
class Program
{
static void Main(string[] args)
{
using (StreamReader reader = new StreamReader(#"C:\myfile.txt", Encoding.GetEncoding(1252)))
{
// read the file with the reader object.
}
}
}
Once you specify the encoding, then the file will automatically be translated into the internal string format (UTF-16 LE) when it is read. Note that the conversion of a valid file in a legacy character encoding into Unicode will always always succeed with no difficulties if the encoding is specified correctly. Saving a file in a legacy encoding is more problematic and requires that all of the source characters map to the legacy encoding or an appropriate fallback mechanism is in place.
Using Unicode exclusively throughout the system in the future will tend to make things easier going forward. Relying on the default system encoding to be set correctly creates a hidden configuration dependency that can cause problem during any migrations, distributed applications, and other circumstances.

Related

C# .csv-file in WinForm with Ä, Ö, Ü [duplicate]

I'm using the code below to read a text file that contains foreign characters, the file is encoded ANSI and looks fine in notepad. The code below doesn't work, when the file values are read and shown in the datagrid the characters appear as squares, could there be another problem elsewhere?
StreamReader reader = new StreamReader(inputFilePath, System.Text.Encoding.ANSI);
using (reader = File.OpenText(inputFilePath))
Thanks
Update 1: I have tried all encodings found under System.Text.Encoding. and all fail to show the file correctly.
Update 2: I've changed the file encoding (resaved the file) to unicode and used System.Text.Encoding.Unicode and it worked just fine. So why did notepad read it correctly? And why didn't System.Text.Encoding.Unicode read the ANSI file?

You may also try the Default encoding, which uses the current system's ANSI codepage.
StreamReader reader = new StreamReader(inputFilePath, Encoding.Default, true)
When you try using the Notepad "Save As" menu with the original file, look at the encoding combo box. It will tell you which encoding notepad guessed is used by the file.
Also, if it is an ANSI file, the detectEncodingFromByteOrderMarks parameter will probably not help much.

I had the same problem and my solution was simple: instead of
Encoding.ASCII
use
Encoding.GetEncoding("iso-8859-1")
The answer was found here.
Edit: more solutions. This maybe more accurate one:
Encoding.GetEncoding(1252);
Also, in some cases this will work for you too if your OS default encoding matches file encoding:
Encoding.Default;

Yes, it could be with the actual encoding of the file, probably unicode. Try UTF-8 as that is the most common form of unicode encoding. Otherwise if the file ASCII then standard ASCII encoding should work.

Using Encoding.Unicode won't accurately decode an ANSI file in the same way that a JPEG decoder won't understand a GIF file.
I'm surprised that Encoding.Default didn't work for the ANSI file if it really was ANSI - if you ever find out exactly which code page Notepad was using, you could use Encoding.GetEncoding(int).
In general, where possible I'd recommend using UTF-8.

Try a different encoding such as Encoding.UTF8. You can also try letting StreamReader find the encoding itself:
StreamReader reader = new StreamReader(inputFilePath, System.Text.Encoding.UTF8, true)
Edit: Just saw your update. Try letting StreamReader do the guessing.

For swedish Å Ä Ö the only solution form the ones above working was:
Encoding.GetEncoding("iso-8859-1")
Hopefully this will save someone time.

File.OpenText() always uses an UTF-8 StreamReader implicitly. Create your own StreamReader
instance instead and specify the desired encoding.
like
using (StreamReader reader = new StreamReader(#"C:\test.txt", Encoding.Default)
{
// ...
}

I solved my problem of reading portuguese characters, changing the source file on notepad++.
C#
var url = System.Web.HttpContext.Current.Server.MapPath(#"~/Content/data.json");
string s = string.Empty;
using (System.IO.StreamReader sr = new System.IO.StreamReader(url, System.Text.Encoding.UTF8,true))
{
s = sr.ReadToEnd();
}

I'm also reading an exported file which contains french and German languages. I used Encoding.GetEncoding("iso-8859-1"), true which worked out without any challenges.

for Arabic, I used Encoding.GetEncoding(1256). it is working good.

I had a similar problem with ProcessStartInfo and the property StandardOutputEncoding. I set it for German language console output to code page 850. This way I could read the output like ausführen instead of ausf�hren.

Using File.ReadAllLines from embedded text file

I have been applying what I have learned so far in Bob Tabors absolute beginners series and I wrote a small console word game for my daughter that requires me to generate a random 5 letter word.
I was previously using File.ReadAllLines(path) to generate a string array from a text file (wordlist.txt) on my system and Random.next to generate the index I would pull from the array.
I learned from some posts here how to embed the file as a resource but now I am unable to find the syntax to point to it (path). Or do I have to access it differently now that it is embedded?
Thanks in advance

Without a good, minimal, complete code example it is impossible to offer specific advice.
However, the basic issue is this: when you embed a file as a resource, it is no longer a file. That is, the original file still exists, but the resource itself is not a file in any way. It is stored as some specific kind of data in your assembly; resources embedded from file sources generally wind up as binary data objects.
How to use this data depends on what you mean by "embed". There are actually two common ways to store resources in a C# program: you can use the "Resources" object in the project, which exposes the resource via the project's ...Properties.Resources class (which in turn uses the ResourceManager class in .NET). Or you can simply add the file to the project itself, and select the "Embedded Resource" build option.
If you are using the "Resources" designer, then there are a couple of different ways you might have added the file. One is to use the "New Text File..." option, which allows you to essentially copy/paste or type new text into a resource. This is exposed in code as a string property on the Properties.Resources object. The same thing will happen if you add the resource using the "Existing File..." option and select a file that Visual Studio recognizes as a text file.
Otherwise, the file will be included as a byte[] object exposed by a property in the Properties.Resources class.
If you have used the "Embedded Resource" build option instead of the "Resources" designer, then your data will be available by calling Assembly.GetManifestResourceStream(string) method, which returns a Stream object. This can be wrapped in StreamReader to allow it to be read line-by-line.
Direct replacements for the File.ReadAllLines(string) approach would look something like the following…
Using "Embedded Resource":
string[] ReadAllResourceLines(string resourceName)
{
using (Stream stream = Assembly.GetEntryAssembly()
.GetManifestResourceStream(resourceName))
using (StreamReader reader = new StreamReader(stream))
{
return EnumerateLines(reader).ToArray();
}
}
IEnumerable<string> EnumerateLines(TextReader reader)
{
string line;
while ((line = reader.ReadLine()) != null)
{
yield return line;
}
}
Using Properties.Resources:
You can do something similar when using the Properties.Resources class. It looks almost identical:
string[] ReadAllResourceLines(string resourceText)
{
using (StringReader reader = new StringReader(resourceText))
{
return EnumerateLines(reader).ToArray();
}
}
called like string[] allLines = ReadAllResourceLines(Properties.Resources.MyTextFile);, where MyTextFile is the property name for the resource you added in the designer (i.e. the string you pass in that second example is the text of the file itself, not the name of the resource).
If you added an existing file that Visual Studio didn't recognize as a text file, then the property type will be byte[] instead of string and you'll need yet another slightly different approach:
string[] ReadAllResourceLines(byte[] resourceData)
{
using (Stream stream = new MemoryStream(resourceData))
using (StreamReader reader = new StreamReader(stream))
{
return EnumerateLines(reader).ToArray();
}
}
Note that in all three examples, the key is that the data winds up wrapped in a TextReader implementation, which is then used to read each line individually, to populate an array. These all use the same EnumerateLines() helper method I show above.
Of course, now that you see how the data can be retrieved, you can adapt that to use the data in a variety of other ways, in case for example you don't really want or need the text represented as an array of string objects.

If you are using The Resource file and added a text file you could use
string text=Properties.Resources.<ResourceName>
here Resources is default Resource for your project .If you have added a custom Resource File you can use its name instead of Properties.Resources
if your content is a file then it is represented as a byte.In your case for simple Text it will be an string if you have included a Text File.
for any other file you can use the syntax for converting content to text(if it is text) as
string text=Encoding.ASCII.GetString(Properties.Resources.<ResourceName>);
if your file has any other encoding (as UTF Unicode ) you can use UTF8 or such classes for that under Encoding

non-English character not translating correctly in console app

Environment: Visual Studio 2008 SP1
I have the following line in my text file:
using (var reader = File.OpenText(#"c:\temp\DATA.txt"))
{
...
string textLine = "ist where [name]='Curaçao')"
}
Please notice the non-English character.
Whenever the reader.ReadLine gets to this point it turns it into a question mark in my console application.
Any ideas how to preserve that?

You should use the charset in the reader. The console, however, doesn't support non-ASCII characters!

This is most likely an encoding issue - the reader is using a different encoding to the one the file is in.
Make sure both are using the same encoding.
File.OpenText will use the UTF8Encoding - if your file is in a different encoding, this may very well be the issue.
To specify an encoding, construct StreamReader with a constructor that takes an Encoding parameter:
using (var reader = new StreamReader(#"c:\temp\DATA.txt",
Encoding.GetEncoding(860)))
{
...
string textLine = "ist where [name]='Curaçao')"
}
In the above example, I am using the Portuguese encoding.

StreamReader is unable to correctly read extended character set (UTF8)

I am having an issue where I am unable to read a file that contains foreign characters. The file, I have been told, is encoded in UTF-8 format.
Here is the core of my code:
using (FileStream fileStream = fileInfo.OpenRead())
{
using (StreamReader reader = new StreamReader(fileStream, System.Text.Encoding.UTF8))
{
string line;
while (!string.IsNullOrEmpty(line = reader.ReadLine()))
{
hashSet.Add(line);
}
}
}
The file contains the word "achôcre" but when examining it during debugging it is adding it as "ach�cre".
(This is a profanity file so I apologize if you speak French. I for one, have no idea what that means)

The evidence clearly suggests that the file is not in UTF-8 format. Try System.Text.Encoding.Default and see if you get the correct text then — if you do, you know the file is in Windows-1252 (assuming that is your system default codepage). In that case, I recommend that you open the file in Notepad, then re-“Save As” it as UTF-8, and then you can use Encoding.UTF8 normally.
Another way to check what encoding the file is actually in is to open it in your browser. If the accents display correctly, then the browser has detected the correct character set — so look at the “View / Character set” menu to find out which one is selected. If the accents are not displaying correctly, then change the character set via that menu until they do.

C# - Detecting encoding in a file, write change to file using the found encoding

I wrote a small program for iterating through a lot of files and applying some changes where a certain string match is found, the problem I have is that different files have different encodings. So what I would like to do is check the encoding, then overwrite the file in its original encoding.
What would be the prettiest way of doing that in C# .net 2.0?
My code looks very simple as of now;
String f1 = File.ReadAllText(fileList[i]).ToLower();
if (f1.Contains(oPath))
{
f1 = f1.Replace(oPath, nPath);
File.WriteAllText(fileList[i], f1, Encoding.Unicode);
}
I took a look at Auto encoding detect in C# which made me realize how I could detect encoding, but I am not sure how I could use that information to write in the same encoding.
Would greatly appreciate any help here.

Unfortunately encoding is one of those subjects where there is not always a definitive answer. In many cases it's much closer to guessing the encoding as opposed to detecting it. Raymond Chen did an excellent blog post on this subject that is worth the read
http://blogs.msdn.com/b/oldnewthing/archive/2007/04/17/2158334.aspx
The gist of the article is
If the BOM (byte order marker) exists then you're golden
Else it's guess work and heuristics
However I still think the best approach is to Darin mentioned in the question you linked. Let StreamReader guess for you vs. re-inventing the wheel. It only requires a very slight modification to your sample.
String f1;
Encoding encoding;
using (var reader = new StreamReader(fileList[i])) {
f1 = reader.ReadToEnd().ToLower();
encoding = reader.CurrentEncoding;
}
if (f1.Contains(oPath))
{
f1 = f1.Replace(oPath, nPath);
File.WriteAllText(fileList[i], f1, encoding);
}

By default, .Net use UTF8. It is hard to detect character encoding becus most of the time .Net will read as UTF8. i alway have problem with ANSI.
my trick is i will read the file as Stream as force it to read as UTF8 and detect usual character that should be in text. If found, then UTF8 else ANSI ... and tell user u can use just 2 encoding either ANSI or UTF8. auto dectect not quite work in my language :p

I am afraid, you will have to know the encoding. For UTF based encodings though you can use StreamReader built in functionality though.
Taken form here.
With regard to encodings - you will
need to have identified the encoding
in order to use the StreamReader.
However, the StreamReader itself can
help if you create it with one of the
constructor overloads that allows you
to supply the flag
detectEncodingFromByteOrderMarks as
true (or you can use
Encoding.GetPreamble and look at the
byte preamble yourself).
Both these methods will only help
auto-detect UTF based encodings though
- so any ANSI encodings with a specified codepage will probably not
be parsed correctly.

Prob a bit late but I encountered the same problem myself, using the previous answers I found a solution that works for me, It reads in the text using StreamReaders default encoding, extracts the encoding used on that file and uses StreamWriter to write it back with the changes using the found Encoding. Also removes\reAdds the ReadOnly flag
string file = "File to open";
string text;
Encoding encoding;
string oldValue = "string to be replaced";
string replacementValue = "New string";
var attributes = File.GetAttributes(file);
File.SetAttributes(file, attributes & ~FileAttributes.ReadOnly);
using (StreamReader reader = new StreamReader(file, Encoding.Default))
{
text = reader.ReadToEnd();
encoding = reader.CurrentEncoding;
reader.Close();
}
bool changedValue = false;
if (text.Contains(oldValue))
{
text = text.Replace(oldValue, replacementValue);
changedValue = true;
}
if (changedValue)
{
using (StreamWriter write = new StreamWriter(file, false, encoding))
{
write.Write(text.ToString());
write.Close();
}
File.SetAttributes(file, attributes | FileAttributes.ReadOnly);
}

The solution for all Germans => ÄÖÜäöüß
This function opens the file an determines the Encoding by the BOM.
If the BOM is missing the file will be interpreted as ANSI, but if there are UTF8 encoded German Umlaute in it, it will be detected as UTF8.
https://stackoverflow.com/a/69312696/9134997

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Problem with Encoding in FileOpen and StringBuilder in i18n - c#

Have a look at http://msdn.microsoft.com/en-us/library/ms143456.aspx use a StreamReader with the correct encoding. hth Mario

Related

C# .csv-file in WinForm with Ä, Ö, Ü [duplicate]

Using File.ReadAllLines from embedded text file

non-English character not translating correctly in console app

StreamReader is unable to correctly read extended character set (UTF8)

C# - Detecting encoding in a file, write change to file using the found encoding

Categories

Resources