How do I read chars from other countries such as ß ä? - c#

How do I read chars from other countries such as ß ä?
The following code reads all chars, including chars such as 0x0D.
StreamReader srFile = new StreamReader(gstPathFileName);
char[] acBuf = null;
int iReadLength = 100;
while (srFile.Peek() >= 0) {
acBuf = new char[iReadLength];
srFile.Read(acBuf, 0, iReadLength);
string s = new string(acBuf);
}
But it does not interpret correctly chars such as ß ä.
I don't know what coding the file uses. It is exported from code (into a .txt file) that was written 20 plus years ago from a C-Tree database.
The ß ä display fine with Notepad.

By default, the StreamReader constructor assumes the UTF-8 encoding (which is the de facto universal standard today). Since that's not decoding your file correctly, your characters (ß, ä) suggest that it's probably encoded using Windows-1252 (Western European):
var encoding = Encoding.GetEncoding("Windows-1252");
using (StreamReader srFile = new StreamReader(gstPathFileName, encoding))
{
// ...
}
A closely-related encoding is ISO/IEC 8859-1. If the above gives some unexpected results, use Encoding.GetEncoding("ISO-8859-1") instead.

Related

Get binary representation of ASCII symbol (C#)

Sorry for asking a question like that, but I'm really stuck.
I have this method for reading data from file:
public void ReadFromFile()
{
string fileName = #"my .txt file path";
StreamReader sr;
List<char> encoded = new List<char>();
List<byte> converted = new List<byte>();
using (StreamReader sr = new StreamReader(fileName))
{
string line = sr.ReadToEnd();
string[] lines = line.Split('\n');
foreach (var v in lines[2])
{
encoded.Add(v); // just get data I need
}
} }
Now in encoded I have F and # symbols.
I want to get 01000110 (F representation) and 01000000 (# representation)
I tried to convert every item in List<char> encoded into bytes and then use Convert.ToString(value, 2)
But it's not a good idea, because there's a mistake "Value was either too large or too small for an unsigned byte."
in the output file I have something like this:
s,01;w,000;e,1;t,001; // dictionary of character and its code
6 // number of zeros
F# // encoded string
So what I want to do is to DECODE this thing into the input string (that is 'sweet'). For this, I need to decode F# into 0100011001000000

UTF8 Character lost when written to file

I am creating an application to scan and merge CSV files. I am having an issue when writing the data to a new file. One of the fields has the ö character which is maintained until i write it to the new file. It then becomes the "actual" value: ö instead of the "expected" value: ö
I am suspecting that UTF8 Encoding is not the best thing to use but have yet to find a better working method. Any help with this would be much appreciated!
byte[] nl = new UTF8Encoding(true).GetBytes("\n");
using (FileStream file = File.Create(filepath))
{
string text;
byte[] info;
for (int r = 0; r < data.Count; r++)
{
int c = 0;
for (; c < data[r].Count - 1; c++)
{
text = data[r][c] + #",";
text = text.Replace("\n", #"");
text = text.Replace(#"☼", #"""");
info = new UTF8Encoding(true).GetBytes(text);
file.Write(info, 0, text.Length);
}
text = data[r][c];
info = new UTF8Encoding(true).GetBytes(text);
file.Write(info, 0, text.Length);
file.Write(nl, 0, nl.Length);
}
}
I might be mistaken and this should probably go in a comment but I can't comment yet. Text editors will decode the binary data into a certain encoding. You can check the actual binary data in a hex editor. You can verify the binary data you are writing out to the file. Notepad++ has a hex editor plug in that you could use.
BinaryWriter is easier to work with when it comes to writing bytes to a file. you can also set the encoding of the BinaryWriter. You'll want to set this to UTF-8.
Edit
I forgot to mention. When you write out to bytes you are going to want to read in as bytes as well. Use BinaryReader and set the encoding to UTF-8.
Once you read the Bytes in use Encoding.UTF8.GetString() to convert the bytes into a string.
You might be truncating the output since UTF-8 is multibyte.
Don't do this:
info = new UTF8Encoding(true).GetBytes(text);
file.Write(info, 0, text.Length);
Instead use info.Length.
info = new UTF8Encoding(true).GetBytes(text);
file.Write(info, 0, info.Length); // change this line

UTF-8 File data to ANSII

I have UTF-8 files (with Swedish äåö characters). I read those as:
List<MyData> myDataList = new List<MyData>();
string[] allLines = File.ReadAllLines(csvFile[0], Encoding.Default);
foreach (string line in allLines)
{
MyData myData = new MyData();
string[] words = line.Split(";");
myData.ID = words[0];
myData.Name = word[1];
myData.Age = words[2];
myData.Date = words[3];
myData.Score = words[4];
//Do something...
myDataList.Add(myData);
}
StringBuilder sb = new StringBuilder();
foreach (string data in myDataList)
{
sb.AppendLine(string.Format("{0},{1},{2},{3},{4}",
data.ID,
data.Name,
data.Age,
data.Date,
data.Score));
}
File.WriteAllText("output.txt", sb.ToString(), Encoding.ASCII);
I get output.txt file in ansii but not with Swedish characters. Can someone help me to know how can I save file data from UTF-8 to Ansii? Thanks.
What you probably mean by "ANSII"¹ is the codepage Windows-1252, used by most Western European countries.
At the moment, you are reading the file in your system default encoding, which is probably Windows-1252, and writing it as ASCII, which defines only the first 128 characters and does not include any non-English characters (such as äåö):
string[] allLines = File.ReadAllLines(csvFile[0], Encoding.Default);
...
File.WriteAllText("output.txt", sb.ToString(), Encoding.ASCII);
This is both wrong. If you want to convert your file from UTF-8 to Windows-1252, you need to read as UTF-8 and write as Windows 1252, i.e.
string[] allLines = File.ReadAllLines(csvFile[0], Encoding.UTF8);
...
File.WriteAllText("output.txt", sb.ToString(), new Encoding(1252));
¹ It is spelled ANSI; but even that is not entirely correct (quote from Wikipedia):
Historically, the phrase “ANSI Code Page” (ACP) is used in Windows to refer to various code pages considered as native. The intention was that most of these would be ANSI standards such as ISO-8859-1. Even though Windows-1252 was the first and by far most popular code page named so in Microsoft Windows parlance, the code page has never been an ANSI standard. Microsoft-affiliated bloggers now state that “The term ANSI as used to signify Windows code pages is a historical reference, but is nowadays a misnomer that continues to persist in the Windows community.”
Currently you are writing the file in ASCII, which is very limited and not capable of showing those "swedish" characters. I would recommend to try this :
System.IO.File.WriteAllText(path, text, Encoding.GetEncoding(28603));
This writes the file in ANSI encoding with codepage Latin-4. I would recommend you the wikipedia article: ISO 8859

Reading a CSV file containing greek characters

I am trying to read the data from a CSV file using the following:
var lines = File.ReadAllLines(#"c:\test.csv").Select(a => a.Split(';'));
It works but the fields that contain words are written with Greek charactes and they are presented as symbols.
How can I set the Encoding correctly in order to read those greek characters?
ReadAllLines has overload, which takes Encoding along file path
var lines = File.ReadAllLines(#"c:\test.csv", Encoding.Unicode)
.Select(line => line.Split(';'));
Testing:
File.WriteAllText(#"c:\test.csv", "ϗϡϢϣϤ", Encoding.Unicode);
Console.WriteLine(File.ReadAllLines(#"c:\test.csv", Encoding.Unicode));
will print:
ϗϡϢϣϤ
To find out in which encoding the file was actually written, use next snippet:
using (var r = new StreamReader(#"c:\test.csv", detectEncodingFromByteOrderMarks: true))
{
Console.WriteLine (r.CurrentEncoding.BodyName);
}
for my scenario it will print
utf-8

How to read utf-8 encoded string in C#?

My scenario is:
Create an email in Outlook Express and save it as .eml file;
Read the file as string in C# console application;
I'm saving the .eml file encoded in utf-8. An example of text I wrote is:
'Goiânia é badalação.'
There are special characters like âéçã. It is portuguese characters.
When I open the file with notepad++ the text is shown like this:
'Goi=C3=A2nia =C3=A9 badala=C3=A7=C3=A3o.'
If I open it in outook express again, it's shown normal, like the first way.
When I read the file in console application, using utf-8 decoding, the string is shown like the second way.
The code I using is:
string text = File.ReadAllText(#"C:\fromOutlook.eml", Encoding.UTF8);
Console.WriteLine(text);
I tried all Encoding options and a lot of methods I found in the web but nothing works.
Can someone help me do this simple conversion?
'Goi=C3=A2nia =C3=A9 badala=C3=A7=C3=A3o.'
to
'Goiânia é badalação.'
string text = "Goi=C3=A2nia =C3=A9 badala=C3=A7=C3=A3o.";
byte[] bytes = new byte[text.Length * sizeof(char)];
System.Buffer.BlockCopy(text.ToCharArray(), 0, bytes, 0, bytes.Encoding.UTF8.GetString(bytes, 0, bytes.Length);
char[] chars = new char[bytes.Length / sizeof(char)];
System.Buffer.BlockCopy(bytes, 0, chars, 0, bytes.Length);
Console.WriteLine(new string(chars));
In this utf-8 table you can see the hex. value of these characters, 'é' == 'c3 a9':
http://www.utf8-chartable.de/
Thanks.
var input = "Goi=C3=A2nia =C3=A9 badala=C3=A7=C3=A3o.";
var buffer = new List<byte>();
var i = 0;
while(i < input.Length)
{
var character = input[i];
if(character == '=')
{
var part = input.Substring(i+1,2);
buffer.Add(byte.Parse(part, System.Globalization.NumberStyles.HexNumber));
i+=3;
}
else
{
buffer.Add((byte)character);
i++;
}
};
var output = Encoding.UTF8.GetString(buffer.ToArray());
Console.WriteLine(output); // prints: Goiânia é badalação.
Knowing the problem is quoted printable, I found a good decoder here:
http://www.dpit.co.uk/2011/09/decoding-quoted-printable-email-in-c.html
This works for me.
Thanks folks.
Update:
The above link is dead, here is a workable application:
How to convert Quoted-Print String

Categories