How to read utf-8 encoded string in C#?

How to read utf-8 encoded string in C#? - c#

My scenario is:
Create an email in Outlook Express and save it as .eml file;
Read the file as string in C# console application;
I'm saving the .eml file encoded in utf-8. An example of text I wrote is:
'Goiânia é badalação.'
There are special characters like âéçã. It is portuguese characters.
When I open the file with notepad++ the text is shown like this:
'Goi=C3=A2nia =C3=A9 badala=C3=A7=C3=A3o.'
If I open it in outook express again, it's shown normal, like the first way.
When I read the file in console application, using utf-8 decoding, the string is shown like the second way.
The code I using is:
string text = File.ReadAllText(#"C:\fromOutlook.eml", Encoding.UTF8);
Console.WriteLine(text);
I tried all Encoding options and a lot of methods I found in the web but nothing works.
Can someone help me do this simple conversion?
'Goi=C3=A2nia =C3=A9 badala=C3=A7=C3=A3o.'
to
'Goiânia é badalação.'
string text = "Goi=C3=A2nia =C3=A9 badala=C3=A7=C3=A3o.";
byte[] bytes = new byte[text.Length * sizeof(char)];
System.Buffer.BlockCopy(text.ToCharArray(), 0, bytes, 0, bytes.Encoding.UTF8.GetString(bytes, 0, bytes.Length);
char[] chars = new char[bytes.Length / sizeof(char)];
System.Buffer.BlockCopy(bytes, 0, chars, 0, bytes.Length);
Console.WriteLine(new string(chars));
In this utf-8 table you can see the hex. value of these characters, 'é' == 'c3 a9':
http://www.utf8-chartable.de/
Thanks.

var input = "Goi=C3=A2nia =C3=A9 badala=C3=A7=C3=A3o.";
var buffer = new List<byte>();
var i = 0;
while(i < input.Length)
{
var character = input[i];
if(character == '=')
{
var part = input.Substring(i+1,2);
buffer.Add(byte.Parse(part, System.Globalization.NumberStyles.HexNumber));
i+=3;
}
else
{
buffer.Add((byte)character);
i++;
}
};
var output = Encoding.UTF8.GetString(buffer.ToArray());
Console.WriteLine(output); // prints: Goiânia é badalação.

Knowing the problem is quoted printable, I found a good decoder here:
http://www.dpit.co.uk/2011/09/decoding-quoted-printable-email-in-c.html
This works for me.
Thanks folks.
Update:
The above link is dead, here is a workable application:
How to convert Quoted-Print String

Related

UTF8 Character lost when written to file

I am creating an application to scan and merge CSV files. I am having an issue when writing the data to a new file. One of the fields has the ö character which is maintained until i write it to the new file. It then becomes the "actual" value: Ã¶ instead of the "expected" value: ö
I am suspecting that UTF8 Encoding is not the best thing to use but have yet to find a better working method. Any help with this would be much appreciated!
byte[] nl = new UTF8Encoding(true).GetBytes("\n");
using (FileStream file = File.Create(filepath))
{
string text;
byte[] info;
for (int r = 0; r < data.Count; r++)
{
int c = 0;
for (; c < data[r].Count - 1; c++)
{
text = data[r][c] + #",";
text = text.Replace("\n", #"");
text = text.Replace(#"☼", #"""");
info = new UTF8Encoding(true).GetBytes(text);
file.Write(info, 0, text.Length);
}
text = data[r][c];
info = new UTF8Encoding(true).GetBytes(text);
file.Write(info, 0, text.Length);
file.Write(nl, 0, nl.Length);
}
}

I might be mistaken and this should probably go in a comment but I can't comment yet. Text editors will decode the binary data into a certain encoding. You can check the actual binary data in a hex editor. You can verify the binary data you are writing out to the file. Notepad++ has a hex editor plug in that you could use.
BinaryWriter is easier to work with when it comes to writing bytes to a file. you can also set the encoding of the BinaryWriter. You'll want to set this to UTF-8.
Edit
I forgot to mention. When you write out to bytes you are going to want to read in as bytes as well. Use BinaryReader and set the encoding to UTF-8.
Once you read the Bytes in use Encoding.UTF8.GetString() to convert the bytes into a string.

You might be truncating the output since UTF-8 is multibyte.
Don't do this:
info = new UTF8Encoding(true).GetBytes(text);
file.Write(info, 0, text.Length);
Instead use info.Length.
info = new UTF8Encoding(true).GetBytes(text);
file.Write(info, 0, info.Length); // change this line

CR/LF Encoding to PDF file

I just ran into a problem I haven't seen before. The problem short is that I need to send two different strings to a method which will validate if the string are the same.
one of the string look like this
Sample 1
JVBERi0xLjQNCiW1tbW1DQoxIDAgb2JqDQo8PC9UeXBlL0NhdGFsb2cvUGFnZXMgMiAwIFIvTGFu
ZyhkYS1ESykgL1N0cnVjdFRyZWVSb290IDU3IDAgUi9NYXJrSW5mbzw8L01hcmtlZCB0cnVlPj4v
the second one looks like
Sample 2 ZyhkYS1ESykgL1N0cnVjdFRyZWVSb290IDU3IDAgUi9NYXJrSW5mbzw8L01hcmtlZCB0cnVlPj4vZyhkYS1ESykgL1N0cnVjdFRyZWVSb290IDU3IDAgUi9NYXJrSW5mbzw8L01hcmtlZCB0cnVlPj4v
The actually string is a PDF document compressed into base64 (this is only a part of it)
I tried to take sample one into notepad++ and say show all special characters, it shows me CRLF in the end of each line.
Now im in the situation that i need to have sample 2 looking like sample 1, so I need to read a file into the same encoding, is this possible?
So to sum up here is what I want to do
EDIT/ADD:
What i want is that
1. Take a pdf
2. Convert it into base64encoded with cr/lf
3. in a validation method in another library it needs to be validated as this format.

Well I didnt find any nice way to create a CR/LF split
byte[] bytes = System.IO.File.ReadAllBytes(#"C:\Testdata\SSVALID.pdf");
string temp_inBase64 = Convert.ToBase64String(bytes);
string returnString = "";
int maxLenght = 76;
int counts = temp_inBase64.Length / maxLenght;
for (int i = 0; i < counts; i++)
{
returnString += temp_inBase64.Substring((i * 76), 76);
returnString += "\r\n";
}
returnString += temp_inBase64.Substring(76 * counts, temp_inBase64.Length - (76 * counts));

Put string into textbox -> not complete

I clicked together a small WinForms app for testing. It has two multiline textboxes and a single button, which on press sends a request to a server and posts response headers and content into the textboxes like this:
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
int len = 0;
foreach (var header in response.Headers)
{
var str = header.ToString();
textBox1.AppendText(str + "=" + response.Headers[str] + "\n");
if (str == "Content-Length") len = Convert.ToInt32(response.Headers[str]);
}
Stream respStream = response.GetResponseStream();
byte[] x = new byte[len];
respStream.Read(x, 0, len);
var s = new string(ascii.GetChars(x, 0, len));
// textBox2.Text = s;
textBox2.Clear();
textBox2.AppendText(s);
MessageBox.Show(textBox2.TextLength.ToString(), s.Length.ToString());
But no matter whether I use AppendText or whether I assign the string, the MessageBox always shows the caption 7653 with message 3964, and the headers textbox contains the line Content-length=7653.
So it seems that the string is not completely appended to the TextBox. Why would that be?
Btw: I am requesting an HTML document; the last two chars shown are ".5", and the first two chars missing are "16", so it does not break at some special characters.

Check out this Post
Your problem is that with Stream.Read you may read less than the total number of characters as they may not be available yet on the network.
So your string already contains only the first part of the text. s.Length indicates the right number of characters as it gets copied over from the byte array x but most of the characters are 0 (Char '\0'). textBox2.TextLength then indicates the right number of characters that have been read. I suppose it trims the '\0' characters.
You should use a while loop instead and check the result of Read as indicated before.
Also check the encoding of your html page. For UTF8 (default in HTML 5) one byte doesn't necessarily correspond to one character.

How do I read chars from other countries such as ß ä?

How do I read chars from other countries such as ß ä?
The following code reads all chars, including chars such as 0x0D.
StreamReader srFile = new StreamReader(gstPathFileName);
char[] acBuf = null;
int iReadLength = 100;
while (srFile.Peek() >= 0) {
acBuf = new char[iReadLength];
srFile.Read(acBuf, 0, iReadLength);
string s = new string(acBuf);
}
But it does not interpret correctly chars such as ß ä.
I don't know what coding the file uses. It is exported from code (into a .txt file) that was written 20 plus years ago from a C-Tree database.
The ß ä display fine with Notepad.

By default, the StreamReader constructor assumes the UTF-8 encoding (which is the de facto universal standard today). Since that's not decoding your file correctly, your characters (ß, ä) suggest that it's probably encoded using Windows-1252 (Western European):
var encoding = Encoding.GetEncoding("Windows-1252");
using (StreamReader srFile = new StreamReader(gstPathFileName, encoding))
{
// ...
}
A closely-related encoding is ISO/IEC 8859-1. If the above gives some unexpected results, use Encoding.GetEncoding("ISO-8859-1") instead.

How to read a file starting at a specific cursor point in C#?

I want to read a file but not from the beginning of the file but at a specific point of a file. For example I want to read a file after 977 characters after the beginning of the file, and then read the next 200 characters at once. Thanks.

If you want to read the file as text, skipping characters (not bytes):
using (var textReader = System.IO.File.OpenText(path))
{
// read and disregard the first 977 chars
var buffer = new char[977];
textReader.Read(buffer, 0, buffer.Length);
// read 200 chars
buffer = new char[200];
textReader.Read(buffer, 0, buffer.Length);
}
If you merely want to skip a certain number of bytes (not characters):
using (var fileStream = System.IO.File.OpenRead(path))
{
// seek to starting point
fileStream.Seek(977, SeekOrigin.Begin);
// read 200 bytes
var buffer = new byte[200];
fileStream.Read(buffer, 0, buffer.Length);
}

you can use Linq and converting array of char to string .
add these namespace :
using System.Linq;
using System.IO;
then you can use this to get an array of characters starting index a as much as b characters from your text file :
char[] c = File.ReadAllText(FilePath).ToCharArray().Skip(a).Take(b).ToArray();
Then you can have a string , includes continuous chars of c :
string r = new string(c);
for example , i have this text in a file :
hello how are you ?
i use this code :
char[] c = File.ReadAllText(FilePath).ToCharArray().Skip(6).Take(3).ToArray();
string r = new string(c);
MessageBox.Show(r);
and it shows : how
Way 2
Very simple :
Using Substring method
string s = File.ReadAllText(FilePath);
string r = s.Substring(6,3);
MessageBox.Show(r);
Good Luck ;

using (var fileStream = System.IO.File.OpenRead(path))
{
// seek to starting point
fileStream.Position = 977;
// read
}

if you want to read specific data types from files System.IO.BinaryReader is the best choice.
if you are not sure about file encoding use
using (var binaryreader = new BinaryReader(File.OpenRead(path)))
{
// seek to starting point
binaryreader.ReadChars(977);
// read
char[] data = binaryreader.ReadChars(200);
//do what you want with data
}
else if you know character size in source file size are 1 or 2 byte use
using (var binaryreader = new BinaryReader(File.OpenRead(path)))
{
// seek to starting point
binaryreader.BaseStream.Position = 977 * X;//x is 1 or 2 base on character size in sourcefile
// read
char[] data = binaryreader.ReadChars(200);
//do what you want with data
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to read utf-8 encoded string in C#? - c#

Knowing the problem is quoted printable, I found a good decoder here: http://www.dpit.co.uk/2011/09/decoding-quoted-printable-email-in-c.html This works for me. Thanks folks. Update: The above link is dead, here is a workable application: How to convert Quoted-Print String

Related

UTF8 Character lost when written to file

CR/LF Encoding to PDF file

Put string into textbox -> not complete

How do I read chars from other countries such as ß ä?

How to read a file starting at a specific cursor point in C#?

Categories

Resources