C# letters like å not correctly shown in output file - c#

I work in C# and this is my code:
Encoding encoding;
StringBuilder output = new StringBuilder();
//somePath is string
using (StreamReader sr = new StreamReader(somePath))
{
string line;
encoding = sr.CurrentEncoding;
while ((line = sr.ReadLine()) != null)
{
//make some changes to line
output.AppendLine(line);
}
}
using (StreamWriter writer = new StreamWriter(someOtherPath, false))//encoding
{
writer.Write(output);
}
In the file which is on somePath, I have Norwegian characters like å. But, on the file in someOtherPath I get question marks instead of them. I think it's an encoding problem, so I tried getting the input file encoding and to grant it to the output file. It had no results. I tried opening the file with Google Chrome and grant it every possible encoding but the letters weren't the same as in the input file.

StreamReader can only make guesses with regards to certain encodings. Ideally, you should find out what the encoding of the file really is, then use that to read the file. What created the file, and what allows you to read it correctly? Does the latter program expose which encoding it's using? (For example, it may be using something like Windows-CP1252.)
I would personally recommend using UTF-8 as your output encoding if you can, but it depends on whether you're in control over whatever's then reading the output.
EDIT: Okay, now I've seen the file, I can confirm it's not UTF-8. The word "direktør" is represented as these bytes:
64 69 72 65 6b 74 f8 72
So the non-ASCII character is a single byte (F8) which is not a valid UTF-8 representation of a character.
It could be ISO-Latin-1 - it's not clear (there are multiple encodings which would match). If it is, you can use:
Encoding encoding = Encoding.GetEncoding(28591);
using (TextReader reader = new StreamReader(filename, encoding))
{
...
}
(Alternatively, use File.ReadAllLines to make life simpler.)
You'll need to separately work out what output encoding you want.
EDIT: Here's a short but complete program which I've run against the file you provided, and which has correctly converted the character to UTF-8:
using System;
using System.IO;
using System.Text;
class Test
{
static void Main()
{
Encoding encoding = Encoding.GetEncoding(28591);
StringBuilder output = new StringBuilder();
using (TextReader reader = new StreamReader("file.html", encoding))
{
string line;
while ((line = reader.ReadLine()) != null)
{
output.AppendLine("Read line: " + line);
}
}
using (StreamWriter writer = new StreamWriter("output.html", false))
{
writer.Write(output);
}
}
}

Try this case to save your text:
using (StreamWriter writer = new StreamWriter(someOtherPath, Encoding.UTF8)) { ... }

Related

How to remove BOM from an encoded base64 UTF string?

I have a file encoded in base64 using openssl base64 -in en -out en1 in a command line in MacOS and I am reading this file using the following code:
string fileContent = File.ReadAllText(Path.Combine(AppContext.BaseDirectory, MConst.BASE_DIR, "en1"));
var b1 = Convert.FromBase64String(fileContent);
var str1 = System.Text.Encoding.UTF8.GetString(b1);
The string I am getting has a ? before the actual file content. I am not sure what's causing this, any help will be appreciated.
Example Input:
import pandas
import json
Encoded file example:
77u/DQppbXBvcnQgY29ubmVjdG9yX2FwaQ0KaW1wb3J0IGpzb24NCg0K
Output based on the C# code:
?import pandas
import json
Normally, when you read UTF (with BOM) from a text file, the decoding is handled for you behind the scene. For example, both of the following lines will read UTF text correctly regardless of whether or not the text file has a BOM:
File.ReadAllText(path, Encoding.UTF8);
File.ReadAllText(path); // UTF8 is the default.
The problem is that you're dealing with UTF text that has been encoded to a Base64 string. So, ReadAllText() can no longer handle the BOM for you. You can either do it yourself by (checking and) removing the first 3 bytes from the byte array or delegate that job to a StreamReader, which is exactly what ReadAllText() does:
var bytes = Convert.FromBase64String(fileContent);
string finalString = null;
using (var ms = new MemoryStream(bytes))
using (var reader = new StreamReader(ms)) // Or:
// using (var reader = new StreamReader(ms, Encoding.UTF8))
{
finalString = reader.ReadToEnd();
}
// Proceed to using finalString.

read encoding identifier with StreamReader

I am reading a C# book and in the chapter about streams it says:
If you explicitly specify an encoding, StreamWriter will, by default,
write a prefix to the start of the stream to identify the encoding.
This is usually undesirable and you can prevent it by constructing the
encoding as follows:
var encoding = new UTF8Encoding (encoderShouldEmitUTF8Identifier:false, throwOnInvalidBytes:true);
I'd like to actually see how the identifier looks so I came up with this code:
using (FileStream fs = File.Create ("test.txt"))
using (TextWriter writer = new StreamWriter (fs,new UTF8Encoding(true,false)))
{
writer.WriteLine ("Line1");
}
using (FileStream fs = File.OpenRead ("test.txt"))
using (TextReader reader = new StreamReader (fs))
{
for (int b; (b = reader.Read()) > -1;)
Console.WriteLine (b + " " + (char)b); // identifier not printed
}
To my dissatisfaction, no identifier was printed. How do I read the identifier? Am I missing something?
By default, .NET will try very hard to insulate you from encoding errors. If you want to see the byte-order-mark, aka "preamble" or "BOM", you need to be very explicit with the objects to disable the automatic behavior. This means that you need to use an encoding that does not include the preamble, and you need to tell StreamReader to not try to detect the encoding.
Here is a variation of your original code that will display the BOM:
using (MemoryStream stream = new MemoryStream())
{
Encoding encoding = new UTF8Encoding(encoderShouldEmitUTF8Identifier: true);
using (TextWriter writer = new StreamWriter(stream, encoding, bufferSize: 8192, leaveOpen: true))
{
writer.WriteLine("Line1");
}
stream.Position = 0;
encoding = new UTF8Encoding(encoderShouldEmitUTF8Identifier: false);
using (TextReader reader = new StreamReader(stream, encoding, detectEncodingFromByteOrderMarks: false))
{
for (int b; (b = reader.Read()) > -1;)
Console.WriteLine(b + " " + (char)b); // identifier not printed
}
}
Here, encoderShouldEmitUTF8Identifier: true is passed to the encoder used to create the stream, so that the BOM is written when the stream is created, but encoderShouldEmitUTF8Identifier: false is passed to the encoder used to read the stream, so that the BOM will be treated as a normal character when the stream is being read back. The detectEncodingFromByteOrderMarks: false parameter is passed to the StreamReader constructor as well, so that it won't consume the BOM itself.
This produces this output, just like you wanted:
65279 ?
76 L
105 i
110 n
101 e
49 1
13
10
It is worth mentioning that use of the BOM as a form of identifying UTF8 encoding is generally discouraged. The BOM mainly exists so that the two variations of UTF16 can be distinguished (i.e. UTF16LE and UTF16BE, "little endian" and "big endian", respectively). It's been co-opted as a means of identifying UTF8 as well, but really it's better to just know what the encoding is (which is why things like XML and HTML explicitly state the encoding as ASCII in the first part of the file, and MIME's charset property exists). A single character isn't nearly as reliable as other more explicit means.

Opening a file with french characters using StreamReader showing wrong incorrect data

I have a file with the following text in: SignOut,déconnectez.
When I use the following code:
List<string> list = new List<string>();
using (StreamReader reader = new StreamReader(FileName, Encoding.UTF8))
{
string line;
while ((line = reader.ReadLine()) != null)
list.Add(line); // Add to list.
}
I get this back: "Sign Out,d�connectez,"
I thought that opening the file with Encoding.UTF8 would be enough but it doesn't seem to do anything. Could someone point me in the right direction to open a file that may contain non standard characters please?
Use
Encoding.GetEncoding("iso-8859-1");

Cannot write to rtf file after replacing inside string with utf8 characters

I have a rtf file in which I have to make some text replacements with some language specific characters (UTF8). After the replacements I try to save to a new rtf file but either the characters are not set right(strange characters) or the file is saved with all the rtf raw code and all the formatting.
Here is my code:
var fs = new FileStream(#"F:\projects\projects\RtfEditor\Test.rtf", FileMode.Open, FileAccess.Read);
//reads the file in a byte[]
var sb = FileWorker.ReadToEnd(fs);
var enc = Encoding.GetEncoding(1250);
//var enc = Encoding.UTF8;
var sbs = enc.GetString(sb);
var sbsNew = sbs.Replace("#test/#", "ă î â șșțț");
//first writting aproach
var fsw = new FileStream(#"F:\projects\projects\RtfEditor\diac.rtf", FileMode.Create, FileAccess.Write);
fsw.Write(enc.GetBytes(sbsNew), 0, enc.GetBytes(sbsNew).Length);
fsw.Flush();
fsw.Close();
In this aproach, the result file is the right one but the characters "șșțț" are shown as "????".
//second writing aproach
using (StreamWriter sw = new StreamWriter(fsw, Encoding.UTF8))
{
sw.Write(sbsNew);
sw.Flush();
}
In this aproach, the result file is a rtf file but with all rtf raw code and formatting and the special characters are saved right (șșțț appear correcty, no more ????)
A RTF file can directly contain 7-bit characters only. Everything else needs to be encoded into escape sequences. More detailed information can be found in e.g. this Wikipedia article.

C# Streamreader: Handling of special characters \" \' etc

I'm reading then writing a text file. Before and after the data of interest the file contains many lines that should remain unaltered. But streamreader seems to convert the special characters ( " ' — ) into other characters that appear as funky diamonds in both C# textboxes and in notepad. How can text get passed through file read/write operations completely unaltered? Thanks.
StreamWriter sw = new StreamWriter(sOutputFileName);
using (StreamReader sr = new StreamReader(sTempFileName))
{
while (sr.Peek() >= 0)
{
rdBuffer = sr.ReadLine();
txtProgressDisplay.Text += rdBuffer + "\r\n";
// parse and process some lines here
wrBuffer = rdBuffer;
sw.WriteLine(wrBuffer);
txtProgressDisplay.Text += wrBuffer + "\r\n";
}
sr.Close();
}
sw.Close();
I am almost certain the issue is related to character encoding, ie UTF8, ASCII, UTF7, etc. Try creating your StreamReader passing in the correct encoding,
StreamReader sr = new StreamReader(sTempFileName, System.Text.Encoding.ASCII);
You can use Encoding.ASCII, Encoding.UTF7, etc
Your problem seems to be something with encoding.
1) Check that your text viewer is using the same encoding as your .NET application (maybe UTF-8?).
2) Check if the file itself has been created using the same encoding as your .NET application too (are you mixing characters in different encodings?).

Categories