.NET StreamReader encoding behaviour - c#

I am trying to understand the unicode encoding behaviour and came across the following,
I am writing to a file a string using Encoding.Unicode using
StreamWriter(fileName,false, Encoding.Unicode);
I am reading from the same file but use ASCII intentionally.
StreamReader(fileName,false, Encoding.ASCII);
When I read the string using ReadLine to my surprise it is giving back the same unicode string.
I expected the string to contain ? or other characters with double the length of the original string.
What is happening here?
Code Snippet
string test= "سشصضطظع";//some random arabic set
StreamWriter s = new StreamWriter(fileName,false, Encoding.UTF8);
s.Write(input);
s.Flush();
s.Close();
StreamReader s = new StreamReader(fileName, encoding);
string ss = s.ReadLine();
s.Close();
//In string ss I expect to be a ascii with Double the length of test
If I call StreamReader s = new StreamReader(fileName, encoding, false);
then it gives the expected result.`
Thanks

The parameter detectEncodingFromByteOrderMarks should be set to false when creating StreamReader object.

Related

How to remove BOM from an encoded base64 UTF string?

I have a file encoded in base64 using openssl base64 -in en -out en1 in a command line in MacOS and I am reading this file using the following code:
string fileContent = File.ReadAllText(Path.Combine(AppContext.BaseDirectory, MConst.BASE_DIR, "en1"));
var b1 = Convert.FromBase64String(fileContent);
var str1 = System.Text.Encoding.UTF8.GetString(b1);
The string I am getting has a ? before the actual file content. I am not sure what's causing this, any help will be appreciated.
Example Input:
import pandas
import json
Encoded file example:
77u/DQppbXBvcnQgY29ubmVjdG9yX2FwaQ0KaW1wb3J0IGpzb24NCg0K
Output based on the C# code:
?import pandas
import json
Normally, when you read UTF (with BOM) from a text file, the decoding is handled for you behind the scene. For example, both of the following lines will read UTF text correctly regardless of whether or not the text file has a BOM:
File.ReadAllText(path, Encoding.UTF8);
File.ReadAllText(path); // UTF8 is the default.
The problem is that you're dealing with UTF text that has been encoded to a Base64 string. So, ReadAllText() can no longer handle the BOM for you. You can either do it yourself by (checking and) removing the first 3 bytes from the byte array or delegate that job to a StreamReader, which is exactly what ReadAllText() does:
var bytes = Convert.FromBase64String(fileContent);
string finalString = null;
using (var ms = new MemoryStream(bytes))
using (var reader = new StreamReader(ms)) // Or:
// using (var reader = new StreamReader(ms, Encoding.UTF8))
{
finalString = reader.ReadToEnd();
}
// Proceed to using finalString.

read encoding identifier with StreamReader

I am reading a C# book and in the chapter about streams it says:
If you explicitly specify an encoding, StreamWriter will, by default,
write a prefix to the start of the stream to identify the encoding.
This is usually undesirable and you can prevent it by constructing the
encoding as follows:
var encoding = new UTF8Encoding (encoderShouldEmitUTF8Identifier:false, throwOnInvalidBytes:true);
I'd like to actually see how the identifier looks so I came up with this code:
using (FileStream fs = File.Create ("test.txt"))
using (TextWriter writer = new StreamWriter (fs,new UTF8Encoding(true,false)))
{
writer.WriteLine ("Line1");
}
using (FileStream fs = File.OpenRead ("test.txt"))
using (TextReader reader = new StreamReader (fs))
{
for (int b; (b = reader.Read()) > -1;)
Console.WriteLine (b + " " + (char)b); // identifier not printed
}
To my dissatisfaction, no identifier was printed. How do I read the identifier? Am I missing something?
By default, .NET will try very hard to insulate you from encoding errors. If you want to see the byte-order-mark, aka "preamble" or "BOM", you need to be very explicit with the objects to disable the automatic behavior. This means that you need to use an encoding that does not include the preamble, and you need to tell StreamReader to not try to detect the encoding.
Here is a variation of your original code that will display the BOM:
using (MemoryStream stream = new MemoryStream())
{
Encoding encoding = new UTF8Encoding(encoderShouldEmitUTF8Identifier: true);
using (TextWriter writer = new StreamWriter(stream, encoding, bufferSize: 8192, leaveOpen: true))
{
writer.WriteLine("Line1");
}
stream.Position = 0;
encoding = new UTF8Encoding(encoderShouldEmitUTF8Identifier: false);
using (TextReader reader = new StreamReader(stream, encoding, detectEncodingFromByteOrderMarks: false))
{
for (int b; (b = reader.Read()) > -1;)
Console.WriteLine(b + " " + (char)b); // identifier not printed
}
}
Here, encoderShouldEmitUTF8Identifier: true is passed to the encoder used to create the stream, so that the BOM is written when the stream is created, but encoderShouldEmitUTF8Identifier: false is passed to the encoder used to read the stream, so that the BOM will be treated as a normal character when the stream is being read back. The detectEncodingFromByteOrderMarks: false parameter is passed to the StreamReader constructor as well, so that it won't consume the BOM itself.
This produces this output, just like you wanted:
65279 ?
76 L
105 i
110 n
101 e
49 1
13
10
It is worth mentioning that use of the BOM as a form of identifying UTF8 encoding is generally discouraged. The BOM mainly exists so that the two variations of UTF16 can be distinguished (i.e. UTF16LE and UTF16BE, "little endian" and "big endian", respectively). It's been co-opted as a means of identifying UTF8 as well, but really it's better to just know what the encoding is (which is why things like XML and HTML explicitly state the encoding as ASCII in the first part of the file, and MIME's charset property exists). A single character isn't nearly as reliable as other more explicit means.

Which character encoding should I use for Tab-delimited flat file?

We are calling Report Type ‘_GET_MERCHANT_LISTINGS_DATA_’ of MWS API using C# web Application.
Some times we got � Character instead of single quote, space or for any other special characters while encoding data.
We have used Encoding.GetEncoding(1252) method to encode StreamReader.
We are using below code.
Stream s = reportRequest.Report;
StreamReader stream_reader = new StreamReader(s);
string reportResponseText = stream_reader.ReadToEnd();
byte[] byteArray = Encoding.GetEncoding(1252).GetBytes(reportResponseText);
MemoryStream stream = new MemoryStream(byteArray);
StreamReader filestream = new StreamReader(stream);
We also have tried ‘Encoding.UTF8.GetBytes(reportResponseText)’ but not useful.
Could anyone please suggest us correct method to encode data in correct format?

reading stream with right encoding in C#

I'm trying to read a stream with iso-8859-1 encoding with C#:
using (var reader = new StreamReader(stream,System.Text.Encoding.GetEncoding("iso-8859-1")))
{
var current_enc = reader.CurrentEncoding; //value is UTF8
i set the encoding with iso-8859-1 but it's not really set after.
Some one has seen this behaviour?
I find a parameter of StreamReader detectEncodingFromByteOrderMarks.
If it is to false, there isn't detect encoding and take yours.
using (StreamReader reader = new StreamReader(stream,System.Text.Encoding.GetEncoding("iso-8859-1"), false))

C# Streamreader: Handling of special characters \" \' etc

I'm reading then writing a text file. Before and after the data of interest the file contains many lines that should remain unaltered. But streamreader seems to convert the special characters ( " ' — ) into other characters that appear as funky diamonds in both C# textboxes and in notepad. How can text get passed through file read/write operations completely unaltered? Thanks.
StreamWriter sw = new StreamWriter(sOutputFileName);
using (StreamReader sr = new StreamReader(sTempFileName))
{
while (sr.Peek() >= 0)
{
rdBuffer = sr.ReadLine();
txtProgressDisplay.Text += rdBuffer + "\r\n";
// parse and process some lines here
wrBuffer = rdBuffer;
sw.WriteLine(wrBuffer);
txtProgressDisplay.Text += wrBuffer + "\r\n";
}
sr.Close();
}
sw.Close();
I am almost certain the issue is related to character encoding, ie UTF8, ASCII, UTF7, etc. Try creating your StreamReader passing in the correct encoding,
StreamReader sr = new StreamReader(sTempFileName, System.Text.Encoding.ASCII);
You can use Encoding.ASCII, Encoding.UTF7, etc
Your problem seems to be something with encoding.
1) Check that your text viewer is using the same encoding as your .NET application (maybe UTF-8?).
2) Check if the file itself has been created using the same encoding as your .NET application too (are you mixing characters in different encodings?).

Categories