read encoding identifier with StreamReader

read encoding identifier with StreamReader - c#

I am reading a C# book and in the chapter about streams it says:
If you explicitly specify an encoding, StreamWriter will, by default,
write a prefix to the start of the stream to identify the encoding.
This is usually undesirable and you can prevent it by constructing the
encoding as follows:
var encoding = new UTF8Encoding (encoderShouldEmitUTF8Identifier:false, throwOnInvalidBytes:true);
I'd like to actually see how the identifier looks so I came up with this code:
using (FileStream fs = File.Create ("test.txt"))
using (TextWriter writer = new StreamWriter (fs,new UTF8Encoding(true,false)))
{
writer.WriteLine ("Line1");
}
using (FileStream fs = File.OpenRead ("test.txt"))
using (TextReader reader = new StreamReader (fs))
{
for (int b; (b = reader.Read()) > -1;)
Console.WriteLine (b + " " + (char)b); // identifier not printed
}
To my dissatisfaction, no identifier was printed. How do I read the identifier? Am I missing something?

By default, .NET will try very hard to insulate you from encoding errors. If you want to see the byte-order-mark, aka "preamble" or "BOM", you need to be very explicit with the objects to disable the automatic behavior. This means that you need to use an encoding that does not include the preamble, and you need to tell StreamReader to not try to detect the encoding.
Here is a variation of your original code that will display the BOM:
using (MemoryStream stream = new MemoryStream())
{
Encoding encoding = new UTF8Encoding(encoderShouldEmitUTF8Identifier: true);
using (TextWriter writer = new StreamWriter(stream, encoding, bufferSize: 8192, leaveOpen: true))
{
writer.WriteLine("Line1");
}
stream.Position = 0;
encoding = new UTF8Encoding(encoderShouldEmitUTF8Identifier: false);
using (TextReader reader = new StreamReader(stream, encoding, detectEncodingFromByteOrderMarks: false))
{
for (int b; (b = reader.Read()) > -1;)
Console.WriteLine(b + " " + (char)b); // identifier not printed
}
}
Here, encoderShouldEmitUTF8Identifier: true is passed to the encoder used to create the stream, so that the BOM is written when the stream is created, but encoderShouldEmitUTF8Identifier: false is passed to the encoder used to read the stream, so that the BOM will be treated as a normal character when the stream is being read back. The detectEncodingFromByteOrderMarks: false parameter is passed to the StreamReader constructor as well, so that it won't consume the BOM itself.
This produces this output, just like you wanted:
65279 ?
76 L
105 i
110 n
101 e
49 1
13
10
It is worth mentioning that use of the BOM as a form of identifying UTF8 encoding is generally discouraged. The BOM mainly exists so that the two variations of UTF16 can be distinguished (i.e. UTF16LE and UTF16BE, "little endian" and "big endian", respectively). It's been co-opted as a means of identifying UTF8 as well, but really it's better to just know what the encoding is (which is why things like XML and HTML explicitly state the encoding as ASCII in the first part of the file, and MIME's charset property exists). A single character isn't nearly as reliable as other more explicit means.

Related

reading stream with right encoding in C#

I'm trying to read a stream with iso-8859-1 encoding with C#:
using (var reader = new StreamReader(stream,System.Text.Encoding.GetEncoding("iso-8859-1")))
{
var current_enc = reader.CurrentEncoding; //value is UTF8
i set the encoding with iso-8859-1 but it's not really set after.
Some one has seen this behaviour?

I find a parameter of StreamReader detectEncodingFromByteOrderMarks.
If it is to false, there isn't detect encoding and take yours.
using (StreamReader reader = new StreamReader(stream,System.Text.Encoding.GetEncoding("iso-8859-1"), false))

Decompress file with wrong size

I have a method that decompresses *.gz file:
using (FileStream originalFileStream = new FileStream(gztempfilename, FileMode.Open, FileAccess.Read))
{
using (FileStream decompressedFileStream = new FileStream(outputtempfilename, FileMode.Create, FileAccess.Write))
{
using (GZipStream decompressionStream = new GZipStream(originalFileStream, CompressionMode.Decompress))
{
decompressionStream.CopyTo(decompressedFileStream);
}
}
}
It worked perfectly, but recently I received pack of files with wrong size:
When I open them with 7-zip they have Packed Size ~ 1,600,000 and Size = 7 (it should be ~20,000,000).
So when I extract them using this code I get only a part of the file. But when I extract this file using 7-zip I get full file.
How can I handle this situation in my code?

My guess is that that the other end does a mistake when GZipping the files. It looks like it does not set the ISIZE bytes correctly.
The ISIZE bytes are the last four bytes of a valid GZip file and come after a 32-bit CRC value which in turn comes directly after the compressed data bytes.
7-Zip seems to be robust against such mistakes whereas the GZipStream is not. It is odd however that 7-Zip is not showing you any errors. It should show you (tested with 7-Zip 16.02 x64/Win7)...
CRC error in case the size is simply wrong,
"Unexpected end of data" in case some or all of the ISIZE bytes are cut off,
"There are some data after end of the payload data" in case there is more data following the ISIZE bytes.
7-Zip always uses the last four bytes of the packed file to determine the size of the original unpacked file without checking if the file is valid and whether the bytes read for that are actually the ISIZE bytes.
You can verify this by checking those last four bytes of the GZipped file with a hex viewer. For your example they should be exactly 07 00 00 00.
If you know the exact size of the unpacked original file you could replace those bytes so that they specify the correct size. For instance, if the unpacked file's size is 20,000,078, which is 01312D4E in hex (0-padded to eight digits), those bytes should be 4E 2D 31 01.
In case you don't know the exact size you can try replacing them with the maximum value, i.e. FF FF FF FF.
After that try your unpack code again.
This is obviously only a hacky solution to your problem. Better try fixing the code that GZips the files you receive or try to find a library that is more robust than GZipStream.

I've used ICSharpCode.SharpZipLib.GZip.GZipInputStream from this library instead of System.IO.Compression.GZipStream and it helped.

Did you try this for check the size? ie:
byte[] bArray;
using (FileStream f = new FileStream(tempFile, FileMode.Open))
{
bArray= new byte[f.Length];
f.Read(b, 0, f.Length);
}
Regards
try:
GZipStream uncompressed = new GZipStream(streamIn, CompressionMode.Decompress, true);
FileStream streamOut = new FileStream(tempDoc[0], FileMode.Create, FileAccess.Write, FileShare.None);

Looks like this is some sort of bug in GZipStream (it does not write original file length into gz end of file).
You need to change the way you compress your files using GZipStream.
The way it will work:
inputBytes = Encoding.UTF8.GetBytes(output);
using (var outputStream = new MemoryStream())
{
using (var gZipStream = new GZipStream(outputStream, CompressionMode.Compress))
gZipStream.Write(inputBytes, 0, inputBytes.Length);
System.IO.File.WriteAllBytes("file.xml.gz", outputStream.ToArray());
}
And this way will cause the error you have (no matter will you use Flush() or not):
inputBytes = Encoding.UTF8.GetBytes(output);
using (var outputStream = new MemoryStream())
{
using (var gZipStream = new GZipStream(outputStream, CompressionMode.Compress))
{
gZipStream.Write(inputBytes, 0, inputBytes.Length);
System.IO.File.WriteAllBytes("file.xml.gz", outputStream.ToArray());
}
}

You might need to call decompressedStream.Seek() after closing the gZip stream.
As shown here.

Encode a string to UTF-8 with BOM in C# [duplicate]

I'm having an issue with StreamWriter and Byte Order Marks. The documentation seems to state that the Encoding.UTF8 encoding has byte order marks enabled but when files are being written some have the marks while other don't.
I'm creating the stream writer in the following way:
this.Writer = new StreamWriter(this.Stream, System.Text.Encoding.UTF8);
Any ideas on what could be happening would be appreciated.

As someone pointed that out already, calling without the encoding argument does the trick.
However, if you want to be explicit, try this:
using (var sw = new StreamWriter(this.Stream, new UTF8Encoding(false)))
To disable BOM, the key is to construct with a new UTF8Encoding(false), instead of just Encoding.UTF8Encoding. This is the same as calling StreamWriter without the encoding argument, internally it's just doing the same thing.
To enable BOM, use new UTF8Encoding(true) instead.
Update: Since Windows 10 v1903, when saving as UTF-8 in notepad.exe, BOM byte is now an opt-in feature instead.

The issue is due to the fact that you are using the static UTF8 property on the Encoding class.
When the GetPreamble method is called on the instance of the Encoding class returned by the UTF8 property, it returns the byte order mark (the byte array of three characters) and is written to the stream before any other content is written to the stream (assuming a new stream).
You can avoid this by creating the instance of the UTF8Encoding class yourself, like so:
// As before.
this.Writer = new StreamWriter(this.Stream,
// Create yourself, passing false will prevent the BOM from being written.
new System.Text.UTF8Encoding());
As per the documentation for the default parameterless constructor (emphasis mine):
This constructor creates an instance that does not provide a Unicode byte order mark and does not throw an exception when an invalid encoding is detected.
This means that the call to GetPreamble will return an empty array, and therefore no BOM will be written to the underlying stream.

My answer is based on HelloSam's one which contains all the necessary information.
Only I believe what OP is asking for is how to make sure that BOM is emitted into the file.
So instead of passing false to UTF8Encoding ctor you need to pass true.
using (var sw = new StreamWriter("text.txt", new UTF8Encoding(true)))
Try the code below, open the resulting files in a hex editor and see which one contains BOM and which doesn't.
class Program
{
static void Main(string[] args)
{
const string nobomtxt = "nobom.txt";
File.Delete(nobomtxt);
using (Stream stream = File.OpenWrite(nobomtxt))
using (var writer = new StreamWriter(stream, new UTF8Encoding(false)))
{
writer.WriteLine("HelloПривет");
}
const string bomtxt = "bom.txt";
File.Delete(bomtxt);
using (Stream stream = File.OpenWrite(bomtxt))
using (var writer = new StreamWriter(stream, new UTF8Encoding(true)))
{
writer.WriteLine("HelloПривет");
}
}

The only time I've seen that constructor not add the UTF-8 BOM is if the stream is not at position 0 when you call it. For example, in the code below, the BOM isn't written:
using (var s = File.Create("test2.txt"))
{
s.WriteByte(32);
using (var sw = new StreamWriter(s, Encoding.UTF8))
{
sw.WriteLine("hello, world");
}
}
As others have said, if you're using the StreamWriter(stream) constructor, without specifying the encoding, then you won't see the BOM.

Do you use the same constructor of the StreamWriter for every file? Because the documentation says:
To create a StreamWriter using UTF-8 encoding and a BOM, consider using a constructor that specifies encoding, such as StreamWriter(String, Boolean, Encoding).
I was in a similar situation a while ago. I ended up using the Stream.Write method instead of the StreamWriter and wrote the result of Encoding.GetPreamble() before writing the Encoding.GetBytes(stringToWrite)

I found this answer useful (thanks to #Philipp Grathwohl and #Nik), but in my case I'm using FileStream to accomplish the task, so, the code that generates the BOM goes like this:
using (FileStream vStream = File.Create(pfilePath))
{
// Creates the UTF-8 encoding with parameter "encoderShouldEmitUTF8Identifier" set to true
Encoding vUTF8Encoding = new UTF8Encoding(true);
// Gets the preamble in order to attach the BOM
var vPreambleByte = vUTF8Encoding.GetPreamble();
// Writes the preamble first
vStream.Write(vPreambleByte, 0, vPreambleByte.Length);
// Gets the bytes from text
byte[] vByteData = vUTF8Encoding.GetBytes(pTextToSaveToFile);
vStream.Write(vByteData, 0, vByteData.Length);
vStream.Close();
}

Seems that if the file already existed and didn't contain BOM, then it won't contain BOM when overwritten, in other words StreamWriter preserves BOM (or it's absence) when overwriting a file.

Could you please show a situation where it don't produce it ? The only case where the preamble isn't present that I can find is when nothing is ever written to the writer (Jim Mischel seem to have find an other, logical and more likely to be your problem, see it's answer).
My test code :
var stream = new MemoryStream();
using(var writer = new StreamWriter(stream, System.Text.Encoding.UTF8))
{
writer.Write('a');
}
Console.WriteLine(stream.ToArray()
.Select(b => b.ToString("X2"))
.Aggregate((i, a) => i + " " + a)
);

After reading the source code of SteamWriter, you need to make sure you are creating a new file, then the byte order mark will add to the file.
https://github.com/dotnet/runtime/blob/6ef4b2e7aba70c514d85c2b43eac1616216bea55/src/libraries/System.Private.CoreLib/src/System/IO/StreamWriter.cs#L267
Code in Flush method
if (!_haveWrittenPreamble)
{
_haveWrittenPreamble = true;
ReadOnlySpan preamble = _encoding.Preamble;
if (preamble.Length > 0)
{
_stream.Write(preamble);
}
}
https://github.com/dotnet/runtime/blob/6ef4b2e7aba70c514d85c2b43eac1616216bea55/src/libraries/System.Private.CoreLib/src/System/IO/StreamWriter.cs#L129
Code set the value of _haveWrittenPreamble
// If we're appending to a Stream that already has data, don't
write
// the preamble.
if (_stream.CanSeek && _stream.Position > 0)
{
_haveWrittenPreamble = true;
}

using Encoding.Default instead of Encoding.UTF8 solved my problem

StreamWriter and UTF-8 Byte Order Marks

I'm having an issue with StreamWriter and Byte Order Marks. The documentation seems to state that the Encoding.UTF8 encoding has byte order marks enabled but when files are being written some have the marks while other don't.
I'm creating the stream writer in the following way:
this.Writer = new StreamWriter(this.Stream, System.Text.Encoding.UTF8);
Any ideas on what could be happening would be appreciated.

As someone pointed that out already, calling without the encoding argument does the trick.
However, if you want to be explicit, try this:
using (var sw = new StreamWriter(this.Stream, new UTF8Encoding(false)))
To disable BOM, the key is to construct with a new UTF8Encoding(false), instead of just Encoding.UTF8Encoding. This is the same as calling StreamWriter without the encoding argument, internally it's just doing the same thing.
To enable BOM, use new UTF8Encoding(true) instead.
Update: Since Windows 10 v1903, when saving as UTF-8 in notepad.exe, BOM byte is now an opt-in feature instead.

The issue is due to the fact that you are using the static UTF8 property on the Encoding class.
When the GetPreamble method is called on the instance of the Encoding class returned by the UTF8 property, it returns the byte order mark (the byte array of three characters) and is written to the stream before any other content is written to the stream (assuming a new stream).
You can avoid this by creating the instance of the UTF8Encoding class yourself, like so:
// As before.
this.Writer = new StreamWriter(this.Stream,
// Create yourself, passing false will prevent the BOM from being written.
new System.Text.UTF8Encoding());
As per the documentation for the default parameterless constructor (emphasis mine):
This constructor creates an instance that does not provide a Unicode byte order mark and does not throw an exception when an invalid encoding is detected.
This means that the call to GetPreamble will return an empty array, and therefore no BOM will be written to the underlying stream.

My answer is based on HelloSam's one which contains all the necessary information.
Only I believe what OP is asking for is how to make sure that BOM is emitted into the file.
So instead of passing false to UTF8Encoding ctor you need to pass true.
using (var sw = new StreamWriter("text.txt", new UTF8Encoding(true)))
Try the code below, open the resulting files in a hex editor and see which one contains BOM and which doesn't.
class Program
{
static void Main(string[] args)
{
const string nobomtxt = "nobom.txt";
File.Delete(nobomtxt);
using (Stream stream = File.OpenWrite(nobomtxt))
using (var writer = new StreamWriter(stream, new UTF8Encoding(false)))
{
writer.WriteLine("HelloПривет");
}
const string bomtxt = "bom.txt";
File.Delete(bomtxt);
using (Stream stream = File.OpenWrite(bomtxt))
using (var writer = new StreamWriter(stream, new UTF8Encoding(true)))
{
writer.WriteLine("HelloПривет");
}
}

The only time I've seen that constructor not add the UTF-8 BOM is if the stream is not at position 0 when you call it. For example, in the code below, the BOM isn't written:
using (var s = File.Create("test2.txt"))
{
s.WriteByte(32);
using (var sw = new StreamWriter(s, Encoding.UTF8))
{
sw.WriteLine("hello, world");
}
}
As others have said, if you're using the StreamWriter(stream) constructor, without specifying the encoding, then you won't see the BOM.

Do you use the same constructor of the StreamWriter for every file? Because the documentation says:
To create a StreamWriter using UTF-8 encoding and a BOM, consider using a constructor that specifies encoding, such as StreamWriter(String, Boolean, Encoding).
I was in a similar situation a while ago. I ended up using the Stream.Write method instead of the StreamWriter and wrote the result of Encoding.GetPreamble() before writing the Encoding.GetBytes(stringToWrite)

I found this answer useful (thanks to #Philipp Grathwohl and #Nik), but in my case I'm using FileStream to accomplish the task, so, the code that generates the BOM goes like this:
using (FileStream vStream = File.Create(pfilePath))
{
// Creates the UTF-8 encoding with parameter "encoderShouldEmitUTF8Identifier" set to true
Encoding vUTF8Encoding = new UTF8Encoding(true);
// Gets the preamble in order to attach the BOM
var vPreambleByte = vUTF8Encoding.GetPreamble();
// Writes the preamble first
vStream.Write(vPreambleByte, 0, vPreambleByte.Length);
// Gets the bytes from text
byte[] vByteData = vUTF8Encoding.GetBytes(pTextToSaveToFile);
vStream.Write(vByteData, 0, vByteData.Length);
vStream.Close();
}

Seems that if the file already existed and didn't contain BOM, then it won't contain BOM when overwritten, in other words StreamWriter preserves BOM (or it's absence) when overwriting a file.

Could you please show a situation where it don't produce it ? The only case where the preamble isn't present that I can find is when nothing is ever written to the writer (Jim Mischel seem to have find an other, logical and more likely to be your problem, see it's answer).
My test code :
var stream = new MemoryStream();
using(var writer = new StreamWriter(stream, System.Text.Encoding.UTF8))
{
writer.Write('a');
}
Console.WriteLine(stream.ToArray()
.Select(b => b.ToString("X2"))
.Aggregate((i, a) => i + " " + a)
);

After reading the source code of SteamWriter, you need to make sure you are creating a new file, then the byte order mark will add to the file.
https://github.com/dotnet/runtime/blob/6ef4b2e7aba70c514d85c2b43eac1616216bea55/src/libraries/System.Private.CoreLib/src/System/IO/StreamWriter.cs#L267
Code in Flush method
if (!_haveWrittenPreamble)
{
_haveWrittenPreamble = true;
ReadOnlySpan preamble = _encoding.Preamble;
if (preamble.Length > 0)
{
_stream.Write(preamble);
}
}
https://github.com/dotnet/runtime/blob/6ef4b2e7aba70c514d85c2b43eac1616216bea55/src/libraries/System.Private.CoreLib/src/System/IO/StreamWriter.cs#L129
Code set the value of _haveWrittenPreamble
// If we're appending to a Stream that already has data, don't
write
// the preamble.
if (_stream.CanSeek && _stream.Position > 0)
{
_haveWrittenPreamble = true;
}

using Encoding.Default instead of Encoding.UTF8 solved my problem

binary file to string

i'm trying to read a binary file (for example an executable) into a string, then write it back
FileStream fs = new FileStream("C:\\tvin.exe", FileMode.Open);
BinaryReader br = new BinaryReader(fs);
byte[] bin = br.ReadBytes(Convert.ToInt32(fs.Length));
System.Text.Encoding enc = System.Text.Encoding.ASCII;
string myString = enc.GetString(bin);
fs.Close();
br.Close();
System.Text.ASCIIEncoding encoding = new System.Text.ASCIIEncoding();
byte[] rebin = encoding.GetBytes(myString);
FileStream fs2 = new FileStream("C:\\tvout.exe", FileMode.Create);
BinaryWriter bw = new BinaryWriter(fs2);
bw.Write(rebin);
fs2.Close();
bw.Close();
this does not work (the result has exactly the same size in bytes but can't run)
if i do bw.Write(bin) the result is ok, but i must save it to a string

When you decode the bytes into a string, and re-encodes them back into bytes, you're losing information. ASCII in particular is a very bad choice for this since ASCII will throw out a lot of information on the way, but you risk losing information when encoding and decoding regardless of the type of Encoding you pick, so you're not on the right path.
What you need is one of the BaseXX routines, that encodes binary data to printable characters, typically for storage or transmission over a medium that only allows text (email and usenet comes to mind.)
Ascii85 is one such algorithm, and the page contains links to different implementations. It has a ratio of 4:5 meaning that 4 bytes will be encoded as 5 characters (a 25% increase in size.)
If nothing else, there's already a Base64 encoding routine built into .NET. It has a ratio of 3:4 (a 33% increase in size), here:
Convert.ToBase64String Method
Convert.FromBase64String Method
Here's what your code can look like with these methods:
string myString;
using (FileStream fs = new FileStream("C:\\tvin.exe", FileMode.Open))
using (BinaryReader br = new BinaryReader(fs))
{
byte[] bin = br.ReadBytes(Convert.ToInt32(fs.Length));
myString = Convert.ToBase64String(bin);
}
byte[] rebin = Convert.FromBase64String(myString);
using (FileStream fs2 = new FileStream("C:\\tvout.exe", FileMode.Create))
using (BinaryWriter bw = new BinaryWriter(fs2))
bw.Write(rebin);

I don't think you can represent all bytes with ASCII in that way. Base64 is an alternative, but with a ratio between bytes and text of 3:4.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

read encoding identifier with StreamReader - c#

Related

reading stream with right encoding in C#

Decompress file with wrong size

Encode a string to UTF-8 with BOM in C# [duplicate]

StreamWriter and UTF-8 Byte Order Marks

binary file to string

Categories

Resources