Related
I'm trying to parse a crg-file in C#. The file is mixed with plain text and binary data. The first section of the file contains plain text while the rest of the file is binary (lots of floats), here's an example:
$
$ROAD_CRG
reference_line_start_u = 100
reference_line_end_u = 120
$
$KD_DEFINITION
#:KRBI
U:reference line u,m,730.000,0.010
D:reference line phi,rad
D:long section 1,m
D:long section 2,m
D:long section 3,m
...
$
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
�#z����RA����\�l
...
I know I can read bytes starting at a specific offset but how do I find out which byte to start from? The last row before the binary section will always contain at least four dollar signs "$$$$". Here's what I've got so far:
using var fs = new FileStream(#"crg_sample.crg", FileMode.Open, FileAccess.Read);
var startByte = ??; // How to find out where to start?
using (BinaryReader reader = new BinaryReader(fs))
{
reader.BaseStream.Seek(startByte, SeekOrigin.Begin);
var f = reader.ReadSingle();
Debug.WriteLine(f);
}
When you have a mixture of text data and binary data, you need to treat everything as binary. This means you should be using raw Stream access, or something similar, and using binary APIs to look through the text data (often looking for cr/lf/crlf at bytes as sentinels, although it sounds like in your case you could just look for the $$$$ using binary APIs, then decode the entire block before, and scan forwards). When you think you have an entire line, then you can use Encoding to parse each line - the most convenient API being encoding.GetString(). When you've finished looking through the text data as binary, then you can continue parsing the binary data, again using the binary API. I would usually recommend against BinaryReader here too, because frankly it doesn't gain you much over more direct API. The other problem you might want to think about is CPU endianness, but assuming that isn't a problem: BitConverter.ToSingle() may be your friend.
If the data is modest in size, you may find it easiest to use byte[] for the data; either via File.ReadAllBytes, or by renting an oversized byte[] from the array-pool, and loading it from a FileStream. The Stream API is awkward for this kind of scenario, because once you've looked at data: it has gone - so you need to maintain your own back-buffers. The pipelines API is ideal for this, when dealing with large data, but is an advanced topic.
UPDATE: This code may not work as expected. Please review the valuable information in the comments.
using (var fs = new FileStream(#"crg_sample.crg", FileMode.Open, FileAccess.Read))
{
using (StreamReader sr = new StreamReader(fs, Encoding.ASCII, true, 1, true))
{
var line = sr.ReadLine();
while (!string.IsNullOrWhiteSpace(line) && !line.Contains("$$$$"))
{
line = sr.ReadLine();
}
}
using (BinaryReader reader = new BinaryReader(fs))
{
// TODO: Start reading the binary data
}
}
Solution
I know this is far from the most optimized solution but in my case it did the trick and since the plain text section of the file was known to be fairly small this didn't cause any noticable performance issues. Here's the code:
using var fileStream = new FileStream(#"crg_sample.crg", FileMode.Open, FileAccess.Read);
using var reader = new BinaryReader(fileStream);
var newLine = '\n';
var markerString = "$$$$";
var currentString = "";
var foundMarker = false;
var foundNewLine = false;
while (!foundNewLine)
{
var c = reader.ReadChar();
if (!foundMarker)
{
currentString += c;
if (currentString.Length > markerString.Length)
currentString = currentString.Substring(1);
if (currentString == markerString)
foundMarker = true;
}
else
{
if (c == newLine)
foundNewLine = true;
}
}
if (foundNewLine)
{
// Read binary
}
Note: If you're dealing with larger or more complex files you should probably take a look at Mark Gravell's answer and the comment sections.
I'm having an issue with StreamWriter and Byte Order Marks. The documentation seems to state that the Encoding.UTF8 encoding has byte order marks enabled but when files are being written some have the marks while other don't.
I'm creating the stream writer in the following way:
this.Writer = new StreamWriter(this.Stream, System.Text.Encoding.UTF8);
Any ideas on what could be happening would be appreciated.
As someone pointed that out already, calling without the encoding argument does the trick.
However, if you want to be explicit, try this:
using (var sw = new StreamWriter(this.Stream, new UTF8Encoding(false)))
To disable BOM, the key is to construct with a new UTF8Encoding(false), instead of just Encoding.UTF8Encoding. This is the same as calling StreamWriter without the encoding argument, internally it's just doing the same thing.
To enable BOM, use new UTF8Encoding(true) instead.
Update: Since Windows 10 v1903, when saving as UTF-8 in notepad.exe, BOM byte is now an opt-in feature instead.
The issue is due to the fact that you are using the static UTF8 property on the Encoding class.
When the GetPreamble method is called on the instance of the Encoding class returned by the UTF8 property, it returns the byte order mark (the byte array of three characters) and is written to the stream before any other content is written to the stream (assuming a new stream).
You can avoid this by creating the instance of the UTF8Encoding class yourself, like so:
// As before.
this.Writer = new StreamWriter(this.Stream,
// Create yourself, passing false will prevent the BOM from being written.
new System.Text.UTF8Encoding());
As per the documentation for the default parameterless constructor (emphasis mine):
This constructor creates an instance that does not provide a Unicode byte order mark and does not throw an exception when an invalid encoding is detected.
This means that the call to GetPreamble will return an empty array, and therefore no BOM will be written to the underlying stream.
My answer is based on HelloSam's one which contains all the necessary information.
Only I believe what OP is asking for is how to make sure that BOM is emitted into the file.
So instead of passing false to UTF8Encoding ctor you need to pass true.
using (var sw = new StreamWriter("text.txt", new UTF8Encoding(true)))
Try the code below, open the resulting files in a hex editor and see which one contains BOM and which doesn't.
class Program
{
static void Main(string[] args)
{
const string nobomtxt = "nobom.txt";
File.Delete(nobomtxt);
using (Stream stream = File.OpenWrite(nobomtxt))
using (var writer = new StreamWriter(stream, new UTF8Encoding(false)))
{
writer.WriteLine("HelloПривет");
}
const string bomtxt = "bom.txt";
File.Delete(bomtxt);
using (Stream stream = File.OpenWrite(bomtxt))
using (var writer = new StreamWriter(stream, new UTF8Encoding(true)))
{
writer.WriteLine("HelloПривет");
}
}
The only time I've seen that constructor not add the UTF-8 BOM is if the stream is not at position 0 when you call it. For example, in the code below, the BOM isn't written:
using (var s = File.Create("test2.txt"))
{
s.WriteByte(32);
using (var sw = new StreamWriter(s, Encoding.UTF8))
{
sw.WriteLine("hello, world");
}
}
As others have said, if you're using the StreamWriter(stream) constructor, without specifying the encoding, then you won't see the BOM.
Do you use the same constructor of the StreamWriter for every file? Because the documentation says:
To create a StreamWriter using UTF-8 encoding and a BOM, consider using a constructor that specifies encoding, such as StreamWriter(String, Boolean, Encoding).
I was in a similar situation a while ago. I ended up using the Stream.Write method instead of the StreamWriter and wrote the result of Encoding.GetPreamble() before writing the Encoding.GetBytes(stringToWrite)
I found this answer useful (thanks to #Philipp Grathwohl and #Nik), but in my case I'm using FileStream to accomplish the task, so, the code that generates the BOM goes like this:
using (FileStream vStream = File.Create(pfilePath))
{
// Creates the UTF-8 encoding with parameter "encoderShouldEmitUTF8Identifier" set to true
Encoding vUTF8Encoding = new UTF8Encoding(true);
// Gets the preamble in order to attach the BOM
var vPreambleByte = vUTF8Encoding.GetPreamble();
// Writes the preamble first
vStream.Write(vPreambleByte, 0, vPreambleByte.Length);
// Gets the bytes from text
byte[] vByteData = vUTF8Encoding.GetBytes(pTextToSaveToFile);
vStream.Write(vByteData, 0, vByteData.Length);
vStream.Close();
}
Seems that if the file already existed and didn't contain BOM, then it won't contain BOM when overwritten, in other words StreamWriter preserves BOM (or it's absence) when overwriting a file.
Could you please show a situation where it don't produce it ? The only case where the preamble isn't present that I can find is when nothing is ever written to the writer (Jim Mischel seem to have find an other, logical and more likely to be your problem, see it's answer).
My test code :
var stream = new MemoryStream();
using(var writer = new StreamWriter(stream, System.Text.Encoding.UTF8))
{
writer.Write('a');
}
Console.WriteLine(stream.ToArray()
.Select(b => b.ToString("X2"))
.Aggregate((i, a) => i + " " + a)
);
After reading the source code of SteamWriter, you need to make sure you are creating a new file, then the byte order mark will add to the file.
https://github.com/dotnet/runtime/blob/6ef4b2e7aba70c514d85c2b43eac1616216bea55/src/libraries/System.Private.CoreLib/src/System/IO/StreamWriter.cs#L267
Code in Flush method
if (!_haveWrittenPreamble)
{
_haveWrittenPreamble = true;
ReadOnlySpan preamble = _encoding.Preamble;
if (preamble.Length > 0)
{
_stream.Write(preamble);
}
}
https://github.com/dotnet/runtime/blob/6ef4b2e7aba70c514d85c2b43eac1616216bea55/src/libraries/System.Private.CoreLib/src/System/IO/StreamWriter.cs#L129
Code set the value of _haveWrittenPreamble
// If we're appending to a Stream that already has data, don't
write
// the preamble.
if (_stream.CanSeek && _stream.Position > 0)
{
_haveWrittenPreamble = true;
}
using Encoding.Default instead of Encoding.UTF8 solved my problem
I have big big data in form of bytes around 5GB.
I need to store this data in a file ServerData.xml. This data should be first converted into string and then should be saved to file so that we can perform operation on the file.
I used below code to convert stream of bytes to string and then to save the same in a file.
private const string fileName = "ServerData.xml";
public void ProcessBuffer(byte[] receiveBuffer, int bytes)
{
if (!File.Exists(fileName))
{
using (File.Create(fileName)) { };
}
TextWriter tw = new StreamWriter(fileName, true);
tw.Write(Encoding.UTF8.GetString(receiveBuffer).TrimEnd((Char)0));
tw.Close();
}
Is it the right way ?
or please suggest better way so that there should not be any memory issue if any in future ?
The code in your question can only work if ProcessBuffer is always called with a UTF-8 encoded text that is broken on code point boundaries. That seems pretty unlikely to me, so I would expect that you encounter errors when decoding to text.
However, decoding to text and then writing, is rather pointless and indeed counter-productive. The bytes are already UTF-8 encoded. Write them directly to file as they arrive from the socket. Don't perform any processing of them. When you come to read the XML using XmlReader, the parser will read the encoding as UTF-8 from the document's XML declaration, and be able to decode the rest of the document. I am assuming that the document's XML declaration specifies UTF-8 but that seems highly likely. You should check.
You should get rid of the text writer which is no use to you for writing bytes. Write the bytes directly to a file stream. And try to avoid opening and closing the file repeatedly. That's very inefficient. Open and close the file exactly once.
Why do you need to convert it to a string?
using System.IO;
public static void WriteBytes(byte[] bytes, string filename)
{
using (FileStream fs = new FileStream(filename, FileMode.OpenOrCreate))
using (BinaryWriter writer = new BinaryWriter(fs, Encoding.UTF8))
{
writer.Write(bytes);
}
}
You can simply write these bytes to a file using FileStream:
public void ProcessBuffer(byte[] receivedBuffer, int bytes)
{
using (var fileStream = new FileStream(fileName, FileMode.Create)) // overwrites file
{
fileStream.Write(receivedBuffer, 0, bytes);
}
}
Update: You won't be able to work with such a big XML document if you don't have enough resources. I would suggest reformatting this file. For example, I would parse this XML and insert data into a SQL database. Then, you can easily operate with such amounts of data.
I would prefer that I write all bytes to file. And when reading, convert it to string and then convert to XML using XDocument, XElement etc. By writing bytes in file you will save space, and it is efficient,
Instead of using FileStream, I will prefer File.WriteAllBytes method.
private const string fileName = "ServerData.xml";
public void ProcessBuffer(byte[] receiveBuffer, int bytes)
{
File.WriteAllBytes(filename, bytes);
// And when reading
var bytes = File.ReadAllBytes(filename);
var binaryReader = new BinaryReader(new MemoryStream(bytes));
// Parse strings and make xml,
binaryReader.ReadString();
}
I'm trying to serialize/deserialize string. Using the code:
private byte[] StrToBytes(string str)
{
BinaryFormatter bf = new BinaryFormatter();
MemoryStream ms = new MemoryStream();
bf.Serialize(ms, str);
ms.Seek(0, 0);
return ms.ToArray();
}
private string BytesToStr(byte[] bytes)
{
BinaryFormatter bfx = new BinaryFormatter();
MemoryStream msx = new MemoryStream();
msx.Write(bytes, 0, bytes.Length);
msx.Seek(0, 0);
return Convert.ToString(bfx.Deserialize(msx));
}
This two code works fine if I play with string variables.
But If I deserialize a string and save it to a file, after reading the back and serializing it again, I end up with only first portion of the string.
So I believe I have a problem with my file save/read operation. Here is the code for my save/read
private byte[] ReadWhole(string fileName)
{
try
{
using (BinaryReader br = new BinaryReader(new FileStream(fileName, FileMode.Open)))
{
return br.ReadBytes((int)br.BaseStream.Length);
}
}
catch (Exception)
{
return null;
}
}
private void WriteWhole(byte[] wrt,string fileName,bool append)
{
FileMode fm = FileMode.OpenOrCreate;
if (append)
fm = FileMode.Append;
using (BinaryWriter bw = new BinaryWriter(new FileStream(fileName, fm)))
{
bw.Write(wrt);
}
return;
}
Any help will be appreciated.
Many thanks
Sample Problematic Run:
WriteWhole(StrToBytes("First portion of text"),"filename",true);
WriteWhole(StrToBytes("Second portion of text"),"filename",true);
byte[] readBytes = ReadWhole("filename");
string deserializedStr = BytesToStr(readBytes); // here deserializeddStr becomes "First portion of text"
Just use
Encoding.UTF8.GetBytes(string s)
Encoding.UTF8.GetString(byte[] b)
and don't forget to add System.Text in your using statements
BTW, why do you need to serialize a string and save it that way?
You can just use File.WriteAllText() or File.WriteAllBytes. The same way you can read it back, File.ReadAllBytes() and File.ReadAllText()
The problem is that you are writing two strings to the file, but only reading one back.
If you want to read back multiple strings, then you must deserialize multiple strings. If there are always two strings, then you can just deserialize two strings. If you want to store any number of strings, then you must first store how many strings there are, so that you can control the deserialization process.
If you are trying to hide data (as indicated by your comment to another answer), then this is not a reliable way to accomplish that goal. On the other hand, if you are storing data an a user's hard-drive, and the user is running your program on their local machine, then there is no way to hide the data from them, so this is as good as anything else.
I'm having an issue with StreamWriter and Byte Order Marks. The documentation seems to state that the Encoding.UTF8 encoding has byte order marks enabled but when files are being written some have the marks while other don't.
I'm creating the stream writer in the following way:
this.Writer = new StreamWriter(this.Stream, System.Text.Encoding.UTF8);
Any ideas on what could be happening would be appreciated.
As someone pointed that out already, calling without the encoding argument does the trick.
However, if you want to be explicit, try this:
using (var sw = new StreamWriter(this.Stream, new UTF8Encoding(false)))
To disable BOM, the key is to construct with a new UTF8Encoding(false), instead of just Encoding.UTF8Encoding. This is the same as calling StreamWriter without the encoding argument, internally it's just doing the same thing.
To enable BOM, use new UTF8Encoding(true) instead.
Update: Since Windows 10 v1903, when saving as UTF-8 in notepad.exe, BOM byte is now an opt-in feature instead.
The issue is due to the fact that you are using the static UTF8 property on the Encoding class.
When the GetPreamble method is called on the instance of the Encoding class returned by the UTF8 property, it returns the byte order mark (the byte array of three characters) and is written to the stream before any other content is written to the stream (assuming a new stream).
You can avoid this by creating the instance of the UTF8Encoding class yourself, like so:
// As before.
this.Writer = new StreamWriter(this.Stream,
// Create yourself, passing false will prevent the BOM from being written.
new System.Text.UTF8Encoding());
As per the documentation for the default parameterless constructor (emphasis mine):
This constructor creates an instance that does not provide a Unicode byte order mark and does not throw an exception when an invalid encoding is detected.
This means that the call to GetPreamble will return an empty array, and therefore no BOM will be written to the underlying stream.
My answer is based on HelloSam's one which contains all the necessary information.
Only I believe what OP is asking for is how to make sure that BOM is emitted into the file.
So instead of passing false to UTF8Encoding ctor you need to pass true.
using (var sw = new StreamWriter("text.txt", new UTF8Encoding(true)))
Try the code below, open the resulting files in a hex editor and see which one contains BOM and which doesn't.
class Program
{
static void Main(string[] args)
{
const string nobomtxt = "nobom.txt";
File.Delete(nobomtxt);
using (Stream stream = File.OpenWrite(nobomtxt))
using (var writer = new StreamWriter(stream, new UTF8Encoding(false)))
{
writer.WriteLine("HelloПривет");
}
const string bomtxt = "bom.txt";
File.Delete(bomtxt);
using (Stream stream = File.OpenWrite(bomtxt))
using (var writer = new StreamWriter(stream, new UTF8Encoding(true)))
{
writer.WriteLine("HelloПривет");
}
}
The only time I've seen that constructor not add the UTF-8 BOM is if the stream is not at position 0 when you call it. For example, in the code below, the BOM isn't written:
using (var s = File.Create("test2.txt"))
{
s.WriteByte(32);
using (var sw = new StreamWriter(s, Encoding.UTF8))
{
sw.WriteLine("hello, world");
}
}
As others have said, if you're using the StreamWriter(stream) constructor, without specifying the encoding, then you won't see the BOM.
Do you use the same constructor of the StreamWriter for every file? Because the documentation says:
To create a StreamWriter using UTF-8 encoding and a BOM, consider using a constructor that specifies encoding, such as StreamWriter(String, Boolean, Encoding).
I was in a similar situation a while ago. I ended up using the Stream.Write method instead of the StreamWriter and wrote the result of Encoding.GetPreamble() before writing the Encoding.GetBytes(stringToWrite)
I found this answer useful (thanks to #Philipp Grathwohl and #Nik), but in my case I'm using FileStream to accomplish the task, so, the code that generates the BOM goes like this:
using (FileStream vStream = File.Create(pfilePath))
{
// Creates the UTF-8 encoding with parameter "encoderShouldEmitUTF8Identifier" set to true
Encoding vUTF8Encoding = new UTF8Encoding(true);
// Gets the preamble in order to attach the BOM
var vPreambleByte = vUTF8Encoding.GetPreamble();
// Writes the preamble first
vStream.Write(vPreambleByte, 0, vPreambleByte.Length);
// Gets the bytes from text
byte[] vByteData = vUTF8Encoding.GetBytes(pTextToSaveToFile);
vStream.Write(vByteData, 0, vByteData.Length);
vStream.Close();
}
Seems that if the file already existed and didn't contain BOM, then it won't contain BOM when overwritten, in other words StreamWriter preserves BOM (or it's absence) when overwriting a file.
Could you please show a situation where it don't produce it ? The only case where the preamble isn't present that I can find is when nothing is ever written to the writer (Jim Mischel seem to have find an other, logical and more likely to be your problem, see it's answer).
My test code :
var stream = new MemoryStream();
using(var writer = new StreamWriter(stream, System.Text.Encoding.UTF8))
{
writer.Write('a');
}
Console.WriteLine(stream.ToArray()
.Select(b => b.ToString("X2"))
.Aggregate((i, a) => i + " " + a)
);
After reading the source code of SteamWriter, you need to make sure you are creating a new file, then the byte order mark will add to the file.
https://github.com/dotnet/runtime/blob/6ef4b2e7aba70c514d85c2b43eac1616216bea55/src/libraries/System.Private.CoreLib/src/System/IO/StreamWriter.cs#L267
Code in Flush method
if (!_haveWrittenPreamble)
{
_haveWrittenPreamble = true;
ReadOnlySpan preamble = _encoding.Preamble;
if (preamble.Length > 0)
{
_stream.Write(preamble);
}
}
https://github.com/dotnet/runtime/blob/6ef4b2e7aba70c514d85c2b43eac1616216bea55/src/libraries/System.Private.CoreLib/src/System/IO/StreamWriter.cs#L129
Code set the value of _haveWrittenPreamble
// If we're appending to a Stream that already has data, don't
write
// the preamble.
if (_stream.CanSeek && _stream.Position > 0)
{
_haveWrittenPreamble = true;
}
using Encoding.Default instead of Encoding.UTF8 solved my problem