i want to read\write a binary file which has the following structure:
The file is composed by "RECORDS". Each "RECORD" has the following structure:
I will use the first record as example
(red)START byte: 0x5A (always 1 byte, fixed value 0x5A)
(green) LENGTH bytes: 0x00 0x16 (always 2 bytes, value can change from
"0x00 0x02" to "0xFF 0xFF")
(blue) CONTENT: Number of Bytes indicated by the decimal value of LENGTH Field minus 2. In this case LENGHT field value is 22 (0x00 0x16 converted to decimal), therefore the CONTENT will contain 20 (22 - 2) bytes.
My goal is to read each record one by one, and write it to an output file.
Actually i have a read function and write function (some pseudocode):
private void Read(BinaryReader binaryReader, BinaryWriter binaryWriter)
{
byte START = 0x5A;
int decimalLenght = 0;
byte[] content = null;
byte[] length = new byte[2];
while (binaryReader.PeekChar() != -1)
{
//Check the first byte which should be equals to 0x5A
if (binaryReader.ReadByte() != START)
{
throw new Exception("0x5A Expected");
}
//Extract the length field value
length = binaryReader.ReadBytes(2);
//Convert the length field to decimal
int decimalLenght = GetLength(length);
//Extract the content field value
content = binaryReader.ReadBytes(decimalLenght - 2);
//DO WORK
//modifying the content
//Writing the record
Write(binaryWriter, content, length, START);
}
}
private void Write(BinaryWriter binaryWriter, byte[] content, byte[] length, byte START)
{
binaryWriter.Write(START);
binaryWriter.Write(length);
binaryWriter.Write(content);
}
This way is actually working.
However since I am dealing with very large files i find it to be not performing at all, cause I Read and write 3 times foreach Record. Actually I would like to read bug chunks of data instead small amount of byte and maybe work in memory, but my experience in using Stream stops with BinaryReader and BinaryWriter. Thanks in advance.
FileStream is already buffered, so I'd expect it to work pretty well. You could always create a BufferedStream around the original stream to add extra more buffering if you really need to, but I doubt it would make a significant difference.
You say it's "not performing at all" - how fast is it working? How sure are you that the IO is where your time is going? Have you performed any profiling of the code?
I might also suggest that you read 3 (or 6?) bytes initially, instead of 2 separate reads. Put the initial bytes in a small array, check the 5a ck-byte, then the 2 byte length indicator, then the 3 byte AFP op-code, THEN, read the remainder of the AFP record.
It's a small difference, but it gets rid of one of your read calls.
I'm no Jon Skeet, but I did work at one of the biggest print & mail shops in the country for quite a while, and we did mostly AFP output :-)
(usually in C, though)
Related
When writing a string to a binary file using C#, the length (in bytes) is automatically prepended to the output. According to the MSDN documentation this is an unsigned integer, but is also a single byte. The example they give is that a single UTF-8 character would be three written bytes: 1 size byte and 2 bytes for the character. This is fine for strings up to length 255, and matches with the behaviour I've observed.
However, if your string is longer than 255 bytes, the size of the unsigned integer grows as necessary. As a simple example, consider 1024 characters as:
string header = "ABCDEFGHIJKLMNOP";
for (int ii = 0; ii < 63; ii++)
{
header += "ABCDEFGHIJKLMNOP";
}
fileObject.Write(header);
results in 2-bytes prepending the string. Creating a 2^17 length string results in a somewhat maddening 3-byte array.
The question, therefore, is how to know how many bytes to read to get the size of what follows when reading? I wouldn't necessarily know a priori the header size. Ultimately, can I force the Write(string) method to always use a consistent size (say 2 bytes)?
A possible workaround is to write my own write(string) method, but I would like to avoid that for obvious reasons (similar questions here and here accept this as an answer). Another more palatable workaround is to have the reader look for a specific character that starts the ASCII string information (maybe an unprintable character?), but that is not infallible. A final workaround (that I can think of) would be to force the string to be within the range of sizes for a particular number of size bytes; again, that is non ideal.
While forcing the size of the byte array to be consistent is the easiest, I have control over the reader so any clever reader solutions are also welcome.
BinaryWriter and BinaryReader aren't the only way of writing binary data; simply: they provide a convention that is shared between that specific reader and writer. No, you can't tell them to use another convention - unless of course you subclass both of them and override the ReadString and Write(string) methods completely.
If you want to use a different convention, then simply: don't use BinaryReader and BinaryWriter. It is pretty easy to talk to a Stream directly using any text Encoding you want to get hold of the bytes and the byte count. Then you can use whatever convention you want. If you only ever need to write strings up to 65k then sure: use fixed 2 bytes (unsigned short). You'll also need to decide which byte comes first, of course (the "endianness").
As for the size of the prefix: it is essentially using:
int byteCount = this._encoding.GetByteCount(value);
this.Write7BitEncodedInt(byteCount);
with:
protected void Write7BitEncodedInt(int value)
{
uint num = (uint) value;
while (num >= 0x80)
{
this.Write((byte) (num | 0x80));
num = num >> 7;
}
this.Write((byte) num);
}
This type of encoding of lengths is pretty common - it is the same idea as the "varint" that "protobuf" uses, for example (base-128, least significant group first, retaining bit order in 7-bit groups, 8th bit as continuation)
If you want to write the length yourself:
using (var bw = new BinaryWriter(fs))
{
bw.Write(length); // Use a byte, a short...
bw.Write(Encoding.Unicode.GetBytes("Your string"));
}
I have a raw byte stream stored on a file (rawbytes.txt) that I need to parse and output to a CSV-style text file.
The input of raw bytes (when read as characters/long/int etc.) looks something like this:
A2401028475764B241102847576511001200C...
Parsed it should look like:
OutputA.txt
(Field1,Field2,Field3) - heading
A,240,1028475764
OutputB.txt
(Field1,Field2,Field3,Field4,Field5) - heading
B,241,1028475765,1100,1200
OutputC.txt
C,...//and so on
Essentially, it's a hex-dump-style input of bytes that is continuous without any line terminators or gaps between data that needs to be parsed. The data, as seen above, consists of different data types one after the other.
Here's a snippet of my code - because there are no commas within any field, and no need arises to use "" (i.e. a CSV wrapper), I'm simply using TextWriter to create the CSV-style text file as follows:
if (File.Exists(fileName))
{
using (BinaryReader reader = new BinaryReader(File.Open(fileName, FileMode.Open)))
{
inputCharIdentifier = reader.ReadChar();
switch (inputCharIdentifier)
case 'A':
field1 = reader.ReadUInt64();
field2 = reader.ReadUInt64();
field3 = reader.ReadChars(10);
string strtmp = new string(field3);
//and so on
using (TextWriter writer = File.AppendText("outputA.txt"))
{
writer.WriteLine(field1 + "," + field2 + "," + strtmp); // +
}
case 'B':
//code...
My question is simple - how do I use a loop to read through the entire file? Generally, it exceeds 1 GB (which rules out File.ReadAllBytes and the methods suggested at Best way to read a large file into a byte array in C#?) - I considered using a while loop, but peekchar is not suitable here. Also, case A, B and so on have different sized input - in other words, A might be 40 bytes total, while B is 50 bytes. So the use of a fixed size buffer, say inputBuf[1000], or [50] for instance - if they were all the same size - wouldn't work well either, AFAIK.
Any suggestions? I'm relatively new to C# (2 months) so please be gentle.
You could read the file byte by byte which you append to the currentBlock byte array until you find the next block. If the byte identifies a new block you can then parse the currentBlock using you case trick and make the currentBlock = characterJustRead.
This approach works even if the id of the next block is longer than 1 byte - in this case you just parse currentBlock[0,currentBlock.Lenght-lenOfCurrentIdInBytes] - in other words you read a little too much, but you then parse only what is needed and use what is left as the base for the next currentBlock.
If you want more speed you can read the file in chunks of X bytes, but apply the same logic.
You said "The issue is that the data is not 100% kosher - i.e. there are situations where I need to separately deal with the possibility that the character I expect to identify each block is not in the right place." but building a currentBlock still should work. The code surely will have some complications, maybe something like nextBlock, but I'm guessing here without knowing what incorrect data you have to deal with.
I have some data that I know its exact structure. It has to be inserted in files second by second.
The structs contain fields of double, but they have different names. The same number of struct have to be written to file every second
The thing is ..
Which is a better appraoch when it comes to reading the data
1- Convert the Structs to bytes then insert it while indexing the byte that marks the end of the second
2- Writing CSV data and index the byte that marks the end of second
The data is requested at random basis from the file.
So in both cases I will set the position of the FileStream to the byte of the second.
In the first case I will use the following for each of the struct in that second to get the whole data
_filestream.Read(buffer, 0, buffer.Length);
GCHandle handle = GCHandle.Alloc(buffer, GCHandleType.Pinned);
oReturn = (object)Marshal.PtrToStructure(handle.AddrOfPinnedObject(), _oType);
the previous approach is applied X number of times because there's around 100 struct every second
In the second case I will use string.Split(',') then I will fill in the data accordingly since I know the exact order of my data
file.Read(buffer, 0, buffer.Length);
string val = System.Text.ASCIIEncoding.ASCII.GetString(buffer);
string[] row = val.Split(',');
edit
using the profiler is not showing a difference, but I cannot simulate the exact real life scenario because the file size might get really huge. I am looking for theoratical information for now
I have a binary file. It consists of 4 messages, each is inthe size of 100 bytes.
I want to read that last 2 messages again. I am using BinaryReader object.
I seek to psosition 200 and then I read: BinaryReaderObject.read(charBuffer, 0, 10000),
where charBuffer is big enougth.
I get all the time the a mount of read is always missing 1. Instead of getting 200 I get 199. Instead of getting 400 I get 399.
I checked and saw the size of the file is correct and the data that I get starts at the right place.
Thnaks,
Try this code and see what happens with your file.
String message = #"Read {0} bytes into the buffer.";
String fileName = #"TEST.DAT";
Int32 recordSize = 100;
Byte[] buffer = new Byte[recordSize];
using (BinaryReader br = new BinaryReader(File.OpenRead(fileName)))
{
br.BaseStream.Seek(2 * recordSize, SeekOrigin.Begin);
Console.WriteLine(message, br.Read(buffer, 0, recordSize));
Console.WriteLine(message, br.Read(buffer, 0, recordSize));
}
Console.ReadLine();
I get the following output with a 400 byte test file.
Read 100 bytes into the buffer.
Read 100 bytes into the buffer.
If I seek to 2 * recordSize + 1 or use a 399 byte file, I get the following output.
Read 100 bytes into the buffer.
Read 99 bytes into the buffer.
So it works as expected.
Hint: zero-based array indexes, and zero-based positions ...
First byte will start at position zero.
Seek to the end and print position. Is it as expected?
Print the position after reading the 199 -- is it as expected?
Try to read 1 more byte from the position after you get 199 -- do you get EOF?
How are you checking the size of the file?
Diff the 199 bytes with the expected ones -- what is different?
Two things I would check
CR/LF transformations
That the size is what you think it is.
The problem was that I used a wrapper to BinaryReader object.
When calling the Read method there are some function overloding. Instead os using the signeture of char[], I used byte[]. Till now it worked fine because there was only use of utf-8, but now when I entered real binary data in the beginning of each message it caused the problem.
I'm writing a C# application that reads data from an SQL database generated by VB6 code. The data is an array of Singles. I'm trying to convert them to a float[]
Below is the VB6 code that wrote the data in the database (cannot change this code):
Set fso = New FileSystemObject
strFilePath = "c:\temp\temp.tmp"
' Output the data to a temporary file
intFileNr = FreeFile
Open strFilePath For Binary Access Write As #intFileNr
Put #intFileNr, , GetSize(Data, 1)
Put #intFileNr, , GetSize(Data, 2)
Put #intFileNr, , Data
Close #intFileNr
' Read the data back AS STRING
Open strFilePath For Binary Access Read As #intFileNr
strData = String$(LOF(intFileNr), 32)
Get #intFileNr, 1, strData
Close #intFileNr
Call Field.AppendChunk(strData)
As you can see, the data is put in a temporary file, then read back as VB6 String and wrote in the database (row of type dbLongBinary)
I've tried the following:
Doing a BlockCopy
byte[] source = databaseValue as byte[];
float [,] destination = new float[BitConverter.ToInt32(source, 0), BitConverter.ToInt32(source, 4)];
Buffer.BlockCopy(source, 8, destination, 0, 50 * 99 * 4);
The problem here is the VB6 binary to string conversion. The VB6 string char is 2 bytes wide and I don't know how to transform this back to a binary format I can handle.
Below is a dump of the temp file that the VB6 code generates:
alt text http://robbertdam.nl/share/dump%20of%20text%20file%20generated%20by%20VB6.png
And here is the dump of the data as I read it from the database in (=the VB6 string):
alt text http://robbertdam.nl/share/dump%20of%20database%20field.png
One possible way I see is to:
Read the data back as a System.Char[], which is Unicode just like VB BSTRs.
Convert it to an ASCII byte array via Encoding.ASCII.GetBytes(). Effectively this removes all the interleaved 0s.
Copy this ASCII byte array to your final float array.
Something like this:
char[] destinationAsChars = new char[BitConverter.ToInt32(source, 0)* BitConverter.ToInt32(source, 4)];
byte[] asciiBytes = Encoding.ASCII.GetBytes(destinationAsChars);
float[] destination = new float[notSureHowLarge];
Buffer.BlockCopy(asciiBytes, 0, destination, 0, asciiBytes.Length);
Now destination should contain the original floats. CAVEAT: am not sure if the internal format of VB6 Singles is binary-compatible with the internal format of System.Float. If not, all bets are off.
This is the solution I derived from the answer above.
Reading the file in as a unicode char[], and then re-encoding to my default system encoding produced readable files.
internal void FixBytes()
{
//Convert the bytes from VB6 style BSTR to standard byte[].
char[] destinationAsChars =
System.Text.Encoding.Unicode.GetString(File).ToCharArray();
byte[] asciiBytes = Encoding.Default.GetBytes(destinationAsChars);
byte[] newFile = new byte[asciiBytes.Length];
Buffer.BlockCopy(asciiBytes,0, newFile, 0, asciiBytes.Length);
File = newFile;
}
As you probably know, that's very bad coding on the VB6 end. What it's trying to do is to cast the Single data -- which is the same as float in C# -- as a String. But while there are better ways to do that, it's a really bad idea to begin with.
The main reason is that reading the binary data into a VB6 BSTR will convert the data from 8-bit bytes to 16-bit characters, using on the current code page. So this can produce different results in the DB depending on what locale it's running in. (!)
So when you read it back from the DB, unless you specify the same code page used when writing, you'll get different floats, possibly even invalid ones.
It would help to see examples of data both in binary (single) and DB (string) form, in hex, to verify that this is what's happening.
From a later post:
Actually that is not "bad" VB6 code.
It is, because it takes binary data into the string domain, which violates a prime rule of modern VB coding. It's why the Byte data type exists. If you ignore this, you may well wind up with undecipherable data when a DB you create crosses locale boundaries.
What he is doing is storing the array
in a compact binary format and saving
it as a "chunk" into the database.
There are lots of valid reasons to do
this.
Of course he has a valid reason for wanting this (although your definition of 'compact' is different from the conventional one). The ends are fine: the means chosen are not.
To the OP:
You probably can't change what you're given as input data, so the above is mostly academic. If there's still time to change the method used to create the blobs, let us suggest methods that don't involve strings.
In applying any provided solution, do your best to avoid strings, and if you can't, decode them using the specific code page that matches the one that created them.
Can you clarify what the contents of the file are (i.e. an example)? Either as binary (perhaps hex) or characters? If the data is a VB6 string, then you'll have to use float.Parse() to read it. .NET strings are also 2-bytes per character, but when loading from a file you can control this using the Encoding.
Actually that is not "bad" VB6 code. What he is doing is storing the array in a compact binary format and saving it as a "chunk" into the database. There are lots of valid reasons to do this.
The reason for the VB6 code saving it to disk and reading it back is because VB6 doesn't have native support for reading and writing files in memory only. This is the common algorithm if you want to create a chunk of binary data and stuff it somewhere else like a database field.
It is not an issues dealing with this in .NET. The code I have is in VB.NET so you will have to convert it to C#.
Modified to handle bytes and the unicode problem.
Public Function DataArrayFromDatabase(ByVal dbData As byte()) As Single(,)
Dim bData(Ubound(dbData)/2) As Byte
Dim I As Long
Dim J As Long
J=0
For I = 1 To Ubound(dbData) step 2
bData(J) = dbData(I)
J=1
Next I
Dim sM As New IO.MemoryStream(bData)
Dim bR As IO.BinaryReader = New IO.BinaryReader(sM)
Dim Dim1 As Integer = bR.ReadInt32
Dim Dim2 As Integer = bR.ReadInt32
Dim newData(Dim1, Dim2) As Single
For I = 0 To Dim2
For J = 0 To Dim1
newData(J, I) = bR.ReadSingle
Next
Next
bR.Close()
sM.Close()
Return newData
End Function
The key trick is to read in the data just like if you were in VB6. We have the ability to use MemoryStreams in .NET so this is fairly easy.
First we skip every other byte to eliminate the Unicode padding.
Then we create a memorystream from the array of bytes. Then a BinaryReader initialized with the MemoryStream.
We read in the first dimension of the array a VB6 Long or .NET Int32
We read in the second dimension of the array a VB6 Long or .NET Int32
The read loops are constructed in reverse order of the array's dimension. Dim2 is the outer loop and Dim1 is the inner. The reason for this is that this is how VB6 store arrays in binary format.
Return newData and you have successfully restored the original array that was created in VB6!
Now you could try to use some math trick. The two dimension are 4 bytes/characters and each array element is 4 bytes/characters. But for long term maintainability I find using byte manipulation with memorystreams a lot more explicit. It take a little more code but a lot more clear when you revisit it 5 years from now.
First we skip every other byte to
eliminate the Unicode padding.
Hmmm... if that were a valid strategy, then every other column in the DB string dump would consist of nothing but zeros. But a quick scan down the first one shows that this isn't the case. In fact there are a lot of non-zero bytes in those columns. Can we afford to just discard them?
What this shows is that the conversion to Unicode caused by the use of Strings does not simply add 'padding', but changes the character of the data. What you call padding is a coincidence of the fact that the ASCII range (00-7F binary) is mapped onto the same Unicode range. But this is not true of binary 80-FF.
Take a look at the first stored value, which has an original byte value of 94 9A 27 3A. When converted to Unicode, these DO NOT become 94 00 97 00 27 00 3A 00. They become 1D 20 61 01 27 00 3A 00.
Discarding every other byte gives you 1D 61 27 3A -- not the original 94 9A 27 3A.