System.IO.BinaryWriter outfile;
System.IO.FileStream fs = new System.IO.FileStream(some_object.text, System.IO.FileMode.Create);
outfile = new System.IO.BinaryWriter(fs);
outfile.Write('A'); // Line 1
outfile.Write('B'); // Line 2
outfile.Write('C'); // Line 3
outfile.Write( Convert.ToUInt16(some_object.text, 16) ); // Line 4
outfile.Write((ushort)0); // Line 5
Here i declare a BinaryWriter for creating my output file.
What i need to know clearly is how the file is exactly being written?
Meaning, that Line 1, 2, 3 write the file Byte by Byte meaning 1 byte at a time if i am correct??
This some_object.text holds a value 2000.
How many bytes does Line 4 exactly write?? (2 Bytes/16 Bits since UInt16 of is 16 bits) ?
Take a look at the chart from MSDN to see how many bytes are written:
BinaryWriter.Write Method
The BinaryWriter uses the BitConverter class to create sequences of bytes that are written to the underlying stream. A great way to understand what is going on, at the lowest level, is to use .NET Reflector. It can decompile assemblies and easily be used to figure out framework implementation details.
Most of the binary write methods use the native representation in little endian (though the endian is architecture specific and varies between platforms such as XBOX and Windows). The only exception to this are strings. Strings are by default encoded using UTF-8 encoding.
Related
Alright, so I basically want to read any file with a specific extension. Going through all the bytes and reading the file is basically easy, but what about getting the type of the next byte? For example:
while ((int)reader.BaseStream.Position != RecordSize * RecordsCount)
{
// How do I check what type is the next byte gonna be?
// Example:
// In every file, the first byte is always a uint:
uint id = reader.GetUInt32();
// However, now I need to check for the next byte's type:
// How do I check the next byte's type?
}
Bytes don't have a type. When data in some language type, such as a char or string or Long is converted to bytes and written to a file, there is no strict way to tell what the type was : all bytes look alike, a number from 0-255.
In order to know, and to convert back from bytes to structured language types, you need to know the format that the file was written in.
For example, you might know that the file was written as an ascii text file, and hence every byte represents one ascii character.
Or you might know that your file was written with the format {uint}{50 byte string}{linefeed}, where the first 2 bytes represent a uint, the next 50 a string, followed by a linefeed.
Because all bytes look the same, if you don't know the file format you can't read the file in a semantically correct way. For example, I might send you a file I created by writing out some ascii text, but I might tell you that the file is full of 2-byte uints. You would write a program to read those bytes as 2-byte uints and it would work : any 2 bytes can be interpreted as a uint. I could tell someone else that the same file was composed of 4-byte longs, and they could read it as 4-byte longs : any 4 bytes can be interpreted as a long. I could tell someone else the file was a 2 byte uint followed by 6 ascii characters. And so on.
Many types of files will have a defined format : for example, a Windows executable, or a Linux ELF binary.
You might be able to guess the types of the bytes in the file if you know something about the reason the file exists. But somehow you have to know, and then you interpret those bytes according to the file format description.
You might think "I'll write the bytes with a token describing them, so the reading program can know what each byte means". For example, a byte with a '1' might mean the next 2 bytes represent a uint, a byte with a '2' might mean the following byte tells the length of a string, and the bytes after that are the string, and so on. Sure, you can do that. But (a) the reading program still needs to understand that convention, so everything I said above is true (it's turtles all the way down), (b) that approach uses a lot of space to describe the file, and (c) The reading program needs to know how to interpret a dynamically described file, which is only useful in certain circumstances and probably means there is a meta-meta format describing what the embedded meta-format means.
Long story short, all bytes look the same, and a reading program has to be told what those bytes represent before it can use them meaningfully.
ive been reading about this topic and didnt get the specific info for my question :
(maybe the following is incorrect - but please do correct me)
Every file( text/binary) is saving BYTES.
byte is 8 bits hence max value is 2^8-1 = 255 codes.
those 255 codes divides to 2 groups:
0..127 : textual chars
128:..255 : special chars.
so binary file contains char codes from the whole range : 0..255 ( ascii chars+special chars).
1 ) correct ?
2) NOw , lets say im saving one INT in binary file. ( 4 byte in 32 bit system)
how does the file tells the progem reads it : its not 4 single unrelated bytes but an int which is 4 bytes ?
Underlying all files are being stored as bytes, so in a sense what you're saying is correct. However, if you open a file that's intended to be read as binary and try to read it in a text editor, it will look like gibberish.
How does a program know whether to read a file as text or as binary? (ie as special sets of ASCII or other encoded bytes, or just as the underlying bytes with a different representation)?
Well, it doesn't know - it just does what it's told.
In Windows, you open .txt files in notepad - notepad expects to be reading text. Try opening a binary file in notepad. It will open, you will see stuff, but it will be rubbish.
If you're writing your own program you can write using BinaryWriter and read using BinaryReader if you want to store everything as binary. What would happen if you wrote using BinaryWriter and read using StringReader?
To answer your specific example:
using (var test = new BinaryWriter(new FileStream(#"c:\test.bin", FileMode.Create)))
{
test.Write(10);
test.Write("hello world");
}
using (var test = new BinaryReader(new FileStream(#"c:\test.bin", FileMode.Open)))
{
var out1 = test.ReadInt32();
var out2 = test.ReadString();
Console.WriteLine("{0} {1}", out1, out2);
}
See how you have to read in the same order that's written? The file doesn't tell you anything.
Now switch the second part around:
using (var test = new BinaryReader(new FileStream(#"c:\test.bin", FileMode.Open)))
{
var out1 = test.ReadString();
var out2 = test.ReadInt32();
Console.WriteLine("{0} {1}", out1, out2);
}
You'll get gibberish out (if it works at all). Yet there is nothing you can read in the file that will tell you that beforehand. There is no special information there. The program must know what to do based on some out of band information (a specification of some sort).
so binary file contains char codes from the whole range : 0..255 ( ascii chars+special chars).
No, a binary file just contains bytes. Values between 0 and 255. They should only be considered as character at all if you decide to ascribe that meaning to them. If it's a binary file (e.g. a JPEG) then you shouldn't do that - a byte 65 in image data isn't logically an 'A' - it's whatever byte 65 means at that point in the file.
(Note that even text files aren't divided into "ASCII characters" and "special characters" - it depends on the encoding. In UTF-16, each code unit takes two bytes regardless of its value. In UTF-8 the number of bytes depends on the character you're trying to represent.)
how does the file tells the progem reads it : its not 4 single unrelated bytes but an int which is 4 bytes ?
The file doesn't tell the program. The program has to know how to read the file. If you ask Notepad to open a JPEG file, it won't show you an image - it will show you gibberish. Likewise if you try to force an image viewer to open a text file as if it were a JPEG, it will complain that it's broken.
Programs reading data need to understand the structure of the data they're going to read - they have to know what to expect. In some cases the format is quite flexible, like XML: there are well-specified layers, but then the program reads the values with higher-level meaning - elements, attributes etc. In other cases, the format is absolutely precise: first you'll start with a 4 byte integer, then two 2-byte integers or whatever. It depends on the format.
EDIT: To answer your specific (repeated) comment:
Im Cmd shell....youve written your binary file. I have no clue what did you do there. how am i suppose to know whether to read 4 single bytes or 4 bytes as once ?
Either the program reading the data needs to know the meaning of the data or it doesn't. If it's just copying the file from one place to another, it doesn't need to know the meaning of the data. It doesn't matter whether it copies it one byte at a time or all four bytes at once.
If it does need to know the meaning of the data, then just knowing that it's a four byte integer doesn't really help much - it would need to know what that integer meant to do anything useful with it. So your file written from the command shell... what does it mean? If I don't know what it means, what does it matter whether I know to read one byte at a time or four bytes as an integer?
(As I mentioned above, there's an intermediate option where code can understand structure without meaning, and expose that structure to other code which then imposes meaning - XML is a classic example of that.)
It's all a matter of interpretation. Neither the file nor the system know what's going on in your file, they just see your storage as a sequence of bytes that has absolutely no meaning in itself. The same thing happens in your brain when you read a word (you attempt to choose a language to interpret it in, to give the sequence of characters a meaning).
It is the responsibility of your program to interpret the data the way you want it, as there is no single valid interpretation. For example, the sequence of bytes 48 65 6C 6C 6F 20 53 6F 6F 68 6A 75 6E can be interpreted as:
A string (Hello Soohjun)
A sequence of 12 one-byte characters (H, e, l, l, o, , S, o, o, h, j, u, n)
A sequence of 3 unsigned ints followed by a character (1214606444, 1864389487, 1869113973, 110)
A character followed by a float followed by an unsigned int followed by a float (72, 6.977992E22, 542338927, 4.4287998E24), and so on...
You are the one choosing the meaning of those bytes, another program would make a different interpretation of the very same data, much the same a combination of letters has a different interpretation in say, English and French.
PS: By the way, that's the goal of reverse engineering file formats: find the meaning of each byte.
I've been a programmer for a few years now, but I've never had to understand low-level operations involving bytes. It interests me however, and I would like to understand more about working with bytes.
In the below code I'm reading a text file that contains only the words "hi there".
FileStream fileStream = new FileStream(#"C:\myfile.txt", FileMode.Open);
byte[] mybyte = new byte[fileStream.Length];
fileStream.Read(mybyte, 0, (int)fileStream.Length);
foreach(byte b in mybyte)
Console.Write(b);
Console.ReadLine();
In this case, the mybyte variable contains numeric values that appear to represent the ASCII decimal counterpart. However, I thougth bytes represent bits, which in turn represnt binary values. When reading a byte I would expect to see a binary value like '0001010', not '104' which is the ascii character for 'h'.
In the case of reading an image, when reading the image into a byte array I once again see numbers in the array, and from a low-level persepctive I would expect binary values. I know that these numbers obviously don't map to Ascii, but I'm confused why when reading a string they would map to ascii numbers and when reading an image stream it does something else (I'm not actually sure what the numbers represent in the case of reading an image).
I know understanding what the numbers mean in a byte array isn't critical, but it greatly interests me.
Could someone please shed a light on bytes in the .net framework when reading from a text file and when reading binary (i.e. image). Thank You
This image is the byte array holding the text "hi there" read from myfile.txt
This image is a byte array holding an image stream
01101000 is the 8 bit representation of the value 104. Since a c# byte stores 8bits (0-255) it is shown to you as something more readable. Open up the windows calculator and change the view to "Programmer", then set it to "Bin". Might clears things up a bit.
It is not showing you a decimal number, it is showing you a c# byte, a number from 0 to 255
A byte is literally an 8-bit integer that is represented there as an integer from 0 to 255 - in other words, in decimal notation. You were expecting it to be represented in binary notation, but it actually would mean the same thing. As best I can say is that's just how Visual Studio in this case represents it but there may some more details someone can shed.
An image file is just a sequential set of bytes, again, all represented here as decimal numbers.
Hope that helps.
A byte consists of 8 bits. Those can be written in different ways, for example as decimal value (104), as binary values (1101000) or as headecimal value (68). They all mean exactly the same, it are just different representations of the values.
This has nothing to do with ASCII-Characters. They just happen to be a byte long, too (7 bit, to be precise).
Of course, everything at low-level will be stored as collection of binary values. What you are seeing with debugger is it's decimal representation. As binary values don't mean anything unless we interpret them, the same thing with the decimal number your seeing with the debugger in both the cases (string and image).
For example, when your read a byte from filestream and then parse it with encoding like:
FileStream fs = new FileStream(#"<Filename>", FileMode.Open, FileAccess.Read, FileShare.Read);
byte[] bt = new byte[8];
fs.Read(bt , 0, 1);
string str = System.Text.ASCIIEncoding.ASCII.GetString(bt);
You will get a ASCII character even if your reading from a image file. If you pass the same image filestream to a Image class like
Bitmap bmp = (Bitmap)Image.FromFile(#"<Filename>");
and assign this bmp to picture box, you will see a image.
Summary:
Your interpreters give the meaning to your 0's and 1's or your decimal numbers. By themselves they don't mean anything.
I'm looking at the C# library called BitStream, which allows you to write and read any number of bits to a standard C# Stream object. I noticed what seemed to me a strange design decision:
When adding bits to an empty byte, the bits are added to the MSB of the byte. For example:
var s = new BitStream();
s.Write(true);
Debug.Assert(s.ToByteArray()[0] == 0x80); // and not 0x01
var s = new BitStream();
s.Write(0x7,0,4);
s.Write(0x3,0,4);
Debug.Assert(s.ToByteArray()[0] == 0x73); // and not 0x37
However, when referencing bits in a number as the input, the first bit of the input number is the LSB. For example
//s.Write(int input,int bit_offset, int count_bits)
//when referencing the LSB and the next bit we'll write
s.Write(data,0,2); //and not s.Write(data,data_bits_number,data_bits_number-2)
It seems inconsistent to me. Since in this case, when "gradually" copying a byte like in the previous example (the first four bits, and then the last four bits), we will not get the original byte. We need to copy it "backwards" (first the last four bits, then the first four bits).
Is there a reason for that design that I'm missing? Any other implementation of bits stream with this behaviour? What are the design considerations for that?
It seems that ffmpeg bitstream behaves in a way I consider consistent. Look at the amount it shifts the byte before ORing it with the src pointer in the put_bits function.
As a side note:
The first byte added, is the first byte in the byte array. For example
var s = new BitStream();
s.Write(0x1,0,4);
s.Write(0x2,0,4);
s.Write(0x3,0,4);
Debug.Assert(s.ToByteArray()[0] == 0x12); // and not s.ToByteArray()[1] == 0x12
Here are some additional considerations:
In the case of the boolean - only one bit is required to represent true or false. When that bit gets added to the beginning of the stream, the bit stream is "1." When you extend that stream to byte length it forces the padding of zero bits to the end of the stream, even though those bits did not exist in the stream to begin with. Position in the stream is important information just like the values of the bits, and a bit stream of "1000000" or 0x80 safeguards the expectation that subsequent readers of the stream may have that the first bit they read is the first bit that was added.
Second, other data types like integers require more bits to represent so they are going to take up more room in the stream than booleans. Mixing different size data types in the same stream can be very tricky when they aren't aligned on byte boundaries.
Finally, if you are on Intel x86 your CPU architecture is "little-endian" which means LSB first like you are describing. If you need to store values in the stream as big-endian you'll need to add a conversion layer in your code - similar to what you've shown above where you push one byte at a time into the stream in the order you want. This is annoying, but commonly required if you need to interop with big-endian Unix boxes or as may be required by a protocol specification.
Hope that helps!
Is there a reason for that design that I'm missing? Any other implementation of bits stream with this behaviour? What are the design considerations for that?
I doubt there was any significant meaning behind the descision. Technically it just does not matter so long as the writer and reader agree on the ordering.
I agree with Elazar.
As he/she points out, this is a case where the reader and writer do NOT agree on the bit ordering. In fact, they're incompatible.
I'm writing a C# application that reads data from an SQL database generated by VB6 code. The data is an array of Singles. I'm trying to convert them to a float[]
Below is the VB6 code that wrote the data in the database (cannot change this code):
Set fso = New FileSystemObject
strFilePath = "c:\temp\temp.tmp"
' Output the data to a temporary file
intFileNr = FreeFile
Open strFilePath For Binary Access Write As #intFileNr
Put #intFileNr, , GetSize(Data, 1)
Put #intFileNr, , GetSize(Data, 2)
Put #intFileNr, , Data
Close #intFileNr
' Read the data back AS STRING
Open strFilePath For Binary Access Read As #intFileNr
strData = String$(LOF(intFileNr), 32)
Get #intFileNr, 1, strData
Close #intFileNr
Call Field.AppendChunk(strData)
As you can see, the data is put in a temporary file, then read back as VB6 String and wrote in the database (row of type dbLongBinary)
I've tried the following:
Doing a BlockCopy
byte[] source = databaseValue as byte[];
float [,] destination = new float[BitConverter.ToInt32(source, 0), BitConverter.ToInt32(source, 4)];
Buffer.BlockCopy(source, 8, destination, 0, 50 * 99 * 4);
The problem here is the VB6 binary to string conversion. The VB6 string char is 2 bytes wide and I don't know how to transform this back to a binary format I can handle.
Below is a dump of the temp file that the VB6 code generates:
alt text http://robbertdam.nl/share/dump%20of%20text%20file%20generated%20by%20VB6.png
And here is the dump of the data as I read it from the database in (=the VB6 string):
alt text http://robbertdam.nl/share/dump%20of%20database%20field.png
One possible way I see is to:
Read the data back as a System.Char[], which is Unicode just like VB BSTRs.
Convert it to an ASCII byte array via Encoding.ASCII.GetBytes(). Effectively this removes all the interleaved 0s.
Copy this ASCII byte array to your final float array.
Something like this:
char[] destinationAsChars = new char[BitConverter.ToInt32(source, 0)* BitConverter.ToInt32(source, 4)];
byte[] asciiBytes = Encoding.ASCII.GetBytes(destinationAsChars);
float[] destination = new float[notSureHowLarge];
Buffer.BlockCopy(asciiBytes, 0, destination, 0, asciiBytes.Length);
Now destination should contain the original floats. CAVEAT: am not sure if the internal format of VB6 Singles is binary-compatible with the internal format of System.Float. If not, all bets are off.
This is the solution I derived from the answer above.
Reading the file in as a unicode char[], and then re-encoding to my default system encoding produced readable files.
internal void FixBytes()
{
//Convert the bytes from VB6 style BSTR to standard byte[].
char[] destinationAsChars =
System.Text.Encoding.Unicode.GetString(File).ToCharArray();
byte[] asciiBytes = Encoding.Default.GetBytes(destinationAsChars);
byte[] newFile = new byte[asciiBytes.Length];
Buffer.BlockCopy(asciiBytes,0, newFile, 0, asciiBytes.Length);
File = newFile;
}
As you probably know, that's very bad coding on the VB6 end. What it's trying to do is to cast the Single data -- which is the same as float in C# -- as a String. But while there are better ways to do that, it's a really bad idea to begin with.
The main reason is that reading the binary data into a VB6 BSTR will convert the data from 8-bit bytes to 16-bit characters, using on the current code page. So this can produce different results in the DB depending on what locale it's running in. (!)
So when you read it back from the DB, unless you specify the same code page used when writing, you'll get different floats, possibly even invalid ones.
It would help to see examples of data both in binary (single) and DB (string) form, in hex, to verify that this is what's happening.
From a later post:
Actually that is not "bad" VB6 code.
It is, because it takes binary data into the string domain, which violates a prime rule of modern VB coding. It's why the Byte data type exists. If you ignore this, you may well wind up with undecipherable data when a DB you create crosses locale boundaries.
What he is doing is storing the array
in a compact binary format and saving
it as a "chunk" into the database.
There are lots of valid reasons to do
this.
Of course he has a valid reason for wanting this (although your definition of 'compact' is different from the conventional one). The ends are fine: the means chosen are not.
To the OP:
You probably can't change what you're given as input data, so the above is mostly academic. If there's still time to change the method used to create the blobs, let us suggest methods that don't involve strings.
In applying any provided solution, do your best to avoid strings, and if you can't, decode them using the specific code page that matches the one that created them.
Can you clarify what the contents of the file are (i.e. an example)? Either as binary (perhaps hex) or characters? If the data is a VB6 string, then you'll have to use float.Parse() to read it. .NET strings are also 2-bytes per character, but when loading from a file you can control this using the Encoding.
Actually that is not "bad" VB6 code. What he is doing is storing the array in a compact binary format and saving it as a "chunk" into the database. There are lots of valid reasons to do this.
The reason for the VB6 code saving it to disk and reading it back is because VB6 doesn't have native support for reading and writing files in memory only. This is the common algorithm if you want to create a chunk of binary data and stuff it somewhere else like a database field.
It is not an issues dealing with this in .NET. The code I have is in VB.NET so you will have to convert it to C#.
Modified to handle bytes and the unicode problem.
Public Function DataArrayFromDatabase(ByVal dbData As byte()) As Single(,)
Dim bData(Ubound(dbData)/2) As Byte
Dim I As Long
Dim J As Long
J=0
For I = 1 To Ubound(dbData) step 2
bData(J) = dbData(I)
J=1
Next I
Dim sM As New IO.MemoryStream(bData)
Dim bR As IO.BinaryReader = New IO.BinaryReader(sM)
Dim Dim1 As Integer = bR.ReadInt32
Dim Dim2 As Integer = bR.ReadInt32
Dim newData(Dim1, Dim2) As Single
For I = 0 To Dim2
For J = 0 To Dim1
newData(J, I) = bR.ReadSingle
Next
Next
bR.Close()
sM.Close()
Return newData
End Function
The key trick is to read in the data just like if you were in VB6. We have the ability to use MemoryStreams in .NET so this is fairly easy.
First we skip every other byte to eliminate the Unicode padding.
Then we create a memorystream from the array of bytes. Then a BinaryReader initialized with the MemoryStream.
We read in the first dimension of the array a VB6 Long or .NET Int32
We read in the second dimension of the array a VB6 Long or .NET Int32
The read loops are constructed in reverse order of the array's dimension. Dim2 is the outer loop and Dim1 is the inner. The reason for this is that this is how VB6 store arrays in binary format.
Return newData and you have successfully restored the original array that was created in VB6!
Now you could try to use some math trick. The two dimension are 4 bytes/characters and each array element is 4 bytes/characters. But for long term maintainability I find using byte manipulation with memorystreams a lot more explicit. It take a little more code but a lot more clear when you revisit it 5 years from now.
First we skip every other byte to
eliminate the Unicode padding.
Hmmm... if that were a valid strategy, then every other column in the DB string dump would consist of nothing but zeros. But a quick scan down the first one shows that this isn't the case. In fact there are a lot of non-zero bytes in those columns. Can we afford to just discard them?
What this shows is that the conversion to Unicode caused by the use of Strings does not simply add 'padding', but changes the character of the data. What you call padding is a coincidence of the fact that the ASCII range (00-7F binary) is mapped onto the same Unicode range. But this is not true of binary 80-FF.
Take a look at the first stored value, which has an original byte value of 94 9A 27 3A. When converted to Unicode, these DO NOT become 94 00 97 00 27 00 3A 00. They become 1D 20 61 01 27 00 3A 00.
Discarding every other byte gives you 1D 61 27 3A -- not the original 94 9A 27 3A.