ive been reading about this topic and didnt get the specific info for my question :
(maybe the following is incorrect - but please do correct me)
Every file( text/binary) is saving BYTES.
byte is 8 bits hence max value is 2^8-1 = 255 codes.
those 255 codes divides to 2 groups:
0..127 : textual chars
128:..255 : special chars.
so binary file contains char codes from the whole range : 0..255 ( ascii chars+special chars).
1 ) correct ?
2) NOw , lets say im saving one INT in binary file. ( 4 byte in 32 bit system)
how does the file tells the progem reads it : its not 4 single unrelated bytes but an int which is 4 bytes ?
Underlying all files are being stored as bytes, so in a sense what you're saying is correct. However, if you open a file that's intended to be read as binary and try to read it in a text editor, it will look like gibberish.
How does a program know whether to read a file as text or as binary? (ie as special sets of ASCII or other encoded bytes, or just as the underlying bytes with a different representation)?
Well, it doesn't know - it just does what it's told.
In Windows, you open .txt files in notepad - notepad expects to be reading text. Try opening a binary file in notepad. It will open, you will see stuff, but it will be rubbish.
If you're writing your own program you can write using BinaryWriter and read using BinaryReader if you want to store everything as binary. What would happen if you wrote using BinaryWriter and read using StringReader?
To answer your specific example:
using (var test = new BinaryWriter(new FileStream(#"c:\test.bin", FileMode.Create)))
{
test.Write(10);
test.Write("hello world");
}
using (var test = new BinaryReader(new FileStream(#"c:\test.bin", FileMode.Open)))
{
var out1 = test.ReadInt32();
var out2 = test.ReadString();
Console.WriteLine("{0} {1}", out1, out2);
}
See how you have to read in the same order that's written? The file doesn't tell you anything.
Now switch the second part around:
using (var test = new BinaryReader(new FileStream(#"c:\test.bin", FileMode.Open)))
{
var out1 = test.ReadString();
var out2 = test.ReadInt32();
Console.WriteLine("{0} {1}", out1, out2);
}
You'll get gibberish out (if it works at all). Yet there is nothing you can read in the file that will tell you that beforehand. There is no special information there. The program must know what to do based on some out of band information (a specification of some sort).
so binary file contains char codes from the whole range : 0..255 ( ascii chars+special chars).
No, a binary file just contains bytes. Values between 0 and 255. They should only be considered as character at all if you decide to ascribe that meaning to them. If it's a binary file (e.g. a JPEG) then you shouldn't do that - a byte 65 in image data isn't logically an 'A' - it's whatever byte 65 means at that point in the file.
(Note that even text files aren't divided into "ASCII characters" and "special characters" - it depends on the encoding. In UTF-16, each code unit takes two bytes regardless of its value. In UTF-8 the number of bytes depends on the character you're trying to represent.)
how does the file tells the progem reads it : its not 4 single unrelated bytes but an int which is 4 bytes ?
The file doesn't tell the program. The program has to know how to read the file. If you ask Notepad to open a JPEG file, it won't show you an image - it will show you gibberish. Likewise if you try to force an image viewer to open a text file as if it were a JPEG, it will complain that it's broken.
Programs reading data need to understand the structure of the data they're going to read - they have to know what to expect. In some cases the format is quite flexible, like XML: there are well-specified layers, but then the program reads the values with higher-level meaning - elements, attributes etc. In other cases, the format is absolutely precise: first you'll start with a 4 byte integer, then two 2-byte integers or whatever. It depends on the format.
EDIT: To answer your specific (repeated) comment:
Im Cmd shell....youve written your binary file. I have no clue what did you do there. how am i suppose to know whether to read 4 single bytes or 4 bytes as once ?
Either the program reading the data needs to know the meaning of the data or it doesn't. If it's just copying the file from one place to another, it doesn't need to know the meaning of the data. It doesn't matter whether it copies it one byte at a time or all four bytes at once.
If it does need to know the meaning of the data, then just knowing that it's a four byte integer doesn't really help much - it would need to know what that integer meant to do anything useful with it. So your file written from the command shell... what does it mean? If I don't know what it means, what does it matter whether I know to read one byte at a time or four bytes as an integer?
(As I mentioned above, there's an intermediate option where code can understand structure without meaning, and expose that structure to other code which then imposes meaning - XML is a classic example of that.)
It's all a matter of interpretation. Neither the file nor the system know what's going on in your file, they just see your storage as a sequence of bytes that has absolutely no meaning in itself. The same thing happens in your brain when you read a word (you attempt to choose a language to interpret it in, to give the sequence of characters a meaning).
It is the responsibility of your program to interpret the data the way you want it, as there is no single valid interpretation. For example, the sequence of bytes 48 65 6C 6C 6F 20 53 6F 6F 68 6A 75 6E can be interpreted as:
A string (Hello Soohjun)
A sequence of 12 one-byte characters (H, e, l, l, o, , S, o, o, h, j, u, n)
A sequence of 3 unsigned ints followed by a character (1214606444, 1864389487, 1869113973, 110)
A character followed by a float followed by an unsigned int followed by a float (72, 6.977992E22, 542338927, 4.4287998E24), and so on...
You are the one choosing the meaning of those bytes, another program would make a different interpretation of the very same data, much the same a combination of letters has a different interpretation in say, English and French.
PS: By the way, that's the goal of reverse engineering file formats: find the meaning of each byte.
Related
I can easily convert a character string into a Huffman-Tree then encode into a binary sequence.
How should I save these to be able to actually compress the original data and then recover back?
I searched the web but I only could find guides and answers showing until what I already did. How can I use huffman algorithm further to actually achieve lossless compression?
I am using C# for this project.
EDIT: I've achieved these so far, might need rethinking.
I am attempting to compress a text file. I use Huffman Algorithm but there are some key points I couldn't figure out:
"aaaabbbccdef" when compressed gives this encoding
Key = a, Value = 11
Key = b, Value = 01
Key = c, Value = 101
Key = d, Value = 000
Key = e, Value = 001
Key = f, Value = 100
11111111010101101101000001100 is the encoded version. It normally needs 12*8 bits but we've compressed it to be 29 bits. This example might be a litte unnecessary for a file this small but let me explain what I tried to do.
We have 29 bits here but we need 8*n bits so I fill the encodedString with zeros until it becomes a multiple of eight. Since I can add 1 to 7 zeros it is more than enough to use 1-byte to represent this. This case I've added 3 zeros
11111111010101101101000001100000 Then add as binary how many extra bits I've added to the front and the split into 8-bit pieces
00000011-11111111-01010110-11010000-01100000
Turn these into ASCII characters
ÿVÐ`
Now if I have the encoding table I can look to the first 8bits convert that to integer ignoreBits and by ignoring the last ignoreBits turn it back to the original form.
The problem is I also want to include uncompressed version of encoding table with this file to have a fully functional ZIP/UNZIP prpgram but I am having trouble deciding when my ignoreBits ends, my encodingTable startse/ends, encoded bits start/end.
I thought about using null character but there is no assurance that Values cannot produce a null character. "ddd" in this situation produces 00000000-0.....
Your representation of the code needs to be self-terminating. Then you know the next bit is the start of the Huffman codes. One way is to traverse the tree that resulted from the Huffman code, writing a 0 bit for each branch, or a 1 bit followed by the symbol for leaf. When the traverse is done, you know the next bit must be the codes.
You also need to make your data self terminating. Note that in the example you give, the added three zero bits will be decoded as another 'd'. So you will incorrectly get 'aaaabbbccdefd' as the result. You need to either precede the encoded data with a count of symbols expected, or you need to add a symbol to your encoded set, with frequency 1, that marks the end of the data.
I'm working with a MagTek DynaPro in a project to read credit card data and enter it into an accounting system (not my first post on this project). I've successfully leverage Dukpt.NET to decrypt MSR data, so that's been good (https://github.com/sgbj/Dukpt.NET). So I started working on getting the EMV data, and I've used the following MagTek document for TLV structure reference: https://www.magtek.com/content/documentationfiles/d99875585.pdf (starting at page 89). However, I'm having trouble reading the data.
I tried using BerTlv.NET (https://github.com/kspearrin/BerTlv.NET) to handle parsing the data, but it always throws an exception when I pass the TLV byte array to it. Specifically, this is what I get:
System.OverflowException : Array dimensions exceeded supported range.
I've also tried running the data through some other utilities to parse it out, but they all seem to throw errors, too. So, I think I'm left with trying to parse it on my own, but I'm not sure about the most efficient way to get it done. In some instances I know how many bytes to read in to get the data length, but in other cases I don't know what to expect.
Also, when breaking some of the data, I get to the F9 tag, and between it and the DFDF54 tag the hex reads as 8201B3. Now, the 01B3 makes sense considering the leading two bytes for full message length are 01B7, but I don't understand the 82. I can't assume that's the tag for "EMV Application Interchange Profile" since that's listed under the F2 tag.
Also, there's some padding of zeros (I think up to eight bytes worth) and four bytes of something else at the end that are excluded from two-byte message length at the very beginning. I'm not certain if that data being passed into parsers is causing a problem or not.
Refer the spec screenshot 1, as per EMV specs you are supposed to read the tags like below.
Eg tag 9F26 [1001 1111] the subsequent byte is also tag data - [0010 0110]
But when it is 9A [1001 1010], tag data is complete, length follows.
The spec also says to check the bit 8 of second byte of tag to see whether a third byte of tag follows like below, but practically you will not require it.
In real life you know upfront the tags you will encounter, so you parse through the data byte by byte, if you get 9F you look for the next byte to get the full tag and then next one byte of length, and if it is 9A, the next byte is length.
Note that length is also in Hex, which mean, 09 means 9 bytes, where as 10 means 16 bytes. For 10 bytes it is 0A.
I now bless you to fly!!
While #adarsh-nanu's answer provides the exact BER-TLV specs I believe what #michael-mccauley was encountering was MagTek's invalid usage of TLV tags. I actually stumbled through this exact scenario for the IDTech VIVOpay where they also used invalid tags.
I rolled my own TLV parsing routines and I specifically called out the non-conforming tags to force set a length when not in BER-TLV conformance. See example code below:
int TlvTagLen(uchar *tag)
{
int len = 0; // Tag length
// Check for non-conforming IDTech tags 0xFFE0 : 0xFFFF
if ((tag[0] == 0xFF) &&
((tag[1] >= 0xE0) && (tag[1] <= 0xFF)))
{
len = 2;
}
// Check if bits 0-4 in the first octet are all 1's
else if ((tag[len++] & 0x1F) == 0x1F)
{
// Remaining octets use bit 7 to indicate the tag includes an
// additional octet
while ((tag[len++] & 0x80) == 0x80)
{
// Include the next byte in the tag
}
}
return len;
}
Alright, so I basically want to read any file with a specific extension. Going through all the bytes and reading the file is basically easy, but what about getting the type of the next byte? For example:
while ((int)reader.BaseStream.Position != RecordSize * RecordsCount)
{
// How do I check what type is the next byte gonna be?
// Example:
// In every file, the first byte is always a uint:
uint id = reader.GetUInt32();
// However, now I need to check for the next byte's type:
// How do I check the next byte's type?
}
Bytes don't have a type. When data in some language type, such as a char or string or Long is converted to bytes and written to a file, there is no strict way to tell what the type was : all bytes look alike, a number from 0-255.
In order to know, and to convert back from bytes to structured language types, you need to know the format that the file was written in.
For example, you might know that the file was written as an ascii text file, and hence every byte represents one ascii character.
Or you might know that your file was written with the format {uint}{50 byte string}{linefeed}, where the first 2 bytes represent a uint, the next 50 a string, followed by a linefeed.
Because all bytes look the same, if you don't know the file format you can't read the file in a semantically correct way. For example, I might send you a file I created by writing out some ascii text, but I might tell you that the file is full of 2-byte uints. You would write a program to read those bytes as 2-byte uints and it would work : any 2 bytes can be interpreted as a uint. I could tell someone else that the same file was composed of 4-byte longs, and they could read it as 4-byte longs : any 4 bytes can be interpreted as a long. I could tell someone else the file was a 2 byte uint followed by 6 ascii characters. And so on.
Many types of files will have a defined format : for example, a Windows executable, or a Linux ELF binary.
You might be able to guess the types of the bytes in the file if you know something about the reason the file exists. But somehow you have to know, and then you interpret those bytes according to the file format description.
You might think "I'll write the bytes with a token describing them, so the reading program can know what each byte means". For example, a byte with a '1' might mean the next 2 bytes represent a uint, a byte with a '2' might mean the following byte tells the length of a string, and the bytes after that are the string, and so on. Sure, you can do that. But (a) the reading program still needs to understand that convention, so everything I said above is true (it's turtles all the way down), (b) that approach uses a lot of space to describe the file, and (c) The reading program needs to know how to interpret a dynamically described file, which is only useful in certain circumstances and probably means there is a meta-meta format describing what the embedded meta-format means.
Long story short, all bytes look the same, and a reading program has to be told what those bytes represent before it can use them meaningfully.
System.IO.BinaryWriter outfile;
System.IO.FileStream fs = new System.IO.FileStream(some_object.text, System.IO.FileMode.Create);
outfile = new System.IO.BinaryWriter(fs);
outfile.Write('A'); // Line 1
outfile.Write('B'); // Line 2
outfile.Write('C'); // Line 3
outfile.Write( Convert.ToUInt16(some_object.text, 16) ); // Line 4
outfile.Write((ushort)0); // Line 5
Here i declare a BinaryWriter for creating my output file.
What i need to know clearly is how the file is exactly being written?
Meaning, that Line 1, 2, 3 write the file Byte by Byte meaning 1 byte at a time if i am correct??
This some_object.text holds a value 2000.
How many bytes does Line 4 exactly write?? (2 Bytes/16 Bits since UInt16 of is 16 bits) ?
Take a look at the chart from MSDN to see how many bytes are written:
BinaryWriter.Write Method
The BinaryWriter uses the BitConverter class to create sequences of bytes that are written to the underlying stream. A great way to understand what is going on, at the lowest level, is to use .NET Reflector. It can decompile assemblies and easily be used to figure out framework implementation details.
Most of the binary write methods use the native representation in little endian (though the endian is architecture specific and varies between platforms such as XBOX and Windows). The only exception to this are strings. Strings are by default encoded using UTF-8 encoding.
I'm writing a C# application that reads data from an SQL database generated by VB6 code. The data is an array of Singles. I'm trying to convert them to a float[]
Below is the VB6 code that wrote the data in the database (cannot change this code):
Set fso = New FileSystemObject
strFilePath = "c:\temp\temp.tmp"
' Output the data to a temporary file
intFileNr = FreeFile
Open strFilePath For Binary Access Write As #intFileNr
Put #intFileNr, , GetSize(Data, 1)
Put #intFileNr, , GetSize(Data, 2)
Put #intFileNr, , Data
Close #intFileNr
' Read the data back AS STRING
Open strFilePath For Binary Access Read As #intFileNr
strData = String$(LOF(intFileNr), 32)
Get #intFileNr, 1, strData
Close #intFileNr
Call Field.AppendChunk(strData)
As you can see, the data is put in a temporary file, then read back as VB6 String and wrote in the database (row of type dbLongBinary)
I've tried the following:
Doing a BlockCopy
byte[] source = databaseValue as byte[];
float [,] destination = new float[BitConverter.ToInt32(source, 0), BitConverter.ToInt32(source, 4)];
Buffer.BlockCopy(source, 8, destination, 0, 50 * 99 * 4);
The problem here is the VB6 binary to string conversion. The VB6 string char is 2 bytes wide and I don't know how to transform this back to a binary format I can handle.
Below is a dump of the temp file that the VB6 code generates:
alt text http://robbertdam.nl/share/dump%20of%20text%20file%20generated%20by%20VB6.png
And here is the dump of the data as I read it from the database in (=the VB6 string):
alt text http://robbertdam.nl/share/dump%20of%20database%20field.png
One possible way I see is to:
Read the data back as a System.Char[], which is Unicode just like VB BSTRs.
Convert it to an ASCII byte array via Encoding.ASCII.GetBytes(). Effectively this removes all the interleaved 0s.
Copy this ASCII byte array to your final float array.
Something like this:
char[] destinationAsChars = new char[BitConverter.ToInt32(source, 0)* BitConverter.ToInt32(source, 4)];
byte[] asciiBytes = Encoding.ASCII.GetBytes(destinationAsChars);
float[] destination = new float[notSureHowLarge];
Buffer.BlockCopy(asciiBytes, 0, destination, 0, asciiBytes.Length);
Now destination should contain the original floats. CAVEAT: am not sure if the internal format of VB6 Singles is binary-compatible with the internal format of System.Float. If not, all bets are off.
This is the solution I derived from the answer above.
Reading the file in as a unicode char[], and then re-encoding to my default system encoding produced readable files.
internal void FixBytes()
{
//Convert the bytes from VB6 style BSTR to standard byte[].
char[] destinationAsChars =
System.Text.Encoding.Unicode.GetString(File).ToCharArray();
byte[] asciiBytes = Encoding.Default.GetBytes(destinationAsChars);
byte[] newFile = new byte[asciiBytes.Length];
Buffer.BlockCopy(asciiBytes,0, newFile, 0, asciiBytes.Length);
File = newFile;
}
As you probably know, that's very bad coding on the VB6 end. What it's trying to do is to cast the Single data -- which is the same as float in C# -- as a String. But while there are better ways to do that, it's a really bad idea to begin with.
The main reason is that reading the binary data into a VB6 BSTR will convert the data from 8-bit bytes to 16-bit characters, using on the current code page. So this can produce different results in the DB depending on what locale it's running in. (!)
So when you read it back from the DB, unless you specify the same code page used when writing, you'll get different floats, possibly even invalid ones.
It would help to see examples of data both in binary (single) and DB (string) form, in hex, to verify that this is what's happening.
From a later post:
Actually that is not "bad" VB6 code.
It is, because it takes binary data into the string domain, which violates a prime rule of modern VB coding. It's why the Byte data type exists. If you ignore this, you may well wind up with undecipherable data when a DB you create crosses locale boundaries.
What he is doing is storing the array
in a compact binary format and saving
it as a "chunk" into the database.
There are lots of valid reasons to do
this.
Of course he has a valid reason for wanting this (although your definition of 'compact' is different from the conventional one). The ends are fine: the means chosen are not.
To the OP:
You probably can't change what you're given as input data, so the above is mostly academic. If there's still time to change the method used to create the blobs, let us suggest methods that don't involve strings.
In applying any provided solution, do your best to avoid strings, and if you can't, decode them using the specific code page that matches the one that created them.
Can you clarify what the contents of the file are (i.e. an example)? Either as binary (perhaps hex) or characters? If the data is a VB6 string, then you'll have to use float.Parse() to read it. .NET strings are also 2-bytes per character, but when loading from a file you can control this using the Encoding.
Actually that is not "bad" VB6 code. What he is doing is storing the array in a compact binary format and saving it as a "chunk" into the database. There are lots of valid reasons to do this.
The reason for the VB6 code saving it to disk and reading it back is because VB6 doesn't have native support for reading and writing files in memory only. This is the common algorithm if you want to create a chunk of binary data and stuff it somewhere else like a database field.
It is not an issues dealing with this in .NET. The code I have is in VB.NET so you will have to convert it to C#.
Modified to handle bytes and the unicode problem.
Public Function DataArrayFromDatabase(ByVal dbData As byte()) As Single(,)
Dim bData(Ubound(dbData)/2) As Byte
Dim I As Long
Dim J As Long
J=0
For I = 1 To Ubound(dbData) step 2
bData(J) = dbData(I)
J=1
Next I
Dim sM As New IO.MemoryStream(bData)
Dim bR As IO.BinaryReader = New IO.BinaryReader(sM)
Dim Dim1 As Integer = bR.ReadInt32
Dim Dim2 As Integer = bR.ReadInt32
Dim newData(Dim1, Dim2) As Single
For I = 0 To Dim2
For J = 0 To Dim1
newData(J, I) = bR.ReadSingle
Next
Next
bR.Close()
sM.Close()
Return newData
End Function
The key trick is to read in the data just like if you were in VB6. We have the ability to use MemoryStreams in .NET so this is fairly easy.
First we skip every other byte to eliminate the Unicode padding.
Then we create a memorystream from the array of bytes. Then a BinaryReader initialized with the MemoryStream.
We read in the first dimension of the array a VB6 Long or .NET Int32
We read in the second dimension of the array a VB6 Long or .NET Int32
The read loops are constructed in reverse order of the array's dimension. Dim2 is the outer loop and Dim1 is the inner. The reason for this is that this is how VB6 store arrays in binary format.
Return newData and you have successfully restored the original array that was created in VB6!
Now you could try to use some math trick. The two dimension are 4 bytes/characters and each array element is 4 bytes/characters. But for long term maintainability I find using byte manipulation with memorystreams a lot more explicit. It take a little more code but a lot more clear when you revisit it 5 years from now.
First we skip every other byte to
eliminate the Unicode padding.
Hmmm... if that were a valid strategy, then every other column in the DB string dump would consist of nothing but zeros. But a quick scan down the first one shows that this isn't the case. In fact there are a lot of non-zero bytes in those columns. Can we afford to just discard them?
What this shows is that the conversion to Unicode caused by the use of Strings does not simply add 'padding', but changes the character of the data. What you call padding is a coincidence of the fact that the ASCII range (00-7F binary) is mapped onto the same Unicode range. But this is not true of binary 80-FF.
Take a look at the first stored value, which has an original byte value of 94 9A 27 3A. When converted to Unicode, these DO NOT become 94 00 97 00 27 00 3A 00. They become 1D 20 61 01 27 00 3A 00.
Discarding every other byte gives you 1D 61 27 3A -- not the original 94 9A 27 3A.