Parsing MagTek EMV TLV - c#

I'm working with a MagTek DynaPro in a project to read credit card data and enter it into an accounting system (not my first post on this project). I've successfully leverage Dukpt.NET to decrypt MSR data, so that's been good (https://github.com/sgbj/Dukpt.NET). So I started working on getting the EMV data, and I've used the following MagTek document for TLV structure reference: https://www.magtek.com/content/documentationfiles/d99875585.pdf (starting at page 89). However, I'm having trouble reading the data.
I tried using BerTlv.NET (https://github.com/kspearrin/BerTlv.NET) to handle parsing the data, but it always throws an exception when I pass the TLV byte array to it. Specifically, this is what I get:
System.OverflowException : Array dimensions exceeded supported range.
I've also tried running the data through some other utilities to parse it out, but they all seem to throw errors, too. So, I think I'm left with trying to parse it on my own, but I'm not sure about the most efficient way to get it done. In some instances I know how many bytes to read in to get the data length, but in other cases I don't know what to expect.
Also, when breaking some of the data, I get to the F9 tag, and between it and the DFDF54 tag the hex reads as 8201B3. Now, the 01B3 makes sense considering the leading two bytes for full message length are 01B7, but I don't understand the 82. I can't assume that's the tag for "EMV Application Interchange Profile" since that's listed under the F2 tag.
Also, there's some padding of zeros (I think up to eight bytes worth) and four bytes of something else at the end that are excluded from two-byte message length at the very beginning. I'm not certain if that data being passed into parsers is causing a problem or not.

Refer the spec screenshot 1, as per EMV specs you are supposed to read the tags like below.
Eg tag 9F26 [1001 1111] the subsequent byte is also tag data - [0010 0110]
But when it is 9A [1001 1010], tag data is complete, length follows.
The spec also says to check the bit 8 of second byte of tag to see whether a third byte of tag follows like below, but practically you will not require it.
In real life you know upfront the tags you will encounter, so you parse through the data byte by byte, if you get 9F you look for the next byte to get the full tag and then next one byte of length, and if it is 9A, the next byte is length.
Note that length is also in Hex, which mean, 09 means 9 bytes, where as 10 means 16 bytes. For 10 bytes it is 0A.
I now bless you to fly!!

While #adarsh-nanu's answer provides the exact BER-TLV specs I believe what #michael-mccauley was encountering was MagTek's invalid usage of TLV tags. I actually stumbled through this exact scenario for the IDTech VIVOpay where they also used invalid tags.
I rolled my own TLV parsing routines and I specifically called out the non-conforming tags to force set a length when not in BER-TLV conformance. See example code below:
int TlvTagLen(uchar *tag)
{
int len = 0; // Tag length
// Check for non-conforming IDTech tags 0xFFE0 : 0xFFFF
if ((tag[0] == 0xFF) &&
((tag[1] >= 0xE0) && (tag[1] <= 0xFF)))
{
len = 2;
}
// Check if bits 0-4 in the first octet are all 1's
else if ((tag[len++] & 0x1F) == 0x1F)
{
// Remaining octets use bit 7 to indicate the tag includes an
// additional octet
while ((tag[len++] & 0x80) == 0x80)
{
// Include the next byte in the tag
}
}
return len;
}

Related

How to efficiently store Huffman Tree and Encoded binary string into a file?

I can easily convert a character string into a Huffman-Tree then encode into a binary sequence.
How should I save these to be able to actually compress the original data and then recover back?
I searched the web but I only could find guides and answers showing until what I already did. How can I use huffman algorithm further to actually achieve lossless compression?
I am using C# for this project.
EDIT: I've achieved these so far, might need rethinking.
I am attempting to compress a text file. I use Huffman Algorithm but there are some key points I couldn't figure out:
"aaaabbbccdef" when compressed gives this encoding
Key = a, Value = 11
Key = b, Value = 01
Key = c, Value = 101
Key = d, Value = 000
Key = e, Value = 001
Key = f, Value = 100
11111111010101101101000001100 is the encoded version. It normally needs 12*8 bits but we've compressed it to be 29 bits. This example might be a litte unnecessary for a file this small but let me explain what I tried to do.
We have 29 bits here but we need 8*n bits so I fill the encodedString with zeros until it becomes a multiple of eight. Since I can add 1 to 7 zeros it is more than enough to use 1-byte to represent this. This case I've added 3 zeros
11111111010101101101000001100000 Then add as binary how many extra bits I've added to the front and the split into 8-bit pieces
00000011-11111111-01010110-11010000-01100000
Turn these into ASCII characters
ÿVÐ`
Now if I have the encoding table I can look to the first 8bits convert that to integer ignoreBits and by ignoring the last ignoreBits turn it back to the original form.
The problem is I also want to include uncompressed version of encoding table with this file to have a fully functional ZIP/UNZIP prpgram but I am having trouble deciding when my ignoreBits ends, my encodingTable startse/ends, encoded bits start/end.
I thought about using null character but there is no assurance that Values cannot produce a null character. "ddd" in this situation produces 00000000-0.....
Your representation of the code needs to be self-terminating. Then you know the next bit is the start of the Huffman codes. One way is to traverse the tree that resulted from the Huffman code, writing a 0 bit for each branch, or a 1 bit followed by the symbol for leaf. When the traverse is done, you know the next bit must be the codes.
You also need to make your data self terminating. Note that in the example you give, the added three zero bits will be decoded as another 'd'. So you will incorrectly get 'aaaabbbccdefd' as the result. You need to either precede the encoded data with a count of symbols expected, or you need to add a symbol to your encoded set, with frequency 1, that marks the end of the data.

C# WPF Binary Reading

Alright, so I basically want to read any file with a specific extension. Going through all the bytes and reading the file is basically easy, but what about getting the type of the next byte? For example:
while ((int)reader.BaseStream.Position != RecordSize * RecordsCount)
{
// How do I check what type is the next byte gonna be?
// Example:
// In every file, the first byte is always a uint:
uint id = reader.GetUInt32();
// However, now I need to check for the next byte's type:
// How do I check the next byte's type?
}
Bytes don't have a type. When data in some language type, such as a char or string or Long is converted to bytes and written to a file, there is no strict way to tell what the type was : all bytes look alike, a number from 0-255.
In order to know, and to convert back from bytes to structured language types, you need to know the format that the file was written in.
For example, you might know that the file was written as an ascii text file, and hence every byte represents one ascii character.
Or you might know that your file was written with the format {uint}{50 byte string}{linefeed}, where the first 2 bytes represent a uint, the next 50 a string, followed by a linefeed.
Because all bytes look the same, if you don't know the file format you can't read the file in a semantically correct way. For example, I might send you a file I created by writing out some ascii text, but I might tell you that the file is full of 2-byte uints. You would write a program to read those bytes as 2-byte uints and it would work : any 2 bytes can be interpreted as a uint. I could tell someone else that the same file was composed of 4-byte longs, and they could read it as 4-byte longs : any 4 bytes can be interpreted as a long. I could tell someone else the file was a 2 byte uint followed by 6 ascii characters. And so on.
Many types of files will have a defined format : for example, a Windows executable, or a Linux ELF binary.
You might be able to guess the types of the bytes in the file if you know something about the reason the file exists. But somehow you have to know, and then you interpret those bytes according to the file format description.
You might think "I'll write the bytes with a token describing them, so the reading program can know what each byte means". For example, a byte with a '1' might mean the next 2 bytes represent a uint, a byte with a '2' might mean the following byte tells the length of a string, and the bytes after that are the string, and so on. Sure, you can do that. But (a) the reading program still needs to understand that convention, so everything I said above is true (it's turtles all the way down), (b) that approach uses a lot of space to describe the file, and (c) The reading program needs to know how to interpret a dynamically described file, which is only useful in certain circumstances and probably means there is a meta-meta format describing what the embedded meta-format means.
Long story short, all bytes look the same, and a reading program has to be told what those bytes represent before it can use them meaningfully.

1:1 decoding of UTF-8 octets for visualization

I'm making a tool (C#, WPF) for viewing binary data which may contain embedded text. It's traditional for such data viewers to use two vertical columns, one displaying the hexadecimal value of each byte and the other displaying the ASCII character corresponding to each byte, if printable.
I've been thinking it would be nice to support display of embedded text using non-ASCII encodings as well, in particular UTF-8 and UTF-16. The issue is that UTF code points don't map 1:1 with octets. I would like to keep the output grid-aligned according to its location in the data, so I need every octet to map to something to appear in the corresponding cell in the grid. What I'm thinking is that the end octet of each code point will map to the resulting Unicode character, and lead bytes map to placeholders that vary with sequence length (perhaps circled forms and use color to distinguish them from the actual encoded characters), and continuation and invalid bytes similarly to placeholders.
struct UtfOctetVisualization
{
enum Classification
{
Ascii,
NonAscii,
LeadByteOf2,
LeadByteOf3,
LeadByteOf4,
Continuation,
Error
}
Classification OctetClass;
int CodePoint; // valid only when OctetClass == Ascii or NonAscii
}
The Encoding.UTF8.GetString() method doesn't provide any information about the location each resulting character came from.
I could use Encoding.UTF8.GetDecoder() and call Convert passing a single byte at a time so that the completed output parameter gives a classification for each octet.
But in both methods, in order to have handling of invalid characters, I would need to implement a DecoderFallback class? This looks complicated.
Is there a simple way to get this information using the APIs provided with .NET (in System.Text or otherwise)? Using System.Text.Decoder, what would the fallback look like that fills in an output array shared with the decoder?
Or is it more feasible to write a custom UTF-8 recognizer (finite state machine)?
How about decoding one character at a time so that you can capture the number of bytes each character occupies. Something like this:
string data = "hello????";
byte[] buffer = new byte[Encoding.UTF8.GetByteCount(data)];
int bufferIndex = 0;
for(int i = 0; i < data.Length; i++)
{
int bytes = Encoding.UTF8.GetBytes(data, i, 1, buffer, bufferIndex);
Console.WriteLine("Character: {0}, Position: {1}, Bytes: {2}", data[i], i, bytes);
bufferIndex += bytes;
}
Fiddle: https://dotnetfiddle.net/poohHM
Those ???" in the string are supposed to be multi-byte characters, but SO dosent let me paste them in. See the Fiddle.
I dont this this is going to workout the way you want when you mix binary stuff with characters as #Jon has pointed out. I mean you'll see something, but it may not be what you expect, because the encoder wont be able to distinguish what bytes are supposed to be characters.

How to reverse a Reed - Solomon algorithm? [duplicate]

I want to transmit binary data over a noisy channel.
I read that a good ECC algorithm to detect errors is Reed-Solomon.
The problem is i don't understand the input for this algorithm.
here is my naive failed attempt with zxing.net:
int[] toEncode = { 123,232,432};
var gf = GenericGF.AZTEC_DATA_12;
ReedSolomonEncoder rse = new ReedSolomonEncoder(gf);
rse.encode(toEncode, 2);
ReedSolomonDecoder rsd = new ReedSolomonDecoder(gf);
rse.encode(toEncode, 2);
please explain to me the input for the encoder and decoder.
Is this the implementation you are using here: ReedSolomonEncoder.cs?
If so, to encode N integers with M data correction integers, you need to pass an array of length N+M. Your data should be in the first N indices and the codes look to be added at the end in the final M entries.
Also, note the following restriction in the encoder:
Update: a more recent version is here: http://zxingnet.codeplex.com/. Its most recent version of ReedSolomonEncoder.cs does not have this restriction.
This class implements Reed-Solomon encoding schemes used in processing QR codes. A very brief description of Reed Solomon encoding is here: Reed-Solomon Codes.
An encoding choice of "QR_CODE_FIELD_256" (which is probably a reasonable choice for you) means that error correction codes are being generated on byte-sized chunks ("symbols") of your message, which means your maximum message length (data to encode plus error correction codes) is 255 bytes long. If you are sending more data you will need to break it into chunks.
Update 2: Using QR_CODE_FIELD_256, your integers need to be between 0 and 255 as well, so to encode a general byte stream, you need to put each byte into a separate integer in the integer array, pass the int array (plus space for error correction codes) through the encoder, then reconvert to a (larger) byte array. And the reverse for decoding.

Ascii range regards binary files?

ive been reading about this topic and didnt get the specific info for my question :
(maybe the following is incorrect - but please do correct me)
Every file( text/binary) is saving BYTES.
byte is 8 bits hence max value is 2^8-1 = 255 codes.
those 255 codes divides to 2 groups:
0..127 : textual chars
128:..255 : special chars.
so binary file contains char codes from the whole range : 0..255 ( ascii chars+special chars).
1 ) correct ?
2) NOw , lets say im saving one INT in binary file. ( 4 byte in 32 bit system)
how does the file tells the progem reads it : its not 4 single unrelated bytes but an int which is 4 bytes ?
Underlying all files are being stored as bytes, so in a sense what you're saying is correct. However, if you open a file that's intended to be read as binary and try to read it in a text editor, it will look like gibberish.
How does a program know whether to read a file as text or as binary? (ie as special sets of ASCII or other encoded bytes, or just as the underlying bytes with a different representation)?
Well, it doesn't know - it just does what it's told.
In Windows, you open .txt files in notepad - notepad expects to be reading text. Try opening a binary file in notepad. It will open, you will see stuff, but it will be rubbish.
If you're writing your own program you can write using BinaryWriter and read using BinaryReader if you want to store everything as binary. What would happen if you wrote using BinaryWriter and read using StringReader?
To answer your specific example:
using (var test = new BinaryWriter(new FileStream(#"c:\test.bin", FileMode.Create)))
{
test.Write(10);
test.Write("hello world");
}
using (var test = new BinaryReader(new FileStream(#"c:\test.bin", FileMode.Open)))
{
var out1 = test.ReadInt32();
var out2 = test.ReadString();
Console.WriteLine("{0} {1}", out1, out2);
}
See how you have to read in the same order that's written? The file doesn't tell you anything.
Now switch the second part around:
using (var test = new BinaryReader(new FileStream(#"c:\test.bin", FileMode.Open)))
{
var out1 = test.ReadString();
var out2 = test.ReadInt32();
Console.WriteLine("{0} {1}", out1, out2);
}
You'll get gibberish out (if it works at all). Yet there is nothing you can read in the file that will tell you that beforehand. There is no special information there. The program must know what to do based on some out of band information (a specification of some sort).
so binary file contains char codes from the whole range : 0..255 ( ascii chars+special chars).
No, a binary file just contains bytes. Values between 0 and 255. They should only be considered as character at all if you decide to ascribe that meaning to them. If it's a binary file (e.g. a JPEG) then you shouldn't do that - a byte 65 in image data isn't logically an 'A' - it's whatever byte 65 means at that point in the file.
(Note that even text files aren't divided into "ASCII characters" and "special characters" - it depends on the encoding. In UTF-16, each code unit takes two bytes regardless of its value. In UTF-8 the number of bytes depends on the character you're trying to represent.)
how does the file tells the progem reads it : its not 4 single unrelated bytes but an int which is 4 bytes ?
The file doesn't tell the program. The program has to know how to read the file. If you ask Notepad to open a JPEG file, it won't show you an image - it will show you gibberish. Likewise if you try to force an image viewer to open a text file as if it were a JPEG, it will complain that it's broken.
Programs reading data need to understand the structure of the data they're going to read - they have to know what to expect. In some cases the format is quite flexible, like XML: there are well-specified layers, but then the program reads the values with higher-level meaning - elements, attributes etc. In other cases, the format is absolutely precise: first you'll start with a 4 byte integer, then two 2-byte integers or whatever. It depends on the format.
EDIT: To answer your specific (repeated) comment:
Im Cmd shell....youve written your binary file. I have no clue what did you do there. how am i suppose to know whether to read 4 single bytes or 4 bytes as once ?
Either the program reading the data needs to know the meaning of the data or it doesn't. If it's just copying the file from one place to another, it doesn't need to know the meaning of the data. It doesn't matter whether it copies it one byte at a time or all four bytes at once.
If it does need to know the meaning of the data, then just knowing that it's a four byte integer doesn't really help much - it would need to know what that integer meant to do anything useful with it. So your file written from the command shell... what does it mean? If I don't know what it means, what does it matter whether I know to read one byte at a time or four bytes as an integer?
(As I mentioned above, there's an intermediate option where code can understand structure without meaning, and expose that structure to other code which then imposes meaning - XML is a classic example of that.)
It's all a matter of interpretation. Neither the file nor the system know what's going on in your file, they just see your storage as a sequence of bytes that has absolutely no meaning in itself. The same thing happens in your brain when you read a word (you attempt to choose a language to interpret it in, to give the sequence of characters a meaning).
It is the responsibility of your program to interpret the data the way you want it, as there is no single valid interpretation. For example, the sequence of bytes 48 65 6C 6C 6F 20 53 6F 6F 68 6A 75 6E can be interpreted as:
A string (Hello Soohjun)
A sequence of 12 one-byte characters (H, e, l, l, o, , S, o, o, h, j, u, n)
A sequence of 3 unsigned ints followed by a character (1214606444, 1864389487, 1869113973, 110)
A character followed by a float followed by an unsigned int followed by a float (72, 6.977992E22, 542338927, 4.4287998E24), and so on...
You are the one choosing the meaning of those bytes, another program would make a different interpretation of the very same data, much the same a combination of letters has a different interpretation in say, English and French.
PS: By the way, that's the goal of reverse engineering file formats: find the meaning of each byte.

Categories