Identifying repeating sequences of data in byte array

Identifying repeating sequences of data in byte array - c#

Given a sample of hexadecimal data, I would like to identify UNKNOWN sequences of bytes that are repeated throughout the sample. (Not searching for a known string or value) I am attempting to reverse engineer a network protocol, and I am working on determining data structures within the packet. As an example of what I'm trying to do (albeit on a smaller scale):
(af:b6:ea:3d:83:02:00:00):{21:03:00:00}:[b3:49:96:23:01]
{21:03:00:00}:(af:b6:ea:3d:83:02:00:00):01:42:00:00:00:00:01:57
And
(38:64:88:6e:83:02:00:00):{26:03:00:00}:[b3:49:96:23:01]
{26:03:00:00}:(38:64:88:6e:83:02:00:00):01:42:00:00:00:00:00:01
Obviously, these are easy to spot by eye, but patterns that are hundreds of chars into the data are not. I'm not expecting a magic bullet for the solution, just a nudge in the right direction, or even better, a premade tool.
I'm currently needing this for a C# project, but I am open to any and all tools.

If you have no idea what you are looking for, you could get an idea of the layout of the data by performing a negative entropy analysis on a reasonably large enough sample of conversations to see the length of the records/sub-records.
If the data is structured with repeated sequences of roughly the same length and content type you should see clusters of values with nearly the same negative entropy around the length of the record and sub records.
For example if you put a basic file with a lot of the same data through that, you should see values around the average record length with comparable negentropies (ex: if you use a CSV file with an average line length of 117 bytes, you might see 115, 116, 117 & 119 with the highest negentropy), and values around the most common field lengths with the same negentropy.
You might do a byte occurence scan, to see which byte values are likely separators.
There is a free hex editor with sources which does that for you (hexplorer, in the Crypto/Find Pattern menu). You may have to change the default font through Options to actually something in the UI.

Related

What is the formula to calculate a QR Code's maximum data?

I've Google'd and read quite a bit on QR codes and the maximum data that can be used based on the various settings, all of it being in tabular format. I can't seem to find anything giving a formula or a proper explanation of how these values are calculated.
What I would like to do is this:
Present the user with a form, allowing them to choose Format, EC & Version.
Then they can type in some data and generate a QR code.
Done deal. That part is easy.
The addition I would like to include is a "remaining character count" so that they (the user) can see how much more data they can type in, as well as what effect the properties have on the storage capacity of the QR code.
Does anyone know where I can find the formula(s)? Or do I need to purchase ISO 18004:2006?

A formula to calculate the amount of data you could put in a QRcode would be quite complex to make, not mentioning it would need some approximations for the calculation to be possible. The formula would have to calculate the amount of modules dedicated to the data in your QRCode based on its version, and then calculate how many codewords (which are sets of 8 modules) will be used for the error correction.
To calculate the amount of modules that will be used for the data, you need to know how many modules will be used for the function patterns. While this is not a problem for the three finder patterns, the timing or the version/format information, there will be a problem with the alignment patterns as their number is dependent on the QRCode's version, meaning you anyway would have to use a table at that point.
For the second part, I have to say I don't know how to calculate the number of error correcting codewords based on the correction capacity. For some reason, there are more error correcting codewords used that there should to match the error correction capacity, as for example a 6-H QRCode can correct up to 32.6% of the data, instead of the 30% set by the H correction level.
In any case, as you can see a formula would be quite complex to implement. Using a table like already suggested is probably the best thing you could do.

I wrote the original AIM specification for QR Code back in the '90s for Denso Corporation, and was also project editor for both editions of the ISO/IEC 18004 standard. It was felt to be much easier for people producing code printing software to use a look-up table rather than calculate capacities from a formula - no easy job as there are several independent variables that have to be taken into account iteratively when parsing the text to be encoded to minimise its length in bits, in order to achieve the smallest symbol. The most crucial factor is the mix of characters in the data, the sequence and lengths of sub-strings of numeric, alphanumeric, Kanji data, with the overhead needed to signal each change of character set, then the required level of error correction. I did produce a guidance section for this which is contained in the ISO standard.

The storage is calculated by the QR mode and the version/type that you are using. More specifically the calculation is based on how 'compressible' the characters are and what algorithm that the qr generator is allowed to use on the content present.
More information can be found http://en.wikipedia.org/wiki/QR_code#Storage

Data in a condensed format

I need a library which would help me to save and query data in a condensed format (a mini DSL in essence) here's a sample of what I want:
Update 1 - Please note, figures in the samples above are made small just to make is easier to follow the logic, the real figures are limited with c# long type capacity, ex:
1,18,28,29,39,18456789,18456790,18456792,184567896.
Sample Raw Data set: 1,2,3,8,11,12,13,14
Condensed Sample Data set:
1..3,8,11..14
What would be absolute nice to have is to be able to present 1,2,4,5,6,7,8,9,10 as 1..10-3.
Querying Sample Data set:
Query 1 (get range):
1..5 -> 1..3
Query 2 (check if the value exists)
?2 -> true
Query 3 (get multiple ranges and scalar values):
1..5,11..12,14 -> 1..3,11..12,14
I don't want to develop it from scratch and would highly prefer to use something which already exists.

Here are some ideas I've had over the days since I read your question. I can't be sure any of them really apply to your use case but I hope you'll find something useful here.
Storing your data compressed
Steps you can take to reduce the amount of space your numbers take up on disk:
If your values are between 1 and ~10M, don't use a long, use a uint. (4 bytes per number.)
Actually, don't use a uint. Store your numbers 7 bits to a byte, with the remaining bit used to say "there are more bytes in this number". (Then 1-127 will fit in 1 byte, 128-~16k in 2 bytes, ~16k-~2M in 3 bytes, ~2M-~270M in 4 bytes.)
This should reduce your storage from 8 bytes per number (if you were originally storing them as longs) to, say, on average 3 bytes. Also, if you end up needing bigger numbers, the variable-byte storage will be able to hold them.
Then I can think of a couple of ways to reduce it further, given you know the numbers are always increasing and may contain lots of runs. Which works best for you only you can know by trying it on your actual data.
For each of your actual numbers, store two numbers: the number itself, followed by the number of numbers contiguous after it (e.g. 2,3,4,5,6 => 2,4). You'll have to store lone numbers as e.g. 8,0 so will increase storage for those, but if your data has lots of runs (especially long ones) this should reduce storage on average. You could further store "single gaps" in runs as e.g. 1,2,3,5,6,7 => 1,6,4 (unambiguous as 4 is too small to be the start of the next run) but this will make processing more complex and won't save much space so I wouldn't bother.
Or, rather than storing the numbers themselves, store the deltas (so 3,4,5,7,8,9 => 3,1,1,2,1,1. This will reduce the number of bytes used for storing larger numbers (e.g. 15000,15005 (4 bytes) => 15000,5 (3 bytes)). Further, if the data contains a lot of runs (e.g. lots of 1 bytes), it will then compress (e.g. zip) nicely.
Handling in code
I'd simply advise you to write a couple of methods that stream a file from disk into an IEnumerable<uint> (or ulong if you end up with bigger numbers), and do the reverse, while handling whatever you've implemented from the above.
If you do this in a lazy fashion - using yield return to return the numbers as you read them from disk and calculate them, and streaming numbers to disk rather than holding them in memory and returning them at once, you can keep your memory usage down whatever the size of the stored data.
(I think, but I'm not sure, that even the GZipStream and other compression streams will let you stream your data without having it all in memory.)
Querying
If you're comparing two of your big data sets, I wouldn't advise using LINQ's Intersect method as it requires reading one of the sources completely into memory. However, as you know both sequences are increasing, you can write a similar method that needs only hold an enumerator for each sequence.
If you're querying one of your data sets against a user-input, small list of numbers, you can happily use LINQ's Intersect method as it is currently implemented, as it only needs the second sequence to be entirely in memory.

I'm not aware of any off-the-shelf library that does quite what you want, but I'm not sure you need one.
I suggest you consider using the existing BitArray class. If, as your example suggests, you're interested in compressing sets of small integers then a single BitArray with, say 256 bits, could represent any set of integers in the range [0..255]. Of course, if your typical set has only 5 integers in it then this approach would actually expand your storage requirements; you'll have to figure out the right size of such arrays from your own knowledge of your sets.
I'd suggest also looking at your data as sets of integers, so your example 1,2,3,8,11,12,13,14 would be represented by setting on the corresponding bits in a BitArray. Your query operations then reduce to intersection between a test BitArray and your data BitArray.
Incidentally, I think your example 2, which transforms 2 -> true, would be better staying in the domain of functions that map sets of integers to sets of integers, ie it should transform 2 -> 2. If you want to, write a different method which returns a boolean.
I guess you'd need to write code to pack integers into BitArrays and to unpack BitArrays into integers, but that's part of the cost of compression.

Compress small string

Maybe there are any way to compress small strings(86 chars) to something smaller?
#a#1\s\215\c\6\-0.55955,-0.766462,0.315342\s\1\x\-3421.-4006,3519.-4994,3847.1744,sbs
The only way I see is to replace the recurring characters on a unique character.
But i can't find something about that in google.
Thanks for any reply.

http://en.wikipedia.org/wiki/Huffman_coding
Huffman coding would probably be pretty good start. In general the idea is to replace individual characters with the smallest bit pattern needed to replicate the original string or dataset.
You'll want to run statistical analysis on a variety of 'small strings' to find the most common characters so that the more common characters will be represented with the smallest unique bit patterns. And possibly makeup a 'example' small string with every character that will need to be represented (like a-z0-9#.0-)

I took your example string of 85 bytes (not 83 since it was copied verbatim from the post, perhaps with some intended escapes not processed). I compressed it using raw deflate, i.e. no zlib or gzip headers and trailers, and it compressed to 69 bytes. This was done mostly by Huffman coding, though also with four three-byte backward string references.
The best way to compress this sort of thing is to use everything you know about the data. There appears to be some structure to it and there are numbers coded in it. You could develop a representation of the expected data that is shorter. You can encode it as a stream of bits, and the first bit could indicate that what follows is straight bytes in the case that the data you got was not what was expected.
Another approach would be to take advantage of previous messages. If this message is one of a stream of messages, and they all look similar to each other, then you can make a dictionary of previous messages to use as a basis for compression, which can be reconstructed at the other end by the previous messages received. That may offer dramatically improved compression if they messages really are similar.

You should look up RUN-LENGTH ENCODING. Here is a demonstration
rrrrrunnnnnn BECOMES 5r1u6n WHAT? truncate repetitions: for x consecutive r use xr
Now what if some of the characters are digits? Then instead of using x, use the character whose ASCII value is x. for example,
if you have 43 consecutive P, write +P because '+' has ASCII code 43. If you have 49 consecutive y, write 1y because '1' has ASCII code 49.
Now the catch, which you will find with all compression algorithms, is if you have a string with little or no repetitions. Then in that case your code may be longer than the original word. But that's true for all compression algorithms.
NOTE:
I don't encourage using Huffman coding because even if you use the Ziv-Lempel implementation, it's still a lot of work to get it right.

Dissolve string bytes into a fixed length formula based pattern by using keys, and even extract those bytes

Suppose there is a string containing 255 characters. And there is a fixed length assume 64-128 bytes a kind of byte pattern. I want to "dissolve" that string with 255 characters, byte by byte into the other fixed length byte pattern. The byte pattern is like a formula based "hash" or something similar into which a formula based algorithm dissolves the bytes into it. Later, when I am required to extract the dissolved bytes from that fixed length pattern, I would use the same algorithm's reverse, or extract function. The algorithm works through special keys or passwords and uses them to dissolve the bytes into the pattern, the same keys are used to extract the bytes in their original value from the pattern. I ask for help from the coders here. Please also guide me with steps so that I be able to understand what steps are to be taken, what to do. I only know VB .NET and C#.
For instance:
I have this three characters: "A", "B", "C"
The formula based fixed length super pattern (works like a whirlpool) is:
AJE83HDL389SB4VS9L3
Now I wish to "dissolve", "submerge" the characters "A", "B", "C", one by one into the above pattern to change it completely. After dissolving the characters, the super pattern changes drastically, just like the hash:
EJS83HDLG89DB2G9L47
I would be able to extract the characters from the last dissolved character to the first by using an extraction algorhythm and the original keys which were used to dissolve the characters into this super pattern. After the extraction of all the characters, the super pattern resets to the original initial state. Each character insert and remove has a unique pattern state.
After extraction of all characters, the super pattern goes back to the original state. This happens upon the removal of the character by the extraction algo:
AJE83HDL389SB4VS9L3

This looks a lot like your previous question(s). The problem with them is that you seem to start asking from a half-baked solution.
So, what do you really want? Input , Output, Constraints?
To encrypt a string, use Encryption (Reijndael). To transform the resulting byte[] data to a string (for transport), use base64.

If you're happy having the 'keys' for the individual bits of data being determined for you, this can be done similarly to a one-time-pad (though it's not one-time!) - generate a random string as your 'base', then xor your data strings with it. Each output is the 'key' to get the original data back, and the 'base' doesn't change. This doesn't result in output data that's any smaller than the input, however (and this is impossible in the general case anyway), if that's what you're going for.
Like your previous question, you're not really being clear about what you want. Why not just ask a question about how to achieve your end goals, and let people provide answers describing how, or tell you why it's not possible.

Here are 2 cases
Lossless compression (exact bytes are decoded from compressed info)
In this case Shannon Entropy
clearly states that there can't be any algorithm which could compress data to rates greater than information entropy predicts.
Loosy compression (some original bytes are lost forever in compression scheme,- such as used in JPG image files (Do you remember setting of 'image quality' ??))
In this type of compression, you however can make better and better compression scheme with penalty that you loose more and more original bytes.
(Down to example of compression to zero bytes, where zero bytes are restored after, but this compression is invented either - magical button DELETE - moves information to black hole (sorry for sarcasm );)

Space-efficient in-memory structure for sorted text supporting prefix searches

I have a problem: I need space-efficient lookup of file-system data based of file path prefix. Prefix searching of sorted text, in other words. Use a trie, you say, and I thought the same thing. Trouble is, tries are not space-efficient enough, not without other tricks.
I have a fair amount of data:
about 450M in a plain-text Unix-format listing on disk
about 8 million lines
gzip default compresses to 31M
bzip2 default compresses to 21M
I don't want to be eating anywhere close to 450M in memory. At this point I'd be happy to be using somewhere around 100M, since there's lots of redundancy in the form of prefixes.
I'm using C# for this job, and a straightforward implementation of a trie will still require one leaf node for every line in the file. Given that every leaf node will require some kind of reference to the final chunk of text (32 bits, say an index into an array of string data to minimize string duplication), and CLR object overhead is 8 bytes (verified using windbg / SOS), I'll be spending >96,000,000 bytes in structural overhead with no text storage at all.
Let's look at some of the statistical attributes of the data. When stuffed in a trie:
total unique "chunks" of text about 1.1 million
total unique chunks about 16M on disk in a text file
average chunk length is 5.5 characters, max 136
when not taking into account duplicates, about 52 million characters total in chunks
Internal trie nodes average about 6.5 children with a max of 44
about 1.8M interior nodes.
Excess rates of leaf creation is about 15%, excess interior node creation is 22% - by excess creation, I mean leaves and interior nodes created during trie construction but not in the final trie as a proportion of the final number of nodes of each type.
Here's a heap analysis from SOS, indicating where the most memory is getting used:
[MT ]--[Count]----[ Size]-[Class ]
03563150 11 1584 System.Collections.Hashtable+bucket[]
03561630 24 4636 System.Char[]
03563470 8 6000 System.Byte[]
00193558 425 74788 Free
00984ac8 14457 462624 MiniList`1+<GetEnumerator>d__0[[StringTrie+Node]]
03562b9c 6 11573372 System.Int32[]
*009835a0 1456066 23297056 StringTrie+InteriorNode
035576dc 1 46292000 Dictionary`2+Entry[[String],[Int32]][]
*035341d0 1456085 69730164 System.Object[]
*03560a00 1747257 80435032 System.String
*00983a54 8052746 96632952 StringTrie+LeafNode
The Dictionary<string,int> is being used to map string chunks to indexes into a List<string>, and can be discarded after trie construction, though GC doesn't seem to be removing it (a couple of explicit collections were done before this dump) - !gcroot in SOS doesn't indicate any roots, but I anticipate that a later GC would free it.
MiniList<T> is a replacement for List<T> using a precisely-sized (i.e. linear growth, O(n^2) addition performance) T[] to avoid space wastage; it's a value type and is used by InteriorNode to track children. This T[] is added to the System.Object[] pile.
So, if I tot up the "interesting" items (marked with *), I get about 270M, which is better than raw text on disk, but still not close enough to my goal. I figured that .NET object overhead was too much, and created a new "slim" trie, using just value-type arrays to store data:
class SlimTrie
{
byte[] _stringData; // UTF8-encoded, 7-bit-encoded-length prefixed string data
// indexed by _interiorChildIndex[n].._interiorChildIndex[n]+_interiorChildCount[n]
// Indexes interior_node_index if negative (bitwise complement),
// leaf_node_group if positive.
int[] _interiorChildren;
// The interior_node_index group - all arrays use same index.
byte[] _interiorChildCount;
int[] _interiorChildIndex; // indexes _interiorChildren
int[] _interiorChunk; // indexes _stringData
// The leaf_node_index group.
int[] _leafNodes; // indexes _stringData
// ...
}
This structure has brought down the amount of data to 139M, and is still an efficiently traversable trie for read-only operations. And because it's so simple, I can trivially save it to disk and restore it to avoid the cost of recreating the trie every time.
So, any suggestions for more efficient structures for prefix search than trie? Alternative approaches I should consider?

Since there are only 1.1 million chunks, you can index a chunk using 24 bits instead of 32 bits and save space there.
You could also compress the chunks. Perhaps Huffman coding is a good choice. I would also try the following strategy: instead of using a character as a symbol to encode, you should encode character transitions. So instead of looking at the probability of a character appearing, look at the probability of the transition in a Markov chain where the state is the current character.

You can find a scientific paper connected to your problem here (citation of the authors: "Experiments show that our index supports fast queries within a space occupancy that is close to the one achievable by compressing the string dictionary via gzip, bzip or ppmdi." - but unfortunately the paper is payment only). I'm not sure how difficult these ideas are to implement. The authors of this paper have a website where you can find also implementations (under "Index Collection") of various compressed index algorithms.
If you want to go on with your approach, make sure to check out the websites about Crit-bit trees and Radix tree.

Off-the-wall idea: Instead of a trie a hash table. You'd have in memory just the hash and the string data, perhaps compressed.
Or can you afford one page read? Only hash and file position in memory, retrieve the "page" with lines matching that hash, presumably small number of ordered lines, hence very quick to search in the event of collisions.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.