Searching for strings in big binary files

Searching for strings in big binary files - c#

I have a big binary file (1 MB < size < 50 MB). I need to search for a string and extract the subsequent four bytes (which is the {size,offset} of actual data in another file). What is the most efficient way to do it so that the search would be fastest?
EDIT: The strings in the index files are in sorted order.

Look up the Boyer–Moore string search algorithm.

Store the {string, size, offset} tuples in sorted order (by string) and use a binary search for the string.
You might also store, at the start of the file, offsets for each first letter of strings. For example if strings starting with 'a' began at position 120 and those starting with 'b' began at position 2000 in the file you could start the file with something like 120, 2000, ...

If the encoding is fixed (ASCII) it is relatively simple. Open a binary stream, read byte for byte and match with first char of the targetstring.
If you have strings using another (UTF-8) encoding it gets trickier.

First, use memory mapping on the file. It will be a lot more efficient than reading it into RAM because instead of two copies (one in your program and one in file cache) there is only one copy.
If each string is a fixed length then a binary search is very easy because you can treat the memory as an array of character arrays.
If each string is variable length but 0 terminated, then you can use a variant of binary search where you jump to the middle of the string list, search for the next 0, then test the next string after that. Then jump forward or back to 1/4 or 3/4 of the string list and repeat.
If each string is variable length in Pascal style, with a byte count at the beginning it is trickier. A linear search from the beginning is not too slow, for infrequent searches. If you're looking for exact string matches, do not forget that you can skip most strings just by checking that the lengths do not match.
If you have to search the list often then building an array of char pointers to the string list would again make binary search really easy. If this file is really an index file for fast searches then it probably already has this in it somewhere, unless the designer intended to build a char pointer array while loading the file.

Related

Identifying repeating sequences of data in byte array

Given a sample of hexadecimal data, I would like to identify UNKNOWN sequences of bytes that are repeated throughout the sample. (Not searching for a known string or value) I am attempting to reverse engineer a network protocol, and I am working on determining data structures within the packet. As an example of what I'm trying to do (albeit on a smaller scale):
(af:b6:ea:3d:83:02:00:00):{21:03:00:00}:[b3:49:96:23:01]
{21:03:00:00}:(af:b6:ea:3d:83:02:00:00):01:42:00:00:00:00:01:57
And
(38:64:88:6e:83:02:00:00):{26:03:00:00}:[b3:49:96:23:01]
{26:03:00:00}:(38:64:88:6e:83:02:00:00):01:42:00:00:00:00:00:01
Obviously, these are easy to spot by eye, but patterns that are hundreds of chars into the data are not. I'm not expecting a magic bullet for the solution, just a nudge in the right direction, or even better, a premade tool.
I'm currently needing this for a C# project, but I am open to any and all tools.

If you have no idea what you are looking for, you could get an idea of the layout of the data by performing a negative entropy analysis on a reasonably large enough sample of conversations to see the length of the records/sub-records.
If the data is structured with repeated sequences of roughly the same length and content type you should see clusters of values with nearly the same negative entropy around the length of the record and sub records.
For example if you put a basic file with a lot of the same data through that, you should see values around the average record length with comparable negentropies (ex: if you use a CSV file with an average line length of 117 bytes, you might see 115, 116, 117 & 119 with the highest negentropy), and values around the most common field lengths with the same negentropy.
You might do a byte occurence scan, to see which byte values are likely separators.
There is a free hex editor with sources which does that for you (hexplorer, in the Crypto/Find Pattern menu). You may have to change the default font through Options to actually something in the UI.

Dissolve string bytes into a fixed length formula based pattern by using keys, and even extract those bytes

Suppose there is a string containing 255 characters. And there is a fixed length assume 64-128 bytes a kind of byte pattern. I want to "dissolve" that string with 255 characters, byte by byte into the other fixed length byte pattern. The byte pattern is like a formula based "hash" or something similar into which a formula based algorithm dissolves the bytes into it. Later, when I am required to extract the dissolved bytes from that fixed length pattern, I would use the same algorithm's reverse, or extract function. The algorithm works through special keys or passwords and uses them to dissolve the bytes into the pattern, the same keys are used to extract the bytes in their original value from the pattern. I ask for help from the coders here. Please also guide me with steps so that I be able to understand what steps are to be taken, what to do. I only know VB .NET and C#.
For instance:
I have this three characters: "A", "B", "C"
The formula based fixed length super pattern (works like a whirlpool) is:
AJE83HDL389SB4VS9L3
Now I wish to "dissolve", "submerge" the characters "A", "B", "C", one by one into the above pattern to change it completely. After dissolving the characters, the super pattern changes drastically, just like the hash:
EJS83HDLG89DB2G9L47
I would be able to extract the characters from the last dissolved character to the first by using an extraction algorhythm and the original keys which were used to dissolve the characters into this super pattern. After the extraction of all the characters, the super pattern resets to the original initial state. Each character insert and remove has a unique pattern state.
After extraction of all characters, the super pattern goes back to the original state. This happens upon the removal of the character by the extraction algo:
AJE83HDL389SB4VS9L3

This looks a lot like your previous question(s). The problem with them is that you seem to start asking from a half-baked solution.
So, what do you really want? Input , Output, Constraints?
To encrypt a string, use Encryption (Reijndael). To transform the resulting byte[] data to a string (for transport), use base64.

If you're happy having the 'keys' for the individual bits of data being determined for you, this can be done similarly to a one-time-pad (though it's not one-time!) - generate a random string as your 'base', then xor your data strings with it. Each output is the 'key' to get the original data back, and the 'base' doesn't change. This doesn't result in output data that's any smaller than the input, however (and this is impossible in the general case anyway), if that's what you're going for.
Like your previous question, you're not really being clear about what you want. Why not just ask a question about how to achieve your end goals, and let people provide answers describing how, or tell you why it's not possible.

Here are 2 cases
Lossless compression (exact bytes are decoded from compressed info)
In this case Shannon Entropy
clearly states that there can't be any algorithm which could compress data to rates greater than information entropy predicts.
Loosy compression (some original bytes are lost forever in compression scheme,- such as used in JPG image files (Do you remember setting of 'image quality' ??))
In this type of compression, you however can make better and better compression scheme with penalty that you loose more and more original bytes.
(Down to example of compression to zero bytes, where zero bytes are restored after, but this compression is invented either - magical button DELETE - moves information to black hole (sorry for sarcasm );)

C# code to perform Binary search in a very big text file

Is there a library that I can use to perform binary search in a very big text file (can be 10GB).
The file is a sort of a log file - every row starts with a date and time. Therefore rows are ordered.

I started to write the pseudo-code on how to do it, but I gave up since it may seem condescending. You probably know how to write a binary search, it's really not complicated.
You won't find it in a library, for two reasons:
It's not really "binary search" - the line sizes are different, so you need to adapt the algorithm (e.g. look for the middle of the file, then look for the next "newline" and consider that to be the "middle").
Your datetime log format is most likely non-standard (ok, it may look "standard", but think a bit.... you probably use '[]' or something to separate the date from the log message, something like [10/02/2001 10:35:02] My message ).
On summary - I think your need is too specific and too simple to implement in custom code for someone to bother writing a library :)

As the line lengths are not guaranteed to be the same length, you're going to need some form of recognisable line delimiter e.g. carriage return or line feed.
The binary search pattern can then be pretty much your traditional algorithm. Seek to the 'middle' of the file (by length), seek backwards (byte by byte) to the start of the line you happen to land in, as identified by the line delimiter sequence, read that record and make your comparison. Depending on the comparison, seek halfway up or down (in bytes) and repeat.
When you identify the start index of a record, check whether it was the same as the last seek. You may find that, as you dial in on your target record, moving halfway won't get you to a different record. e.g. you have adjacent records of 100 bytes and 50 bytes respectively, so jumping in at 75 bytes always takes you back to the start of the first record. If that happens, read on to the next record before making your comparison.
You should find that you will reach your target pretty quickly.

You would need to be able to stream the file, but you would also need random access. I'm not sure how you accomplish this short of a guarantee that each line of the file contains the same number of bytes. If you had that, you could get a Stream of the object and use the Seek method to move around in the file, and from there you could conduct your binary search by reading in the number of bytes that constitute a line. But again, this is only valid if the lines are the same number of bytes. Otherwise, you would jump in and out of the middle of lines.
Something like
byte[] buffer = new byte[lineLength];
stream.Seek(lineLength * searchPosition, SeekOrigin.Begin);
stream.Read(buffer, 0, lineLength);
string line = Encoding.Default.GetString(buffer);

This shouldn't be too bad under the constraint that you hold an Int64 in memory for every line-feed in the file. That really depends upon how long the line of text is on average, given 1000 bytes per line you be looking at around (10,000,000,000 / 1000 * 4) = 40mb. Very big, but possible.
So try this:
Scan the file and store the ordinal offset of each line-feed in a List
Binary search the List with a custom comparer that scans to the file offset and reads the data.

If your file is static (or changes rarely) and you have to run "enough" queries against it, I believe the best approach will be creating "index" file:
Scan the initial file and take the datetime parts of the file plus their positions in the original (this is why has to be pretty static) encode them some how (for example: unix time (full 10 digits) + nanoseconds (zero-filled 4 digits) and line position (zero filed 10 digits). this way you will have file with consistent "lines"
preform binary search on that file (you may need to be a bit creative in order to achieve range search) and get the relevant location(s) in the original file
read directly from the original file starting from the given location / read the given range
You've got range search with O(log(n)) run-time :) (and you've created primitive DB functionality)
Needless to say that if the file data file is updated "too" frequently or you don't run "enough" queries against the index file you mat end up with spending more time on creating the index file than you are saving from the query file.
Btw, working with this index file doesn't require the data file to be sorted. As log files tend to be append only, and sorted, you may speed up the whole thing by simply creating index file that only holds the locations of the EOL marks (zero-filled 10 digits) in the data file - this way you can preform the binary search directly on the data-file (using the index file in order to determinate the seek positions in the original file) and if lines are appended to the log file you can simply add (append) their EOL positions to the index file.

The List object has a Binary Search method.
http://msdn.microsoft.com/en-us/library/w4e7fxsh%28VS.80%29.aspx

Matching a string in a Large text file?

I have a list of strings containing about 7 million items in a text file of size 152MB. I was wondering what could be best way to implement the a function that takes a single string and returns whether it is in that list of strings.

Are you going to have to match against this text file several times? If so, I'd create a HashSet<string>. Otherwise, just read it line by line (I'm assuming there's one string per line) and see whether it matches.
152MB of ASCII will end up as over 300MB of Unicode data in memory - but in modern machines have plenty of memory, so keeping the whole lot in a HashSet<string> will make repeated lookups very fast indeed.
The absolute simplest way to do this is probably to use File.ReadAllLines, although that will create an array which will then be discarded - not great for memory usage, but probably not too bad:
HashSet<string> strings = new HashSet<string>(File.ReadAllLines("data.txt"));
...
if (strings.Contains(stringToCheck))
{
...
}

Depends what you want to do. When you want to repeat the search for matches again and again, I'd load the whole file into memory (into a HashSet). There it is very easy to search for matches.

Space-efficient in-memory structure for sorted text supporting prefix searches

I have a problem: I need space-efficient lookup of file-system data based of file path prefix. Prefix searching of sorted text, in other words. Use a trie, you say, and I thought the same thing. Trouble is, tries are not space-efficient enough, not without other tricks.
I have a fair amount of data:
about 450M in a plain-text Unix-format listing on disk
about 8 million lines
gzip default compresses to 31M
bzip2 default compresses to 21M
I don't want to be eating anywhere close to 450M in memory. At this point I'd be happy to be using somewhere around 100M, since there's lots of redundancy in the form of prefixes.
I'm using C# for this job, and a straightforward implementation of a trie will still require one leaf node for every line in the file. Given that every leaf node will require some kind of reference to the final chunk of text (32 bits, say an index into an array of string data to minimize string duplication), and CLR object overhead is 8 bytes (verified using windbg / SOS), I'll be spending >96,000,000 bytes in structural overhead with no text storage at all.
Let's look at some of the statistical attributes of the data. When stuffed in a trie:
total unique "chunks" of text about 1.1 million
total unique chunks about 16M on disk in a text file
average chunk length is 5.5 characters, max 136
when not taking into account duplicates, about 52 million characters total in chunks
Internal trie nodes average about 6.5 children with a max of 44
about 1.8M interior nodes.
Excess rates of leaf creation is about 15%, excess interior node creation is 22% - by excess creation, I mean leaves and interior nodes created during trie construction but not in the final trie as a proportion of the final number of nodes of each type.
Here's a heap analysis from SOS, indicating where the most memory is getting used:
[MT ]--[Count]----[ Size]-[Class ]
03563150 11 1584 System.Collections.Hashtable+bucket[]
03561630 24 4636 System.Char[]
03563470 8 6000 System.Byte[]
00193558 425 74788 Free
00984ac8 14457 462624 MiniList`1+<GetEnumerator>d__0[[StringTrie+Node]]
03562b9c 6 11573372 System.Int32[]
*009835a0 1456066 23297056 StringTrie+InteriorNode
035576dc 1 46292000 Dictionary`2+Entry[[String],[Int32]][]
*035341d0 1456085 69730164 System.Object[]
*03560a00 1747257 80435032 System.String
*00983a54 8052746 96632952 StringTrie+LeafNode
The Dictionary<string,int> is being used to map string chunks to indexes into a List<string>, and can be discarded after trie construction, though GC doesn't seem to be removing it (a couple of explicit collections were done before this dump) - !gcroot in SOS doesn't indicate any roots, but I anticipate that a later GC would free it.
MiniList<T> is a replacement for List<T> using a precisely-sized (i.e. linear growth, O(n^2) addition performance) T[] to avoid space wastage; it's a value type and is used by InteriorNode to track children. This T[] is added to the System.Object[] pile.
So, if I tot up the "interesting" items (marked with *), I get about 270M, which is better than raw text on disk, but still not close enough to my goal. I figured that .NET object overhead was too much, and created a new "slim" trie, using just value-type arrays to store data:
class SlimTrie
{
byte[] _stringData; // UTF8-encoded, 7-bit-encoded-length prefixed string data
// indexed by _interiorChildIndex[n].._interiorChildIndex[n]+_interiorChildCount[n]
// Indexes interior_node_index if negative (bitwise complement),
// leaf_node_group if positive.
int[] _interiorChildren;
// The interior_node_index group - all arrays use same index.
byte[] _interiorChildCount;
int[] _interiorChildIndex; // indexes _interiorChildren
int[] _interiorChunk; // indexes _stringData
// The leaf_node_index group.
int[] _leafNodes; // indexes _stringData
// ...
}
This structure has brought down the amount of data to 139M, and is still an efficiently traversable trie for read-only operations. And because it's so simple, I can trivially save it to disk and restore it to avoid the cost of recreating the trie every time.
So, any suggestions for more efficient structures for prefix search than trie? Alternative approaches I should consider?

Since there are only 1.1 million chunks, you can index a chunk using 24 bits instead of 32 bits and save space there.
You could also compress the chunks. Perhaps Huffman coding is a good choice. I would also try the following strategy: instead of using a character as a symbol to encode, you should encode character transitions. So instead of looking at the probability of a character appearing, look at the probability of the transition in a Markov chain where the state is the current character.

You can find a scientific paper connected to your problem here (citation of the authors: "Experiments show that our index supports fast queries within a space occupancy that is close to the one achievable by compressing the string dictionary via gzip, bzip or ppmdi." - but unfortunately the paper is payment only). I'm not sure how difficult these ideas are to implement. The authors of this paper have a website where you can find also implementations (under "Index Collection") of various compressed index algorithms.
If you want to go on with your approach, make sure to check out the websites about Crit-bit trees and Radix tree.

Off-the-wall idea: Instead of a trie a hash table. You'd have in memory just the hash and the string data, perhaps compressed.
Or can you afford one page read? Only hash and file position in memory, retrieve the "page" with lines matching that hash, presumably small number of ordered lines, hence very quick to search in the event of collisions.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.