Looking for byte[] in larger byte[] similar to String's IndexOf()?

Looking for byte[] in larger byte[] similar to String's IndexOf()? - c#

Say I have a large byte[] and I'm not only looking to see if, but also where, a smaller byte[] is in the larger array. For example:
byte[] large = new byte[100];
for (byte i = 0; i < 100; i++) {
large[i] = i;
}
byte[] small = new byte[] { 23, 24, 25 };
int loc = large.IndexOf(small); // this is what I want to write
I guess I'm asking about looking for a sequence of any type (primitive or otherwise) within a larger sequence.
I faintly remember reading about a specific approach to this in strings, but I don't remember the name of the algorithm. I could easily write some way to do this, but I know there's a good solution and it's on the tip of my tongue. If there's some .Net method that does this, I'll take that too (although I'd still appreciate the name of the searching algorithm for education's sake).

You can do it with LINQ, like this:
var res = Enumerable.Range(0, large.Length-1)
.Cast<int?>()
.FirstOrDefault(n => large.Skip(n.Value).Take(small.Length).SequenceEqual(small));
if (res != null) {
Console.Println("Found at {0}", res.Value);
} else {
Console.Println("Not found");
}
The approach is self-explanatory except for the Cast<int?> part: you need it to decide between finding the result at the initial location of the large array, when zero is returned, and not finding the result at all, when the return is null.
Here is a demo on ideone.
The complexity of the above is O(M*N), where M and N are lengths of the large and small arrays. If the large array is very long, and contains a significant number of "almost right" sub-sequences that match long prefixes of small, you may be better off implementing an advanced algorithm for searching sequences, such as the Knuth–Morris–Pratt (KMP) algorithm. The KMP algorithm speeds up the search by making an observation that when a mismatch occurs, the small sequence contains enough information on how far ahead you can move in the large sequence based on where in the small sequence is the first mismatch. A look-up table is prepared for the small sequence, and then that table is used throughout the search to decide how to advance the search point. The complexity of KMP is O(N+M). See the Wikipedia article linked above for pseudocode of the KMP algorithm.

Are you thinking of Lambda expressions? That is what came to my mind when you said a more specific approach with strings.
http://www.dotnetperls.com/array-find

Related

Convert a list of strings into one unique sha512

I want to know if there is a way to convert fast a whole list of string into a one unique sha512 hash string.
For now I use this method for get a unique sha512 hash, but this way become slower and slower when the list have more and more string.
string hashDataList = string.Empty;
for (int i = 0; i < ListOfElement.Count; i++)
{
if (i < ListOfElement.Count)
{
hashDataList += ListOfElement[i];
}
}
hashDataList = MakeHash(HashDataList);
Console.WriteLine("Hash: "+hashDataList);
Edit:
Method for make the hash:
public static string MakeHash(string str)
{
using (var hash = SHA512.Create())
{
var bytes = Encoding.UTF8.GetBytes(str);
var hashedInputBytes = hash.ComputeHash(bytes);
var hashedInputStringBuilder = new StringBuilder(128);
foreach (var b in hashedInputBytes)
hashedInputStringBuilder.Append(b.ToString("X2"));
str = hashedInputStringBuilder.ToString();
hashedInputStringBuilder.Clear();
GC.SuppressFinalize(bytes);
GC.SuppressFinalize(hashedInputBytes);
GC.SuppressFinalize(hashedInputStringBuilder);
return str;
}
}

Try this, using built-in SHA512:
StringBuilder sb = new StringBuilder();
foreach(string s in ListOfElement)
{
sb.Append(s);
}
hashDataList = BitConverter.ToString (new System.Security.Cryptography.SHA512CryptoServiceProvider()
.ComputeHash(Encoding.UTF8.GetBytes(sb.ToString()))).Replace("-", String.Empty).ToUpper();
Console.WriteLine("Hash: "+hashDataList);
Performance depends a lot on MakeHash() implementation as well.

I think the problem might be a bit misstated here. First from a performance standpoint:
Any method of hashing a list of strings will take longer as the number (and length) of the strings increases. The only way to avoid this would be to ignore some of the data in (at least some of) the strings, and then you lose the assurances that a hash should give you.
So you can try to make the whole thing faster, so that you can process more (and/or longer) strings in an acceptable time frame. Without knowing the performance characteristics of the hashing function, we can't say if that's possible; but as farbiondriven's answer suggests, about the only plausible strategy is to assemble a single string and hash that once.
The potential objection to this, I suppose, would be: does it affect the uniqueness of the hash. There are two factors to consider:
First, if you just concatenate all the strings together, then you would get the same output hash for
["element one and ", "element two"]
as for
["element one ", "and element two"]
because the concatenated data is the same. One way to correct this is to insert each string's length before the string (with a delimiter to show the end of the length). For example you could build
"16:element one and 11:element two"
for the first array above, and
"12:element one 15:and element two"
for the second.
The other possible concern (though it isn't really valid) could arise if the individual strings are never longer than a single SHA512 hash, but the total amount of data in the array is. In that case, your method (hashing each string and concatenating them) might seem safer, because whenever you has data that's longer than the actual hash, it's mathematically possible for a hash collision to occur. But as I say, this concern is not valid for at least one, and possibly two reasons.
The biggest reason is: hash collisions in a 512-bit hash are ridiculously unlikely. Even though the math says it could happen, it is beyond safe to assume that it literally never will. If you're going to worry about a hash collision at that level, you might as well also worry about your data being spontaneously corrupted due to RAM errors that occur in just such a pattern as to avoid detection. At that level of improbability, you simply can't program around a vast number of catastrophic things that "could" (but won't) happen, and you really might as well count hash collisions among them.
The second reason is: if you're paranoid enough not to buy the first reason, then how can you be sure that hashing shorter strings guarantees uniqueness?
What concatenating a hash per string does do if the individual strings are less than 512 bits, is it means that the hash ends up being longer than the source data - which defeats the typical purposes of a hash. If that's acceptable, then you probably want an encryption algorithm instead of a hash.

How to compare an array loaded from file with another array loaded from another file c#

I have to do a program in C# Form, which has to load from a file which looks something like that:
100ACTGGCTTACACTAATCAAG
101TTAAGGCACAGAAGTTTCCA
102ATGGTATAAACCAGAAGTCT
...
120GCATCAGTACGTACCCGTAC
20 lines formed with a number (ID) and 20 letters (ADN); the other file looks like that:
TGCAACGTGTACTATGGACC
In few words, this is a game where a murder is done, there are 20 people; i have to load and split the letters and.. i have to compare them and in the end i have to find the best match.
I have no idea how to do that, I don't know how to load the letters in the array and then to split them.. and then to compare them.

What you want to do here, is use something like a calculation of the Levenshtein distance between the strings.
In simple terms, that provides a count of how many single letters you have to change for a string to become equal to another. In the context of DNA or Proteins, this can be interpreted as representing the number of mutations between two individuals or samples. A shorter distance will therefore indicate a closer relationship between the two.
The algorithm can be fairly heavy computationally, but will give you a good answer. It's also quite fun and enlightening to implement. You can find a couple of ways of implementing it under the wikipedia article.
If you find it challenging to understand how it works, I recommend you set up an example grid by hand, with one short string horizontally along the top, and one vertically along the left side, and try going through the calculations manually, just to understand the concept properly (it can be confusing at first, but is really not that difficult).

This is a simple match function. It might not be of the complexity your game requires. This solution does not require an explicit split on the strings in order to get an array of DNA "letters". The DNA is compared in place.
Compare each "suspect" entry to the "evidence one.
int idLength = 3;
string evidence = //read from file
List<string> suspects = //read from file
List<double> matchScores = new List<double>();
foreach (string suspect in suspects)
{
int count = 0;
for (int i = idLength; i < suspect.Length; i++)
{
if (suspect[i + idLength] == evidence[i]) count++;
}
matchScores.Add(count * 100 / evidence.Length);
}
The matchScores list now contains all the individual match scores. I did not save the maximum match score in a separate variable as there can be several "suspects" with the same score. To find out which subject has the best match, just iterate the matchScores list. The index of the best match is the index of the suspect in the suspects list.
Optimization notes:
you could check each "suspect" string to see where (i.e. at what index does) the DNA sequence starts, as it could be variable;
a dictionary could be used here, instead of two lists, with the "suspect string" as key and the match score as value

How can I use a very large string (500 million characters) in my program?

I have a .txt file that contains a 500 million digit binary representation of Pi.
I need to use a string representation of that in my program. I also need to be able to search it for substrings and the like - in other words, I need to be able to treat it like a normal sized string. I'll be trying to find a lot of substrings so speed is necessary.
My initial logic was to simply copy and paste the string directly into the program and use it as a static variable.. But I was unable to actually open the .txt file, so I couldn't copy and paste. My next attempt was to read the entire string from the file, but I can't do this in a static method and it takes WAAAY too long (I actually don't know exactly how long it takes, I closed the program eventually).
Is it possible to do this? Any help would be appreciated.
Edit: Potentially relevant information:
With this code:
/// <summary>
/// Gets a 500 million binary digit representation of Pi.
/// </summary>
public static string GetPi()
{
//as per http://msdn.microsoft.com/en-us/library/db5x7c0d.aspx
StreamReader piStream = new StreamReader(#"C:\binaryPi.txt");
string pi = "";
string line;
while ((line = piStream.ReadLine()) != null)
{
pi += line;
}
return pi;
}
I get an OutOfMemoryException.. So scanning the file actually doesn't seem possible, unless I'm missing something..

I would suggest that you make a custom class that can handle that kind of data.
If the content of the file is a representation of the binary form of pi, then it's just zeroes and ones. If you store each bit in an actual bit, then each binary digit uses 1/8 of a byte, while if you store it as text, each bit will use two bytes. By storing in a more compact form, you will use 1/16 of the memory.
Your class would then have to handle how you search for bit patterns in the data. That would be the tricky part, but if you create eight different versions of the search pattern, shifted to match the eight possible positions in a byte, the search could be even more efficient than searching in a string.
Edit:
Here's a start...
public class BitList {
private byte[] _data;
private int _count;
public BitList(string fileName) {
using (FileStream s = File.OpenRead(fileName)) {
_data = new byte[(s.Length + 7) / 8];
_count = 0;
int len;
byte[] buffer = new byte[4096];
while ((len = s.Read(buffer, 0, buffer.Length)) > 0) {
for (int i = 0; i < len; i++) {
switch (buffer[i]) {
case 48: Add(0); break;
case 49: Add(1); break;
}
}
}
}
}
public void Add(int bit) {
_data[_count / 8] |= (byte)(bit << (_count % 8));
_count++;
}
public int this[int index] {
get {
return (_data[index / 8] >> (index % 8)) & 1;
}
}
}
(Note: This code is NOT TESTED, but you should at least get the principle.)

so with the information available, i would just declare a bitarray (initial size as file.length) then i would open the file and reading it by chunk of maybe 4096 then you look trough these 4096 characters
in the loop you just do a simple if text = 1 then set true else set false
do this until you reach the end of the file then you have the full thing into a huge bitarray variable
from that point on you just need to find your pattern

Read the text file once in an application that converts it to an array of bits, one segment at a time, and then write a new file containing the array persisted in binary. Thereafter, just use the real binary file.
To search you can create a bit mask of the target pattern and slide it along the bit array, one bit at a time, performing a bitwise XOR to compare the bits and a bitwise AND to filter out bits you don't care about. If anything left is nonzero then you don't have a match.
Experiment to determine how the performance differs between datatypes. For example, you could use bytes and search 8-bits at a time or integers and search 32-bits at a time. If your pattern is smaller than the selected datatype then the bitwise AND discards the extra bits. Larger patterns are handled by finding an initial match, then trying to match the next segment of the pattern and so on.
EDIT: An optimization that may help. Let's say you have a long, e.g. greater than 128-bit, pattern. Construct an array of 64 64-bit values from the pattern: bits 0-63, 1-64, 2-65, ... . You can then make a fast pass through trying to match any of the array values to each long integer value in the pi array. Where matches occur, check any prior bits for matches as needed, then test the subsequent bits. The idea is to make the best use of aligned memory accesses.
Depending on the pattern length it may be worthwhile to assemble a two dimensional array of shifted values such that you can easily continue matching a shifted pattern without recomputing the values. (Just make a turn at the match and pick up the next pattern value wih the same shift.) You would need to allow for masking unused bits at each end. Note that you want to make the most frequent array access occur on the shortest stride to make the best use of cache.
The BigInteger structure may be of some use in fiddling about.

If speed of finding multiple different sub-strings is key, then here is an approach that may give you better results.
Leave it in the text file. Scan it and build a tree where the top of the tree has 10 nodes holding the digits 0..9. Each of those nodes holds 10 nodes, the sequence of their digit and then 0..9. Top level is 0..9, next is 00..09, .., 90..99. Next is 000..009, ... 990..999.
And at each level you also store the offset in the text file of every occurrence that matches its sequence - only if it has no children. The no children rule is to save a lot of memory and by definition every child node contains offsets where the parent sequence exists. In other words, if you are looking for "123456" then an occurrence of "123456789" is a match.
This would use a horrendous amount of memory but it would be very fast on the lookups.
Adding: There are a lot of tricks you can implement to minimize the memory usage. Store the numbers as a nibble (4 bits). Instead of objects for elements in your tree, store offsets from a base and put everything in a small number of large fixed size arrays. You can do this because you create the tree once and then it is read only.

According to MSDN the maximum size of a String object in memory is 2 GB, or about 1 billion characters. To allocate a string of that size you would probably need a 64bit OS. Since you are dealing with digits only try to use some other data type than strings.

C#: how to get the length of string in string[]

I have a collection of string in C#. My Code looks like this:
string[] lines = System.IO.File.ReadAllLines(#"d:\SampleFile.txt");
What i want to do is to find the max length of string in that collection and store it in variable. Currently, i code this manually, like?
int nMaxLengthOfString = 0;
for (int i = 0; i < lines.Length;i++ )
{
if (lines[i].Length>nMaxLengthOfString)
{
nMaxLengthOfString = lines[i].Length;
}
}
The code above does the work for me, but i am looking for some built in function in order to maintain efficiency, because there will thousand of line in myfile :(

A simpler way with LINQ would be:
int maxLength = lines.Max(x => x.Length);
Note that if you're using .NET 4, you don't need to read all the lines into an array first, if you don't need them later:
// Note call to ReadLines rather than ReadAllLines.
int maxLength = File.ReadLines(filename).Max(x => x.Length);
(If you're not using .NET 4, it's easy to write the equivalent of File.ReadLines.)
That will be more efficient in terms of memory, but fundamentally you will have to read every line from disk, and you will need to iterate over those lines to find the maximum length. The disk access is likely to be the bottleneck of course.

The efficiency will certainly not be worse in your case, if not better.
But if you're looking to be succinct, try lambdas with LINQ:
lines.Aggregate((a, b) => Math.Max(a.Length, b.Length));
Btw, minor point: you can technically stop reading if the amount of data left is less than the longest line you've found. So you can technically save some steps, although it's probably not worth the code.
Completely irrelevant, but just because I feel like it, here's the (elegant!) Scheme version:
(reduce max (map length lines))

Radix sort for strings of arbitrary lengths

I need to sort a huge list of text strings of arbitrary length. I suppose radix sort is the best option here. List is really huge, so padding strings to the same length is completely impossible.
Is there any ready-made implementation for this task, preferably in C#?

Depending on what you need, you might find inserting all the strings into some form of Trie to be the best solution. Even a basic Ternary Search Trie will have a smaller memory footprint than an array/list of strings and will store the strings in sorted order.
Insertion, lookup and removal are all O(k * log(a)) where a is the size of your alphabet (the number of possible values for a character). Since a is constant so is log(a) so you end up with a O(n * k) algorithm for sorting.
Edit: In case you are unfamiliar with Tries, they are basically n-ary trees where each edge represents a single character of the key. When inserting, you check if the root node contains an edge (or child, whatever) that matches the first character of your key. If so, you follow that path and repeat with the second character and so on. If not, you add a new edge. In a Ternary Search Trie, the edges/children are stored in a binary tree so the characters are in sorted order and can be searched in log(a) time. If you want to waste memory you can store the edges/children in an array of size a which gives you constant lookup at each step.

See this thread. radix sort or this one radix sort implementation

How many are many, one million?
The built in List<string>.Sort() takes O(n * log(n)) on average.
log2(10^6) ~=20, that is not very much slower than O(n) for 10^6 elements. If your strings are more than 20 characters long radix sort O(n * k) will be "slower".
I doubt a radix sort will be significantly faster than the built in sort. But it would be fun to measure and compare.

Edit: there is a point to these statements I made previously, but the point is wrong overall.
Radix sort is the wrong sort to use on large numbers of strings. For things like
I really like squirrels. Yay, yay, yay!
I really like blue jays. Yay, yay, yay!
I really like registers. Yay, yay, yay!
you will have a bunch of entries falling in the same bucket. You could avoid this by hashing, but what use is sorting a hash code?
Use quicksort or mergesort or the like. (Quicksort generally performs better and takes less memory, but many examples have worst-case performance of O(N^2) which almost never occurs in practice; Mergesort doesn't perform quite as well but is usually implemented to be stable, and it's easy to do part in memory and part on disk.) That is, use the built-in sort function.
Edit: Well, it turns out that at least on very large files with long repeats at the beginning (e.g. source code) and with many lines exactly the same (100x repeats, in fact), radix sort does start becoming competitive with quicksort. I'm surprised! But, anyway, here is the code I used to implement radix sort. It's in Scala, not C#, but I've written it in fairly iterative style so it should be reasonably obvious how things work. The only two tricky bits are that (a(i)(ch) & 0xFF) is to extract a 0-255 byte from an array of arrays of bytes (bytes are signed), that counts.scanLeft(0)(_ + _) forms a cumulative sum of the counts, starting from zero (and then indices.clone.take(257) takes all but the last one), and that Scala allows multiple parameter lists (so I split up the always-provided argument from the arguments that have defaults that are used in recursion). Here it is:
def radixSort(a: Array[Array[Byte]])(i0: Int = 0, i1: Int = a.length, ch: Int = 0) {
val counts = new Array[Int](257)
var i = i0
while (i < i1) {
if (a(i).length <= ch) counts(0) += 1
else { counts((a(i)(ch)&0xFF)+1) += 1 }
i += 1
}
val indices = counts.scanLeft(0)(_ + _)
val starts = indices.clone.take(257)
i = i0
while (i < i1) {
val bucket = if (a(i).length <= ch) 0 else (a(i)(ch)&0xFF)+1
if (starts(bucket)+i0 <= i && i < starts(bucket)+i0+counts(bucket)) {
if (indices(bucket) <= i) indices(bucket) = i+1
i += 1
}
else {
val temp = a(indices(bucket)+i0)
a(indices(bucket)+i0) = a(i)
a(i) = temp
indices(bucket) += 1
}
}
i = 1
while (i < counts.length) {
if (counts(i)>1) {
radixSort(a)(i0+starts(i),i0+starts(i)+counts(i),ch+1)
}
i += 1
}
}
And the timings are that with 7M lines of source code (100x duplication of 70k lines), the radix sort ties the built-in library sort, and wins thereafter.

String.Compare() overloads are using such string comparison. See what you need is to feed this to your
sort algorithm.
UPDATE
This is the implementation:
[MethodImpl(MethodImplOptions.InternalCall)]
internal static extern int nativeCompareString(int lcid, string string1, int offset1, int length1, string string2, int offset2, int length2, int flags);
Hard to beat this native unmanaged implementation with your own implementation.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.