Binary search a byte array - c#

I have a file that contains a sorted list of key/value pairs. The key is a fixed-length string, and the value is the binary representation of an integer. So if the key size were 6, each record is exactly 10 bytes (6+sizeof(int)). The key size can be any (reasonable) value but is the same for every record in the file.
For example, the data file might look like this (where 'xxxx' would be binary representations of int values):
ABLE xxxxBAKER xxxxCARL xxxxDELTA xxxxEDGAR xxxxGEORGExxxx
I would like to load this table into memory (a byte[]) and binary search it. In C++, this is a snap, the bsearch() function takes an 'element size' parameter that tells it the size of the records in the buffer, but I cannot find a similar function in C#.
I've also looked at Array.BinarySearch, but I can't figure out what I'd use as a T.

You should probably use a binary reader to convert your binary data to proper .net types. Since c# is a managed language, and the memory representation of types may depend on the runtime, reading and writing data is generally more involved than just dumping some part of memory to disk.
You mention "fixed-length string", I would assume you mean it is ascii encoded. C# string are UTF16, so you will probably need to convert it to allow it to be used in any meaningful way. See also bitconverter for an alternative way to read integer data.
Once you have converted the data to proper c# types it should be fairly trivial to do a binary search, just put the key and value in separate arrays and use Array.BinarySearch. Or use a custom type that implements IComparable<T>.
Or put the data into some other data structure, like a dictionary, for fast searching.

We can modify standard binary-search algorithms to do this.
We need to modify it to refer to the array in Spans of the width you want. So we multiply the index by that width.
delegate int BytesComparer(ReadOnlySpan<byte> a, ReadOnlySpan<byte> b);
int BinarySearchRecord(byte[] array, int recordWidth, ReadOnlySpan<byte> value, BytesComparer comparer)
{
return BinarySearchRecord(array, 0, array.Length / recordWidth, value, recordWidth, comparer);
}
int BinarySearchRecord(byte[] array, int startRecord, int count, int recordWidth, ReadOnlySpan<byte> value, BytesComparer comparer)
{
int lo = startRecord;
int hi = startRecord + count - 1;
while (lo <= hi)
{
int i = lo + ((hi - lo) >> 1);
int order = comparer(new ReadOnlySpan<byte>(array, i * recordWidth, recordWidth), value);
if (order == 0) return i;
if (order < 0)
lo = i + 1;
else
hi = i - 1;
}
return ~lo;
}
You can use it like this (it requires some bit-twiddling on little-endian systems, in order to sort strings correctly)
var yourArray = Encoding.ASCII.GetBytes("ABLE xxxxBAKER xxxxCARL xxxxDELTA xxxxEDGAR xxxxGEORGExxxx");
var valToFind = Encoding.ASCII.GetBytes("DELTA ").Concat(new byte[]{0x1, 0x2, 0x3, 0x4}).ToArray();
var foundIndex = BinarySearchRecord(yourArray, valToFind.Length, new ReadOnlySpan<byte>(valToFind), (a, b) =>
{
return BinaryPrimitives.ReverseEndianness(BitConverter.ToInt64(a) & 0xffffffffffff)
.CompareTo(
BinaryPrimitives.ReverseEndianness(BitConverter.ToInt64(b) & 0xffffffffffff));
});
dotnetfiddle

Related

c# - fastest method get integer from part of bits

I have byte[] byteArray, usually byteArray.Length = 1-3
I need decompose an array into bits, take some bits (for example, 5-17), and convert these bits to Int32.
I tried to do this
private static IEnumerable<bool> GetBitsStartingFromLSB(byte b)
{
for (int i = 0; i < 8; i++)
{
yield return (b % 2 != 0);
b = (byte)(b >> 1);
}
}
public static Int32 Bits2Int(ref byte[] source, int offset, int length)
{
List<bool> bools = source.SelectMany(GetBitsStartingFromLSB).ToList();
bools = bools.GetRange(offset, length);
bools.AddRange(Enumerable.Repeat(false, 32-length).ToList() );
int[] array = new int[1];
(new BitArray(bools.ToArray())).CopyTo(array, 0);
return array[0];
}
But this method is too slow, and I have to call it very often.
How can I do this more efficiently?
Thanx a lot! Now i do this:
public static byte[] GetPartOfByteArray( byte[] source, int offset, int length)
{
byte[] retBytes = new byte[length];
Buffer.BlockCopy(source, offset, retBytes, 0, length);
return retBytes;
}
public static Int32 Bits2Int(byte[] source, int offset, int length)
{
if (source.Length > 4)
{
source = GetPartOfByteArray(source, offset / 8, (source.Length - offset / 8 > 3 ? 4 : source.Length - offset / 8));
offset -= 8 * (offset / 8);
}
byte[] intBytes = new byte[4];
source.CopyTo(intBytes, 0);
Int32 full = BitConverter.ToInt32(intBytes);
Int32 mask = (1 << length) - 1;
return (full >> offset) & mask;
}
And it works very fast!
If you're after "fast", then ultimately you need to do this with bit logic, not LINQ etc. I'm not going to write actual code, but you'd need to:
use your offset with / 8 and % 8 to find the starting byte and the bit-offset inside that byte
compose however many bytes you need - quite possibly up to 5 if you are after a 32-bit number (because of the possibility of an offset)
; for example into a long, in whichever endianness (presumably big-endian?) you expect
use right-shift (>>) on the composed value to drop however-many bits you need to apply the bit-offset (i.e. value >>= offset % 8;)
mask out any bits you don't want; for example value &= ~(-1L << length); (the -1 gives you all-ones; the << length creates length zeros at the right hand edge, and the ~ swaps all zeros for ones and ones for zeros, so you now have length ones at the right hand edge)
if the value is signed, you'll need to think about how you want negatives to be handled, especially if you aren't always reading 32 bits
First of all, you're asking for optimization. But the only things you've said are:
too slow
need to call it often
There's no information on:
how much slow is too slow? have you measured current code? have you estimated how fast you need it to be?
how often is "often"?
how large are the source byt arrays?
etc.
Optimization can be done in a multitude of ways. When asking for optimization, everything is important. For example, if source byte[] is 1 or 2 bytes long (yeah, may be ridiculous, but you didn't tell us), and if it rarely changes, then you could get very nice results by caching results. And so on.
So, no solutions from me, just a list of possible performance problems:
private static IEnumerable<bool> GetBitsStartingFromLSB(byte b) // A
{
for (int i = 0; i < 8; i++)
{
yield return (b % 2 != 0); // A
b = (byte)(b >> 1);
}
}
public static Int32 Bits2Int(ref byte[] source, int offset, int length)
{
List<bool> bools = source.SelectMany(GetBitsStartingFromLSB).ToList(); //A,B
bools = bools.GetRange(offset, length); //B
bools.AddRange(Enumerable.Repeat(false, 32-length).ToList() ); //C
int[] array = new int[1]; //D
(new BitArray(bools.ToArray())).CopyTo(array, 0); //D
return array[0]; //D
}
A: LINQ is fun, but not fast unless done carefully. For each input byte, it takes 1 byte, splits that in 8 bools, passing them around wrapped it in a compiler-generated IEnumerable object *). Note that it all needs to be cleaned up later, too. Probably you'd get a better performance simply returning a new bool[8] or even BitArray(size=8).
*) conceptually. In fact yield-return is lazy, so it's not 8valueobj+1refobj, but just 1 enumerable that generates items. But then, you're doing .ToList() in (B), so me writing this in that way isn't that far from truth.
A2: the 8 is hardcoded. Once you drop that pretty IEnumerable and change it to a constant-sized array-like thing, you can preallocate that array and pass it via parameter to GetBitsStartingFromLSB to further reduce the amount of temporary objects created and later thrown away. And since SelectMany visits items one-by-one without ever going back, that preallocated array can be reused.
B: Converts whole Source array to stream of bytes, converts it to List. Then discards that whole list except for a small offset-length range of that list. Why covert to list at all? It's just another pack of objects wasted, and internal data is copied as well, since bool is a valuetype. You could have taken the range directly from IEnumerable by .Skip(X).Take(Y)
C: padding a list of bools to have 32 items. AddRange/Repeat is fun, but Repeat has to return an IEnumerable. It's again another object that is created and throw away. You're padding the list with false. Drop the list idea, make it an bool[32]. Or BitArray(32). They start with false automatically. That's the default value of a bool. Iterate over the those bits from 'range' A+B and write them into that array by index. Those written will have their value, those unwritten will stay false. Job done, no objects wasted.
C2: connect preallocating 32-item array with A+A2. GetBitsStartingFromLSB doesn't need to return anything, it may get a buffer to be filled via parameter. And that buffer doesn't need to be 8-item buffer. You may pass the whole 32-item final array, and pass an offset so that function knows exactly where to write. Even less objects wasted.
D: finally, all that work to return selected bits as an integer. new temporary array is created&wasted, new BitArray is effectively created&wasted too. Note that earlier you were already doing manual bit-shift conversion int->bits in GetBitsStartingFromLSB, why not just create a similar method that will do some shifts and do bits->int instead? If you knew the order of the bits, now you know them as well. No need for array&BitArray, some code wiggling, and you save on that allocations and data copying again.
I have no idea how much time/space/etc will that save for you, but that's just a few points that stand out at first glance, without modifying your original idea for the code too much, without doing-it-all via math&bitshifts in one go, etc. I've seen MarcGravell already wrote you some hints too. If you have time to spare, I suggest you try first mine, one by one, and see how (and if at all !) each change affects performance. Just to see. Then you'll probably scrap it all and try again new "do-it-all via math&bitshifts in one go" version with Marc's hints.

How the space complexity of this algorithm is O(1)

I found an algorithm here to remove duplicate characters from string with O(1) space complexity (SC). Here we see that the algorithm converts string to character array which is not constant, it will change depending on input size. They claim that it will run in SC of O(1). How?
// Function to remove duplicates
static string removeDuplicatesFromString(string string1)
{
// keeps track of visited characters
int counter = 0;
char[] str = string1.ToCharArray();
int i = 0;
int size = str.Length;
// gets character value
int x;
// keeps track of length of resultant String
int length = 0;
while (i < size) {
x = str[i] - 97;
// check if Xth bit of counter is unset
if ((counter & (1 << x)) == 0) {
str[length] = (char)('a' + x);
// mark current character as visited
counter = counter | (1 << x);
length++;
}
i++;
}
return (new string(str)).Substring(0, length);
}
It seems that I don't understand Space Complexity.
I found an algorithm here to remove duplicate characters from string with O(1) space complexity (SC). Here we see that the algorithm converts string to character array which is not constant, it will change depending on input size. They claim that it will run in SC of O(1). How?
It does not.
The algorithm takes as its input an arbitrary sized string consisting only of 26 characters, and therefore the output is only ever 26 characters or fewer, so the output array need not be of the size of the input.
You are correct to point out that the implementation given on the site allocates O(n) extra space unnecessarily for the char array.
Exercise: Can you fix the char array problem?
Harder Exercise: Can you describe and implement a string data structure that implements the contract of a string efficiently but allows this algorithm to be implemented actually using only O(1) extra space for arbitrary strings?
Better exercise: The fact that we are restricted to an alphabet of 26 characters is what enables the cheesy "let's just use an int as a set of flags" solution. Instead of saying that n is the size of the input string, what if we allow arbitrary sequences of arbitrary values that have an equality relation; can you come up with a solution to this problem that is O(n) in the size of the output sequence, not the input sequence?
That is, can you implement public static IEnumerable<T> Distinct<T>(this IEnumerable<T> t) such that the output is deduplicated but otherwise in the same order as the input, using O(n) storage where n is the size of the output sequence?
This is a better exercise because this function is actually implemented in the base class library. It's useful, unlike the toy problem.
I note also that the problem statement assumes that there is only one relevant alphabet with lowercase characters, and that there are 26 of them. This assumption is false.

Cryptography .NET, Avoiding Timing Attack

I was browsing crackstation.net website and came across this code which was commented as following:
Compares two byte arrays in length-constant time. This comparison method is used so that password hashes cannot be extracted from on-line systems using a timing attack and then attacked off-line.
private static bool SlowEquals(byte[] a, byte[] b)
{
uint diff = (uint)a.Length ^ (uint)b.Length;
for (int i = 0; i < a.Length && i < b.Length; i++)
diff |= (uint)(a[i] ^ b[i]);
return diff == 0;
}
Can anyone please explain how does this function actual works, why do we need to convert the length to an unsigned integer and how this method avoids a timing attack? What does the line diff |= (uint)(a[i] ^ b[i]); do?
This sets diff based on whether there's a difference between a and b.
It avoids a timing attack by always walking through the entirety of the shorter of the two of a and b, regardless of whether there's a mismatch sooner than that or not.
The diff |= (uint)(a[i] ^ (uint)b[i]) takes the exclusive-or of a byte of a with a corresponding byte of b. That will be 0 if the two bytes are the same, or non-zero if they're different. It then ors that with diff.
Therefore, diff will be set to non-zero in an iteration if a difference was found between the inputs in that iteration. Once diff is given a non-zero value at any iteration of the loop, it will retain the non-zero value through further iterations.
Therefore, the final result in diff will be non-zero if any difference is found between corresponding bytes of a and b, and 0 only if all bytes (and the lengths) of a and b are equal.
Unlike a typical comparison, however, this will always execute the loop until all the bytes in the shorter of the two inputs have been compared to bytes in the other. A typical comparison would have an early-out where the loop would be broken as soon as a mismatch was found:
bool equal(byte a[], byte b[]) {
if (a.length() != b.length())
return false;
for (int i=0; i<a.length(); i++)
if (a[i] != b[i])
return false;
return true;
}
With this, based on the amount of time consumed to return false, we can learn (at least an approximation of) the number of bytes that matched between a and b. Let's say the initial test of length takes 10 ns, and each iteration of the loop takes another 10 ns. Based on that, if it returns false in 50 ns, we can quickly guess that we have the right length, and the first four bytes of a and b match.
Even without knowing the exact amounts of time, we can still use the timing differences to determine the correct string. We start with a string of length 1, and increase that one byte at a time until we see an increase in the time taken to return false. Then we run through all the possible values in the first byte until we see another increase, indicating that it has executed another iteration of the loop. Continue with the same for successive bytes until all bytes match and we get a return of true.
The original is still open to a little bit of a timing attack -- although we can't easily determine the contents of the correct string based on timing, we can at least find the string length based on timing. Since it only compares up to the shorter of the two strings, we can start with a string of length 1, then 2, then 3, and so on until the time becomes stable. As long as the time is increasing our proposed string is shorter than the correct string. When we give it longer strings, but the time remains constant, we know our string is longer than the correct string. The correct length of string will be the shortest one that takes that maximum duration to test.
Whether this is useful or not depends on the situation, but it's clearly leaking some information, regardless. For truly maximum security, we'd probably want to append random garbage to the end of the real string to make it the length of the user's input, so the time stays proportional to the length of the input, regardless of whether it's shorter, equal to, or longer than the correct string.
this version goes on for the length of the input 'a'
private static bool SlowEquals(byte[] a, byte[] b)
{
uint diff = (uint)a.Length ^ (uint)b.Length;
byte[] c = new byte[] { 0 };
for (int i = 0; i < a.Length; i++)
diff |= (uint)(GetElem(a, i, c, 0) ^ GetElem(b, i, c, 0));
return diff == 0;
}
private static byte GetElem(byte[] x, int i, byte[] c, int i0)
{
bool ok = (i < x.Length);
return (ok ? x : c)[ok ? i : i0];
}

I need to take an int and convert it into its binary notation (C#)

I need to take an int and turn it into its byte form.
i.e. I need to take '1' and turn it into '00000001'
or '160' and turn it into '10100000'
Currently, I am using this
int x = 3;
string s = Convert.ToString(x, 2);
int b = int.Parse(s);
This is an awful way to do things, so I am looking for a better way.
Any Suggestions?
EDIT
Basically, I need to get a list of every number up to 256 in base-2. I'm going to store all the numbers in a list, and keep them in a table on my db.
UPDATE
I decided to keep the base-2 number as a string instead of parsing it back. Thanks for the help and sorry for the confusion!
If you need the byte form you should take a look at the BitConverter.GetBytes() method. It does not return a string, but an array of bytes.
The int is already a binary number. What exactly are you looking to do with the new integer? What you are doing is setting a base 10 number to a base 2 value. That's kind of confusing and I think you must be trying to do something that should happen a different way.
I don't know what you need at the end ... this may help:
Turn the int into an int array:
byte[] bytes = BitConverter.GetBytes(x);
Turn the int into a bit array:
BitArray bitArray = new BitArray(new[] {x});
You can use BitArray.
The code looks a bit clumsy, but that could be improved a bit.
int testValue = 160;
System.Collections.BitArray bitarray = new System.Collections.BitArray(new int[] { testValue });
var bitList = new List<bool>();
foreach (bool bit in bitarray)
bitList.Add(bit);
bitList.Reverse();
var base2 = 0;
foreach (bool bit in bitList)
{
base2 *= 10; // Shift one step left
if (bit)
base2++; // Add 1 last
}
Console.WriteLine(base2);
I think you are confusing the data type Integer with its textual representation.
int x = 3;
is the number three regardless of the representation (binary, decimal, hexadecimal, etc.)
When you parse the binary textual representation of an integer back to integer you are getting a different number. The framework assumes you are parsing the number represented in the decimal base and gives the corresponding integer.
You can try
int x = 1600;
string s = Convert.ToString(x, 2);
int b = int.Parse(s);
and it will throw an exception because the binary representation of 1600 interpreted as decimal
is too big to fit in an integer

Converting raw byte data to float[]

I have this code for converting a byte[] to float[].
public float[] ConvertByteToFloat(byte[] array)
{
float[] floatArr = new float[array.Length / sizeof(float)];
int index = 0;
for (int i = 0; i < floatArr.Length; i++)
{
floatArr[i] = BitConverter.ToSingle(array, index);
index += sizeof(float);
}
return floatArr;
}
Problem is, I usually get a NaN result! Why should this be? I checked if there is data in the byte[] and the data seems to be fine. If it helps, an example of the values are:
new byte[] {
231,
255,
235,
255,
}
But this returns NaN (Not a Number) after conversion to float. What could be the problem? Are there other better ways of converting byte[] to float[]? I am sure that the values read into the buffer are correct since I compared it with my other program (which performs amplification for a .wav file).
If endianness is the problem, you should check the value of BitConverter.IsLittleEndian to determine if the bytes have to be reversed:
public static float[] ConvertByteToFloat(byte[] array) {
float[] floatArr = new float[array.Length / 4];
for (int i = 0; i < floatArr.Length; i++) {
if (BitConverter.IsLittleEndian) {
Array.Reverse(array, i * 4, 4);
}
floatArr[i] = BitConverter.ToSingle(array, i * 4);
}
return floatArr;
}
Isn't the problem that the 255 in the exponent represents NaN (see Wikipedia in the exponent section), so you should get a NaN. Try changing the last 255 to something else...
If it's the endianness that is wrong (you are reading big endian numbers) try these (be aware that they are "unsafe" so you have to check the unsafe flag in your project properties)
public static unsafe int ToInt32(byte[] value, int startIndex)
{
fixed (byte* numRef = &value[startIndex])
{
var num = (uint)((numRef[0] << 0x18) | (numRef[1] << 0x10) | (numRef[2] << 0x8) | numRef[3]);
return (int)num;
}
}
public static unsafe float ToSingle(byte[] value, int startIndex)
{
int val = ToInt32(value, startIndex);
return *(float*)&val;
}
I assume that your byte[] doesn't contain the binary representation of floats, but either a sequence of Int8s or Int16s. The examples you posted don't look like audio samples based on float (neither NaN nor -2.41E+24 are in the range -1 to 1. While some new audio formats might support floats outside that range, traditionally audio data consists of signed 8 or 16 bit integer samples.
Another thing you need to be aware of is that often the different channels are interleaved. For example it could contain the first sample for the left, then for the right channel, and then the second sample for both... So you need to separate the channels while parsing.
It's also possible, but uncommon that the samples are unsigned. In which case you need to remove the offset from the conversion functions.
So you first need to parse current position in the byte array into an Int8/16. And then convert that integer to a float in the range -1 to 1.
If the format is little endian you can use BitConverter. Another possibility that works with both endiannesses is getting two bytes manually and combining them with a bit-shift. I don't remember if little or big endian is common. So you need to try that yourself.
This can be done with functions like the following(I didn't test them):
float Int8ToFloat(Int8 i)
{
return ((i-Int8.MinValue)*(1f/0xFF))-0.5f;
}
float Int16ToFloat(Int16 i)
{
return ((i-Int16.MinValue)*(1f/0xFFFF))-0.5f;
}
Depends on how you want to convert the bytes to float. Start out by trying to pin-point what is actually stored in the byte array. Perhaps there is a file format specification? Other transformations in your code?
In the case each byte should be converted to a float between 0 and 255:
public float[] ConvertByteToFloat(byte[] array)
{
return array.Select(b => (float)b).ToArray();
}
If the bytes array contains binary representation of floats, there are several representation and if the representation stored in your file does not match the c# language standard floating point representation (IEEE 754) weird things like this will happen.

Categories