Optimize C# Code Fragment

Optimize C# Code Fragment - c#

I'm profiling some C# code. The method below is one of the most expensive ones. For the purpose of this question, assume that micro-optimization is the right thing to do. Is there an approach to improve performance of this method?
Changing the input parameter to p to ulong[] would create a macro inefficiency.
static ulong Fetch64(byte[] p, int ofs = 0)
{
unchecked
{
ulong result = p[0 + ofs] +
((ulong) p[1 + ofs] << 8) +
((ulong) p[2 + ofs] << 16) +
((ulong) p[3 + ofs] << 24) +
((ulong) p[4 + ofs] << 32) +
((ulong) p[5 + ofs] << 40) +
((ulong) p[6 + ofs] << 48) +
((ulong) p[7 + ofs] << 56);
return result;
}
}

Why not use BitConverter? I've got to believe the Microsoft has spent some time tuning that code. Plus it deals with endian issues.
Here's how BitConverter turns a byte[] into a long/ulong (ulong converts it as signed and then casts it to unsigned):
[SecuritySafeCritical]
public static unsafe long ToInt64(byte[] value, int startIndex)
{
if (value == null)
{
ThrowHelper.ThrowArgumentNullException(ExceptionArgument.value);
}
if (((ulong) startIndex) >= value.Length)
{
ThrowHelper.ThrowArgumentOutOfRangeException(ExceptionArgument.startIndex, ExceptionResource.ArgumentOutOfRange_Index);
}
if (startIndex > (value.Length - 8))
{
ThrowHelper.ThrowArgumentException(ExceptionResource.Arg_ArrayPlusOffTooSmall);
}
fixed (byte* numRef = &(value[startIndex]))
{
if ((startIndex % 8) == 0)
{
return *(((long*) numRef));
}
if (IsLittleEndian)
{
int num = ((numRef[0] | (numRef[1] << 8)) | (numRef[2] << 0x10)) | (numRef[3] << 0x18);
int num2 = ((numRef[4] | (numRef[5] << 8)) | (numRef[6] << 0x10)) | (numRef[7] << 0x18);
return (((long) ((ulong) num)) | (num2 << 0x20));
}
int num3 = (((numRef[0] << 0x18) | (numRef[1] << 0x10)) | (numRef[2] << 8)) | numRef[3];
int num4 = (((numRef[4] << 0x18) | (numRef[5] << 0x10)) | (numRef[6] << 8)) | numRef[7];
return (((long) ((ulong) num4)) | (num3 << 0x20));
}
}
I suspect that doing the conversion one 32-bit word at a time is for 32-bit efficiency. No 64-bit registers on a 32-bit CPU means dealing with a 64-bit ints is a lot more expensive.
If you know for sure you're targeting 64-bit hardware, it might be faster to do do the conversion in one fell swoop.

Try to use for instead of unrolling the loop. You may be able to save time on boundary checks.
Try BitConverter.ToUInt64 - http://msdn.microsoft.com/en-us/library/system.bitconverter.touint64.aspx if it is what you looking for.

For reference, Microsoft's .NET 4.0 BitConverter.ToInt64 (Shared Source Initiative at http://referencesource.microsoft.com/netframework.aspx):
// Converts an array of bytes into a long.
[System.Security.SecuritySafeCritical] // auto-generated
public static unsafe long ToInt64 (byte[] value, int startIndex) {
if( value == null) {
ThrowHelper.ThrowArgumentNullException(ExceptionArgument.value);
}
if ((uint) startIndex >= value.Length) {
ThrowHelper.ThrowArgumentOutOfRangeException(ExceptionArgument.startIndex, ExceptionResource.ArgumentOutOfRange_Index);
}
if (startIndex > value.Length -8) {
ThrowHelper.ThrowArgumentException(ExceptionResource.Arg_ArrayPlusOffTooSmall);
}
fixed( byte * pbyte = &value[startIndex]) {
if( startIndex % 8 == 0) { // data is aligned
return *((long *) pbyte);
}
else {
if( IsLittleEndian) {
int i1 = (*pbyte) | (*(pbyte + 1) << 8) | (*(pbyte + 2) << 16) | (*(pbyte + 3) << 24);
int i2 = (*(pbyte+4)) | (*(pbyte + 5) << 8) | (*(pbyte + 6) << 16) | (*(pbyte + 7) << 24);
return (uint)i1 | ((long)i2 << 32);
}
else {
int i1 = (*pbyte << 24) | (*(pbyte + 1) << 16) | (*(pbyte + 2) << 8) | (*(pbyte + 3));
int i2 = (*(pbyte+4) << 24) | (*(pbyte + 5) << 16) | (*(pbyte + 6) << 8) | (*(pbyte + 7));
return (uint)i2 | ((long)i1 << 32);
}
}
}
}

Why not go unsafe?
unsafe static ulong Fetch64(byte[] p, int ofs = 0)
{
fixed (byte* bp = p)
{
return *((ulong*)(bp + ofs));
}
}

Related

how to convert ulong to rowversion

reference to converting sql server rowversion to long or ulong?
i can convert SQL RowVersion to ulong by code:
static ulong BigEndianToUInt64(byte[] bigEndianBinary)
{
return ((ulong)bigEndianBinary[0] << 56) |
((ulong)bigEndianBinary[1] << 48) |
((ulong)bigEndianBinary[2] << 40) |
((ulong)bigEndianBinary[3] << 32) |
((ulong)bigEndianBinary[4] << 24) |
((ulong)bigEndianBinary[5] << 16) |
((ulong)bigEndianBinary[6] << 8) |
bigEndianBinary[7];
}
Now, I am faced with a problem, how to convert ulong to byte [8]?
I save the value of rowversion to a file, then read it and use it to make the query. the query parameter should be byte[] , not ulong . otherwise , there is an error will be raised.

i got an answer from BigEndian.GetBytes Method (Int64)
thanks all.
public static byte[] GetBytes(this ulong value)
{
return new[]
{
(byte)(value >> 56),
(byte)(value >> 48),
(byte)(value >> 40),
(byte)(value >> 32),
(byte)(value >> 24),
(byte)(value >> 16),
(byte)(value >> 8),
(byte)(value)
};
}

c# FTDI 24bit two's complement implement

I am a beginner at C# and would like some advice on how to solve the following problem:
My code read 3 byte from ADC true FTDI ic and calculating value. Problem is that sometimes I get values under zero (less than 0000 0000 0000 0000 0000 0000) and my calculating value jump to a big number. I need to implement two's complement but I and I don't know how.
byte[] readData1 = new byte[3];
ftStatus = myFtdiDevice.Read(readData1, numBytesAvailable, ref numBytesRead); //read data from FTDI
int vrednost1 = (readData1[2] << 16) | (readData1[1] << 8) | (readData1[0]); //convert LSB MSB
int v = ((((4.096 / 16777216) * vrednost1) / 4) * 250); //calculate

Convert the data left-justified, so that your sign-bit lands in the spot your PC processor treats as a sign bit. Then right-shift back, which will automatically perform sign-extension.
int left_justified = (readData[2] << 24) | (readData[1] << 16) | (readData[0] << 8);
int result = left_justified >> 8;
Equivalently, one can mark the byte containing a sign bit as signed, so the processor will perform sign extension:
int result = (unchecked((sbyte)readData[2]) << 16) | (readData[1] << 8) | readData[0];
The second approach with a cast only works if the sign bit is already aligned left within any one of the bytes. The left-justification approach can work for arbitrary sizes, such as 18-bit two's complement readings. In this situation the cast to sbyte wouldn't do the job.
int left_justified18 = (readData[2] << 30) | (readData[1] << 22) | (readData[0] << 14);
int result18 = left_justified >> 14;

Well, the only obscure moment here is endianness (big or little). Assuming that byte[0] stands for the least significant byte you can put
private static int FromInt24(byte[] data) {
int result = (data[2] << 16) | (data[1] << 8) | data[0];
return data[2] < 128
? result // positive number
: -((~result & 0xFFFFFF) + 1); // negative number, its abs value is 2-complement
}
Demo:
byte[][] tests = new byte[][] {
new byte[] { 255, 255, 255},
new byte[] { 44, 1, 0},
};
string report = string.Join(Environment.NewLine, tests
.Select(test => $"[{string.Join(", ", test.Select(x => $"0x{x:X2}"))}] == {FromInt24(test)}"));
Console.Write(report);
Outcome:
[0xFF, 0xFF, 0xFF] == -1
[0x2C, 0x01, 0x00] == 300
If you have big endianness (e.g. 300 == {0, 1, 44}) you have to swap bytes:
private static int FromInt24(byte[] data) {
int result = (data[0] << 16) | (data[1] << 8) | data[2];
return data[0] < 128
? result
: -((~result & 0xFFFFFF) + 1);
}

Performance improvement for mirroring bit matrix around the diagonal

I have some code that manages data received from an array of sensors. The PIC that controls the sensors uses 8 SAR-ADCs in parallel to read 4096 data bytes. It means it reads the most significant bit for the first 8 bytes; then it reads their second bit and so on until the eighth (least significant bit).
Basically, for each 8 bytes it reads, it creates (and sends forth to the computer) 8 bytes as follows:
// rxData[0] = MSB[7] MSB[6] MSB[5] MSB[4] MSB[3] MSB[2] MSB[1] MSB[0]
// rxData[1] = B6[7] B6[6] B6[5] B6[4] B6[3] B6[2] B6[1] B6[0]
// rxData[2] = B5[7] B5[6] B5[5] B5[4] B5[3] B5[2] B5[1] B5[0]
// rxData[3] = B4[7] B4[6] B4[5] B4[4] B4[3] B4[2] B4[1] B4[0]
// rxData[4] = B3[7] B3[6] B3[5] B3[4] B3[3] B3[2] B3[1] B3[0]
// rxData[5] = B2[7] B2[6] B2[5] B2[4] B2[3] B2[2] B2[1] B2[0]
// rxData[6] = B1[7] B1[6] B1[5] B1[4] B1[3] B1[2] B1[1] B1[0]
// rxData[7] = LSB[7] LSB[6] LSB[5] LSB[4] LSB[3] LSB[2] LSB[1] LSB[0]
This pattern is repeated for all the 4096 bytes the system reads and I process.
Imagine that each 8 bytes read are taken separately, we can then see them as an 8-by-8 array of bits. I need to mirror this array around the diagonal going from its bottom-left (LSB[7]) to its top-right (MSB[0]). Once this is done, the resulting 8-by-8 array of bits contains in its rows the correct data bytes read from the sensors. I used to perform this operation on the PIC controller, using left shifts and so on, but that slowed down the system quite a lot. Thus, this operation is now performed on the computer where we process the data, using the following code:
BitArray ba = new BitArray(rxData);
BitArray ba2 = new BitArray(ba.Count);
for (int i = 0; i < ba.Count; i++)
{
ba2[i] = ba[(((int)(i / 64)) + 1) * 64 - 1 - (i % 8) * 8 - (int)(i / 8) + ((int)(i / 64)) * 8];
}
byte[] data = new byte[rxData.Length];
ba2.CopyTo(data, 0);
Note that THIS CODE WORKS.
rxData is the received byte array.
The formula I use for the index of ba[] in the loop codes for the mirroring of the arrays I described above. The size of the array is checked elsewhere to make sure it always contains the correct number (4096) of bytes.
So far this was the background for my problem.
In each processing loop of my system I need to perform that mirroring twice, because my data processing is on the difference between two arrays acquired consecutively. Speed is important for my system (possibly the main constraint on the processing), and the mirroring accounts for between 10% and 30% of the execution time of my processing.
I would like to know if there are alternative solutions I might compare to my mirroring code and that might allow me to improve performances. Using the BitArrays is the only way I found to address the different bits in the received bytes.

Operating on separate bits is very slow, and creating 2 bit arrays and copy data back and forth creates further overheads
The simplest obvious solution is just extracting the bits and combining them again. You can do it with a loop but since it uses both left and right shift at the same time, you need a function to handle negative shift amount. As a result here I unrolled it for easier understanding and more speed
out[0] = (byte)(((rxData[0] & 0x80) >> 0) | ((rxData[1] & 0x80) >> 1) | ((rxData[2] & 0x80) >> 2) | ((rxData[3] & 0x80) >> 3) |
((rxData[4] & 0x80) >> 4) | ((rxData[5] & 0x80) >> 5) | ((rxData[6] & 0x80) >> 6) | ((rxData[7] & 0x80) >> 7));
out[1] = (byte)(((rxData[0] & 0x40) << 1) | ((rxData[1] & 0x40) >> 0) | ((rxData[2] & 0x40) >> 1) | ((rxData[3] & 0x40) >> 2) |
((rxData[4] & 0x40) >> 3) | ((rxData[5] & 0x40) >> 4) | ((rxData[6] & 0x40) >> 5) | ((rxData[7] & 0x40) >> 6));
out[2] = (byte)(((rxData[0] & 0x20) << 2) | ((rxData[1] & 0x20) << 1) | ((rxData[2] & 0x20) >> 0) | ((rxData[3] & 0x20) >> 1) |
((rxData[4] & 0x20) >> 2) | ((rxData[5] & 0x20) >> 3) | ((rxData[6] & 0x20) >> 4) | ((rxData[7] & 0x20) >> 5));
out[3] = (byte)(((rxData[0] & 0x10) << 3) | ((rxData[1] & 0x10) << 2) | ((rxData[2] & 0x10) << 1) | ((rxData[3] & 0x10) >> 0) |
((rxData[4] & 0x10) >> 1) | ((rxData[5] & 0x10) >> 2) | ((rxData[6] & 0x10) >> 3) | ((rxData[7] & 0x10) >> 4));
out[4] = (byte)(((rxData[0] & 0x08) << 4) | ((rxData[1] & 0x08) << 3) | ((rxData[2] & 0x08) << 2) | ((rxData[3] & 0x08) << 1) |
((rxData[4] & 0x08) >> 0) | ((rxData[5] & 0x08) >> 1) | ((rxData[6] & 0x08) >> 2) | ((rxData[7] & 0x08) >> 3));
out[5] = (byte)(((rxData[0] & 0x04) << 5) | ((rxData[1] & 0x04) << 4) | ((rxData[2] & 0x04) << 3) | ((rxData[3] & 0x04) << 2) |
((rxData[4] & 0x04) << 1) | ((rxData[5] & 0x04) >> 0) | ((rxData[6] & 0x04) >> 1) | ((rxData[7] & 0x04) >> 2));
out[6] = (byte)(((rxData[0] & 0x02) << 6) | ((rxData[1] & 0x02) << 5) | ((rxData[2] & 0x02) << 4) | ((rxData[3] & 0x02) << 3) |
((rxData[4] & 0x02) << 2) | ((rxData[5] & 0x02) << 1) | ((rxData[6] & 0x02) >> 0) | ((rxData[7] & 0x02) >> 1));
out[7] = (byte)(((rxData[0] & 0x01) << 7) | ((rxData[1] & 0x01) << 6) | ((rxData[2] & 0x01) << 5) | ((rxData[3] & 0x01) << 4) |
((rxData[4] & 0x01) << 3) | ((rxData[5] & 0x01) << 2) | ((rxData[6] & 0x01) << 1) | ((rxData[7] & 0x01) >> 0));
Obviously this is still very slow because it operates byte-by-byte. An optimal solution would operate on multiple bytes at once, for example with SIMD and/or multithreading. Especially since you're doing that for lots of data, .NET SIMD intrinsics would be extremely useful

You will probably find that BitVector performs a good deal better than BitArray.
BitVector32 is more efficient than BitArray for Boolean values and small integers that are used internally. A BitArray can grow indefinitely as needed, but it has the memory and performance overhead that a class instance requires. In contrast, a BitVector32 uses only 32 bits.
http://msdn.microsoft.com/en-us/library/system.collections.specialized.bitvector32.aspx
If you initialize an array of BitVector32 and operate on those, it should be faster than operating on BitArray as you do now.
You may also get a performance boost if you use one thread to perform the mirroring and a second thread to perform the analysis of consecutive reads. The Task Parallel Library Dataflow provides a nice framework for that type of solution. You could have one Source Block to acquire the data buffer, one Transform Block to perform the mirroring, and one Target Block to perform the data processing.

This is actually the same as the get column in a bitboard problem so it can be solved even more efficiently by considering the byte array as a 64-bit integer
byte get_byte(ulong matrix, uint col)
{
const ulong column_mask = 0x8080808080808080ull;
const ulong magic = 0x2040810204081ull;
ulong column = ((matrix << col) & column_mask) * magic;
return (byte)(column >> 56);
}
// Actually the below step is not needed. You can read rxData directly into the `ulong`
// variable instead of a bit array. Remember to CHANGE THE ENDIANNESS if necessary
ulong matrix = (rxData[7] << 56) | (rxData[6] << 48) | (rxData[5] << 40) | (rxData[4] << 32)
| (rxData[3] << 24) | (rxData[2] << 16) | (rxData[1] << 8) | rxData[0];
for (int i = 0; i < 8; i++)
data[i] = get_byte(matrix, i);
In newer x86 CPUs you can use the PDEP instruction in the BMI2 instruction set. I'm not sure if there's any corresponding intrinsic in C#. If the intrinsic doesn't exist then you have to use native code like this
data[i] = _pext_u64(matrix, column_mask >> col);
Update:
The instruction has been added to .NET Core 3.0 as ParallelBitDeposit() intrinsic so it's now much easier and faster to do that from C#
ulong matrix = BitConverter.ToUInt64(rxData, 0);
for (int i = 0; i < 8; i++)
data[i] = Bmi2.X64.ParallelBitDeposit(matrix, 0x8080808080808080UL >> i);
There's also ParallelBitExtract() for the inverse PEXT instruction

Bitshift over Int32.max fails

Ok, so I have two methods
public static long ReadLong(this byte[] data)
{
if (data.Length < 8) throw new ArgumentOutOfRangeException("Not enough data");
long length = data[0] | data[1] << 8 | data[2] << 16 | data[3] << 24
| data[4] << 32 | data[5] << 40 | data[6] << 48 | data[7] << 56;
return length;
}
public static void WriteLong(this byte[] data, long i)
{
if (data.Length < 8) throw new ArgumentOutOfRangeException("Not enough data");
data[0] = (byte)((i >> (8*0)) & 0xFF);
data[1] = (byte)((i >> (8*1)) & 0xFF);
data[2] = (byte)((i >> (8*2)) & 0xFF);
data[3] = (byte)((i >> (8*3)) & 0xFF);
data[4] = (byte)((i >> (8*4)) & 0xFF);
data[5] = (byte)((i >> (8*5)) & 0xFF);
data[6] = (byte)((i >> (8*6)) & 0xFF);
data[7] = (byte)((i >> (8*7)) & 0xFF);
}
So WriteLong works correctly(Verified against BitConverter.GetBytes()). The problem is ReadLong. I have a fairly good understanding of this stuff, but I'm guessing what's happening is the or operations are happening as 32 bit ints so at Int32.MaxValue it rolls over. I'm not sure how to avoid that. My first instinct was to make an int from the lower half and an int from the upper half and combine them, but I'm not quite knowledgeable to know even where to start with that, so this is what I tried....
public static long ReadLong(byte[] data)
{
if (data.Length < 8) throw new ArgumentOutOfRangeException("Not enough data");
long l1 = data[0] | data[1] << 8 | data[2] << 16 | data[3] << 24;
long l2 = data[4] | data[5] << 8 | data[6] << 16 | data[7] << 24;
return l1 | l2 << 32;
}
This didn't work though, at least not for larger numbers, it seems to work for everything below zero.
Here's how I run it
void Main()
{
var larr = new long[5]{
long.MinValue,
0,
long.MaxValue,
1,
-2000000000
};
foreach(var l in larr)
{
var arr = new byte[8];
WriteLong(ref arr,l);
Console.WriteLine(ByteString(arr));
var end = ReadLong(arr);
var end2 = BitConverter.ToInt64(arr,0);
Console.WriteLine(l + " == " + end + " == " + end2);
}
}
and here's what I get(using the modified ReadLong method)
0:0:0:0:0:0:0:128
-9223372036854775808 == -9223372036854775808 == -9223372036854775808
0:0:0:0:0:0:0:0
0 == 0 == 0
255:255:255:255:255:255:255:127
9223372036854775807 == -1 == 9223372036854775807
1:0:0:0:0:0:0:0
1 == 1 == 1
0:108:202:136:255:255:255:255
-2000000000 == -2000000000 == -2000000000

The problem is not the or, it is the bitshift. This has to be done as longs. Currently, the data[i] are implicitely converted to int. Just change that to long and that's it. I.e.
public static long ReadLong(byte[] data)
{
if (data.Length < 8) throw new ArgumentOutOfRangeException("Not enough data");
long length = (long)data[0] | (long)data[1] << 8 | (long)data[2] << 16 | (long)data[3] << 24
| (long)data[4] << 32 | (long)data[5] << 40 | (long)data[6] << 48 | (long)data[7] << 56;
return length;
}

You are doing int arithmetic and then assigning to long, try:
long length = data[0] | data[1] << 8L | data[2] << 16L | data[3] << 24L
| data[4] << 32L | data[5] << 40L | data[6] << 48L | data[7] << 56L;
This should define your constants as longs forcing it to use long arithmetic.
EDIT: Turns out this may not work according to comments below as while bitshift takes many operators on the left it only takes int on the right. Georg's should be the accepted answer.

How can I combine 4 bytes into a 32 bit unsigned integer?

I'm trying to convert 4 bytes into a 32 bit unsigned integer.
I thought maybe something like:
UInt32 combined = (UInt32)((map[i] << 32) | (map[i+1] << 24) | (map[i+2] << 16) | (map[i+3] << 8));
But this doesn't seem to be working. What am I missing?

Your shifts are all off by 8. Shift by 24, 16, 8, and 0.

Use the BitConverter class.
Specifically, this overload.

BitConverter.ToInt32()
You can always do something like this:
public static unsafe int ToInt32(byte[] value, int startIndex)
{
fixed (byte* numRef = &(value[startIndex]))
{
if ((startIndex % 4) == 0)
{
return *(((int*)numRef));
}
if (IsLittleEndian)
{
return (((numRef[0] | (numRef[1] << 8)) | (numRef[2] << 0x10)) | (numRef[3] << 0x18));
}
return ((((numRef[0] << 0x18) | (numRef[1] << 0x10)) | (numRef[2] << 8)) | numRef[3]);
}
}
But this would be reinventing the wheel, as this is actually how BitConverter.ToInt32() is implemented.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Optimize C# Code Fragment - c#

Try to use for instead of unrolling the loop. You may be able to save time on boundary checks. Try BitConverter.ToUInt64 - http://msdn.microsoft.com/en-us/library/system.bitconverter.touint64.aspx if it is what you looking for.

Why not go unsafe? unsafe static ulong Fetch64(byte[] p, int ofs = 0) { fixed (byte* bp = p) { return ((ulong)(bp + ofs)); } }

Related

how to convert ulong to rowversion

c# FTDI 24bit two's complement implement

Performance improvement for mirroring bit matrix around the diagonal

Bitshift over Int32.max fails

How can I combine 4 bytes into a 32 bit unsigned integer?

Categories

Resources

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Optimize C# Code Fragment - c#

Try to use for instead of unrolling the loop. You may be able to save time on boundary checks. Try BitConverter.ToUInt64 - http://msdn.microsoft.com/en-us/library/system.bitconverter.touint64.aspx if it is what you looking for.

Why not go unsafe? unsafe static ulong Fetch64(byte[] p, int ofs = 0) { fixed (byte* bp = p) { return *((ulong*)(bp + ofs)); } }

Related

how to convert ulong to rowversion

c# FTDI 24bit two's complement implement

Performance improvement for mirroring bit matrix around the diagonal

Bitshift over Int32.max fails

How can I combine 4 bytes into a 32 bit unsigned integer?

Categories

Resources

Why not go unsafe? unsafe static ulong Fetch64(byte[] p, int ofs = 0) { fixed (byte* bp = p) { return ((ulong)(bp + ofs)); } }