How do I Marshal out a Subset byte[] Array in Managed Code? - c#

Consider the following method (used in a factory method):
private Packet(byte[] rawBytes, int startIndex)
{
m_packetId = BitConverter.ToUInt32(rawBytes, startIndex);
m_dataLength = BitConverter.ToUInt16(rawBytes, startIndex + 4);
if (this.Type != PacketType.Data)
return;
m_bytes = new byte[m_dataLength];
rawBytes.CopyTo(m_bytes, startIndex + Packet.HeaderSize);
}
The last two lines of code strike me wasteful. Allocating more memory and populating it with values from memory seems silly.
With unmanaged code, something like this is possible:
m_bytes = (rawBytes + (startIndex + Packet.HeaderSize));
(I didn't run it through a compiler so syntax is probably off, but you can see it's just a matter of pointer manipulation.)
I ran into a similar problem the other day when an API returned a byte[] array that was really a short[] array.
Are these types of array permutations just the cost of using managed code or is there a new school of thinking that I'm just missing?
Thanks in advance.

Have you considered restructuring so that maybe the copy is not necessary?
First option: store a reference to rawBytes in m_bytes and store the offset that needs to be added to all accesses into this byte array.
Second option: make m_bytes a MemoryStream instead, constructed from the buffer, an offset and a length; this constructor also does not copy the byte buffer and just allows access to the specified sub-segment.
Keep in mind though that the price of avoiding the copy operation is that the rawBytes and m_bytes (array or stream) will be aliases, so changes to one will change the other too.

Yes, this is more costly in managed code but there is nothing stopping you from making this method unsafe and doing the pointer manipulation you wish to do.

Related

Optimizing several million char* to string conversions

I have an application that needs to take in several million char*'s as an input parameter (typically strings less than 512 characters (in unicode)), and convert and store them as .net strings.
It turning out to be a real bottleneck in the performance of my application. I'm wondering if there's some design pattern or ideas to make it more effecient.
There is a key part that makes me feel like it can be improved: There are a LOT of duplicates. Say 1 million objects are coming in, there might only be like 50 unique char* patterns.
For the record, here is the algorithm i'm using to convert char* to string (this algorithm is in C++, but the rest of the project is in C#)
String ^StringTools::MbCharToStr ( const char *Source )
{
String ^str;
if( (Source == NULL) || (Source[0] == '\0') )
{
str = gcnew String("");
}
else
{
// Find the number of UTF-16 characters needed to hold the
// converted UTF-8 string, and allocate a buffer for them.
const size_t max_strsize = 2048;
int wstr_size = MultiByteToWideChar (CP_UTF8, 0L, Source, -1, NULL, 0);
if (wstr_size < max_strsize)
{
// Save the malloc/free overhead if it's a reasonable size.
// Plus, KJN was having fits with exceptions within exception logging due
// to a corrupted heap.
wchar_t wstr[max_strsize];
(void) MultiByteToWideChar (CP_UTF8, 0L, Source, -1, wstr, (int) wstr_size);
str = gcnew String (wstr);
}
else
{
wchar_t *wstr = (wchar_t *)calloc (wstr_size, sizeof(wchar_t));
if (wstr == NULL)
throw gcnew PCSException (__FILE__, __LINE__, PCS_INSUF_MEMORY, MSG_SEVERE);
// Convert the UTF-8 string into the UTF-16 buffer, construct the
// result String from the UTF-16 buffer, and then free the buffer.
(void) MultiByteToWideChar (CP_UTF8, 0L, Source, -1, wstr, (int) wstr_size);
str = gcnew String ( wstr );
free (wstr);
}
}
return str;
}
You could use each character from the input string to feed a trie structure. At the leaves, have a single .NET string object. Then, when a char* comes in that you've seen previously, you can quickly find the existing .NET version without allocating any memory.
Pseudo-code:
start with an empty trie,
process a char* by searching the trie until you can go no further
add nodes until your entire char* has been encoded as nodes
at the leaf, attach an actual .NET string
The answer to this other SO question should get you started: How to create a trie in c#
There is a key part that makes me feel like it can be improved: There are a LOT of duplicates. Say 1 million objects are coming in, there might only be like 50 unique char* patterns.
If this is the case, you may want to consider storing the "found" patterns within a map (such as using a std::map<const char*, gcroot<String^>> [though you'll need a comparer for the const char*), and use that to return the previously converted value.
There is an overhead to storing the map, doing the comparison, etc. However, this may be mitigated by the dramatically reduced memory usage (you can reuse the managed string instances), as well as saving the memory allocations (calloc/free). Also, using malloc instead of calloc would likely be a (very small) improvement, as you don't need to zero out the memory prior to calling MultiByteToWideChar.
I think the first optimization you could make here would be to make your first try calling MultiByteToWideChar start with a buffer instead of a null pointer. Because you specified CP_UTF8, MultiByteToWideChar must walk over the whole string to determine the expected length. If there is some length which is longer than the vast majority of your strings, you might consider optimistically allocating a buffer of that size on the stack; and if that fails, then going to dynamic allocation. That is, move the first branch if your if/else block outside of the if/else.
You might also save some time by calculating the length of the source string once and passing it in explicitly -- that way MultiByteToWideChar doesn't have to do a strlen every time you call it.
That said, it sounds like if the rest of your project is C#, you should use the .NET BCL class libraries designed to do this rather than having a side by side assembly in C++/CLI for the sole purpose of converting strings. That's what System.Text.Encoding is for.
I doubt any kind of caching data structure you could use here is going to make any significant difference.
Oh, and don't ignore the result of MultiByteToWideChar -- not only should you never cast anything to void, you've got undefined behavior in the event MultiByteToWideChar fails.
I would probably use a cache based on a ternary tree structure, or similar, and look up the input string to see if it's already converted before even converting a single character to .NET representation.

Better option for String Manipulation - .NET

I'm working with huge string data for a project in C#. I'm confused about which approach should I use to manipulate my string data.
First Approach:
StringBuilder myString = new StringBuilder().Append(' ', 1024);
while(someString[++counter] != someChar)
myString[i++] += someString[counter];
Second Approach:
String myString = new String();
int i = counter;
while(soumeString[++counter] != someChar);
myString = someString.SubString(i, counter - i);
Which one of the two would be more fast(and efficient)? Considering the strings I'm working with are huge.
The strings are already in the RAM.
The size of the string can vary from 32MB-1GB.
You should use IndexOf rather than doing individual character manipulations in a loop, and add whole chunks of string to the result:
StringBuilder myString = new StringBuilder();
int pos = someString.IndexOf(someChar, counter);
myString.Append(someString.SubString(counter, pos));
For "huge" strings, it may make sense to take a streamed approach and not load the whole thing into memory. For the best raw performance, you can sometimes squeeze a little more speed out by using pointer math to search and capture pieces of strings.
To be clear, I'm stating two completely different approaches.
1 - Stream
The OP doesn't say how big these strings are, but it may be impractical to load them into memory. Perhaps they are being read from a file, from a data reader connected to a DB, from an active network connection, etc.
In this scenario, I would open a stream, read forward, buffering my input in a StringBuilder until the criteria was met.
2 - Unsafe Char Manipulation
This requires that you do have the complete string. You can obtain a char* to the start of a string quite simply:
// fix entire string in memory so that we can work w/ memory range safely
fixed( char* pStart = bigString )
{
char* pChar = pStart; // unfixed pointer to start of string
char* pEnd = pStart + bigString.Length;
}
You can now increment pChar and examine each character. You can buffer it (e.g. if you want to examine multiple adjacent characters) or not as you choose. Once you determine the ending memory location, you now have a range of data that you can work with.
Unsafe Code and Pointers in c#
2.1 - A Safer Approach
If you are familiar with unsafe code, it is very fast, expressive, and flexible. If not, I would still use a similar approach, but without the pointer math. This is similar to the approach which #supercat suggested, namely:
Get a char[].
Read through it character by character.
Buffer where needed. StringBuilder is good for this; set an initial size and reuse the instance.
Analyze buffer where needed.
Dump buffer often.
Do something with the buffer when it contains the desired match.
And an obligatory disclaimer for unsafe code: The vast majority of the time the framework methods are a better solution. They are safe, tested, and invoked millions of times per second. Unsafe code puts all of the responsibility on the developer. It does not make any assumptions; it's up to you to be a good framework/OS citizen (e.g. not overwriting immutable strings, allowing buffer overruns, etc.). Because it does not make any assumptions and removes the safeguards, it will often yield a performance increase. It's up to the developer to determine if there is indeed a benefit, and to decide if the advantages are significant enough.
Per request from OP, here are my test results.
Assumptions:
Big string is already in memory, no requirement for reading from disk
Goal is to not use any native pointers/unsafe blocks
The "checking" process is simple enough that something like Regex is not needed. For now simplifying to a single char comparison. The below code can easily be modified to consider multiple chars at once, this should have no effect on the relative performance of the two approaches.
public static void Main()
{
string bigStr = GenString(100 * 1024 * 1024);
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < 10; i++)
{
int counter = -1;
StringBuilder sb = new StringBuilder();
while (bigStr[++counter] != 'x')
sb.Append(bigStr[counter]);
Console.WriteLine(sb.ToString().Length);
}
sw.Stop();
Console.WriteLine("StringBuilder: {0}", sw.Elapsed.TotalSeconds);
sw = Stopwatch.StartNew();
for (int i = 0; i < 10; i++)
{
int counter = -1;
while (bigStr[++counter] != 'x') ;
Console.WriteLine(bigStr.Substring(0, counter).Length);
}
sw.Stop();
Console.WriteLine("Substring: {0}", sw.Elapsed.TotalSeconds);
}
public static string GenString(int size)
{
StringBuilder sb = new StringBuilder(size);
for (int i = 0; i < size - 1; i++)
{
sb.Append('a');
}
sb.Append('x');
return sb.ToString();
}
Results (release build, .NET 4):
StringBuilder ~7.9 sec
Substring ~1.9 sec
StringBuilder was consistently > 3x slower, with a variety of different sized strings.
There's an IndexOf operation which would search more quickly for someChar, but I'll assume your real function to find the desired length is more complicated than that. In that scenario, I would recommend copying someString to a Char[], doing the search, and then using the new String(Char[], Int32, Int32) constructor to produce the final string. Indexing a Char[] is going to be so much more efficient than indexing an String or StringBuilder that unless you expect that you'll typically be needing only a small fraction of the string, copying everything to the Char[] will be a 'win' (unless, of course, you could simply use something like IndexOf).
Even if the length of the string will often be much larger than the length of interest, you may still be best off using a Char[]. Pre-initialize the Char[] to some size, and then do something like:
Char[] temp = new Char[1024];
int i=0;
while (i < theString.Length)
{
int subLength = theString.Length - i;
if (subLength > temp.Length) // May impose other constraints on subLength, provided
subLength = temp.Length; // it's greater than zero.
theString.CopyTo(i, temp, 0, subLength);
... do stuff with the array
i+=subLength;
}
Once you're all done, you may then use a single SubString call to construct a string with the necessary characters from the original. If your application requires buinding a string whose characters differ from the original, you could use a StringBuilder and, within the above loop, use the Append(Char[], Int32, Int32) method to add processed characters to it.
Note also that when the above loop construct, one may decide to reduce subLength at any point in the loop provided it is not reduced to zero. For example, if one is trying to find whether the string contains a prime number of sixteen or fewer digits enclosed by parentheses, one could start by scanning for an open-paren; if one finds it and it's possible that the data one is looking for might extend beyond the array, set subLength to the position of the open-paren, and reloop. Such an approach will result in a small amount of redundant copying, but not much (often none), and will eliminate the need to keep track of parsing state between loops. A very convenient pattern.
You always want to use StringBuilder when manipulating strings. This is becwuse strings are immutable, so every time a new object needs to be created.

Prepend to a C# Array

Given a populated byte[] values in C#, I want to prepend the value (byte)0x00 to the array. I assume this will require making a new array and adding the contents of the old array. Speed is an important aspect of my application. What is the best way to do this?
-- EDIT --
The byte[] is used to store DSA (Digital Signature Algorithm) parameters. The operation will only need to be performed once per array, but speed is important because I am potentially performing this operation on many different byte[]s.
If you are only going to perform this operation once then there isn't a whole lot of choices. The code provided by Monroe's answer should do just fine.
byte[] newValues = new byte[values.Length + 1];
newValues[0] = 0x00; // set the prepended value
Array.Copy(values, 0, newValues, 1, values.Length); // copy the old values
If, however, you're going to be performing this operation multiple times you have some more choices. There is a fundamental problem that prepending data to an array isn't an efficient operation, so you could choose to use an alternate data structure.
A LinkedList can efficiently prepend data, but it's less efficient in general for most tasks as it involves a lot more memory allocation/deallocation and also looses memory locallity, so it may not be a net win.
A double ended queue (known as a deque) would be a fantastic data structure for you. You can efficiently add to the start or the end, and efficiently access data anywhere in the structure (but you can't efficiently insert somewhere other than the start or end). The major problem here is that .NET doesn't provide an implementation of a deque. You'd need to find a 3rd party library with an implementation.
You can also save yourself a lot when copying by keeping track of "data that I need to prepend" (using a List/Queue/etc.) and then waiting to actually prepend the data as long as possible, so that you minimize the creation of new arrays as much as possible, as well as limiting the number of copies of existing elements.
You could also consider whether you could adjust the structure so that you're adding to the end, rather than the start (even if you know that you'll need to reverse it later). If you are appending a lot in a short space of time it may be worth storing the data in a List (which can efficiently add to the end) and adding to the end. Depending on your needs, it may even be worth making a class that is a wrapper for a List and that hides the fact that it is reversed. You could make an indexer that maps i to Count-i, etc. so that it appears, from the outside, as though your data is stored normally, even though the internal List actually holds the data backwards.
Ok guys, let's take a look at the perfomance issue regarding this question.
This is not an answer, just a microbenchmark to see which option is more efficient.
So, let's set the scenario:
A byte array of 1,000,000 items, randomly populated
We need to prepend item 0x00
We have 3 options:
Manually creating and populating the new array
Manually creating the new array and using Array.Copy (#Monroe)
Creating a list, loading the array, inserting the item and converting the list to an array
Here's the code:
byte[] byteArray = new byte[1000000];
for (int i = 0; i < byteArray.Length; i++)
{
byteArray[i] = Convert.ToByte(DateTime.Now.Second);
}
Stopwatch stopWatch = new Stopwatch();
//#1 Manually creating and populating a new array;
stopWatch.Start();
byte[] extendedByteArray1 = new byte[byteArray.Length + 1];
extendedByteArray1[0] = 0x00;
for (int i = 0; i < byteArray.Length; i++)
{
extendedByteArray1[i + 1] = byteArray[i];
}
stopWatch.Stop();
Console.WriteLine(string.Format("#1: {0} ms", stopWatch.ElapsedMilliseconds));
stopWatch.Reset();
//#2 Using a new array and Array.Copy
stopWatch.Start();
byte[] extendedByteArray2 = new byte[byteArray.Length + 1];
extendedByteArray2[0] = 0x00;
Array.Copy(byteArray, 0, extendedByteArray2, 1, byteArray.Length);
stopWatch.Stop();
Console.WriteLine(string.Format("#2: {0} ms", stopWatch.ElapsedMilliseconds));
stopWatch.Reset();
//#3 Using a List
stopWatch.Start();
List<byte> byteList = new List<byte>();
byteList.AddRange(byteArray);
byteList.Insert(0, 0x00);
byte[] extendedByteArray3 = byteList.ToArray();
stopWatch.Stop();
Console.WriteLine(string.Format("#3: {0} ms", stopWatch.ElapsedMilliseconds));
stopWatch.Reset();
Console.ReadLine();
And the results are:
#1: 9 ms
#2: 1 ms
#3: 6 ms
I've run it multiple times and I got different numbers, but the proportion is always the same: #2 is always the most efficient choice.
My conclusion: arrays are more efficient then Lists (although they provide less functionality), and somehow Array.Copy is really optmized (would like to understand that, though).
Any feedback will be appreciated.
Best regards.
PS: this is not a swordfight post, we are at a Q&A site to learn and teach. And learn.
The easiest and cleanest way for .NET 4.7.1 and above is to use the side-effect free Prepend().
Adds a value to the beginning of the sequence.
Example
// Creating an array of numbers
var numbers = new[] { 1, 2, 3 };
// Trying to prepend any value of the same type
var results = numbers.Prepend(0);
// output is 0, 1, 2, 3
Console.WriteLine(string.Join(", ", results ));
As you surmised, the fastest way to do this is to create new array of length + 1 and copy all the old values over.
If you are going to be doing this many times, then I suggest using a List<byte> instead of byte[], as the cost of reallocating and copying while growing the underlying storage is amortized more effectively; in the usual case, the underlying vector in the List is grown by a factor of two each time an addition or insertion is made to the List that would exceed its current capacity.
...
byte[] newValues = new byte[values.Length + 1];
newValues[0] = 0x00; // set the prepended value
Array.Copy(values, 0, newValues, 1, values.Length); // copy the old values
When I need to append data frequently but also want O(1) random access to individual elements, I'll use an array that is over allocated by some amount of padding for quickly adding new values. This means you need to store the actual content length in another variable, as the array.length will indicate the length + the padding. A new value gets appended by using one slot of the padding, no allocation & copy are necessary until you run out of padding. In effect, allocation is amortized over several append operations. There are speed space trade offs, as if you have many of these data structures you could have a fair amount of padding in use at any one time in the program.
This same technique can be used in prepending. Just as with appending, you can introduce an interface or abstraction between the users and the implementation: you can have several slots of padding so that new memory allocation is only necessary occasionally. As some above suggested, you can also implement a prepending interface with an appending data structure that reverses the indexes.
I'd package the data structure as an implementation of some generic collection interface, so that the interface appears fairly normal to the user (such as an array list or something).
(Also, if removal is supported, it's probably useful to clear elements as soon as they are removed to help reduce gc load.)
The main point is to consider the implementation and the interface separately, as decoupling them gives you the flexibility to choose varied implementations or to hide implementation details using a minimal interface.
There are many other data structures you could use depending on the applicability to your domain. Ropes or Gap Buffer; see What is best data structure suitable to implement editor like notepad?; Trie's do some useful things, too.
I know this is a VERY old post but I actually like using lambda. Sure my code may NOT be the most efficient way but its readable and in one line. I use a combination of .Concat and ArraySegment.
string[] originalStringArray = new string[] { "1", "2", "3", "5", "6" };
int firstElementZero = 0;
int insertAtPositionZeroBased = 3;
string stringToPrepend = "0";
string stringToInsert = "FOUR"; // Deliberate !!!
originalStringArray = new string[] { stringToPrepend }
.Concat(originalStringArray).ToArray();
insertAtPositionZeroBased += 1; // BECAUSE we prepended !!
originalStringArray = new ArraySegment<string>(originalStringArray, firstElementZero, insertAtPositionZeroBased)
.Concat(new string[] { stringToInsert })
.Concat(new ArraySegment<string>(originalStringArray, insertAtPositionZeroBased, originalStringArray.Length - insertAtPositionZeroBased)).ToArray();
The best choice depends on what you're going to be doing with this collection later on down the line. If that's the only length-changing edit that will ever be made, then your best bet is to create a new array with one additional slot and use Array.Copy() to do the rest. No need to initialize the first value, since new C# arrays are always zeroed out:
byte[] PrependWithZero(byte[] input)
{
var newArray = new byte[input.Length + 1];
Array.Copy(input, 0, newArray, 1, input.Length);
return newArray;
}
If there are going to be other length-changing edits that might happen, the most performant option might be to use a List<byte> all along, as long as the additions aren't always to the beginning. (If that's the case, even a linked list might not be an option that you can dismiss out of hand.):
var list = new List<byte>(input);
list.Insert(0, 0);
I am aware this is over 4-year-old accepted post, but for those who this might be relevant Buffer.BlockCopy would be faster.

Scanning byte array as uint array

I have a large byte array with mostly 0's but some values that I need to process. If this was C++ or unsafe C# I would use a 32bit pointer and only if the current 32bit were not 0, I would look at the individual bytes. This enables much faster scanning through the all 0 blocks. Unfortunately this must be safe C# :-)
I could use an uint array instead of a byte array and then manipulate the individual bytes but it makes what I'm doing much more messy than I like. I'm looking for something simpler, like the pointer example (I miss pointers sigh)
Thanks!
If the code must be safe, and you don't want to use a larger type and "shift", them you'll have to iterate each byte.
(edit) If the data is sufficiently sparse, you could use a dictionary to store the non-zero values; then finding the non-zeros is trivial (and enormous by sparse arrays become cheap).
I'd follow what this guy said:
Using SSE in c# is it possible?
Basically, write a little bit of C/C++, possibly using SSE, to implement the scanning part efficiently, and call it from C#.
You can access the characters
string.ToCharArray()
Or you can access the raw byte[]
Text.Encoding.UTF8Encoding.GetBytes(stringvalue)
Ultimately, what I think you'd need here is
MemoryStream stream;
stream.Write(...)
then you will be able to directly hadnle the memory's buffer
There is also UnmanagedMemoryStream but I'm not sure whether it'd use unsafe calls inside
You can use the BitConverter class:
byte[] byteArray = GetByteArray(); // or whatever
for (int i = 0; i < b.Length; I += 2)
{
uint x = BitConverter.ToUInt32(byteArray, i);
// do what you want with x
}
Another option is to create a MemoryStream from the byte array, and then use a BinaryReader to read 32-bit values from it.

C# byte[] substring? (design)

I'm downloading some files asynchronously into a large byte array, and I have a callback that fires off periodically whenever some data is added to that array. If I want to give developers the ability to use the last chunk of data that was added to array, then... well how would I do that? In C++ I could give them a pointer to somewhere in the middle, and then perhaps tell them the number of bytes that were added in the last operation so they at least know the chunk they should be looking at... I don't really want to give them a 2nd copy of that data, that's just wasteful.
I'm just thinking if people want to process this data before the file has completed downloading. Would anyone actually want to do that? Or is it a useless feature anyway? I already have a callback for when the buffer (entire byte array) is full, and then they can dump the whole thing without worrying about start and end points...
.NET has a struct that does exactly what you want:
System.ArraySegment.
In any case, it's easy to implement it yourself too - just make a constructor that takes a base array, an offset, and a length. Then implement an indexer that offsets indexes behind the scenes, so your ArraySegment can be seamlessly used in the place of an array.
You can't give them a pointer into the array, but you could give them the array and start index and length of the new data.
But I have to wonder what someone would use this for. Is this a known need? or are you just guessing that someone might want this someday. And If so, is there any reason why you couldn't wait to add the capability once somone actually needs it?
Whether this is needed or not depends on whether you can afford to accumulate all the data from a file before processing it, or whether you need to provide a streaming mode where you process each chunk as it arrives. This depends on two things: how much data there is (you probably would not want to accumulate a multi-gigabyte file), and how long it takes the file to completely arrive (if you are getting the data over a slow link you might not want your client to wait till it had all arrived). So it is a reasonable feature to add, depending on how the library is to be used. Streaming mode is usually a desirable attribute, so I would vote for implementing the feature. However, the idea of putting the data into an array seems wrong, because it fundamentally implies a non-streaming design, and because it requires an additional copy. What you could do instead is to keep each chunk of arriving data as a discrete piece. These could be stored in a container for which adding at the end and removing from the front is efficient.
Copying a chunk of a byte array may seem "wasteful," but then again, object-oriented languages like C# tend to be a little more wasteful than procedural languages anyway. A few extra CPU cycles and a little extra memory consumption can greatly reduce complexity and increase flexibility in the development process. In fact, copying bytes to a new location in memory to me sounds like good design, as opposed to the pointer approach which will give other classes access to private data.
But if you do want to use pointers, C# does support them. Here is a decent-looking tutorial. The author is correct when he states, "...pointers are only really needed in C# where execution speed is highly important."
I agree with the OP: sometimes you just plain need to pay some attention to efficiency. I don't think the example of providing an API is the best, because that certainly calls for leaning toward safety and simplicity over efficiency.
However, a simple example is when processing large numbers of huge binary files that have zillions of records in them, such as when writing a parser. Without using a mechanism such as System.ArraySegment, the parser becomes a big memory hog, and is greatly slowed down by creating a zillion new data elements, copying all the memory over, and fragmenting the heck out of the heap. It's a very real performance issue. I write these kinds of parsers all the time for telecommunications stuff which generate millions of records per day in each of several categories from each of many switches with variable length binary structures that need to be parsed into databases.
Using the System.ArraySegment mechanism versus creating new structure copies for each record tremendously speeds up the parsing, and greatly reduces the peak memory consumption of the parser. These are very real advantages because the servers run multiple parsers, run them frequently, and speed and memory conservation = very real cost savings in not having to have so many processors dedicated to the parsing.
System.Array segment is very easy to use. Here's a simple example of providing a base way to track the individual records in a typical big binary file full of records with a fixed length header and a variable length record size (obvious exception control deleted):
public struct MyRecord
{
ArraySegment<byte> header;
ArraySegment<byte> data;
}
public class Parser
{
const int HEADER_SIZE = 10;
const int HDR_OFS_REC_TYPE = 0;
const int HDR_OFS_REC_LEN = 4;
byte[] m_fileData;
List<MyRecord> records = new List<MyRecord>();
bool Parse(FileStream fs)
{
int fileLen = (int)fs.FileLength;
m_fileData = new byte[fileLen];
fs.Read(m_fileData, 0, fileLen);
fs.Close();
fs.Dispose();
int offset = 0;
while (offset + HEADER_SIZE < fileLen)
{
int recType = (int)m_fileData[offset];
switch (recType) { /*puke if not a recognized type*/ }
int varDataLen = ((int)m_fileData[offset + HDR_OFS_REC_LEN]) * 256
+ (int)m_fileData[offset + HDR_OFS_REC_LEN + 1];
if (offset + varDataLen > fileLen) { /*puke as file has odd bytes at end*/}
MyRecord rec = new MyRecord();
rec.header = new ArraySegment(m_fileData, offset, HEADER_SIZE);
rec.data = new ArraySegment(m_fileData, offset + HEADER_SIZE,
varDataLen);
records.Add(rec);
offset += HEADER_SIZE + varDataLen;
}
}
}
The above example gives you a list with ArraySegments for each record in the file while leaving all the actual data in place in one big array per file. The only overhead are the two array segments in the MyRecord struct per record. When processing the records, you have the MyRecord.header.Array and MyRecord.data.Array properties which allow you to operate on the elements in each record as if they were their own byte[] copies.
I think you shouldn't bother.
Why on earth would anyone want to use it?
That sounds like you want an event.
public class ArrayChangedEventArgs : EventArgs {
public (byte[] array, int start, int length) {
Array = array;
Start = start;
Length = length;
}
public byte[] Array { get; private set; }
public int Start { get; private set; }
public int Length { get; private set; }
}
// ...
// and in your class:
public event EventHandler<ArrayChangedEventArgs> ArrayChanged;
protected virtual void OnArrayChanged(ArrayChangedEventArgs e)
{
// using a temporary variable avoids a common potential multithreading issue
// where the multicast delegate changes midstream.
// Best practice is to grab a copy first, then test for null
EventHandler<ArrayChangedEventArgs> handler = ArrayChanged;
if (handler != null)
{
handler(this, e);
}
}
// finally, your code that downloads a chunk just needs to call OnArrayChanged()
// with the appropriate args
Clients hook into the event and get called when things change. This is what most client code in .NET expects to have in an API ("call me when something happens"). They can hook into the code with something as simple as:
yourDownloader.ArrayChanged += (sender, e) =>
Console.WriteLine(String.Format("Just downloaded {0} byte{1} at position {2}.",
e.Length, e.Length == 1 ? "" : "s", e.Start));

Categories