This method gets:
IEnumerable<object[]> - in which every array is in fixed size (it represent relational
data structure).
DataEnumerable.Column[] - some metadata columns,mostly they will have the same value for all rows.
Expected outcome:
each "row" should get value for each of these columns (so the data structure remains relational).
private IEnumerable<object[]> BindExtraColumns(IEnumerable<object[]> baseData, int dataSize, DataEnumerable.Column[] columnsToAdd)
{
int extraColumnsLength = columnsToAdd.Length;
object[] row = new object[dataSize + extraColumnsLength];
string columnName;
int rowNumberColumnIndex = -1;
for (int i = 0; i < extraColumnsLength; i++)
{
//Assign values that doesn't change between lines..
// Assign rowNumberColumnIndex if row number column exists
}
//Assign values that change here, since we currently support only row number
// i'ts not generic enough
if (rowNumberColumnIndex != -1)
{
int rowNumber = 1;
foreach (var baseRow in baseData)
{
row[rowNumberColumnIndex] = rowNumber;
Array.Copy(baseRow, 0, row, extraColumnsLength, dataSize);
yield return row;
rowNumber++;
}
}
else
{
foreach (var baseRow in baseData)
{
Array.Copy(baseRow, 0, row, extraColumnsLength, dataSize);
yield return row;
}
}
}
this method can be called from hundreds of threads with relatively big data sets so
performance here is critical, and i tried to create as minimum new objects as possible.
Please note - this is a private method, which used ONLY BY DataReader, which read each line, and passes it to another array immediately prior to reading the next line.
So - does copying arrays here be optimized here somehow, and should i use (carefully) memory to boost things here?
Thanks
Your code is fundamentally broken. You're just returning a reference to the same array every time, which means that unless the caller uses the data within each item immediately, it effectively gets lost. For example, suppose I use:
List<object[]> rows = BindExtraColumns(data, size, toAdd).ToList();
Then when I iterate over the rows, I find the same data in every row. That's really not a good experience.
I think it would make much more sense to create a new array for each iteration. Yes, that's a lot of extra memory being used - but it doesn't surprise callers nearly as much.
If you really don't want to do that, I suggest you change the approach so that the caller has to pass in an Action<object[]> to be executed on each row, with the documented proviso that if the caller stashes a reference to the array, they may well be surprised by the results.
You're obviously very concerned about performance, but if your data is coming from a database I'd expect the array creation/copying performance to be insignificant. You should write the simplest (and most reliable) code that works first, and then benchmark it to see whether it performs well enough. Unless you've got evidence that you need to make this surprising design choice, it feels like you're optimizing way too early.
EDIT: Now we know that it's a private method only used in one specific place, I would still avoid this reuse. It's simply fragile. I really would change to passing in an Action<object[]> or simply copying the data to a new array every time. I certainly wouldn't keep the current approach without strong evidence that it's a bottleneck: as I said before, I'd expect the database communication to be much more important. Leaving timebombs in your code like this very rarely works out well.
If you really, really want to keep doing this, you should document it very strongly, giving severe warnings that the result is non-idiomatic.
In terms of whether there's more optimization you could do - well... one alternative would be to avoid having to work with a single array in the first place. You could create a class which held references to both arrays (the current base row and the fixed data) and exposed an indexer which returned the value from one array or the other based on which index was being requested. We don't know what you're doing with the data ("passes it to another array" doesn't really mean anything) so we don't know whether that's feasible, but it would be efficient and could be implemented without the odd behaviour.
Related
I was told that using a temp object was not the most effective way to swap elements in an array.
Such as:
Object[] objects = new Object[10];
// -- Assign the 10 objects some values
var Temp = objects[2];
objects[2] = objects[4];
objects[4] = Temp;
Is it really possible to swap the elements of the array without using another object?
I know that with math units you can but I cannot figure out how this would be done with any other object type.
Swapping objects with a temporary is the most correct way of doing it. That should rank way higher up in your priorities than speed. It's pretty easy to write fast software that ouputs garbage.
When dealing with objects, you just cannot do it differently. And this is not at all inefficient. An extra reference variable that points to an already existing object is hardly going to be a problem.
But even with numerical values, most clever techniques fail to produce correct results at some point.
It's possible the person who told you this was thinking of something like this:
objects[2] = Interlocked.Exchange(ref objects[4], objects[2]);
Of course, just because this is one line doesn't mean it isn't also using a temporary variable. It's just hidden in the form of a method parameter (the reference to objects[2] is copied and passed to the Exchange method), making it less obvious.
you could do it with Interlocked.Exchange but that won't be faster than using a temp variable... not that speed is likely to matter for this sort problem outside of an interview.
Every example I can find online uses exactly this method to swap 2 elements in an array. In fact, if this were something I were doing often I would definately consider a generic extansion method, akin to this example:
http://w3mentor.com/learn/asp-dot-net-c-sharp/c-collections-and-generics/generically-swapping-two-elements-in-array-using-ccsharp/
The only time you should worry about creating a temporary object is when its massive, and the time taken to copy the object will be sufficiently long and if you're doing it a lot, for example, or if youre doing a sort of 10k items, and moving it from 9999 to 1, 1 stage at a time testing if it should move or not, then swapping it each and every time would be a drain. However, a more efficient way would test all the tests and move once.
Swapping via deconstruction:
(first, second) = (second, first);
I've got an array of integers we're getting from a third party provider. These are meant to be sequential but for some reason they miss a number (something throws an exception, its eaten and the loop continues missing that index). This causes our system some grief and I'm trying to ensure that the array we're getting is indeed sequential.
The numbers start from varying offsets (sometimes 1000, sometimes 5820, others 0) but whatever the start, its meant to go from there.
What's the fastest method to verify the array is sequential? Even though its a required step it seems now, I also have to make sure it doesn't take too long to verify. I am currently starting at the first index, picking up the number and adding one and making sure the next index contains that etc.
EDIT:
The reason why the system fails is because of the way people use the system it may not always be returning the tokens the way it was picked initially - long story. The data can't be corrected until it gets to our layer unfortunately.
If you're sure that the array is sorted and has no duplicates, you can just check:
array[array.Length - 1] == array[0] + array.Length - 1
I think it's worth addressing the bigger issue here: what are you going to do if the data doesn't meet your requriements (sequential, no gaps)?
If you're still going to process the data, then you should probably invest your time in making your system more resilient to gaps or missing entries in the data.
**If you need to process the data and it must be clean, you should work with the vendor to make sure they send you well-formed data.
If you're going to skip processing and report an error, then asserting the precondition of no gaps may be the way to go. In C# there's a number of different things you could do:
If the data is sorted and has no dups, just check if LastValue == FirstValue + ArraySize - 1.
If the data is not sorted but dup free, just sort it and do the above.
If the data is not sorted, has dups and you actually want to detect the gaps, I would use LINQ.
List<int> gaps = Enumerable.Range(array.Min(), array.Length).Except(array).ToList();
or better yet (since the high-end value may be out of range):
int minVal = array.Min();
int maxVal = array.Max();
List<int> gaps = Enumerable.Range(minVal, maxVal-minVal+1).Except(array).ToList();
By the way, the whole concept of being passed a dense, gapless, array of integers is a bit odd for an interface between two parties, unless there's some additional data that associated with them. If there's no other data, why not just send a range {min,max} instead?
for (int i = a.Length - 2; 0 <= i; --i)
{
if (a[i] >= a[i+1]) return false; // not in sequence
}
return true; // in sequence
Gabe's way is definitely the fastest if the array is sorted. If the array is not sorted, then it would probably be best to sort the array (with merge/shell sort (or something of similar speed)) and then use Gabe's way.
I just got a code handed over to me. The code is written in C# and it inserts realtime data into database every second. The data is accumulated in time which makes the numbers big.
The data is updated within the second many times then at the end of the second result is taken and inserted.
We used to address the dataset rows directly within the second through the properties. For example many operations like this one 'datavaluerow.meanvalue += mean; could take place.
we figured out that this is degrading the performance after running the profiler becuase of the internal casting done so we created 2d array of decimals on which the updates are carried out then the values are assigned to the datarows only at the end of the second.
I ran a profiler and found out that it is still taking a lot of time (although less than the time spent accessing datarows frequently when added up).
The code that is exectued at the end of the second is as follows
public void UpdateDataRows(int tick)
{
//ord
//_table1Values is of type decimal[][]
for (int i = 0; i < _table1Values.Length; i++)
{
_table1Values[i][(int)table1Enum.barDateTime] = tick;
table1Row[i].ItemArray = _table1Values[i].Cast<object>().ToArray();
}
// this process is done for other 10 tables
}
Is there a way to further improve this approach.
One obvious question: why do you have a 2D array of decimals when you're only updating them with integers? Could you get away with an int[][] instead?
Next, why are you accessing (int)table1Enum.barDateTime on each iteration? Given that there's a conversion involved there, you may find it helps if you extract that out of the loop.
However, I suspect the majority of the time is going to be spent in _table1Values[i].Cast<object>().ToArray(). Do you really need to do that? Taking a copy of the decimal[] (or int[]) would be faster than boxing every value on every iteration on every call - and then creating another array.
I have a database table with a large number of rows and one numeric column, and I want to represent this data in memory. I could just use one big integer array and this would be very fast, but the number of rows could be too large for this.
Most of the rows (more than 99%) have a value of zero. Is there an effective data structure I could use that would only allocate memory for rows with non-zero values and would be nearly as fast as an array?
Update: as an example, one thing I tried was a Hashtable, reading the original table and adding any non-zero values, keyed by the row number in the original table. I got the value with a function that returned 0 if the requested index wasn't found, or else the value in the Hashtable. This works but is slow as dirt compared to a regular array - I might not be doing it right.
Update 2: here is sample code.
private Hashtable _rowStates;
private void SetRowState(int rowIndex, int state)
{
if (_rowStates.ContainsKey(rowIndex))
{
if (state == 0)
{
_rowStates.Remove(rowIndex);
}
else
{
_rowStates[rowIndex] = state;
}
}
else
{
if (state != 0)
{
_rowStates.Add(rowIndex, state);
}
}
}
private int GetRowState(int rowIndex)
{
if (_rowStates.ContainsKey(rowIndex))
{
return (int)_rowStates[rowIndex];
}
else
{
return 0;
}
}
This is an example of a sparse data structure and there are multiple ways to implement such sparse arrays (or matrices) - it all depends on how you intend to use it. Two possible strategies are:
Store only non-zero values. For each element different than zero store a pair (index, value), all other values are known to be zero by default. You would also need to store the total number of elements.
Compress consecutive zero values. Store a number of (count, value) pairs. For example if you have 12 zeros in a row followed by 200 and another 22 zeros, then store (12, 0), (1, 200), (22, 0).
I would expect that the map/dictionary/hashtable of the non-zero values should be a fast and economical solution.
In Java, using the Hashtable class would introduce locking because it is supposed to be thread-safe. Perhaps something similar has slowed down your implementation.
--- update: using Google-fu suggests that C# Hashtable does incur an overhead for thread safety. Try a Dictionary instead.
How exactly you wan't to implement it depends on what your requirements are, it's a tradeoff between memory and speed. A pure integer array is the fastest, with constant complexity lookups.
Using a hash-based collection such as Hashtable or Dictionary (Hashtable seems to be slower but thread-safe - as others have pointed out) will give you a very low memory usage for a sparse data structure as yours but can be somewhat more expensive when performing lookups. You store a key-value pair for each index and non-zero value.
You can use ContainsKey to find out whether the key exists but it is significantly faster to use TryGetValue to make the check and fetch the data in one go. For dense data it can be worth it to catch exceptions for missing elements as this will only incur a cost in the exceptional case and not each lookup.
Edited again as I got myself confused - that'll teach me to post when I ought to be sleeping.
You're paying a boxing penealty by using Hashtable. Try switching to a Dictionary<int, int>. Also, how many rows are we talking - and how fast do you need it?
Create integer array for non-zero values and bit array holding indicators if particular row contains non-zero value.
You can find then necessary element in first array summing up bits in second array starting from 0 up to row index position.
I am not sure about efficiency of this solution but you can try. So it depends at which scenario you will use it but I will write here two of them that I have in mind. First solution is if you have just one field of integers you can simply use generic list of integers:
List<int> myList = new List<int>();
The second one is almost the same, but you can create a list of your own type for example if you have two fields, count and non-zero value you can create a class which will have two properties and then you can create a list of your class and store information in it. But also you can try generic linked lists. So the code for the solution two can be like this:
public class MyDbFields
{
public MyDbFields(int count, int nonzero)
{
Count = count;
NonZero = nonzero;
}
public int Count { get; set; }
public int NonZero { get; set; }
}
Then you can create a list like this:
List<MyDbFields> fields_list = new List<MyDbFields>();
and then fill it with data:
fields_list.Add(new MyDbFields(100, 11));
I am not sure if this will fully help you solve your problem, but just my suggestion.
If I understand correctly, you cannot just select non-zero rows, because for each row index (aka PK value) your Data Structure will have to be able to report not only the value, but also whether or not it is there at all. So assuming 0 if you don't find it in your Data Structure might not be a good idea.
Just to make sure - exactly how many rows are we talking about here? Millions? A million integers would take up only 4MB RAM as an array. Not much really. I guess it must be at least 100'000'000 rows.
Basically I would suggest a sorted array of integer-pairs for storing non-zero values. The first element in each pair would be the PK value, and this is what the array would be sorted by. The second element would be the value. You can make a DB select that returns only these non-zero values, of course. Since the array will be sorted, you'll be able to use binary search to find your values.
If there are no "holes" in the PK values, then the only thing you would need besides this would be the minimum and maximum PK values so that you can determine whether a given index belongs to your data set.
If there are unused PK values between the used ones, then you need some other mechanism to determine which PK values are valid. Perhaps a bitmask or another array of valid (or invalid, whichever are fewer) PK values.
If you choose the bitmask way, there is another idea. Use two bits for every PK value. First bit will show if the PK value is valid or not. Second bit will show if it is zero or not. Store all non-zero values in another array. This however will have the drawback that you won't know which array item corresponds to which bitmask entry. You'd have to count all the way from the start to find out. This can be mitigated with some indexes. Say, for every 1000 entries in the value array you store another integer which tells you where this entry is in the bitmask.
Perhaps you are looking in the wrong area - all you are storing for each value is the row number of the database row, which suggests that perhaps you are just using this to retrieve the row?
Why not try indexing your table on the numeric column - this will provide lightning fast access to the table rows for any given numeric value (which appears to be the ultimate objective here?) If it is still too slow you can move the index itself into memory etc.
My point here is that your database may solve this problem more elegantly than you can.
I was working on some code recently and came across a method that had 3 for-loops that worked on 2 different arrays.
Basically, what was happening was a foreach loop would walk through a vector and convert a DateTime from an object, and then another foreach loop would convert a long value from an object. Each of these loops would store the converted value into lists.
The final loop would go through these two lists and store those values into yet another list because one final conversion needed to be done for the date.
Then after all that is said and done, The final two lists are converted to an array using ToArray().
Ok, bear with me, I'm finally getting to my question.
So, I decided to make a single for loop to replace the first two foreach loops and convert the values in one fell swoop (the third loop is quasi-necessary, although, I'm sure with some working I could also put it into the single loop).
But then I read the article "What your computer does while you wait" by Gustav Duarte and started thinking about memory management and what the data was doing while it's being accessed in the for-loop where two lists are being accessed simultaneously.
So my question is, what is the best approach for something like this? Try to condense the for-loops so it happens in as little loops as possible, causing multiple data access for the different lists. Or, allow the multiple loops and let the system bring in data it's anticipating. These lists and arrays can be potentially large and looping through 3 lists, perhaps 4 depending on how ToArray() is implemented, can get very costy (O(n^3) ??). But from what I understood in said article and from my CS classes, having to fetch data can be expensive too.
Would anyone like to provide any insight? Or have I completely gone off my rocker and need to relearn what I have unlearned?
Thank you
The best approach? Write the most readable code, work out its complexity, and work out if that's actually a problem.
If each of your loops is O(n), then you've still only got an O(n) operation.
Having said that, it does sound like a LINQ approach would be more readable... and quite possibly more efficient as well. Admittedly we haven't seen the code, but I suspect it's the kind of thing which is ideal for LINQ.
For referemce,
the article is at
What your computer does while you wait - Gustav Duarte
Also there's a guide to big-O notation.
It's impossible to answer the question without being able to see code/pseudocode. The only reliable answer is "use a profiler". Assuming what your loops are doing is a disservice to you and anyone who reads this question.
Well, you've got complications if the two vectors are of different sizes. As has already been pointed out, this doesn't increase the overall complexity of the issue, so I'd stick with the simplest code - which is probably 2 loops, rather than 1 loop with complicated test conditions re the two different lengths.
Actually, these length tests could easily make the two loops quicker than a single loop. You might also get better memory fetch performance with 2 loops - i.e. you are looking at contiguous memory - i.e. A[0],A[1],A[2]... B[0],B[1],B[2]..., rather than A[0],B[0],A[1],B[1],A[2],B[2]...
So in every way, I'd go with 2 separate loops ;-p
Am I understanding you correctly in this?
You have these loops:
for (...){
// Do A
}
for (...){
// Do B
}
for (...){
// Do C
}
And you converted it into
for (...){
// Do A
// Do B
}
for (...){
// Do C
}
and you're wondering which is faster?
If not, some pseudocode would be nice, so we could see what you meant. :)
Impossible to say. It could go either way. You're right, fetching data is expensive, but locality is also important. The first version may be better for data locality, but on the other hand, the second has bigger blocks with no branches, allowing more efficient instruction scheduling.
If the extra performance really matters (as Jon Skeet says, it probably doesn't, and you should pick whatever is most readable), you really need to measure both options, to see which is fastest.
My gut feeling says the second, with more work being done between jump instructions, would be more efficient, but it's just a hunch, and it can easily be wrong.
Aside from cache thrashing on large functions, there may be benefits on tiny functions as well. This applies on any auto-vectorizing compiler (not sure if Java JIT will do this yet, but you can count on it eventually).
Suppose this is your code:
// if this compiles down to a raw memory copy with a bitmask...
Date morningOf(Date d) { return Date(d.year, d.month, d.day, 0, 0, 0); }
Date timestamps[N];
Date mornings[N];
// ... then this can be parallelized using SSE or other SIMD instructions
for (int i = 0; i != N; ++i)
mornings[i] = morningOf(timestamps[i]);
// ... and this will just run like normal
for (int i = 0; i != N; ++i)
doOtherCrap(mornings[i]);
For large data sets, splitting the vectorizable code out into a separate loop can be a big win (provided caching doesn't become a problem). If it was all left as a single loop, no vectorization would occur.
This is something that Intel recommends in their C/C++ optimization manual, and it really can make a big difference.
... working on one piece of data but with two functions can sometimes make it so that code to act on that data doesn't fit in the processor's low level caches.
for(i=0, i<10, i++ ) {
myObject object = array[i];
myObject.functionreallybig1(); // pushes functionreallybig2 out of cache
myObject.functionreallybig2(); // pushes functionreallybig1 out of cache
}
vs
for(i=0, i<10, i++ ) {
myObject object = array[i];
myObject.functionreallybig1(); // this stays in the cache next time through loop
}
for(i=0, i<10, i++ ) {
myObject object = array[i];
myObject.functionreallybig2(); // this stays in the cache next time through loop
}
But it was probably a mistake (usually this type of trick is commented)
When data is cycicly loaded and unloaded like this, it is called cache thrashing, btw.
This is a seperate issue from the data these functions are working on, as typically the processor caches that separately.
I apologize for not responding sooner and providing any kind of code. I got sidetracked on my project and had to work on something else.
To answer anyone still monitoring this question;
Yes, like jalf said, the function is something like:
PrepareData(vectorA, VectorB, xArray, yArray):
listA
listB
foreach(value in vectorA)
convert values insert in listA
foreach(value in vectorB)
convert values insert in listB
listC
listD
for(int i = 0; i < listB.count; i++)
listC[i] = listB[i] converted to something
listD[i] = listA[i]
xArray = listC.ToArray()
yArray = listD.ToArray()
I changed it to:
PrepareData(vectorA, vectorB, ref xArray, ref yArray):
listA
listB
for(int i = 0; i < vectorA.count && vectorB.count; i++)
convert values insert in listA
convert values insert in listB
listC
listD
for(int i = 0; i < listB.count; i++)
listC[i] = listB[i] converted to something
listD[i] = listA[i]
xArray = listC.ToArray()
yArray = listD.ToArray()
Keeping in mind that the vectors can potentially have a large number of items. I figured the second one would be better, so that the program wouldnt't have to loop n times 2 or 3 different times. But then I started to wonder about the affects (effects?) of memory fetching, or prefetching, or what have you.
So, I hope this helps to clear up the question, although a good number of you have provided excellent answers.
Thank you every one for the information. Thinking in terms of Big-O and how to optimize has never been my strong point. I believe I am going to put the code back to the way it was, I should have trusted the way it was written before instead of jumping on my novice instincts. Also, in the future I will put more reference so everyone can understand what the heck I'm talking about (clarity is also not a strong point of mine :-/).
Thank you again.