Enumerate SqlDataReader Columns - c#

Given the following code snip:
using (var reader = cmd.ExecuteReader())
{
while(reader.Read())
{
var count = reader.FieldCount; // the first column must be name all subsequent columns are treated as data
var data = new List<double>();
for (var i = 1; i < count; i++)
{
data.Add(Convert.ToDouble(reader[i].ToString()));
}
barChartSeries.Add(new ColumnBarChart.ColumnChartSeries((string)reader[0],
data));
columnChart.xAxis.categories.Add((string)reader[0]);
}
}
Is there an easy way to eliminate the for loop? Perhaps using linq?
reader[0] will always be a string
reader[0+?] will be a double
I want to pull all the doubles into a list if possible.

Speed is somewhat of a concern I suppose.
Then you're focusing on entirely the wrong problem.
Out of the things you're doing here, looping is the least inefficient part. More worrying is that you're converting from a double to string and then back to double. At least fix your code to:
for (var i = 1; i < count; i++)
{
data.Add(reader.GetDouble(i));
}
You could create the list with the known size, too:
List<double> data = new List<double>(count - 1);
Even then, I strongly suspect that's going to be irrelevant compared with the serialization and database access.
Whatever you do, something is going to be looping. You could go to great lengths to hide the looping (e.g. by writing an extension method to iterate over all the values in a row) - but really, I don't see that there's anything wrong with it here.
I strongly advise you to avoid micro-optimizing until you've proven there's a problem. I'd be utterly astonished if this bit of code you're worrying about is a bottleneck at all. Generally, you should write the simplest code which achieves what you want it to, decide on your performance criteria and then test against them. Only move away from simple code when you need to.

The only way I can see to remove the forloop is to use Enumerable.Range.
But anyway you could do something like:
var data = new List<double>(Enumerable.Range(1, count).Select(i => reader.GetDouble(i)));
But I see no benefit to this approch, and it just makes the code unreadable

I dont think you can loop on it because
public class SqlDataReader : DbDataReader, IDataReader, IDisposable, IDataRecord
and this class does not implement IEnumerable

You could use the LINQ Cast method to get something you can use other LINQ methods on, but I agree with the other posters - there's nothing wrong with a for loop. LINQ methods will just use a less efficient loop in the background.
I believe this should work, but haven't set up a data set to test it.
reader.Cast<Double>().Skip(1).ToList();

Related

It is wise to create a variable to avoid using several times a count?

Many times I'm in front of a code like that:
var maybe = 'some random linq query'
int maybeCount = maybe.Count();
List<KeyValuePair<int, Customer>> lst2scan = maybe.Take(8).ToList();
for (int k = 0; k < 8; k++)
{
if (k + 1 <= maybeCount) custlst[k].Customer = lst2scan[k].Value;
else custlst[k].Customer = new Customer();
}
Each time I have a code like this. I ask me, must I create a variable to avoid the for-each calculate the Count() ?. Maybe for only 1 Count() in the loop it's useless.
Somebody have an advice with the "right way" to code that in a loop. Do you do in case per case or ?
In this case, is my maybeCount variable useless ?
Do you know if the Count() count each time or he just return the content of a count variable.
Thanks for any advice to improve my knowledge.
If the count is guarranteed to not change then yes this will stop the need to calculate the count multiple times, this can become more important when the Count is resolved from method that can take some time to execute.
If the count does change then this can result in erroneous results being generated.
In answer to your questions.
The way you have done it looks pretty reasonable
As mentioned, your cnt variable means that you won't have to repeat your linq query on each iteration.
The count will be determined each time you call it, it has no memory for what the count is
Note: It is important to note that I am talking about the Enumerable.Count method. The List<T>.Count is a property that can just return a value
Well, it depends ;)
I wouldn't use a loop. I would do something like this:
-
var maybe = 'some random linq query'
// if necessary clear list
custlst.Clear();
custlst.AddRange(maybe.Take(8).Select(p => p.Value));
custlst.AddRange(Enumerable.Range(0, 8 - custlst.Count).Select(i => new Customer()));
-
In this case, yes.
It depends if the underlying type is a List (a Collection, in fact) or a simple enumerable. The implementation of Count will look for a precomputed Count value and use it. So, if you're cycling a list, the variable is almost useless (still a little faster, but I don't think it's relevant by any means), if you're cycling a Linq query, a variable is recommended, because Count is going to cycle the entire enumeration.
Just my 2c.
Edit: for reference, you can find the source for .Count() at https://github.com/dotnet/corefx/blob/master/src/System.Linq/src/System/Linq/Enumerable.cs (around line 1489); as you see, it checks for ICollection, and uses the precomputed Count property.

Adding fixed size array to IEnumerable

This method gets:
IEnumerable<object[]> - in which every array is in fixed size (it represent relational
data structure).
DataEnumerable.Column[] - some metadata columns,mostly they will have the same value for all rows.
Expected outcome:
each "row" should get value for each of these columns (so the data structure remains relational).
private IEnumerable<object[]> BindExtraColumns(IEnumerable<object[]> baseData, int dataSize, DataEnumerable.Column[] columnsToAdd)
{
int extraColumnsLength = columnsToAdd.Length;
object[] row = new object[dataSize + extraColumnsLength];
string columnName;
int rowNumberColumnIndex = -1;
for (int i = 0; i < extraColumnsLength; i++)
{
//Assign values that doesn't change between lines..
// Assign rowNumberColumnIndex if row number column exists
}
//Assign values that change here, since we currently support only row number
// i'ts not generic enough
if (rowNumberColumnIndex != -1)
{
int rowNumber = 1;
foreach (var baseRow in baseData)
{
row[rowNumberColumnIndex] = rowNumber;
Array.Copy(baseRow, 0, row, extraColumnsLength, dataSize);
yield return row;
rowNumber++;
}
}
else
{
foreach (var baseRow in baseData)
{
Array.Copy(baseRow, 0, row, extraColumnsLength, dataSize);
yield return row;
}
}
}
this method can be called from hundreds of threads with relatively big data sets so
performance here is critical, and i tried to create as minimum new objects as possible.
Please note - this is a private method, which used ONLY BY DataReader, which read each line, and passes it to another array immediately prior to reading the next line.
So - does copying arrays here be optimized here somehow, and should i use (carefully) memory to boost things here?
Thanks
Your code is fundamentally broken. You're just returning a reference to the same array every time, which means that unless the caller uses the data within each item immediately, it effectively gets lost. For example, suppose I use:
List<object[]> rows = BindExtraColumns(data, size, toAdd).ToList();
Then when I iterate over the rows, I find the same data in every row. That's really not a good experience.
I think it would make much more sense to create a new array for each iteration. Yes, that's a lot of extra memory being used - but it doesn't surprise callers nearly as much.
If you really don't want to do that, I suggest you change the approach so that the caller has to pass in an Action<object[]> to be executed on each row, with the documented proviso that if the caller stashes a reference to the array, they may well be surprised by the results.
You're obviously very concerned about performance, but if your data is coming from a database I'd expect the array creation/copying performance to be insignificant. You should write the simplest (and most reliable) code that works first, and then benchmark it to see whether it performs well enough. Unless you've got evidence that you need to make this surprising design choice, it feels like you're optimizing way too early.
EDIT: Now we know that it's a private method only used in one specific place, I would still avoid this reuse. It's simply fragile. I really would change to passing in an Action<object[]> or simply copying the data to a new array every time. I certainly wouldn't keep the current approach without strong evidence that it's a bottleneck: as I said before, I'd expect the database communication to be much more important. Leaving timebombs in your code like this very rarely works out well.
If you really, really want to keep doing this, you should document it very strongly, giving severe warnings that the result is non-idiomatic.
In terms of whether there's more optimization you could do - well... one alternative would be to avoid having to work with a single array in the first place. You could create a class which held references to both arrays (the current base row and the fixed data) and exposed an indexer which returned the value from one array or the other based on which index was being requested. We don't know what you're doing with the data ("passes it to another array" doesn't really mean anything) so we don't know whether that's feasible, but it would be efficient and could be implemented without the odd behaviour.

Convert loop to LINQ

A list of Equity Analytics (stocks) objects doing a calculation for daily returns.
Was thinking there must be a pairwise solution to do this:
for(int i = 0; i < sdata.Count; i++){
sdata[i].DailyReturn = (i > 0) ? (sdata[i-1].AdjClose/sdata[i].AdjClose) - 1
: 0.0;
}
LINQ stands for: "Language-Integrated Query".
LINQ should not be used and almost can't be used for assignments, LINQ doesn't change the given IEnumerable parameter but creates a new one.
As suggest in a comment below, there is a way to create a new IEnumerable with LINQ, it will be slower, and a lot less readable.
Though LINQ is nice, an important thing is to know when not to use it.
Just use the good old for loop.
I'm new to LINQ, I started using it because of stackoverflow so I tried to play with your question, and I know this may attract downvotes but I tried and it does what you want in case there is at least one element in the list.
sdata[0].DailyReturn = 0.0;
sdata.GetRange(1, sdata.Count - 1).ForEach(c => c.DailyReturn = (sdata[sdata.IndexOf(c)-1].AdjClose / c.AdjClose) - 1);
But must say that avoiding for loops isn't the best practice. from my point of view, LINQ should be used where convenient and not everywhere. Good old loops are sometimes easier to maintain.

what is the difference between for (or) foreach loop and linq query in case of speed

i like to know difference retrieval from list using for (or) foreach loop and retrieval from list using linq query. specially in case of speed and other difference
EXample:
List A=new List() contains 10000 rows i need to copy filter some rows from list A which one better in case of speed am i go with for loop or linq query
You could benchmark yourself and find out. (After all, only you know the particular circumstances in which you'll need to be running these loops and queries.)
My (very crude) rule-of-thumb -- which has so many caveats and exceptions as to be almost useless -- is that a for loop will generally be slightly faster than a foreach which will generally be slightly faster than a sensibly-written LINQ query.
You should use whatever construct makes the most sense for your particular situation. If what you want to do is best expressed with a for loop then do that; if it's best expressed as a foreach then do that; if it's best expressed as a query then use LINQ.
Only if and when you find that performance isn't good enough should you consider re-writing code that's expressive and correct into something faster and less expressive (but hopefully still correct).
If we're talking regular LINQ, then we're focusing on IEnumerable<T> (LINQ-to-Objects) and IQueryable<T> (LINQ-to-most-other-stuff). Since IQueryable<T> : IEnumerable<T>, it is automatic that you can use foreach - but what this means is very query-specific, since LINQ is generally lazily spooling data from an underlying source. Indeed, that source can be infinite:
public IEnumerable<int> Forever() {
int i = 0;
while(true) yield return i++;
}
...
foreach(int i in Forever()) {
Console.WriteLine(i);
if(Console.ReadLine() == "exit") break;
}
However, a for loop requires the length and an indexer. Which in real terms, typically means calling ToList() or ToArray():
var list = source.ToList();
for(int i = 0 ; i < list.Count ; i++) { do something with list[i] }
This is interesting in various ways: firstly, it will die for infinite sequences ;p. However, it also moves the spooling earlier. So if we are reading from an external data source, the for/foreach loop over the list will be quicker, but simply because we've moved a lot of work to ToList() (or ToArray(), etc).
Another important feature of performing the ToList() earlier is that you have closed the reader. You might need to operate on data inside the list, and that isn't always possible while a reader is open; iterators break while enumerating, for example - or perhaps more notably, unless you use "MARS" SQL Server only allows one reader per connection. As a counterpoint, that reeks of "n+1", so watch for that too.
Over a local list/array/etc, is is largely redundant which loop strategy you use.

Back to basics; for-loops, arrays/vectors/lists, and optimization

I was working on some code recently and came across a method that had 3 for-loops that worked on 2 different arrays.
Basically, what was happening was a foreach loop would walk through a vector and convert a DateTime from an object, and then another foreach loop would convert a long value from an object. Each of these loops would store the converted value into lists.
The final loop would go through these two lists and store those values into yet another list because one final conversion needed to be done for the date.
Then after all that is said and done, The final two lists are converted to an array using ToArray().
Ok, bear with me, I'm finally getting to my question.
So, I decided to make a single for loop to replace the first two foreach loops and convert the values in one fell swoop (the third loop is quasi-necessary, although, I'm sure with some working I could also put it into the single loop).
But then I read the article "What your computer does while you wait" by Gustav Duarte and started thinking about memory management and what the data was doing while it's being accessed in the for-loop where two lists are being accessed simultaneously.
So my question is, what is the best approach for something like this? Try to condense the for-loops so it happens in as little loops as possible, causing multiple data access for the different lists. Or, allow the multiple loops and let the system bring in data it's anticipating. These lists and arrays can be potentially large and looping through 3 lists, perhaps 4 depending on how ToArray() is implemented, can get very costy (O(n^3) ??). But from what I understood in said article and from my CS classes, having to fetch data can be expensive too.
Would anyone like to provide any insight? Or have I completely gone off my rocker and need to relearn what I have unlearned?
Thank you
The best approach? Write the most readable code, work out its complexity, and work out if that's actually a problem.
If each of your loops is O(n), then you've still only got an O(n) operation.
Having said that, it does sound like a LINQ approach would be more readable... and quite possibly more efficient as well. Admittedly we haven't seen the code, but I suspect it's the kind of thing which is ideal for LINQ.
For referemce,
the article is at
What your computer does while you wait - Gustav Duarte
Also there's a guide to big-O notation.
It's impossible to answer the question without being able to see code/pseudocode. The only reliable answer is "use a profiler". Assuming what your loops are doing is a disservice to you and anyone who reads this question.
Well, you've got complications if the two vectors are of different sizes. As has already been pointed out, this doesn't increase the overall complexity of the issue, so I'd stick with the simplest code - which is probably 2 loops, rather than 1 loop with complicated test conditions re the two different lengths.
Actually, these length tests could easily make the two loops quicker than a single loop. You might also get better memory fetch performance with 2 loops - i.e. you are looking at contiguous memory - i.e. A[0],A[1],A[2]... B[0],B[1],B[2]..., rather than A[0],B[0],A[1],B[1],A[2],B[2]...
So in every way, I'd go with 2 separate loops ;-p
Am I understanding you correctly in this?
You have these loops:
for (...){
// Do A
}
for (...){
// Do B
}
for (...){
// Do C
}
And you converted it into
for (...){
// Do A
// Do B
}
for (...){
// Do C
}
and you're wondering which is faster?
If not, some pseudocode would be nice, so we could see what you meant. :)
Impossible to say. It could go either way. You're right, fetching data is expensive, but locality is also important. The first version may be better for data locality, but on the other hand, the second has bigger blocks with no branches, allowing more efficient instruction scheduling.
If the extra performance really matters (as Jon Skeet says, it probably doesn't, and you should pick whatever is most readable), you really need to measure both options, to see which is fastest.
My gut feeling says the second, with more work being done between jump instructions, would be more efficient, but it's just a hunch, and it can easily be wrong.
Aside from cache thrashing on large functions, there may be benefits on tiny functions as well. This applies on any auto-vectorizing compiler (not sure if Java JIT will do this yet, but you can count on it eventually).
Suppose this is your code:
// if this compiles down to a raw memory copy with a bitmask...
Date morningOf(Date d) { return Date(d.year, d.month, d.day, 0, 0, 0); }
Date timestamps[N];
Date mornings[N];
// ... then this can be parallelized using SSE or other SIMD instructions
for (int i = 0; i != N; ++i)
mornings[i] = morningOf(timestamps[i]);
// ... and this will just run like normal
for (int i = 0; i != N; ++i)
doOtherCrap(mornings[i]);
For large data sets, splitting the vectorizable code out into a separate loop can be a big win (provided caching doesn't become a problem). If it was all left as a single loop, no vectorization would occur.
This is something that Intel recommends in their C/C++ optimization manual, and it really can make a big difference.
... working on one piece of data but with two functions can sometimes make it so that code to act on that data doesn't fit in the processor's low level caches.
for(i=0, i<10, i++ ) {
myObject object = array[i];
myObject.functionreallybig1(); // pushes functionreallybig2 out of cache
myObject.functionreallybig2(); // pushes functionreallybig1 out of cache
}
vs
for(i=0, i<10, i++ ) {
myObject object = array[i];
myObject.functionreallybig1(); // this stays in the cache next time through loop
}
for(i=0, i<10, i++ ) {
myObject object = array[i];
myObject.functionreallybig2(); // this stays in the cache next time through loop
}
But it was probably a mistake (usually this type of trick is commented)
When data is cycicly loaded and unloaded like this, it is called cache thrashing, btw.
This is a seperate issue from the data these functions are working on, as typically the processor caches that separately.
I apologize for not responding sooner and providing any kind of code. I got sidetracked on my project and had to work on something else.
To answer anyone still monitoring this question;
Yes, like jalf said, the function is something like:
PrepareData(vectorA, VectorB, xArray, yArray):
listA
listB
foreach(value in vectorA)
convert values insert in listA
foreach(value in vectorB)
convert values insert in listB
listC
listD
for(int i = 0; i < listB.count; i++)
listC[i] = listB[i] converted to something
listD[i] = listA[i]
xArray = listC.ToArray()
yArray = listD.ToArray()
I changed it to:
PrepareData(vectorA, vectorB, ref xArray, ref yArray):
listA
listB
for(int i = 0; i < vectorA.count && vectorB.count; i++)
convert values insert in listA
convert values insert in listB
listC
listD
for(int i = 0; i < listB.count; i++)
listC[i] = listB[i] converted to something
listD[i] = listA[i]
xArray = listC.ToArray()
yArray = listD.ToArray()
Keeping in mind that the vectors can potentially have a large number of items. I figured the second one would be better, so that the program wouldnt't have to loop n times 2 or 3 different times. But then I started to wonder about the affects (effects?) of memory fetching, or prefetching, or what have you.
So, I hope this helps to clear up the question, although a good number of you have provided excellent answers.
Thank you every one for the information. Thinking in terms of Big-O and how to optimize has never been my strong point. I believe I am going to put the code back to the way it was, I should have trusted the way it was written before instead of jumping on my novice instincts. Also, in the future I will put more reference so everyone can understand what the heck I'm talking about (clarity is also not a strong point of mine :-/).
Thank you again.

Categories