Faster way to get distinct values in LINQ?

Faster way to get distinct values in LINQ? - c#

I have a web part in SharePoint, and I am trying to populate a drop-down control with the unique/distinct values from a particular field in a list.
Unfortunately, due to the nature of the system, it is a text field, so there is no other definitive source to get the data values (i.e., if it were a choice field, I could get the field definition and just get the values from there), and I am using the chosen value of the drop-down in a subsequent CAML query, so the values must be accurate to what is present on the list items. Currently the list has arpprox. 4K items, but it is (and will continue) growing slowly.
And, it's part of a sandbox solution, so it is restricted by the user code service time limit - and it's timing out more often than not. In my dev environment I stepped through the code in debug, and it seems like the line of LINQ where I actually get the distinct values is the most time consuming, and I then commented out the call to this method entirely, and the timeouts stop, so I am fairly certain this is where the problem is.
Here's my code:
private void AddUniqueValues(SPList list, SPField filterField, DropDownList dropDownControl)
{
SPQuery query = new SPQuery();
query.ViewFields = string.Format("<FieldRef Name='{0}' />", filterField.InternalName);
query.ViewFieldsOnly = true;
SPListItemCollection results = list.GetItems(query); // retrieves ~4K items
List<string> uniqueValues = results.Cast<SPListItem>().Select(item => item[filterField.Id].ToString()).Distinct().ToList(); // this takes too long with 4K items
uniqueValues.Sort();
dropDownControl.Items.AddRange(uniqueValues.Select(itm => new ListItem(itm)).ToArray());
}
As far as I am aware, there's no way to get "distinct" values directly in a CAML query, so how can I do this more quickly? Is there a way to restructure the LINQ to run faster?
Is there an easy/fast way to do this from the client side? (REST would be preferred, but I'd do JSOM if necessary).
Thought I'd add some extra information here since I did some further testing and found some interesting results.
First, to address the questions of whether the Cast() and Select() are needed: yes, they are.
SPListItemCollection is IEnumerable but not IEnumerable<T>, so we need to cast just to be able to get to use LINQ at all.
Then after it's cast to IEnumerable<SPListItem>, SPListItem is a fairly complex object, and I am looking to find distinct values from just one property of that object. Using Distinct() directly on the IEnumerable<SPListItem> yields.. all of them. So I have to Select() just the single values I want to compare.
So yes, the Cast() and Select() are absolutely necessary.
As noted in the comments by M.kazem Akhgary, in my original line of code, calling ToString() every time (for 4K items) did add some time. But in testing some other variations:
// original
List<string> uniqueValues = results.Cast<SPListItem>().Select(item => item[filterField.Id].ToString()).Distinct().ToList();
// hash set alternative
HashSet<object> items = new HashSet<object>(results.Cast<SPListItem>().Select(itm => itm[filterField.Id]));
// don't call ToString(), just deal with base objects
List<object> obs = results.Cast<SPListItem>().Select(itm => itm[filterField.Id]).Distinct().ToList();
// alternate LINQ syntax from Pieter_Daems answer, seems to remove the Cast()
var things = (from SPListItem item in results select item[filterField.Id]).Distinct().ToList();
I found that all of those methods took multiple tens of seconds to complete. Strangely, the DataTable/DataView method from Pieter_Daems answer, to which I added a bit to extract the values I wanted:
DataTable dt = results2.GetDataTable();
DataView vw = new DataView(dt);
DataTable udt = vw.ToTable(true, filterField.InternalName);
List<string> rowValues = new List<string>();
foreach (DataRow row in udt.Rows)
{
rowValues.Add(row[filterField.InternalName].ToString());
}
rowValues.Sort();
took only 1-2 seconds!
In the end, I am going with Thriggle's answer, because it deals nicely with SharePoint's 5000 item list view threshold, which I will probably be dealing with some day, and it is only marginally slower (2-3 seconds) than the DataTable method. Still much, much faster than all the LINQ.
Interesting to note, though, that the fastest way to get distinct values from a particular field from a SPListItemCollection seems to be the DataTable/DataView conversion method.

You're potentially introducing a significant delay by retrieving all items first before checking for distinctness.
An alternative approach would be to perform multiple CAML queries against SharePoint; this would result in one query per unique value (plus one final query that returns no results).
Make sure your list has column indexing applied to the field whose values you want to enumerate.
In your initial CAML query, sort by the field you want to enumerate and impose a row limit of one item.
Get the value of the field from the item returned by that query and add it to your collection of unique values.
Query the list again, sorting by the field and imposing a row limit of 1, but this time add a filter condition such that it only retrieves items where the field value is greater than the field value you just detected.
Add the value of the field in the returned item to your collection of unique values.
Repeat steps 4 and 5 until the query returns an empty result set, at which point your collection of unique values should contain all current values of the field (assuming more haven't been added since you started).
Will this be any faster? That depends on your data, and how frequently duplicate values occur.
If you have 4000 items and only 5 unique values, you'll be able to gather those 5 values in only 6 lightweight CAML queries, returning a total of 5 items. This makes a lot more sense than querying for all 4000 items and enumerating through them one at a time to look for unique values.
On the other hand, if you have 4000 items and 3000 unique values, you're looking at querying the list 3001 times. This might well be slower than retrieving all the items in a single query and using post-processing to find the unique values.

var distinctItems = (from SPListItem item in items select item["EmployeeName"]).Distinct().ToArray();
Or convert your results to DataView and do something like:
SPList oList = SPContext.Current.Web.Lists["ListName"];
SPQuery query = new SPQuery();
query.Query = "<OrderBy><FieldRef Name='Name' /></OrderBy>";
DataTable dtcamltest = oList.GetItems(query).GetDataTable();
DataView dtview = new DataView(dtcamltest);
DataTable dtdistinct = dtview.ToTable(true, "Name");
Source: https://sharepoint.stackexchange.com/questions/77988/caml-query-on-sharepoint-list-without-duplicates

Duplicate maybe?
.Distinct is an O(n) call.
You can't get any faster than that.
This being said, maybe you want to check if you need the cast + select for getting uniques - I'd try a HashSet.

Related

C# - Concatenate an in memory IList and IQueryable?

Suppose I have a List containing one string value. Suppose I also have an IQueryable that contains several strings from a database. I want to be able to concatenate these two containers into one list and then be able to call methods such as .Skip or .Take on the list. I want to be able to do this in such a way that when I combine the two containers I don't load all of the DB data into memory (only after I call .Skip and .Take). Basically, I want to do something like this (pseudocode):
IQueryable someQuery = myEntities.GetDBQuery(); // Gets "test2", "test3"
IList inMemoryList = new List();
inMemoryList.Add("test");
IList finalList = inMemoryList.Union(someQuery) // Can I do something like this without loading DB data into memory? finalList should contain all 3 strings.
// At this point it is fine to load the filtered query into memory.
foreach (string myString in finalList.Skip(100).Take(200))
{
// Do work...
}
How can I achieve this?

If I didn't misunderstand, you are trying to query the data, part of which comes from memory and others from database, like this:
//the following code will not compile, just for example
var dbQuery = BuildDbQuery();
var list = BuildListInMemory();
var myQuery = (dbQuery + list).OrderBy(aa).Skip(bb).Take(cc).Select(dd);
//and you don't want to load all records into memory by dbQuery
//because you only need some of them
The short answer is NO, you can't. Consider the .OrderBy method, all data have to be in a same "place", otherwise the code can't sort them. So the code loads all records in database by dbQuery into memory(now they are in a same place) and then sorts all of them including those in list. That probably causes a memory issue when dbQuery gives thousands of rows.
HOW TO RESOLVE
Pass the data in list into database (as parameters of dbQuery) so that the query happens in database. This is easy if your list has only a few items.
If list also has lots of records that will makes dbQuery too complex, you can try to query twice, one for dbQuery and one for list. For example, you have 10,000 users in database and 1,000 users in your memory list, and you want to get the top 10 youngest users. You don't need to load 10,000 users into memory and then find the youngest 10. Instead, you find 10 youngest (ResultA) in dbQuery and load into memory, and 10 youngest (ResultB) in memory list, and then compare between ResultA and ResultB.

I entirely agree with Danny's answer when he says you need to somehow find a way to include in memory user list into db so that you achieve what you want. As for the example which you sought in your comment, without knowing data structure of your User object, seems difficult. However assuming you would be able to connect the dots. Here is my suggested approach:
Create temporary table with identical structure that of your regular user table in your db and insert all your inmemory users into it
Write a query to Union temporary and regular table both identical in structure so that should be easy.
Return the result in your application and use it performing standard Linq operations
If you want exact code which you can use as it is then you will have to provide your User object structure - fields type etc in db to enable me to write the code.

You specify that your query and your list are both sequences of strings. someQuery can be performed completely on the database side (not in-memory)
Let's make your sequences less generic:
IQueryable<string> someQuery = ...
IList<string> myList = ...
You also specify that myList contains only one element.
string myOneAndOnlyString = myList.Single();
As your list is in-memory, this has to be performed in-memory. But because the list has only one element, this won't take any time.
The query that you request:
IQueryable<string> correctQuery = someQuery
.Where(item => item.Equals(myOneandOnlyString)
.Skip(skipCount)
.Take(takeCount)
Use your SQL server profiler to check the used SQL and see that the request is completely performed in one SQL statement.

Is there any way to loop through my sql results and store certain name/value pairs elsewhere in C#?

I have a large result set coming from a pretty complex SQL query. Among the values are a string which represents a location (that will later help me determine the page location that the value came from), an int which is a priority number calculated for each row based on other values from the row, and another string which contains a value I must remember for display later.
The problem is that the sql query is so complex (it has UNIONS, JOINS, and complex calculations with aliases) that I can't logically fit anything else into it without messing with the way it works.
Suffice it to say, though, after the query is done and the calculations performed, I need something that perhaps aggregate functions might solve, but that IS NOT an option, as all the columns do not come from other aggregate functions.
I have been wracking my brain for days now as to how I can iterate through the results, store a pair of values in a list (or two separate lists tied together somehow) where one value is the sum of all the priority values for each location and the other value is a distinct location value (i.e., as the results are looped through, it will not create another list item with the same location value that has been used before, HOWEVER, it does still need the sum of all of the other priority values from locations that ARE identical). Also, the results need to be ordered by priority in Descending order (hence the problem with using two lists).
EXAMPLE:
EDIT: I forgot, the preserved value should be the value from the row with the highest priority from the sql query.
If I had the following results:
location priority value
--------------------------------------------------------------------------------
page1 1 some text!
page2 3 more text!
page2 4 even more text!
page3 3 text again
page3 1 text
page3 1 still more text!
page4 6 text
If I was able to do what I wanted I would be able to achieve something like this after iteration (and in this order):
location priority value
--------------------------------------------------------------------------------
page2 7 even more text!
page4 6 text
page3 5 text again
page1 1 some text!
I have done research after research after research but absolutely nothing really even gets close to solving this dilemma.
Is what I'm asking too tough for even the powerful C# language?
THINGS I HAVE CONSIDERED:
Looping through the sql results and checking each location for repeats, adding together all priority values as I go, and storing these two plus value in two or three separate lists.
Why I still need help
I can't use a foreach because the logic didn't pan out, and I can't use a for loop because I can't access an IEnumerable (or whatever type it is that stores what's returned from Database.Open.Query() by index. (this makes sense, of course). Also, I need to sort on priority, but can't get one list out of sync with the others.
Using LINQ to select and store what I need
Why I still need help
I don't know LINQ (at all!) mainly because I don't understand lambda expressions (no matter HOW MUCH I read up about it).
Using an instantiated class to store the name/value pairs
Why I still need help
Not only do I expect sorting on this sort of thing to be impossible, and while I do now how to use .cs files in my C#.net webpages with WebMatrix environment, I have mainly only ever used static classes and would also need a little refresher course on constructors and how to set this up appropriately.
Somehow fitting this functionality into the already sizeable and complex SQL query
Why I still need help
While this is probably where I would ideally like this functionality to be, I stress again that this IS NOT AN OPTION. I have tried using aggregate functions, but only get an error saying how not all the other columns come from aggregate functions.
Making another query based on values from the first query's result set
Why I still need help
I can't select distinct results based on only one column (i.e., location) alone.
Assuming I could get the loop logic correct, storing the values in a 3 dimensional array
Why I still need help
I can't declare the array, because I do not know all of its dimensions before I need to use it.

Your post has amazed me in a number of ways like saying to 'mostly using static classes' and 'expecting instantiate a class/object to be impossible'.. really strange things you say. I can only respond in a quote from Charles Babbage:
I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
Anyways.. As you say you find lambdas hard, let's trace the problem in the classic 'manual' way.
Let's assume you have a list of ROWS that contains LOCATIONS and PRIORITIES.
List<DataRow> rows = .... ; // datatable, sqldatareader, whatever
You say you need:
list of unique locations
a "list" of locations paired up with summed up priorites
Let's start with the first objective.
To gather a list of unique 'values', a HashSet is just perfect:
HashSet<string> locations = new HashSet<string>();
foreach(var row in rows)
locations.Add( (string)rows["LOCATION"] );
well, and that's all. After that, the locations hashset will only remember all the unique locations. The "Add" does not result in duplicate elements. The HashSet checks and "uniquifies" all values that are put inside it. Small tricky thing is the hashset does not have the [index] operator. You'll have to enumerate the hashset to get the values:
foreach(string loc in locations)
{
Console.WriteLine(loc);
}
or convert/rewrite it to a list:
List<string> locList = new List<string>(locations);
Console.WriteLine(locList[2]); // of course, assuming there were at least three..
Let's get to the second objective.
To gather a list of values related to some thing behaving like a "logical key", a Dictionary<Key,Val> may be useful. It allows you to store/associate a "value" with some "key", ie:
Dictionary<string, double> dict = new Dictionary<string, double>();
dict["mamma"] = 123.45;
double d = dict["mamma"]; // d == 123.45
    dict["mamma"] += 101; // possible!
double e = dict["mamma"]; // d == 224.45
However, it has a behavior of happily throwing exceptions when you try to read from an unknown key:
Dictionary<string, double> dict = new Dictionary<string, double>();
dict["mamma"] = 123.45;
double d = dict["daddy"]; // throws KeyNotBlarghException
    dict["daddy"] += 101; // would throw too! tries to read the old/current value!
So, one have to be very careful with it with "keys" that it does not yet know. Fortunatelly, you can always ask the dictionary if it already knows a key:
Dictionary<string, double> dict = new Dictionary<string, double>();
dict["mamma"] = 123.45;
bool knowIt = dict.ContainsKey("daddy"); // == false
So you can easily check-and-initialize-when-unknown:
Dictionary<string, double> dict = new Dictionary<string, double>();
bool knowIt = dict.ContainsKey("daddy"); // == false
if( !knowIt )
dict["daddy"] = 5;
dict["daddy"] += 101; // now 106
So.. let's try summing up the priorities location-wise:
Dictionary<string, double> prioSums = new Dictionary<string, double>();
foreach(var row in rows)
{
string location = (string)rows["LOCATION"];
double priority = (double)rows["PRIORITY"];
if( ! prioSums.ContainsKey(location) )
// make sure that dictionary knows the location
prioSums[location] = 0.0;
prioSums[location] += priority;
}
And, really, that's all. Now the prioSums will know all locations and all sums of priorities:
var sss = prioSums["NewYork"]; // 9123, assuming NewYork was some location
However, that'd be quite useless to have to hardcode all locations. Hence, you also can ask the dictionary about what keys does it curently know
foreach(string key in prioSums.Keys)
Console.WriteLine(key);
and you can immediatelly use it:
foreach(string key in prioSums.Keys)
{
Console.WriteLine(key);
Console.WriteLine(prioSums[key]);
}
that should print all locations with all their sums.
You might already noticed an interesting thing: the dictionary can tell you what keys has it remembered. Hence, you do not need the HashSet from the first objective. Simply by summing up the priorities inside the Dictionary, you get the uniquized list of location by free: just ask the dict for its keys.
EDIT:
I noticed you've had a few more requests (like sort-descending or find-highest-prio-value), but I think I'll leave them for now. If you understand how I used a dictionary to collect the priorities, then you will easily build a similar Dictionary<string,string> to collect the highest-ranking value for a location. And the 'descending order' is done very easily if only you take the values out of dictionary and sort them as a i.e. List.. So I'll skip that for now.. This text got far tl;dr already I think :)

LINQ is really the tool to use for this kind of problems.
Suppose you have a variable pages which is an IEnumerable<Page>, where Page is a class with properties location, priority and value you could do
var query = from page in pages
group page by page.location into grp
select new { location = grp.Key,
priority = grp.Sum(page => page.priority),
value = grp.OrderByDescending(page => page.priority)
.First().value
}
You say you don't understand LINQ, so let me try to begin explain this statement.
The rows are group by location, which results in 4 groups of pages of which page.location is the key:
location priority value
--------------------------------------
page1 1 some text!
page2 3 more text!
4 even more text!
page3 1 text
1 still more text!
3 text again
page4 6 text
The select loops through these 4 groups and for each group it creates an anonymous type with 3 properties:
location: the key of the group
priority: the sum of priorities in one group
value: the first value in one group when its pages are sorted by priority in descending order.
The lamba expressions are a way to express which property should be used for a LINQ function like Sum. In short they say "transform page to page.priority": page => page.priority.
You want these new rows in descending order of priority, so finally you can do
result = query.OrderByDescending(x => x.priority).ToList();
The x is just an arbitrary placeholder representing one item in the collection in hand, query (likewise in the query above page could have been any word or character).

Limit Number of Results being returned in a List from Linq

I'm using Linq/EF4.1 to pull some results from a database and would like to limit the results to the (X) most recent results. Where X is a number set by the user.
Is there a way to do this?
I'm currently passing them back as a List if this will help with limiting the result set. While I can limit this by looping until I hit X I'd just assume not pass the extra data around.
Just in case it is relevant...
C# MVC3 project running from a SQL Server database.

Use the Take function
int numberOfrecords=10; // read from user
listOfItems.OrderByDescending(x => x.CreatedDate).Take(numberOfrecords)
Assuming listOfItems is List of your entity objects and CreatedDate is a field which has the date created value (used here to do the Order by descending to get recent items).
Take() Function returns a specified number of contiguous elements from the start of a
sequence.
http://msdn.microsoft.com/en-us/library/bb503062.aspx

results = results.OrderByDescending(x=>x.Date).Take(10);
The OrderByDescending(...) will sort items by your date/time property (or w/e logic you want to use to get most recent) and Take(...) will limit to first x items (first being most recent, thanks to the ordering).
Edit: To return some rows not starting at the first row, use Skip():
results = results.OrderByDescending(x=>x.Date).Skip(50).Take(10);

Use Take(), before converting to a List. This way EF can optimize the query it creates and only return the data you need.

Collecting metadata into table

I have tabluar data that passes through a C# program that I need to collect some metadata on before finishing. The metadata is always counts based on fields of the data. Also, I need them all grouped by one field in the data. Periodically, I need to add new counts to this collection of metadata.
I've been researching it for a little while, and I think what makes sense is to rework my program to store the data as a DataTable, then run LINQ queries on the table. The problem I'm having is being able to put the different counts into one table-like structure and then write that out.
I might run a query like this:
var query01 =
from record in records.AsEnumerable()
group record by record.Field<String>("Association Key") into associationsGroup
select new { AssociationKey = associationsGroup.Key, Count = associationsGroup.Count<DataRow>() };
To get a count of all of the records grouped by the field Association Key. I'm going to want another count, grouped in the same way:
var query02 =
from record in records.AsEnumerable()
where record.Field<String>("Number 9") == "yes"
group record by record.Field<String>("Association Key") into associationsGroup
select new { AssociationKey = associationsGroup.Key, Number9Count = associationsGroup.Count<DataRow>() };
And so on.
I thought about trying Union chain the queries but I was having trouble getting them to union since I'm projecting into anonymous types. I couldn't figure out how to do it differently to make a union work better.
So, how can I collect my metadata into one table-like structure?

Not going to union because you have different types. Add Number9Count and Count to both annonymous types and try union again.

I ended up solving the problem by creating a class that holds the set of records I need as a DataTable. A user can add queries through a method, taking an argument Func<DataRow, bool>. The method constructs the query supplying that argument as the where clause, maintaining the same grouping and properties in the resulting anonymous-typed object.
When retrieving the results, the class iterates over each query stored and enters the results into a new DataTable.

Converting IEnumerable<T> to List<T> on a LINQ result, huge performance loss

On a LINQ-result you like this:
var result = from x in Items select x;
List<T> list = result.ToList<T>();
However, the ToList<T> is Really Slow, does it make the list mutable and therefore the conversion is slow?
In most cases I can manage to just have my IEnumerable or as Paralell.DistinctQuery but now I want to bind the items to a DataGridView, so therefore I need to as something else than IEnumerable, suggestions on how I will gain performance on ToList or any replacement?
On 10 million records in the IEnumerable, the .ToList<T> takes about 6 seconds.

.ToList() is slow in comparison to what?
If you are comparing
var result = from x in Items select x;
List<T> list = result.ToList<T>();
to
var result = from x in Items select x;
you should note that since the query is evaluated lazily, the first line doesn't do much at all. It doesn't retrieve any records. Deferred execution makes this comparison completely unfair.

It's because LINQ likes to be lazy and do as little work as possible. This line:
var result = from x in Items select x;
despite your choice of name, isn't actually a result, it's just a query object. It doesn't fetch any data.
List<T> list = result.ToList<T>();
Now you've actually requested the result, hence it must fetch the data from the source and make a copy of it. ToList guarantees that a copy is made.
With that in mind, it's hardly surprising that the second line is much slower than the first.

No, it's not creating the list that takes time, it's fetching the data that takes time.
Your first code line doesn't actually fetch the data, it only sets up an IEnumerable that is capable of fetching the data. It's when you call the ToList method that it will actually get all the data, and that is why all the execution time is in the second line.
You should also consider if having ten million lines in a grid is useful at all. No user is ever going to look through all the lines, so there isn't really any point in getting them all. Perhaps you should offer a way to filter the result before getting any data at all.

I think it's because of memory reallocations: ToList cannot know the size of the collection beforehand, so that it could allocate enough storage to keep all items. Therefore, it has to reallocate the List<T> as it grows.
If you can estimate the size of your resultset, it'll be much faster to preallocate enough elements using List<T>(int) constructor overload, and then manually add items to it.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.