Performance issue with search of large in memory collection

Performance issue with search of large in memory collection - c#

I wrote a query to find node in nodedata from transitiondata but it is taking quite a long time to come out of that loop since it has 4 million records.
What we have :
1. Transition data(Collection) which will have from and to node.
2. Node data(Collection) which will have key which is equals to form or to node from Transition data(Collection)
What is required out of these collections:
1. Collection which should have Transition Data(from, to) and the corresponding nodes from Node data(from key) and (to key)
The code what i wrote works fine, but it takes lot of time to execute. Below is the code.
foreach (var trans in transitions)
{
string transFrom = trans.From;
string transTo = trans.To;
var fromNodeData = nodeEntitydata.Where(x => x.Key == transFrom).FirstOrDefault();
var toNodeData = nodeEntitydata.Where(x => x.Key == transTo).FirstOrDefault();
if (fromNodeData != null && toNodeData != null)
{
//string fromSwimlane = fromNodeData.Group;
//string toSwimlane = toNodeData.Group;
string dicKey = fromNodeData.sokey + toNodeData.sokey;
if (!dicTrans.ContainsKey(dicKey))
{
soTransition.Add(new TransitionDataJsonObject
{
From = fromNodeData.sokey,
To = toNodeData.sokey,
FromPort = fromPortIds[0],
ToPort = toPortIds[0],
Description = "SOTransition",
IsManual = true
});
dicTrans.Add(dicKey, soTransition);
}
}
}
That is the loop which takes time to execute. I know the problem is in that two Where clause. Because transitions will have 400k and nodeEntitydata will have 400k. Can someone help me on this?

Use direct access to the dictionary entry:
var fromNodeData = nodeEntitydata[transFrom];
var toNodeData = nodeEntitydata[transTo];

It looks like nodeEntitydata is just a normal collection. The problem you're facing is that performing a Where on an in memory collection has linear performance, and you've got a lot of records to process.
What you need is a Dictionary. This is much more efficient for searching large collections, because it uses a binary tree to do the search rather than a linear search.
If nodeEntitydata isn't already a Dictionary, you can create a Dictionary from it like this:
var nodeEntitydictionary = nodeEntitydata.ToDictionary(n => n.Key);
You can then consume the dictionary like this:
var fromNodeData = nodeEntitydictionary[transFrom];
var toNodeData = nodeEntitydictionary[transTo];
Creating the Dictionary will be fairly slow, so make sure you only do it once at the point where you populate nodeEntitydata. If you have to keep re-instantiating the Dictionary too frequently then you won't see much of a performance benefit, so make sure you reuse it as much as possible.

Related

Find TOP (1) for each ID in array

I have a large (60m+) document collection, whereby each ID has many records in time series. Each record has an IMEI identifier, and I'm looking to select the most recent record for each IMEI in a given List<Imei>.
The brute force method is what is currently happening, whereby I create a loop for each IMEI and yield out the top most record, then return a complete collection after the loop completes. As such:
List<BsonDocument> documents = new List<BsonDocument>();
foreach(var config in imeiConfigs)
{
var filter = GetImeiFilter(config.IMEI);
var sort = GetImeiSort();
var data = _historyCollection.Find(filter).Sort(sort).Limit(1).FirstOrDefault();
documents.Add(data);
}
The end result is a List<BsonDocument> which contains the most recent BsonDocument for each IMEI, but it's not massively performant. If imeiConfigs is too large, the query takes a long time to run and return as the documents are rather large.
Is there a way to select the TOP 1 for each IMEI in a single query, as opposed to brute forcing like I am above?

have tried using the LINQ Take function?
List documents = new List();
foreach(var config in imeiConfigs)
{
var filter = GetImeiFilter(config.IMEI);
var sort = GetImeiSort();
var data = _historyCollection.Find(filter).Sort(sort).Take(1).FirstOrDefault();
documents.Add(data);
}
https://learn.microsoft.com/es-es/dotnet/api/system.linq.enumerable.take?view=netframework-4.8

I think bad performance come from "Sort(sort)", because the sorting forces it to go through all the collection.
But perhaps you can improuve time performance with parallel.
List<BsonDocument> documents;
documents = imeiConfigs.AsParallel().Select((config) =>
{
var filter = GetImeiFilter(config.IMEI);
var sort = GetImeiSort();
var data = _historyCollection.Find(filter).Sort(sort).Limit(1).FirstOrDefault();
return data;
}).ToList();

How to set List child element with for each If it is empty initially?

I have Ilist to get all Offer from repository using entity framework core. Also I have service model OfferResponseModel which includes
OfferRequestModel as reference. I used mapster to bind entity model to service model. However it only set first child. Now I want to bind it manually. I created "offers" with the size of "Offer". When I try to use foreach loop, I cannot set "offers" child element.Because it has no elements. So, I can I solve this.
var offer = await _unitOfWork.Offers.GetAllOffer();
if (offer == null)
throw ServiceExceptions.OfferNotFound;
var results = new List<OfferResponseModel>(offer.Count);
results.ForEach(c => { c.Offer = new OfferRequestModel(); });
int i = 0;
foreach(var result in results)
{
result.Offer.User = Offer[i].User.Adapt<UserResponseModel>();
result.Offer.Responsible = Offer[i].Responsible.Adapt<EmployeeResponseModel>();
result.CreatedDate = Offer[i].CreatedDate;
result.ModifiedBy = Guid.Parse(Offer[i].UpdatedBy);
result.Active = Offer[i].Status;
result.Offer = Offer[i].Offer;
result.Offer.User.Company = Offer[i].Company.Adapt<CompanyModel>();
i++;
}

I created "offers" with the size of "Offer".
No, you created it with that capacity. It's still an empty list. It's not clear to me why you're trying to take this approach at all - it looks like you want one OfferResponseModel for each entry in offer, directly from that - which you can do with a single LINQ query. (I'm assuming that offer and Offer are equivalent here.)
var results = Offer.Select(o => new OfferResponseModel
{
Offer = new OfferRequestModel
{
User = o.User.Adapt<UserResponseModel>(),
Responsible = o.Responsible.Adapt<EmployeeResponseModel>()
},
CreatedDate = o.CreatedDate,
ModifiedBy = Guid.Parse(o.UpdatedBy),
Active = o.Status
}).ToList();
That doesn't set the Offer.User.Company in each entry, but your original code is odd as it sets the User and Responsible properties in the original Offer property, and then replaces the Offer with Offer[i].Offer. (Aside from anything else, I'd suggest trying to use the term "offer" less frequently - just changing the plural to "offers" would help.)
I suspect that with the approach I've outlined above, you'll be able to work out what you want and express it more clearly anyway. You definitely don't need to take the "multiple loops" approach of your original code.

One thing you have left out is the type of the offer variable that is referenced in the code. But I am thinking you need to do something along these lines:
if (offer == null)
throw ServiceExceptions.OfferNotFound;
var results = offer.Select(o => new OfferResponseModel
{
Offer = new OfferRequestModel
{
User = o.User.Adapt<UserResponseModel>(),
Responsible = o.Responsible.Adapt<EmployeeResponseModel>(),
...
}
}).ToList();
Select basically loops through any items in offer and "converts" them to other objects, in this case OfferResponseModel. So inside select you simply new up an OfferResponseModel and directly sets all the properties you need to set.
You need using System.Linq; for Select to be available.

Hitting the database in loop

I am hitting the database in loop. Pls suggest how i can avoid this ?
foreach (var assignedPricing in reviewingPricing)
{
var assignedUserId = _wfActivitySvc.GetPricingReviewer(assignedPricing.PricingID).UserId;
if (assignedUserId == UserId)
{
reviewingAssignedPricings.Add(assignedPricing);
}
}

Create a new query in your database service:
//Build an collection with just unique ids
var priceIds = reviewingPricing.Select(x => x.PricingId).Distinct();
//Return a key/value collection with all priceId/UserId
var reviewerMap = _wfActivitySvc.GetAllReviewersByPriceId(priceIds);
//now you can loop without db queries
foreach (var pricing in reviewingPricing)
{
var reviewer = reviewMap.FirstOrDefault(x => x.PricingId == pricing.PricingId);
if (reviewerMap == null)
continue;
if (reviewer.UserId == UserId)
{
reviewingAssignedPricings.Add(pricing);
}
}

1) You may want to insert all records at once. You can create stored procedure to do this. If you use SQL server you can use BulkInserter class: https://github.com/ronnieoverby/RonnieOverbyGrabBag/blob/master/BulkInserter.cs
For production I had to tweak it a little bit internally to reduce its initialization time, but for infrequent bulk inserts Github version is just fine.
Usage example:
var bulkInserter = new BulkInserter<YourClass>(SqlConnection, "Table Name");
bulkInserter.Insert(reviewingAssignedPricings);
bulkInserter.Flush();
2) If this reads from database every time
_wfActivitySvc.GetPricingReviewer(assignedPricing.PricingID).UserId;
then replace it with single call outside loop to get all reviweres from database, then add to dictionary (key = priceID, value = reviewer) and then get reviewers from dictionary within loop. If you use simple List and .FirstOrDefault(), then it can be noticeably slow for 100+ items in list. jgauffin answer describes this idea.

ActiveDirectory with Range not changing results using DirectorySearcher

So I'm basically trying to enumerate results from AD, and for some reason I'm unable to pull down new results, meaning it keeps continuously pulling the first 1500 results even though I tell it I want an additional range.
Can someone point out where I'm making the mistake? The code never breaks out of the loop but more importantly it pulls users 1-1500 even when I say I want users 1500-3000.
uint rangeStep = 1500;
uint rangeLow = 0;
uint rangeHigh = rangeLow + (rangeStep - 1);
bool lastQuery = false;
bool quitLoop = false;
do
{
string attributeWithRange;
if (!lastQuery)
{
attributeWithRange = String.Format("member;Range={0}-{1}", rangeLow, rangeHigh);
}
else
{
attributeWithRange = String.Format("member;Range={0}-*", rangeLow);
}
DirectoryEntry dEntryhighlevel = new DirectoryEntry("LDAP://OU=C,OU=x,DC=h,DC=nt");
DirectorySearcher dSeacher = new DirectorySearcher(dEntryhighlevel,"(&(objectClass=user)(memberof=CN=Users,OU=t,OU=s,OU=x,DC=h,DC=nt))",new string[] {attributeWithRange});
dSeacher.PropertiesToLoad.Add("givenname");
dSeacher.PropertiesToLoad.Add("sn");
dSeacher.PropertiesToLoad.Add("samAccountName");
dSeacher.PropertiesToLoad.Add("mail");
dSeacher.PageSize = 1500;
SearchResultCollection resultCollection = resultCollection = dSeacher.FindAll();
dSeacher.Dispose();
foreach (SearchResult userResults in resultCollection)
{
string Last_Name = userResults.Properties["sn"][0].ToString();
string First_Name = userResults.Properties["givenname"][0].ToString();
string userName = userResults.Properties["samAccountName"][0].ToString();
string Email_Address = userResults.Properties["mail"][0].ToString();
OriginalList.Add(Last_Name + "|" + First_Name + "|" + userName + "|" + Email_Address);
}
if(resultCollection.Count == 1500)
{
lastQuery = true;
rangeLow = rangeHigh + 1;
rangeHigh = rangeLow + (rangeStep - 1);
}
else
{
quitLoop = true;
}
}
while (!quitLoop);

You're mixing up two concepts which is what is causing you trouble. This is a FAQ on the SO forums so I probably should blog on this to try and clear things up.
Let me first just explain the concepts, then correct the code once the concepts are out there.
Concept one is fetching large collections of objects. When you fetch a lot of objects, you need to ask for them in batches. This is typically called "paging" through the results. When you do this you'll get back a paging cookie and can pass back the paged control in subsequent searches to keep getting a "page worth" of results with each pass.
The second concept is fetching large numbers of values from a single attribute. The simple example of this is reading the member attribute from a group (ex: doing a base search for that group). This is called "ranged retrieval." In this search mode you are doing a base search against that object for the large attribute (like member) and asking for "ranges" of values with each passing search.
The code above confuses these concepts. You are doing member range logic like you are doing range retrieval but you are in fact doing a search that is constructed to return a large # of objects like a paged search. This is why you are getting the same results over and over.
To fix this you need to first pick an approach. :) I recommend range retrieval against the group object and asking for the large member attribute in ranges. This will get you all of the members in the group.
If you go down this path, you'll notice you can't ask for attributes for these values. The only vlaue you get is the list of members, and you can then do searches for them. IF you opt to stay with paged searches like you have above, then you end up switching to paged searches.
If you opt to stick with paged searches, then you'll need to:
Get rid of the Range logic, and all mentions of 1500
Set a page size of something like 1000
Instead of ranging, look up how to do paged searches (using the page search control) using your API
If you pick ranging, you'll switch from a memberOf search like this to a search of the form:
a) scope: base
b) filter: (objectclass=)
c) base DN: OU=C,OU=x,DC=h,DC=nt
d) Attributes: member;Range=0-
...then you will increment the 0 up as you fetch ranges of values (ie do this search over and over again for each subsequent range of values, changing only the 0 to subsequent integers)
Other nits you'll notice in my logic:
- I don't set page size...you're not doing a paged search, so it doesn't matter.
- I dont' ever hard code the value 1500 here. It doesn't matter. Ther eis no value in knowing or even computing this. The point is that you asked for 0-* (ie all), you got back 1500, so then you say 1500-, then 3000-, and so on. You don't need to knwo the range size, only what you have been given so far.
I hope this fully answers it...
Here is a code snip of doing a paged search, per my comment below (this is what you would need to do using the System.DirectoryServices.Protocols namespace classes, going down the logical path you started above (paged searches, not ranged retrieval)):
string searchFilter = "(&(objectClass=user)(memberof=CN=Users,OU=t,OU=s,OU=x,DC=h,DC=nt))";
string baseDN = "OU=C,OU=x,DC=h,DC=nt";
var scope = SearchScope.Subtree;
var attributeList = new string[] { "givenname", "sn", "samAccountName", "mail" };
PageResultRequestControl pageSearchControl = new PageResultRequestControl(1000);
do
{
SearchRequest sr = new SearchRequest(baseDN, searchFilter, scope, attributeList);
sr.Controls.Add(pageSearchControl);
var directoryResponse = ldapConnection.SendRequest(sr);
if (directoryResponse.ResultCode != ResultCode.Success)
{
// Handle error
}
var searchResponse = (SearchResponse)directoryResponse;
pageSearchControl = null; // Reset!
foreach (var control in searchResponse.Controls)
{
if (control is PageResultResponseControl)
{
var prrc = (PageResultResponseControl)control;
if (prrc.Cookie.Length > 0)
{
pageSearchControl = new PageResultRequestControl(prrc.Cookie);
}
}
}
foreach (var entry in searchResponse.Entries)
{
// Handle the search result entry
}
} while (pageSearchControl != null);

Your problem is caused by creating new object of directory searcher in loop. Each time there will be new object that will take first 1500 records. Create instance of searher out of the loop and use same instance for all queries.

Cache only parts of an object

I'm trying to achieve a super-fast search, and decided to rely heavily on caching to achieve this. The order of events is as follows;
1) Cache what can be cached (from entire database, around 3000 items)
2) When a search is performed, pull the entire result set out of the cache
3) Filter that result set based on the search criteria. Give each search result a "relevance" score.
4) Send the filtered results down to the database via xml to get the bits that can't be cached (e.g. prices)
5) Display the final results
This is all working and going at lightning speed, but in order to achieve (3) I've given each result a "relevance" score. This is just a member integer on each search result object. I iterate through the entire result set and update this score accordingly, then order-by it at the end.
The problem I am having is that the "relevance" member is retaining this value from search to search. I assume this is because what I am updating is a reference to the search results in the cache, rather than a new object, so updating it also updates the cached version. What I'm looking for is a tidy solution to get around this. What I've come up with so far is either;
a) Clone the cache when i get it.
b) Create a seperate dictionary to store relevances in and match them up at the end
Am I missing a really obvious and clean solution or should i go down one of these routes? I'm using C# and .net.
Hopefully it should be obvious from the description what I'm getting at, here's some code anyway; this first one is the iteration through the cached results in order to do the filtering;
private List<QuickSearchResult> performFiltering(string keywords, string regions, List<QuickSearchResult> cachedSearchResults)
{
List<QuickSearchResult> filteredItems = new List<QuickSearchResult>();
string upperedKeywords = keywords.ToUpper();
string[] keywordsArray = upperedKeywords.Split(' ');
string[] regionsArray = regions.Split(',');
foreach (var item in cachedSearchResults)
{
//Check for keywords
if (keywordsArray != null)
{
if (!item.ContainsKeyword(upperedKeywords, keywordsArray))
continue;
}
//Check for regions
if (regionsArray != null)
{
if (!item.IsInRegion(regionsArray))
continue;
}
filteredItems.Add(item);
}
return filteredItems.OrderBy(t=> t.Relevance).Take(_maxSearchResults).ToList<QuickSearchResult>();
}
and here is an example of the "IsInRegion" method of the QuickSearchResult object;
public bool IsInRegion(string[] regions)
{
int relevanceScore = 0;
foreach (var region in regions)
{
int parsedRegion = 0;
if (int.TryParse(region, out parsedRegion))
{
foreach (var thisItemsRegion in this.Regions)
{
if (thisItemsRegion.ID == parsedRegion)
relevanceScore += 10;
}
}
}
Relevance += relevanceScore;
return relevanceScore > 0;
}
And basically if i search for "london" i get a score of "10" the first time, "20" the second time...

If you use the NetDataContractSerializer to serialize your objects in the cache, you could use a [DataMember] attribute to control what gets serialized and what doesn't. For instance, you could store your temporarary calculated relevance value in a field that is not serialized.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Performance issue with search of large in memory collection - c#

Use direct access to the dictionary entry: var fromNodeData = nodeEntitydata[transFrom]; var toNodeData = nodeEntitydata[transTo];

Related

Find TOP (1) for each ID in array

How to set List child element with for each If it is empty initially?

Hitting the database in loop

ActiveDirectory with Range not changing results using DirectorySearcher

Cache only parts of an object

Categories

Resources