Most efficient way to search enumerable - c#

I am writing a small program that takes in a .csv file as input with about 45k rows. I am trying to compare the contents of this file with the contents of a table on a database (SQL Server through dynamics CRM using Xrm.Sdk if it makes a difference).
In my current program (which takes about 25 minutes to compare - the file and database are the exact same here both 45k rows with no differences), I have all existing records on the database in a DataCollection<Entity> which inherits Collection<T> and IEnumerable<T>
In my code below I am filtering using the Where method and then doing a logic based the count of matches. The Where seems to be the bottleneck here. Is there a more efficient approach than this? I am by no means a LINQ expert.
foreach (var record in inputDataLines)
{
var fields = record.Split(',');
var fund = fields[0];
var bps = Convert.ToDecimal(fields[1]);
var withdrawalPct = Convert.ToDecimal(fields[2]);
var percentile = Convert.ToInt32(fields[3]);
var age = Convert.ToInt32(fields[4]);
var bombOutTerm = Convert.ToDecimal(fields[5]);
var matchingRows = existingRecords.Entities.Where(r => r["field_1"].ToString() == fund
&& Convert.ToDecimal(r["field_2"]) == bps
&& Convert.ToDecimal(r["field_3"]) == withdrawalPct
&& Convert.ToDecimal(r["field_4"]) == percentile
&& Convert.ToDecimal(r["field_5"]) == age);
entitiesFound.AddRange(matchingRows);
if (matchingRows.Count() == 0)
{
rowsToAdd.Add(record);
}
else if (matchingRows.Count() == 1)
{
if (Convert.ToDecimal(matchingRows.First()["field_6"]) != bombOutTerm)
{
rowsToUpdate.Add(record);
entitiesToUpdate.Add(matchingRows.First());
}
}
else
{
entitiesToDelete.AddRange(matchingRows);
rowsToAdd.Add(record);
}
}
EDIT: I can confirm that all existingRecords are in memory before this code is executed. There is no IO or DB access in the above loop.

Himbrombeere is right, you should execute the query first and put the result into a collection before you use Any, Count, AddRange or whatever method will execute the query again. In your code it's possible that the query is executed 5 times in every loop iteration.
Watch out for the term deferred execution in the documentation. If a method is implemented in that way, then it means that this method can be used to construct a LINQ query(so you can chain it with other methods and at the end you have a query). But only methods that don't use deferred execution like Count, Any, ToList(or a plain foreach) will actually execute it. If you dont want that the whole query is executed everytime and you have to access this query multiple times , it's better to store the result in a collection(.f.e with ToList).
However, you could use a different approach which should be much more efficient, a Lookup<TKey, TValue> which is similar to a dictionary and can be used with an anonymous type as key:
var lookup = existingRecords.Entities.ToLookup(r => new
{
fund = r["field_1"].ToString(),
bps = Convert.ToDecimal(r["field_2"]),
withdrawalPct = Convert.ToDecimal(r["field_3"]),
percentile = Convert.ToDecimal(r["field_4"]),
age = Convert.ToDecimal(r["field_5"])
});
Now you can access this lookup in the loop very efficiently.
foreach (var record in inputDataLines)
{
var fields = record.Split(',');
var fund = fields[0];
var bps = Convert.ToDecimal(fields[1]);
var withdrawalPct = Convert.ToDecimal(fields[2]);
var percentile = Convert.ToInt32(fields[3]);
var age = Convert.ToInt32(fields[4]);
var bombOutTerm = Convert.ToDecimal(fields[5]);
var matchingRows = lookup[new {fund, bps, withdrawalPct, percentile, age}].ToList();
entitiesFound.AddRange(matchingRows);
if (matchingRows.Count() == 0)
{
rowsToAdd.Add(record);
}
else if (matchingRows.Count() == 1)
{
if (Convert.ToDecimal(matchingRows.First()["field_6"]) != bombOutTerm)
{
rowsToUpdate.Add(record);
entitiesToUpdate.Add(matchingRows.First());
}
}
else
{
entitiesToDelete.AddRange(matchingRows);
rowsToAdd.Add(record);
}
}
Note that this will work even if the key does not exist(an empty list is returned).

Add a ToList after your Convert.ToDecimal(r["field_5"]) == age);-line to force an immediate execution of the query.
var matchingRows = existingRecords.Entities.Where(r => r["field_1"].ToString() == fund
&& Convert.ToDecimal(r["field_2"]) == bps
&& Convert.ToDecimal(r["field_3"]) == withdrawalPct
&& Convert.ToDecimal(r["field_4"]) == percentile
&& Convert.ToDecimal(r["field_5"]) == age)
.ToList();
The Where doesn´t actually execute your query, it just prepares it. The actual execution happens later in a delayed way. In your case that happens when calling Count which itself will iterate the entire collection of items. But if the first condition fails, the second one is checked leading to a second iteration of the complete collection when calling Count. In this case you actually execute that query a thrird time when calling matchingRows.First().
When forcing an immediate execution you´re executing the query only once and thus iterating the entire collection only once also which will decrease your overall-time.

Another option, which is basically along the same lines as the other answers, is to prepare your data first, so that you're not repeatedly calling things like r["field_2"] (which are relatively slow to look up).
This is a (1) clean your data, (2) query/join your data, (3) process your data approach.
Do this:
(1)
var inputs =
inputDataLines
.Select(record =>
{
var fields = record.Split(',');
return new
{
fund = fields[0],
bps = Convert.ToDecimal(fields[1]),
withdrawalPct = Convert.ToDecimal(fields[2]),
percentile = Convert.ToInt32(fields[3]),
age = Convert.ToInt32(fields[4]),
bombOutTerm = Convert.ToDecimal(fields[5]),
record
};
})
.ToArray();
var entities =
existingRecords
.Entities
.Select(entity => new
{
fund = entity["field_1"].ToString(),
bps = Convert.ToDecimal(entity["field_2"]),
withdrawalPct = Convert.ToDecimal(entity["field_3"]),
percentile = Convert.ToInt32(entity["field_4"]),
age = Convert.ToInt32(entity["field_5"]),
bombOutTerm = Convert.ToDecimal(entity["field_6"]),
entity
})
.ToArray()
.GroupBy(x => new
{
x.fund,
x.bps,
x.withdrawalPct,
x.percentile,
x.age
}, x => new
{
x.bombOutTerm,
x.entity,
});
(2)
var query =
from i in inputs
join e in entities on new { i.fund, i.bps, i.withdrawalPct, i.percentile, i.age } equals e.Key
select new { input = i, matchingRows = e };
(3)
foreach (var x in query)
{
entitiesFound.AddRange(x.matchingRows.Select(y => y.entity));
if (x.matchingRows.Count() == 0)
{
rowsToAdd.Add(x.input.record);
}
else if (x.matchingRows.Count() == 1)
{
if (x.matchingRows.First().bombOutTerm != x.input.bombOutTerm)
{
rowsToUpdate.Add(x.input.record);
entitiesToUpdate.Add(x.matchingRows.First().entity);
}
}
else
{
entitiesToDelete.AddRange(x.matchingRows.Select(y => y.entity));
rowsToAdd.Add(x.input.record);
}
}
I would suspect that this will be the among the fastest approaches presented.

Related

Tolist() takes so long to convert, how can I improve this?

I have following code in which I am using tolist method to convert my data from db to list. The reason why I have to convert whole data to list is that, I have to perform search operations after that for which I am using where and lambda statement, for which we need the list.
Is there any alternative for this?
// This takes less than 2 seconds to execute
var wdata = (from s in db.VIEW_ADDED_LOT
select new LotModel
{
CREATION_DATE = s.CREATION_DATE,
LOT_NO_SPL = s.LOT_NO_SPL,
LOT_TYPE = s.LOT_TYPE,
ITEM = s.ITEM,
BUSINESS_UNIT = s.BUSINESS_UNIT,
INSPECTOR = s.INSPECTOR,
NCRNO = s.NCRNO,
BUILDING_NO = s.BUILDING_NO,
CELL = s.CELL,
NCR_DT = s.NCR_DT,
INVENTORY_ROUTER = s.INVENTORY_ROUTER,
DOC_ISSUE = s.DOC_ISSUE,
COMMENTS = s.COMMENTS,
AGING = s.AGING,
ARCHIVAL_DATE = s.ARCHIVAL_DATE,
NCR_COMPLETION_STATUS = s.NCR_COMPLETION_STATUS,
FLAG_LINK = s.FLAG_LINK,
P_KEY = s.P_KEY
});
// This takes around 1 minute to convert to list as there is 500 000 rows
var data = wdata.ToList();
// The reason why I am converting to list is that I have to perform n number of
// search on the basis of the filter chosen by user
if (NCR != null && NCR != "")
{
data = data.Where(a => a.NCRNO == NCR).ToList();
}
if (LOT != null && LOT != "")
{
data = data.Where(a => a.LOT_NO_SPL == LOT).ToList();
}
In your example the wdata.ToList() call evaluates the query and loads the entirety of wdata into memory. Your initial assignment of wdata only creates an IQueryable object, it does not actually query the database.
To avoid the slow performance you should apply all your filters to the IQueryable and then call ToList() at the end, for example:
var data = wdata; // at this point its a queryable of your initial linq
if (NCR != null && NCR != "")
{
data = data.Where(a => a.NCRNO == NCR); // appends one filter condition
}
if (LOT != null && LOT != "")
{
data = data.Where(a => a.LOT_NO_SPL == LOT); // appends another filter condition
}
var finalResult = data.ToList();
This will append your conditions to the IQueryable which will eventually be resolved once you call .ToList(), which will mean you'll not have to load all your entities into memory, as the filters will be evaluated in the DB.

How can I improve the performance while inserting to a list in while looping C#

How can I improve the performance of below code block. The distinctUniqueIDs list is having more than 10000 records. I have tried parallel execution ,but its not yielding much difference. Is there any other space for optimization
for (int i = 0; i < distinctUniqueIDs.Count; i++)
{
long distinctUniqueID = distinctUniqueIDs[i];
try
{
if (!SmartAppConfigMapDict.ContainsKey(distinctUniqueID))
{
appconfigTemp = new SmartAppconfigRootElementMap();
filteredList = validAppConfig.Where(m => m.UniqueID == distinctUniqueID);
if (filteredList?.Count() > 0)
{
appconfigTemp.AppConfig = filteredList.First();
appconfigTemp.RootElements = filteredList.ToDictionary(m =>
m.RootElementVersionID, m => m.RootElementName);
appconfigTemp.RootElementPrimaryKeyMaps = filteredList.ToDictionary(m =>
m.RootElementVersionID, m => new RootElementPrimaryKeyMap
{
RootElementName = m.RootElementName,
RootElementVersionID = m.RootElementVersionID,
ChildElementDesc = m.ChildElementDesc,
ChildElementName = m.ChildElementName,
ChildElementPath = m.ChildElementPath,
ActualRootElementVersionID = m.ActualRootElementVersionID,
ActualRootElementName = m.ActualRootElementName
});
SmartAppConfigMapDict.Add(distinctUniqueID, appconfigTemp);
if (queryRootElementMap.ContainsKey(appconfigTemp.AppConfig.QueryID))
{
queryRootElementMap[appconfigTemp.AppConfig.QueryID]
.Add(appconfigTemp);
}
else
{
appConfigRootMapSubList = new List<SmartAppconfigRootElementMap>();
appConfigRootMapSubList.Add(appconfigTemp);
queryRootElementMap.Add(appconfigTemp.AppConfig.QueryID,
appConfigRootMapSubList);
}
}
}
}
catch (Exception)
{
continue;
}
}
The culprit should be
filteredList = validAppConfig.Where(m => m.UniqueID == distinctUniqueID);
First, LINQ Where is inefficient method with linear time complexity O(N). Using such methods inside loops is not recommended.
Second, due to LINQ deferred execution, the aforementioned linear search is executed several times - basically by every operator applied to filteredList - filteredList?.Count(), filteredList.First() and 2 filteredList.ToDictionary(…) calls.
What you can do is to prepare in advance a fast hash based lookup data structure (for instance, Lookup) outside the loop and use it inside.
e.g. add something like this outside the loop:
var validAppConfigsByUniqueID = validAppConfig.ToLookup(m => m.UniqueID);
and the inside replace
filteredList = validAppConfig.Where(m => m.UniqueID == distinctUniqueID);
if (filteredList?.Count() > 0)
with
var filteredList = validAppConfigsByUniqueID[distinctUniqueID];
if (filteredList.Any())
Note that validAppConfigsByUniqueID[distinctUniqueID] operation has constant time complexity O(1). And returned enumerable is already buffered, so iterating it several times is not an issue.

How do I convert this looped code to a single LINQ implementation?

I am trying to optimise the code below which loops through objects one by one and does a database lookup. I want to make a LINQ statement that will do the same task in one transaction.
This is my inefficient looped code;
IStoreUnitOfWork uow = StoreRepository.UnitOfWorkSource.GetUnitOfWorkFactory().CreateUnitOfWork();
var localRunners = new List<Runners>();
foreach(var remoteRunner in m.Runners) {
var localRunner = uow.CacheMarketRunners.Where(x => x.SelectionId == remoteRunner.SelectionId && x.MarketId == m.MarketId).FirstOrDefault();
localRunners.Add(localRunner);
}
This is my very feable attempt at a single query to do the same thing. Well it's not even an attempt. I don't know where to start. The remoteRunners object has a composite key.
IStoreUnitOfWork uow = StoreRepository.UnitOfWorkSource.GetUnitOfWorkFactory().CreateUnitOfWork();
var localRunners = new List<Runners>();
var localRunners = uow.CacheMarketRunners.Where(x =>
x.SelectionId in remoteRunners.SelectionId &&
x.MarketId in remoteRunners.MarketId);
Thank you for looking
So you have an object m, which has a property MarketId. Object m also has a sequence of Runners, where every Runner has a property SelectionId.
Your database has CacheMarketRunners. Every CacheMarketRunner has a MarketId and a SelectionId.
Your query should return allCacheMarketRunners with a MarketId equal to m.MarketId and a SelectionId that is contained in the sequence m.Runners.SelectionId.
If your m does not have too many Runners, say less then 250, consider using Queryable.Contains
var requestedSelectionIds = m.Runners.Select(runner => runner.SelectionId);
var result = CacheMarketRunners.Where(cacheMarketRunner =>
cacheMarketRunner.MarketId == m.MarketId
&& requestedSelectionIds.Contains(cacheMarketRunner.SelectionId));
To improve performance, you need caching transaction results:
var marketRunners = uow.CacheMarketRunners.Where(x => x.MarketId == m.MarketId).ToList();
Transaction results regarding uow are stored in the List, such that you don't have transaction in the for loop. Hence performance should be improved:
var localRunners = new List<Runners>();
foreach(var remoteRunner in m.Runners) {
var localRunner = marketRunners.FirstOrDefault(x => x.SelectionId == remoteRunner.SelectionId);
localRunners.Add(localRunner);
}
You can even remove the for loop:
var localRunners = m.Runners.Select(remoteRunner => marketRunners.FirstOrDefault(x => x.SelectionId == remoteRunner.SelectionId)).ToList();

Issue regarding C# collection performance

My application contains two observable collections:
1
int count = collection.Count();
It takes nearly 30 milliseconds.
Please tell me any method which should take very less time.
2
I am comparing collection_1 with collection_2 with specified value
like:
var common = collection_2.firstOrDefault(i=>i.name == collection_1.name);
It takes more than 6 milliseconds where collection_1 contains more than 35,0000 records and collection_2 contains ore than 1,00,000 records. Please tell me the best way.
my code:
foreach (var singleItem in StartWindow.omsReqRes.Where(i => i.Tag.Contains("Request")))
{
AddFileData(singleItem);
}
public void AddFileData(LogClass singleItem)
{
var responses = (StartWindow.omsReqRes.Where(i => i.Clordid == singleItem.Clordid && i.Tag.Contains("Response")));
foreach (var response in responses)
{
LogClass obj = new LogClass();
obj.AlgoName = singleItem.AlgoName;
obj.RealTime = singleItem.RealTime;
obj.TimeStamp = singleItem.TimeStamp;
var request = StartWindow.gatewayReqRes.FirstOrDefault(i => i.Tag == singleItem.Tag && i.Clordid == singleItem.Clordid);
if (request != null)
obj.v = request.e;
}
}
The collection contains more than 1 lakh records, so it takes more time for searching, I divided collection based on one value to key value pair dictionary
hence it takes less time

Use LINQ to group a sequence by date with no gaps

I'm trying to select a subgroup of a list where items have contiguous dates, e.g.
ID StaffID Title ActivityDate
-- ------- ----------------- ------------
1 41 Meeting with John 03/06/2010
2 41 Meeting with John 08/06/2010
3 41 Meeting Continues 09/06/2010
4 41 Meeting Continues 10/06/2010
5 41 Meeting with Kay 14/06/2010
6 41 Meeting Continues 15/06/2010
I'm using a pivot point each time, so take the example pivot item as 3, I'd like to get the following resulting contiguous events around the pivot:
ID StaffID Title ActivityDate
-- ------- ----------------- ------------
2 41 Meeting with John 08/06/2010
3 41 Meeting Continues 09/06/2010
4 41 Meeting Continues 10/06/2010
My current implementation is a laborious "walk" into the past, then into the future, to build the list:
var activity = // item number 3: Meeting Continues (09/06/2010)
var orderedEvents = activities.OrderBy(a => a.ActivityDate).ToArray();
// Walk into the past until a gap is found
var preceedingEvents = orderedEvents.TakeWhile(a => a.ID != activity.ID);
DateTime dayBefore;
var previousEvent = activity;
while (previousEvent != null)
{
dayBefore = previousEvent.ActivityDate.AddDays(-1).Date;
previousEvent = preceedingEvents.TakeWhile(a => a.ID != previousEvent.ID).LastOrDefault();
if (previousEvent != null)
{
if (previousEvent.ActivityDate.Date == dayBefore)
relatedActivities.Insert(0, previousEvent);
else
previousEvent = null;
}
}
// Walk into the future until a gap is found
var followingEvents = orderedEvents.SkipWhile(a => a.ID != activity.ID);
DateTime dayAfter;
var nextEvent = activity;
while (nextEvent != null)
{
dayAfter = nextEvent.ActivityDate.AddDays(1).Date;
nextEvent = followingEvents.SkipWhile(a => a.ID != nextEvent.ID).Skip(1).FirstOrDefault();
if (nextEvent != null)
{
if (nextEvent.ActivityDate.Date == dayAfter)
relatedActivities.Add(nextEvent);
else
nextEvent = null;
}
}
The list relatedActivities should then contain the contiguous events, in order.
Is there a better way (maybe using LINQ) for this?
I had an idea of using .Aggregate() but couldn't think how to get the aggregate to break out when it finds a gap in the sequence.
Here's an implementation:
public static IEnumerable<IGrouping<int, T>> GroupByContiguous(
this IEnumerable<T> source,
Func<T, int> keySelector
)
{
int keyGroup = Int32.MinValue;
int currentGroupValue = Int32.MinValue;
return source
.Select(t => new {obj = t, key = keySelector(t))
.OrderBy(x => x.key)
.GroupBy(x => {
if (currentGroupValue + 1 < x.key)
{
keyGroup = x.key;
}
currentGroupValue = x.key;
return keyGroup;
}, x => x.obj);
}
You can either convert the dates to ints by means of subtraction, or imagine a DateTime version (easily).
In this case I think that a standard foreach loop is probably more readable than a LINQ query:
var relatedActivities = new List<TActivity>();
bool found = false;
foreach (var item in activities.OrderBy(a => a.ActivityDate))
{
int count = relatedActivities.Count;
if ((count > 0) && (relatedActivities[count - 1].ActivityDate.Date.AddDays(1) != item.ActivityDate.Date))
{
if (found)
break;
relatedActivities.Clear();
}
relatedActivities.Add(item);
if (item.ID == activity.ID)
found = true;
}
if (!found)
relatedActivities.Clear();
For what it's worth, here's a roughly equivalent -- and far less readable -- LINQ query:
var relatedActivities = activities
.OrderBy(x => x.ActivityDate)
.Aggregate
(
new { List = new List<TActivity>(), Found = false, ShortCircuit = false },
(a, x) =>
{
if (a.ShortCircuit)
return a;
int count = a.List.Count;
if ((count > 0) && (a.List[count - 1].ActivityDate.Date.AddDays(1) != x.ActivityDate.Date))
{
if (a.Found)
return new { a.List, a.Found, ShortCircuit = true };
a.List.Clear();
}
a.List.Add(x);
return new { a.List, Found = a.Found || (x.ID == activity.ID), a.ShortCircuit };
},
a => a.Found ? a.List : new List<TActivity>()
);
Somehow, I don't think LINQ was truly meant to be used for bidirectional-one-dimensional-depth-first-searches, but I constructed a working LINQ using Aggregate. For this example I'm going to use a List instead of an array. Also, I'm going to use Activity to refer to whatever class you are storing the data in. Replace it with whatever is appropriate for your code.
Before we even start, we need a small function to handle something. List.Add(T) returns null, but we want to be able to accumulate in a list and return the new list for this aggregate function. So all you need is a simple function like the following.
private List<T> ListWithAdd<T>(List<T> src, T obj)
{
src.Add(obj);
return src;
}
First, we get the sorted list of all activities, and then initialize the list of related activities. This initial list will contain the target activity only, to start.
List<Activity> orderedEvents = activities.OrderBy(a => a.ActivityDate).ToList();
List<Activity> relatedActivities = new List<Activity>();
relatedActivities.Add(activity);
We have to break this into two lists, the past and the future just like you currently do it.
We'll start with the past, the construction should look mostly familiar. Then we'll aggregate all of it into relatedActivities. This uses the ListWithAdd function we wrote earlier. You could condense it into one line and skip declaring previousEvents as its own variable, but I kept it separate for this example.
var previousEvents = orderedEvents.TakeWhile(a => a.ID != activity.ID).Reverse();
relatedActivities = previousEvents.Aggregate<Activity, List<Activity>>(relatedActivities, (items, prevItem) => items.OrderBy(a => a.ActivityDate).First().ActivityDate.Subtract(prevItem.ActivityDate).Days.Equals(1) ? ListWithAdd(items, prevItem) : items).ToList();
Next, we'll build the following events in a similar fashion, and likewise aggregate it.
var nextEvents = orderedEvents.SkipWhile(a => a.ID != activity.ID);
relatedActivities = nextEvents.Aggregate<Activity, List<Activity>>(relatedActivities, (items, nextItem) => nextItem.ActivityDate.Subtract(items.OrderBy(a => a.ActivityDate).Last().ActivityDate).Days.Equals(1) ? ListWithAdd(items, nextItem) : items).ToList();
You can properly sort the result afterwards, as now relatedActivities should contain all activities with no gaps. It won't immediately break when it hits the first gap, no, but I don't think you can literally break out of a LINQ. So it instead just ignores anything which it finds past a gap.
Note that this example code only operates on the actual difference in time. Your example output seems to imply that you need some other comparison factors, but this should be enough to get you started. Just add the necessary logic to the date subtraction comparison in both entries.

Categories