I am trying to create an application that will extract some data out of an automatically generated excel file. This can be very easily done with Access but the file is in Excel and the solution must be a one button sort of thing.
For some reason, simply looping through the data without doing any actions is slow. The code below is my attempt at optimizing it from something that was far slower. I have arrived at using Linq to SQL after a few attempts at this with the Interop classes directly and through different wrappers.
I also have read the answers to a few questions on here and Google. In an attempt to see what is causing the slowness, I have removed all instructions but kept "i++" from the relevant section. It is still very slow. I also tried to optimize it by limiting the number of records retrieved in the where clause in the third line but that didn't work. Your help would be appreciated.
Thank you.
Dictionary<string,double> instructors = new Dictionary<string,double>();
var t = from c in excel.Worksheet("Course_201410_M1")
// where c["COURSE CODE"].ToString().Substring(0,4) == "COSC" || c["COURSE CODE"].ToString().Substring(0,3) == "COEN" || c["COURSE CODE"].ToString().Substring(0,3) == "GEIT" || c["COURSE CODE"].ToString().Substring(0,3) == "ITAP" || c["COURSE CODE"] == "PRPL 0012" || c["COURSE CODE"] == "ASSE 4311" || c["COURSE CODE"] == "GEEN 2312" || c["COURSE CODE"] == "ITLB 1311"
select c;
HashSet<string> uniqueForce = new HashSet<string>();
foreach (var c in t)
{
if(uniqueForce.Add(c["Instructor"]))
instructors.Add(c["Instructor"],0.0);
}
foreach (string name in instructors.Keys)
{
var y = from d in t
where d["Instructor"] == name
select d;
int i = 1;
foreach(var z in y)
{
//this is the really slow. It takes a couple of minutes to finish. The
// file has less than a 1000 records.
i++;
}
}
Put the query that forms var t into brackets and then call ToList() on it.
var t = (from c in excel.Worksheet("Course_201410_M1")
select c).ToList();
Due to linq's lazy/deferred execution model, whenever you iterate over the collection it will requery the data source unless you give it a List to work with.
Related
I am writing a small program that takes in a .csv file as input with about 45k rows. I am trying to compare the contents of this file with the contents of a table on a database (SQL Server through dynamics CRM using Xrm.Sdk if it makes a difference).
In my current program (which takes about 25 minutes to compare - the file and database are the exact same here both 45k rows with no differences), I have all existing records on the database in a DataCollection<Entity> which inherits Collection<T> and IEnumerable<T>
In my code below I am filtering using the Where method and then doing a logic based the count of matches. The Where seems to be the bottleneck here. Is there a more efficient approach than this? I am by no means a LINQ expert.
foreach (var record in inputDataLines)
{
var fields = record.Split(',');
var fund = fields[0];
var bps = Convert.ToDecimal(fields[1]);
var withdrawalPct = Convert.ToDecimal(fields[2]);
var percentile = Convert.ToInt32(fields[3]);
var age = Convert.ToInt32(fields[4]);
var bombOutTerm = Convert.ToDecimal(fields[5]);
var matchingRows = existingRecords.Entities.Where(r => r["field_1"].ToString() == fund
&& Convert.ToDecimal(r["field_2"]) == bps
&& Convert.ToDecimal(r["field_3"]) == withdrawalPct
&& Convert.ToDecimal(r["field_4"]) == percentile
&& Convert.ToDecimal(r["field_5"]) == age);
entitiesFound.AddRange(matchingRows);
if (matchingRows.Count() == 0)
{
rowsToAdd.Add(record);
}
else if (matchingRows.Count() == 1)
{
if (Convert.ToDecimal(matchingRows.First()["field_6"]) != bombOutTerm)
{
rowsToUpdate.Add(record);
entitiesToUpdate.Add(matchingRows.First());
}
}
else
{
entitiesToDelete.AddRange(matchingRows);
rowsToAdd.Add(record);
}
}
EDIT: I can confirm that all existingRecords are in memory before this code is executed. There is no IO or DB access in the above loop.
Himbrombeere is right, you should execute the query first and put the result into a collection before you use Any, Count, AddRange or whatever method will execute the query again. In your code it's possible that the query is executed 5 times in every loop iteration.
Watch out for the term deferred execution in the documentation. If a method is implemented in that way, then it means that this method can be used to construct a LINQ query(so you can chain it with other methods and at the end you have a query). But only methods that don't use deferred execution like Count, Any, ToList(or a plain foreach) will actually execute it. If you dont want that the whole query is executed everytime and you have to access this query multiple times , it's better to store the result in a collection(.f.e with ToList).
However, you could use a different approach which should be much more efficient, a Lookup<TKey, TValue> which is similar to a dictionary and can be used with an anonymous type as key:
var lookup = existingRecords.Entities.ToLookup(r => new
{
fund = r["field_1"].ToString(),
bps = Convert.ToDecimal(r["field_2"]),
withdrawalPct = Convert.ToDecimal(r["field_3"]),
percentile = Convert.ToDecimal(r["field_4"]),
age = Convert.ToDecimal(r["field_5"])
});
Now you can access this lookup in the loop very efficiently.
foreach (var record in inputDataLines)
{
var fields = record.Split(',');
var fund = fields[0];
var bps = Convert.ToDecimal(fields[1]);
var withdrawalPct = Convert.ToDecimal(fields[2]);
var percentile = Convert.ToInt32(fields[3]);
var age = Convert.ToInt32(fields[4]);
var bombOutTerm = Convert.ToDecimal(fields[5]);
var matchingRows = lookup[new {fund, bps, withdrawalPct, percentile, age}].ToList();
entitiesFound.AddRange(matchingRows);
if (matchingRows.Count() == 0)
{
rowsToAdd.Add(record);
}
else if (matchingRows.Count() == 1)
{
if (Convert.ToDecimal(matchingRows.First()["field_6"]) != bombOutTerm)
{
rowsToUpdate.Add(record);
entitiesToUpdate.Add(matchingRows.First());
}
}
else
{
entitiesToDelete.AddRange(matchingRows);
rowsToAdd.Add(record);
}
}
Note that this will work even if the key does not exist(an empty list is returned).
Add a ToList after your Convert.ToDecimal(r["field_5"]) == age);-line to force an immediate execution of the query.
var matchingRows = existingRecords.Entities.Where(r => r["field_1"].ToString() == fund
&& Convert.ToDecimal(r["field_2"]) == bps
&& Convert.ToDecimal(r["field_3"]) == withdrawalPct
&& Convert.ToDecimal(r["field_4"]) == percentile
&& Convert.ToDecimal(r["field_5"]) == age)
.ToList();
The Where doesn´t actually execute your query, it just prepares it. The actual execution happens later in a delayed way. In your case that happens when calling Count which itself will iterate the entire collection of items. But if the first condition fails, the second one is checked leading to a second iteration of the complete collection when calling Count. In this case you actually execute that query a thrird time when calling matchingRows.First().
When forcing an immediate execution you´re executing the query only once and thus iterating the entire collection only once also which will decrease your overall-time.
Another option, which is basically along the same lines as the other answers, is to prepare your data first, so that you're not repeatedly calling things like r["field_2"] (which are relatively slow to look up).
This is a (1) clean your data, (2) query/join your data, (3) process your data approach.
Do this:
(1)
var inputs =
inputDataLines
.Select(record =>
{
var fields = record.Split(',');
return new
{
fund = fields[0],
bps = Convert.ToDecimal(fields[1]),
withdrawalPct = Convert.ToDecimal(fields[2]),
percentile = Convert.ToInt32(fields[3]),
age = Convert.ToInt32(fields[4]),
bombOutTerm = Convert.ToDecimal(fields[5]),
record
};
})
.ToArray();
var entities =
existingRecords
.Entities
.Select(entity => new
{
fund = entity["field_1"].ToString(),
bps = Convert.ToDecimal(entity["field_2"]),
withdrawalPct = Convert.ToDecimal(entity["field_3"]),
percentile = Convert.ToInt32(entity["field_4"]),
age = Convert.ToInt32(entity["field_5"]),
bombOutTerm = Convert.ToDecimal(entity["field_6"]),
entity
})
.ToArray()
.GroupBy(x => new
{
x.fund,
x.bps,
x.withdrawalPct,
x.percentile,
x.age
}, x => new
{
x.bombOutTerm,
x.entity,
});
(2)
var query =
from i in inputs
join e in entities on new { i.fund, i.bps, i.withdrawalPct, i.percentile, i.age } equals e.Key
select new { input = i, matchingRows = e };
(3)
foreach (var x in query)
{
entitiesFound.AddRange(x.matchingRows.Select(y => y.entity));
if (x.matchingRows.Count() == 0)
{
rowsToAdd.Add(x.input.record);
}
else if (x.matchingRows.Count() == 1)
{
if (x.matchingRows.First().bombOutTerm != x.input.bombOutTerm)
{
rowsToUpdate.Add(x.input.record);
entitiesToUpdate.Add(x.matchingRows.First().entity);
}
}
else
{
entitiesToDelete.AddRange(x.matchingRows.Select(y => y.entity));
rowsToAdd.Add(x.input.record);
}
}
I would suspect that this will be the among the fastest approaches presented.
I have this code that looks though all Contacts and does a count on each email that's been sent to them and if they haven't open/click the last X amount then return them in a list
at the moment the code is taking about 10 mins to run, is there anything I can do to improve this?
I know I could limit the amount returned but that's still slow.
var contactList =
(from c in db.Contacts
where c.Accounts_CustomerID == Account.AccountID && !c.Deleted && !c.EmailOptOut
select c).ToList();
foreach (var person in contactList)
{
var SentEmails =
(from c in db.Comms_Emails_EmailsSents where c.ContactID == person.ID select c).OrderBy(
x => x.DateSent).Take(Last).ToList();
if (SentEmails.Count == Last)
{
if (!Clicks)
{
if (SentEmails.Count(x => x.Opens == 0) == Last)
{
ReturnContacts.Add(person);
}
}
else
{
if (SentEmails.Count(x => x.Clicks == 0) == Last)
{
ReturnContacts.Add(person);
}
}
}
}
return ReturnContacts;
Remove the .ToList()'s and use IQueryables. By using iqueryables the code will execute once and reduces memory. The ToList() retrieves all entities and store them in memory, which you don't want.
Run the logic on the db - rewrite a query using joins etc., so that it returns a result set that already contains relevant data.
What you're doing now is performing a db query for each initial query result. That can mean A LOT of queries.
If you offload that to the RDBMS you can always try and and optimize it there (by introducing indexes etc.).
EDIT: I rewrote the code in notepad:
foreach(var record in (from c in db.Contacts
join es in db.Comms_Emails_EmailsSents
on c.Id equals es.ContactId
where c.Accounts_CustomerID == Account.AccountID && !c.Deleted && !c.EmailOptOut
orderby c.Id, es.DateSent descending
select new {opens=es.Opens, clicks=es.Clicks, person=c})
.GroupBy(r=>r.person)){
var mails = record.Take(Last).ToList();
if(mails.Count == Last){
if(!Clicks){
if(mails.Count(x=>x.opens == 0) == Last){
ReturnContacts.Add(record.Key);
}
}
}else
{
if (SentEmails.Count(x => x.Clicks == 0) == Last)
{
ReturnContacts.Add(record.Key);
}
}
}
I don't have time at hand to mock up a db and test it. Also, this approach performs a join between contacts and emails, and if you have 100k emails per person, this might be a very bad idea. You could optimize it by using rank function, but I'd say that if performance is still bad, you could start thinking of doing db-side optimizations, as this data structure is - at least to my, non-dba eyes - not perfectly suited for this kind of querying.
Using Entity Framework 6.0.2 and .NET 4.5.1 in Visual Studio 2013 Update 1 with a DbContext connected to SQL Server:
I have a long chain of filters I am applying to a query based on the caller's desired results. Everything was fine until I needed to add paging. Here's a glimpse:
IQueryable<ProviderWithDistance> results = (from pl in db.ProviderLocations
let distance = pl.Location.Geocode.Distance(_geo)
where pl.Location.Geocode.IsEmpty == false
where distance <= radius * 1609.344
orderby distance
select new ProviderWithDistance() { Provider = pl.Provider, Distance = Math.Round((double)(distance / 1609.344), 1) }).Distinct();
if (gender != null)
{
results = results.Where(p => p.Provider.Gender == (gender.ToUpper() == "M" ? Gender.Male : Gender.Female));
}
if (type != null)
{
int providerType;
if (int.TryParse(type, out providerType))
results = results.Where(p => p.Provider.ProviderType.Id == providerType);
}
if (newpatients != null && newpatients == true)
{
results = results.Where(p => p.Provider.ProviderLocations.Any(pl => pl.AcceptingNewPatients == null || pl.AcceptingNewPatients == AcceptingNewPatients.Yes));
}
if (string.IsNullOrEmpty(specialties) == false)
{
List<int> _ids = specialties.Split(',').Select(int.Parse).ToList();
results = results.Where(p => p.Provider.Specialties.Any(x => _ids.Contains(x.Id)));
}
if (string.IsNullOrEmpty(degrees) == false)
{
List<int> _ids = specialties.Split(',').Select(int.Parse).ToList();
results = results.Where(p => p.Provider.Degrees.Any(x => _ids.Contains(x.Id)));
}
if (string.IsNullOrEmpty(languages) == false)
{
List<int> _ids = specialties.Split(',').Select(int.Parse).ToList();
results = results.Where(p => p.Provider.Languages.Any(x => _ids.Contains(x.Id)));
}
if (string.IsNullOrEmpty(keyword) == false)
{
results = results.Where(p =>
(p.Provider.FirstName + " " + p.Provider.LastName).Contains(keyword));
}
Here's the paging I added to the bottom (skip and max are just int parameters):
if (skip > 0)
results = results.Skip(skip);
results = results.Take(max);
return new ProviderWithDistanceDto { Locations = results.AsEnumerable() };
Now for my question(s):
As you can see, I am doing an orderby in the initial LINQ query, so why is it complaining that I need to do an OrderBy before doing a Skip (I thought I was?)...
I was under the assumption that it won't be turned into a SQL query and executed until I enumerate the results, which is why I wait until the last line to return the results AsEnumerable(). Is that the correct approach?
If I have to enumerate the results before doing Skip and Take how will that affect performance? Obviously I'd like to have SQL Server do the heavy lifting and return only the requested results. Or does it not matter (or have I got it wrong)?
I am doing an orderby in the initial LINQ query, so why is it complaining that I need to do an OrderBy before doing a Skip (I thought I was?)
Your result starts off correctly as an ordered queryable: the type returned from the query on the first line is IOrderedQueryable<ProviderWithDistance>, because you have an order by clause. However, adding a Where on top of it makes your query an ordinary IQueryable<ProviderWithDistance> again, causing the problem that you see down the road. Logically, that's the same thing, but the structure of the query definition in memory implies otherwise.
To fix this, remove the order by in the original query, and add it right before you are ready for the paging, like this:
...
if (string.IsNullOrEmpty(languages) == false)
...
if (string.IsNullOrEmpty(keyword) == false)
...
result = result.OrderBy(r => r.distance);
As long as ordering is the last operation, this should fix the runtime problem.
I was under the assumption that it won't be turned into a SQL query and executed until I enumerate the results, which is why I wait until the last line to return the results AsEnumerable(). Is that the correct approach?
Yes, that is the correct approach. You want your RDBMS to do as much work as possible, because doing paging in memory defeats the purpose of paging in the first place.
If I have to enumerate the results before doing Skip and Take how will that affect performance?
It would kill the performance, because your system would need to move around a lot more data than it did before you added paging.
Can anyone help me figure this out?
The below code works fine and gets inside the if statument
foreach (var m in msg)
{
if (string.IsNullOrEmpty(m.PhoneNumber))
{
m.PhoneNumber = (from c in db.Customers
where c.CustomerID == m.CustomerID
select c.PhoneNumber).Single();
}
}
However in the below code phoneNumber is never set
foreach (var m in msg.Where(z => (z.PhoneNumber == null || z.PhoneNumber == "")))
{
m.PhoneNumber = (from c in db.Customers
where c.CustomerID == m.CustomerID
select c.PhoneNumber).Single();
}
I'm presuming its because the top code actually evaluates the expression whereas the below dosent. If that is the case then how can you check for null on an unevaluated LINQ query?
EDIT Just to stop confusion here is how msg is poplated in both cases
var msg = from m in db.Messages
where (m.StatusID == (int)MessageStatus.Submitted && m.MessageBoxTypeID == (int)MessageBoxType.Outbox)
select m;
I’m somewhat baffled by this one, but I have a wild guess. If the msg sequence is an IQueryable<T> which translates to an SQL query, then the behavior of the two snippets may vary. Suppose you have:
var msg =
from m in dataContext.MyTable
select m;
Your first snippet would cause the entire msg sequence to be enumerated, thereby issuing an unfiltered SELECT…FROM command to the database and fetching all the rows within your table.
foreach (var m in msg)
On the other hand, your second snippet applies a filter to your sequence before it is enumerated. Thus, the command issued to the database is a SELECT…FROM…WHERE.
foreach (var m in msg.Where(z => (z.PhoneNumber == null || z.PhoneNumber == "")))
There are various cases where the behavior of a filter applied in .NET would differ from its translation to Transact-SQL. For one, case-sensitivity. In your case, I assume that the mismatch is caused by entries whose PhoneNumber consists of whitespace, as these may match the empty string in SQL Server.
To test this possibility, check what happens if you change your second snippet to:
foreach (var m in msg.ToList().Where(z => (z.PhoneNumber == null || z.PhoneNumber == "")))
Edit: Your issue might be that your query is being executed again during subsequent access (when you check whether PhoneNumber was set).
If you execute:
foreach (var m in msg.Where(z => (z.PhoneNumber == null || z.PhoneNumber == "")))
{
m.PhoneNumber = …
}
bool stillHasNulls = msg.Any(z => z.PhoneNumber == null || z.PhoneNumber == "");
You will find that stillHasNulls might still evaluate to true, since your assignment to m.PhoneNumber is being lost when you re-evaluate the msg sequence (in the above case, when you execute msg.Any, which issues an EXISTS command to the database).
For your m.PhoneNumber assignments to be preserved, you need to either persist them to the database (if that’s what you want), or else make sure that you’re accessing the same sequence elements each time. One way to do this would be to pre-populate the sequence as a collection, using ToList.
msg = msg.Where(z => (z.PhoneNumber == null || z.PhoneNumber == "")).ToList();
foreach (var m in msg)
{
m.PhoneNumber = …
}
In the above code, the filter still gets issued to the database as a SELECT…FROM…WHERE, but the result is evaluated eagerly, and then stored as a list within msg. Any subsequent queries on msg would be evaluated against the pre-populated in-memory collection (which would contain any new values you assign to its elements).
Can we use foreach loop for IQueryable object?
I'd like to do something as follow:
query = IQueryable<Myclass> = objDataContext.Myclass; // objDataContext is an object of LINQ datacontext class
int[] arr1 = new int[] { 3, 4, 5 };
foreach (int i in arr1)
{
query = query.Where(q => (q.f_id1 == i || q.f_id2 == i || q.f_id3 == i));
}
I get a wrong output as each time value of i is changed.
The problem you're facing is deferred execution, you should be able to find a lot of information on this but basically none of the code s being executed until you actually try to read data from the IQueryable (Convert it to an IEnumerable or a List or other similar operations). This means that this all happens after the foreach is finished when i is set to the final value.
If I recall correctly one thing you can do is initialize a new variable inside the for loop like this:
foreach (int i in arr1)
{
int tmp = i;
query = query.Where(q => (q.f_id1 == tmp || q.f_id2 == tmp || q.f_id3 == tmp));
}
By putting it in a new variable which is re-created each loop, the variable should not be changed before you execute the IQueryable.
You dont need a for each, try it like this:
query = objDataContext.Myclass.Where(q => (arr1.Contains(q.f_id1) || arr1.Contains(q.f_id2) || arr1.Contains(q.f_id3));
this is because "i" is not evaluated until you really use the iterate the query collectionif not by that time I believe "i" will be the last.