new objects added during long loop

new objects added during long loop - c#

We currently have a production application that runs as a windows service. Many times this application will end up in a loop that can take several hours to complete. We are using Entity Framework for .net 4.0 for our data access.
I'm looking for confirmation that if we load new data into the system, after this loop is initialized, it will not result in items being added to the loop itself. When the loop is initialized we are looking for data "as of" that moment. Although I'm relatively certain that this will work exactly like using ADO and doing a loop on the data (the loop only cycles through data that was present at the time of initialization), I am looking for confirmation for co-workers.
Thanks in advance for your help.
//update : here's some sample code in c# - question is the same, will the enumeration change if new items are added to the table that EF is querying?
IEnumerable<myobject> myobjects = (from o in db.theobjects where o.id==myID select o);
foreach (myobject obj in myobjects)
{
//perform action on obj here
}

It depends on your precise implementation.
Once a query has been executed against the database then the results of the query will not change (assuming you aren't using lazy loading). To ensure this you can dispose of the context after retrieving query results--this effectively "cuts the cord" between the retrieved data and that database.
Lazy loading can result in a mix of "initial" and "new" data; however once the data has been retrieved it will become a fixed snapshot and not susceptible to updates.
You mention this is a long running process; which implies that there may be a very large amount of data involved. If you aren't able to fully retrieve all data to be processed (due to memory limitations, or other bottlenecks) then you likely can't ensure that you are working against the original data. The results are not fixed until a query is executed, and any updates prior to query execution will appear in results.

I think your best bet is to change the logic of your application such that when the "loop" logic is determining whether it should do another interation or exit you take the opportunity to load the newly added items to the list. see pseudo code below:
var repo = new Repository();
while (repo.HasMoreItemsToProcess())
{
var entity = repo.GetNextItem();
}
Let me know if this makes sense.

The easiest way to assure that this happens - if the data itself isn't too big - is to convert the data you retrieve from the database to a List<>, e.g., something like this (pulled at random from my current project):
var sessionIds = room.Sessions.Select(s => s.SessionId).ToList();
And then iterate through the list, not through the IEnumerable<> that would otherwise be returned. Converting it to a list triggers the enumeration, and then throws all the results into memory.
If there's too much data to fit into memory, and you need to stick with an IEnumerable<>, then the answer to your question depends on various database and connection settings.

I'd take a snapshot of ID's to be processed -- quickly and as a transaction -- then work that list in the fashion you're doing today.
In addition to accomplishing the goal of not changing the sample mid-stream, this also gives you the ability to extend your solution to track status on each item as it's processed. For a long-running process, this can be very helpful for progress reporting restart / retry capabilities, etc.

Related

Mass Update a property on multiple records inside a dictionary (VB.NET / C#)

I have a Dictionary (of Long, Class), where Class has multiple properties (assume we have a property called Updated as Boolean).
I want to update this (Updated) property to (True) at once for let's say all Odd key records (or based on any specific rule). What is the best way to do so?
My thoughts are to use Linq to fetch those records then (for each) them, but is there any better way to do so like doing a mass update where a condition happens (like what we do in the database)?
An example of my approach is below. Appreciate it if there is a better way to do such an update...
Thanks
Dim ReturnedObjs = From Obj In Dictionary Where Obj.Key Mod 2 = 1
For Each item As KeyValuePair(Of Long, Class) In ReturnedObjs
item.Value.Updated = True
Next

First, this sounds like a obvious case for the speed rant:
https://ericlippert.com/2012/12/17/performance-rant/
Second:
The best way is to keep this in the Database. You are not going to beat the speed of a DB Query with Indexes designed for quick matching, by transfering the data over the network twice (once to get it, once to return it) and doubling the search load (once to get all odd ones, once to update all the ones you just changed). My standing advice is to always keep as much work as possible on the DB side. Your client code will never be able to beat it.
Third:
If you do need to use client side processing:
Now a lot of my answer depend on details of the implementation, how the JiT and general Compiler optimsiations work, etc.
Foreach uses works on enumerators, not Collections. But if you feed a collection to foreaach, a Enumerator is implicitly created. Now enumerators do have two properties:
If the collection changes, the Enumerator becomes invalid. Most people learn about them because they ran into this issue.
It is a extra function call and set of checks for accessing a collection. So it will be a slowdown. How much is hard to say, as the Optimisations and JiT are pretty good.
So you propably want to use for loop instead.
If you could turn the Dictionary into a collection where the Primary Key is used as Index, it might be a bit faster. But hat has the danger of running into a lot of "dry spells" regarding data, so it depends a lot on your source data.

How to iterate large query

I know how to make pagination but it doesn't fit my requirements because underlying query of pagination is updating itself whenever i need next paged result. So i am looking simple solution to iterate one by one whole results of query efficiently. Please take a look below example.
var urls = db.Websites.Select(s => s.Website)
.Except(db.OldWebsites.Select(s => s.Website));
foreach (var url in urls)
{
//process items
}
I just want to know that the solution is really efficiently does iteration whole results or not. I am not exactly sure that the solution is loading rows one by one without loading all results to memory.
Can someone verify this or suggest better solution ?

Yes Entity Framework streaming results instead of buffering as default. Calling AsStreaming method as below gives warning as : "Queries are now streaming by default unless a retrying ExecutionStrategy is used.
foreach (var item in db.Websites.AsStreaming()) { }
Just needs to be carefull that DbContext doesn't hold references for iterated results. Anonymous types or primitive results already not tracked so it needs to call AsNoTracking for Entity results like
db.Websites.AsNoTracking()

Multi Threading with LINQ to SQL

I am writing a WinForms application. I am pulling data from my database, performing some actions on that data set and then plan to save it back to the database. I am using LINQ to SQL to perform the query to the database because I am only concerned with 1 table in our database so I didn't want to implement an entire ORM for this.
I have it pulling the dataset from the DB. However, the dataset is rather large. So currently what I am trying to do is separate the dataset into 4 relatively equal sized lists (List<object>).
Then I have a separate background worker to run through each of those lists, perform the action and report its progress while doing so. I have it planned to consolidate those sections into one big list once all 4 background workers have finished processing their section.
But I keep getting an error while the background workers are processing their unique list. Do the objects maintain their tie to the DataContext for the LINQ to SQL even though they have been converted to List objects? Any ideas how to fix this? I have minimal experience with multi-threading so if I am going at this completely wrong, please tell me.
Thanks guys. If you need any code snippets or any other information just ask.
Edit: Oops. I completely forgot to give the error message. In the DataContext designer.cs it gives the error An item with the same key has already been added. on the SendPropertyChanging function.
private void Setup(){
List<MyObject> quarter1 = _listFromDB.Take(5000).ToList();
bgw1.RunWorkerAsync();
}
private void bgw1_DoWork(object sender, DoWorkEventArgs e){
e.Result = functionToExecute(bgw1, quarter1);
}
private List<MyObject> functionToExecute(BackgroundWorker caller, List<MyObject> myList)
{
int progress = 0;
foreach (MyObject obj in myList)
{
string newString1 = createString();
obj.strText = newString;
//report progress here
caller.ReportProgress(progress++);
}
return myList;
}
This same function is called by all four workers and is given a different list for myList based on which worker is called the function.

Because a real answer has yet to be posted, I'll give it a shot.
Given that you haven't shown any LINQ-to-SQL code (no usage of DataContext) - I'll take an educated guess that the DataContext is shared between the threads, for example:
using (MyDataContext context = new MyDataContext())
{
// this is just some random query, that has not been listed - ToList()
// thus query execution is defered. listFromDB = IQueryable<>
var listFromDB = context.SomeTable.Where(st => st.Something == true);
System.Threading.Tasks.Task.Factory.StartNew(() =>
{
var list1 = listFromDB.Take(5000).ToList(); // runs the SQL query
// call some function on list1
});
System.Threading.Tasks.Task.Factory.StartNew(() =>
{
var list2 = listFromDB.Take(5000).ToList(); // runs the SQL query
// call some function on list2
});
}
Now the error you got - An item with the same key has already been added. - was because the DataContext object is not thread safe! A lot of stuff happens in the background - DataContext has to load objects from SQL, track their states, etc. This background work is what throws the error (because each thread is running the query, the DataContext gets accessed).
At least this is my own personal experience. Having come across the same error while sharing the DataContext between multiple threads. You only have two options in this scenario:
1) Before starting the threads, call .ToList() on the query, making listFromDB not an IQueryable<>, but an actual List<>. This means that the query has already ran and the threads operate on an actual List, not on the DataContext.
2) Move the DataContext definition into each thread. Because the DataContext is no longer shared, no more errors.
The third option would be to re-write the scenario into something else, like you did (for example, make everything sequential on a single background thread)...

First of all, I don't really see why you'd need multiple worker threads at all. (are theses lists in seperate databases / tables / servers? Do you really want to show 4 progress bars if you have 4 lists or are you somehow merging these progress reportings into one weird progress bar:D
Also, you're trying to speed up processing updates to your databases, but you don't send linq to sql any SAVES, so you're not really batching transactions, you'll just save everything at the end in one big transaction, is that really what you're aiming for? the progress bar will just stop at 100% and then spend a lot of time on the SQL side.
Just create one background thread and process everything synchronously, but batch a save transaction every couple of rows (i'd suggest something like every 1000 rows, but you should experiment with this) , it'll be fast, even with millions of rows,
If you really need this multithreaded solution:
The "another blabla with the same key has been added" error suggests that you are adding the same item to multiple "mylists", or adding the same item to the same list twice, otherwise how would there be any errors at all?

Using Parallel LINQ (PLINQ), you can take benefit of multiple CPU cores for processing your data. But if your application is going to run on single-core CPU, then splitting data into peaces wouldn't give you performance benefits instead it will incur some context-change overhead.
Hope it Helps

How to clear the DataContext cache on Linq to Sql

I'm using Linq to Sql to query some database, i only use Linq to read data from the DB, and i make changes to it by other means. (This cannot be changed, this is a restriction from the App that we are extending, all updates must go trough its sdk).
This is fine, but I'm hitting some cache problems, basically, i query a row using Linq, then i delete it trough external means, and then i create a new row externally if i query that row again using linq i got the old (cached) data.
I cannot turn off Object Tracking because that seems to prevent the data context from auto loading associated propertys (Foreign Keys).
Is there any way to clear the DataContex cache?
I found a method sufring the net but it doesn't seem safe: http://blog.robustsoftware.co.uk/2008/11/clearing-cache-of-linq-to-sql.html
What do you think? what are my options?.

If you want to refresh a specific object, then the Refresh() method may be your best bet.
Like this:
Context.Refresh(RefreshMode.OverwriteCurrentValues, objectToRefresh);
You can also pass an array of objects or an IEnumerable as the 2nd argument if you need to refresh more than one object at a time.
Update
I see what you're talking about in comments, in reflector you see this happening inside .Refresh():
object objectByKey = context.Services.GetObjectByKey(trackedObject.Type, keyValues);
if (objectByKey == null)
{
throw Error.RefreshOfDeletedObject();
}
The method you linked seems to be your best option, the DataContext class doesn't provide any other way to clear a deleted row. The disposal checks and such are inside the ClearCache() method...it's really just checking for disposal and calling ResetServices() on the CommonDataServices underneath..the only ill-effect would be clearing any pending inserts, updates or deletes that you have queued.
There is one more option, can you fire up another DataContext for whatever operation you're doing? It wouldn't have any cache to it...but that does involve some computational cost, so if the pending insert, update and deletes aren't an issue, I'd stick with the ClearCache() approach.

I made this code to really CLEAR the "cached" entities, detaching it.
var entidades = Ctx.ObjectStateManager.GetObjectStateEntries(EntityState.Added | EntityState.Deleted | EntityState.Modified | EntityState.Unchanged);
foreach (var objectStateEntry in entidades)
Ctx.Detach(objectStateEntry.Entity);
Where Ctx are my Context.

You should be able to just requery the result sets that are using this objects. This would not pull a cached set, but would actually return the final results. I know that this may not be as easy or feasible depending on how you setup your app...
HTH.

Is it suggestable to use generics for large amount of data?

I'm having let's say thousands of Customer records and I have to show them on a webform. Also, I have one CustomerEntity which has 10 properties. So when I fetch data in using a DataReader and convert it into List<CustomerEntity> I am required to loop through the data two times.
So is the use of generics suggestable in such a scenario? If yes then what will be my applications performance?
For E.g.
In CustomerEntity class, i'm having CustomerId & CustomerName propeties. And i'm getting 100 records from Customer Table
Then for Preparing List i've wrote following code
while (dr.Read())
{
// creation of new object of customerEntity
// code for getting properties of CustomerEntity
for (var index = 0; index < MyProperties.Count; index++)
{
MyProperty.setValue(CustEntityObject,dr.GetValue(index));
}
//adding CustEntity object to List<CustomerEntity>
}
How can i avoid these two loops. Is their any other mechanism?

I'm not really sure how generics ties into data-volume; they are unrelated concepts... it also isn't clear to me why this requires you to read everything twice. But yes: generics are fine when used in volume (why wouldn't they be?). But of course, the best way to find a problem is profiling (either server performance or bandwidth - perhaps more the latter in this case).
Of course the better approach is: don't show thousands of records on a web form; what is the user going to do with that? Use paging, searching, filtering, ajax, etc - every trick imaginable - but don't send thousands of records to the client.
Re the updated question; the loop for setting properties isn't necessarily bad. This is an entirely appropriate inner loop. Before doing anything, profile to see if this is actually a problem. I suspect that sheer bandwidth (between server and client, or server and database) is the biggest issue. If you can prove that this loop is a problem there are things you can do do optimise:
switch to using PropertyDescriptor (rather than PropertyInfo), and use HyperDescriptor to make it a lot faster
write code with DynamicMethod to do the job - requires some understanding of IL, but very fast
write a .NET 3.5 / LINQ Expression to do the same and use .Compile() - like the second point, but (IMO) a bit easier
I can add examples for the first and third bullets; I don't really want to write an example for the second, simply because I wouldn't write that code myself that way any more (I'd use the 3rd option where available, else the 1st).

It is very difficult what to say the performance will be, but consider these things -
Generics provides type saftey
If you're going to display 10,000 records in the page, your application will probably be unusable. If records are being paged, consider returning only those records that are actually needed for the page you are on.
You shouldn't need to loop through the data twice. What are you doing with the data?

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.