Building a simple multithreaded newsletter engine using Entity Framework

Building a simple multithreaded newsletter engine using Entity Framework - c#

I understand the concepts around multithreading and using Thread Pools. One concept I am trying to figure out is how to keep track of what emails have been sent to on each thread. So imagine, each thread is responsible for pulling x number of records, iterating through those emails, applying an email template, then saving the email to a pick up directory. Obviously, I need a way to tell each thread not to pull the same data as another thread.
One solution I was thinking was to page the data, have a global variable or array to keep track of the pages already sent to, have each thread examine that variable and start from the next available page. The only issue I can think of is if the data changes, then the pages available might get out of sync.
another solution is to set a boolean value in the database to determine if an account has been emailed to or not. So, EF would pull X amount of records and update those records as being ready to email on. This way each query would only look for emails that are not ready to be emailed to.
I wanted to get some other suggestions, if possible, or expand on the solutions I provided.

Given that you may one day want to scale to more than one app server, memory synchronization implementations might also not be sufficient to guarantee that emails are not duplicated.
One of the simplest ways to solve is to implement a batch processing mechanism would be at the database level.
Under a Unit of Work
Read N x records, with Pessimistic Locking (i.e. preventing concurrent reads by other threads pulling the same emails)
Stamp these records with a batch id (or a IsProcessed Indicator)
Return the records to your app
e.g. a Batching PROC in SQL server might look something like (Assuming table = dbo.Emails, which has a PK EmailId and a processed indicator BIT field IsProcessed):
CREATE PROC dbo.GetNextBatchOfEmails
AS
BEGIN
-- Identify the next N emails to be batched. UPDLOCK is to prevent another thread batching same emails
SELECT top 100 EmailId
INTO #tmpBatch
FROM dbo.Emails WITH (UPDLOCK)
WHERE IsProcessed = 0
-- Stamp emails as sent. Assumed that PROC is called under a UOW. The batch IS the UOW
UPDATE e
SET e.IsProcessed = 1
FROM dbo.Emails e
INNER JOIN #tmpBatch t
on e.EmailId = t.EmailId
-- Return the batch of emails to caller
SELECT e.*
FROM dbo.Emails e
INNER JOIN #tmpBatch t
on e.EmailId = t.EmailId
END
Then expose the PROC as an EF Function Import mapped to your Email Entity. Under a TransactionScope ts, you can then call the EF Function Import, and send emails, and call ts.Complete() on success.

In addition to nonnb's method, you can accomplish it all in one statement if you wish if you are using SQL Server 2005+.
;WITH q AS
(
SELECT TOP 10 *
FROM dbo.your_queue_table
WHERE
IsProcessing = 0
--you can obviously include more filtering criteria to meet your needs
)
UPDATE q WITH (ROWLOCK, READPAST)
SET IsProcessing = 1
OUTPUT INSERTED.*
There is also some great information located here about using database tables as queues.

Related

"Wait Operation Timed Out" When Reading Records

I have some code that reads results into a List from a SqlDataReader, parsing them into domain objects as it goes using reader.GetXXX(int ordinal) methods inside a while reader.Read() loop.
This code generally works fine over very large datasets and across a wide range of queries and tables, but one query hangs midway through reading this list of results and eventually times out with a "Wait Operation Timed Out" error.
I've repeated this a lot of times, and it always hangs on the same record (roughly 336k records into a 337k record set).
If I pause execution while it's hanging i can see that it is midway through parsing a record, and it is hanging on a reader.GetXXX call.
I have tried the following:
executing the proc manually in ssms (works fine)
Chunking the calls to the database such that it reads 250k records in a chunk and then requeries to get the rest (it still hung on the same record, but now the record was in the second batch)
Checking the ID for each record before parsing, and skipping the one that it hangs on (the record after that is parsed but it hangs on the next record).
Updating stats (parsing gets about 3000 records further before hanging after this)
I'm tempted to blame the database, but given the query runs without a hitch in SSMS, I'm not convinced. Any help much appreciated!
other stuff that may help:
the proc takes a table valued parameter and joins onto that along with another database table to get its results
this was originally spotted on a VM running a long way from the client machine, but i have subsequently reproduced it by restoring and connecting to a machine that is under my desk.
edit: as requested, the query is:
CREATE PROCEDURE [trading].[Trade_SelectLegalEntityPositionChunked]
#PartyPositions trading.PartyPositionType READONLY,
#Offset INT,
#ChunkSize INT
AS
BEGIN
SET NOCOUNT ON;
DECLARE #OffsetLocal INT = #Offset;
DECLARE #ChunkSizeLocal INT = #ChunkSize;
SELECT /* stuff */
FROM [trading].[Trade]
JOIN [refdata].[Book] ON [refdata].[Book].[BookId] = [trading].[Trade].[BookId]
JOIN #PartyPositions pos
ON pos.[PartyId] = [refdata].[Book].[LegalEntityId]
AND [trading].[Trade].[TradeType] = pos.[TradeType]
AND [trading].[Trade].[InstrumentId] = pos.[InstrumentId]
ORDER BY [trading].[Trade].[TradeId]
OFFSET #OffsetLocal ROWS FETCH NEXT #ChunkSizeLocal ROWS ONLY;;
END
edit: i've also checked:
other threads - there are none that should be getting in the way: the only threads that are running are the one running the query and the usual supporting threads such as the main thread and message loop (I'm testing from a CUI)
edit (bored yet?)
if i reduce the scope of the query a bit i can see it blocking while parsing a record for about 4 mins. It then carries on reading stuff until it hangs again.
I still cannot see anything much going on in the client - i now have the GcNotification class from CLR via C# running and there isn't any GC going on.This points to the database, as #Crowcoder says but the fact it runs fine in SSMS and it pausing midway through reading a record means I am loathe to blame it. This is probably down to my lack of knowledge about databases though!

Suggestions for queuing up updates to be emailed out in an asp.net-mvc application?

I have an asp.net-mvc application (SQL Server backend) and its a basic CRUD web app that tracks purchase orders. One of the features is the ability to subscribe to a order so if anyone makes any changes, you get an email notification. I am triggering the email notification after the form post after and after the update has been committed to the database.
The issue is that one day there was a ton of updates on a single order and a person got 24 emails updates in one day. The user requested that he just gets 1 email a day with the summary of all changes. This seems like a reasonable request and I have seen this in other cases (get single daily or weekly bulk notifications) but i am trying to figure out the best way to architect this. I know i need to persist a user setting on how they want to receive updates but after that I could:
Have a temporary update table or
Run a query over all orders once a day and see if there are any changes on orders for any users who has the "once a day" check.
other ?
Since i have seen this in other places i thought there might be a recommended pattern or suggestions to support this.

There are three things that need to be solved here:
storing a bunch of possible messages
aggregating the messages before sending
send the messages
Storing the messages is simple, you create a DB table with the individual messages with the recipient's ID.
Then, aggregating the messages totally depend on you, you might want to allow users to specify that they want individual messages, daily or weekly digests. When you've combined the messages to a single e-mail, you can send it, and remove the messages used in the aggregation. (You might just want to soft delete them, or move them to a log table).
As for the technology used for the aggregation and sending, you'd need an asynchronous job to process your message queue and send the e-mails.
If you don't want to use any additional library, you could set up an action method to do the job, and just call the action method regularly from a scheduled Windows job or any site-alive pinger service. You might want to secure this action method, or put some logic in place to process the queue only a couple of times a day.
If third party libraries are not a problem, then Scott Hanselman collected a list of possible options a while back, see here. As others mentioned, Hangfire.io is a popular option.

I would use dates to do the trick:
Any order has a "last updated" datetime.
Every user has a "last run" datetime and a
frequency.
Then you can run a process once every ## minutes, take all users that need to be notified according to user preference and find all orders with last updated date > user last run attribute.
The key is that you will need some background job processing component in your app to schedule the work and to monitor running. I use hangfire.io for the job, having excellent results

Code bellow written in notepad I will check it tomorrow.
Update: I have checked the code at the bottom and it works fine in SQL Server.
Quickest and easiest way to do what required is to add field last_update_date to orders table and then:
This piece of code return you all updated records in some period of time and users subscribed to the order
select o.*, u.* from orders o
inner join user_subscribtions us on us.order_id = o.order_id
inner join users_intervals ui on ui.user_id = us.user_id and ui.interval = #interval
inner join users u on u.user_id = us.user_id
where o.last_update_date >= DATEADD(HOUR, #interval, GETDATE());
If you have server and can create service
Run 2 jobs that will execute same procedure with different time interval parameter. First job will execute procedure every 24 hours with parameter -24 (hours)
second job will execute mentioned procedure once a week with parameter -168 (hours = week) the procedure will get all orders updated in that period of time
and insert their data into table of emailing
create procedure getAllUpdatedOrders #interval datetime as
begin
insert into emails_to_send
select o.*, u.* from orders o
inner join user_subscribtions us on us.order_id = o.order_id
inner join users_intervals ui on ui.user_id = us.user_id and ui.interval = #interval
inner join users u on u.user_id = us.user_id
where o.last_update_date >= DATEADD(HOUR, #interval, GETDATE());
end
GO
Then you need service that will periodically check table emails_to_send and send all emails from that table.
exec getAllUpdatedOrders -24; - all orders updated in last 24 hours + users subscribed to the orders and daily mail
exec getAllUpdatedOrders -168; - all orders updated within last week + users subscribed to the orders and weekly mail
If you need to show all updates of some order for last 24 hours or a week you will need to store such updates in some separate table for at least one week and to join to above query
Example of getting subscribed users and order updated in some period:
--drop TABLE orders;
--drop TABLE user_subscribtions;
--drop TABLE users_intervals;
CREATE TABLE orders(order_id int, last_update_date datetime);
CREATE TABLE user_subscribtions(id int, order_id int, user_id int);
CREATE TABLE users_intervals(id int, interval int, user_id int);
insert into orders values(1, '2015-02-15');
insert into orders values(2, '2015-02-02');
insert into user_subscribtions values(1, 1, 10);
insert into user_subscribtions values(1, 1, 11);
insert into users_intervals values(1, -24, 10);
insert into users_intervals values(1, -168, 11);
select o.*, us.* from orders o
inner join user_subscribtions us on us.order_id = o.order_id
inner join users_intervals ui on ui.user_id = us.user_id and ui.interval = -24
where o.last_update_date >= '2015-02-15';

I would consider storing all email messages to a seperate table in the case of a digest-type update sysem, and create a trigger -- either a windows service that runs once a day or a scheduled task that runs a console to trigger a function that compiles the messages to one person in the table into one email, and deals with the sending.
You could even go further and split the types of updates into priority; some types result in an immediate message, others get queued for the daily digest.
Additionally, you could let users decide wether they want a digest-type update or single messages if you keep track of preferences. Choice is always a good thing to have.
Keeping track of when the last message was send should be easy enough, but you should keep in mind that some things the customer will want to know regardless of when the last message was send. Let them decide with options.

You'll definitely need some form of background processing, tons of solutions out there, however without knowing where and how this site is hosted I'll wait to offer suggestions on that front.
As for the notifications themselves. It sounds like a super simple request, however what kindof time-frame does the user expect from their notifications? Do they use them to go check on orders or is it just data to them? The nice thing about instant notifications is you know whats happening in real time, which the user may enjoy. However in the high volume situation they would have preferred one email with all the mall lumped together, to save spam. However when you lump notifications like that (especially on a daily level) you're disconnecting yourself (up to 24 hours) from the thing you're being notified about. It's very difficult to have both, since the app would somehow need to know "I'm about to get a bunch of orders, so I should wait to send these" or "This is just one order and I wont have another for some time so I can send it now", which I bet is what the user would prefer, but is very difficult to architect (precognition is hard).
I think you might want to ask the user exactly what it is they expect from their notifications. If they want these notifications to alert them of this information, and to potentially trigger something on their end, a daily summary might not be ideal for them (outside of that high volume situation).
I'm going to continue with the pretense that the user would prefer a Daily Summary (since that is the situation where you actually have to change something). I would architect a system where you have a core "notification" table where you store a record per event that will trigger a notification, then a second many to many table with UserIds and NotificationIds, and then a flag stating that the notification was sent or not. Then you simply need an app off to the side somewhere that at set intervals (daily, hourly, etc) checks the many to many table for emails it needs to send out, groups them, and sends them out.
Notifications
NotificationId
... (other columns relevant to the notification)
Users
UserId
...
UserNotifications
UserNotificationId (some form of primary key)
UserId (FK)
NotificationId (FK)
Sent (bit)
You could also, with the help of some DateTime columns on those tables, and user preferences, setup a system where the application waits say, 15 minutes, to send notifications, and if another notification happens in that 15 minute time span it resets the cool-down, until 15 minutes pass without a new event, and it then groups all the notifications and sends them. This is definitely more complicated, but may be a "best of both worlds" for the user. Where high volume situations will be lumped conveniently, and other notifications will only be delayed by 15 minutes (or whatever the "cool-down" may be)
Notifications are one of those things that are simple on the surface, but get blown out once you dig into what the user wants out of them, hence why there isn't a one size fits all shoe.

I think here are two solutions that i did in one of my project.
Use a trigger and send email via SQL Mail
Use a windows service and run it once a day and you can send your all your summary information in just one email.

I don't think we are discussing actual logic here, instead, we are discussing the approaches. I see majorly two options:
Calculate and create digest-type emails every time you trigger email notification on normalized data (use #last-updated)
De-normalize the data and prepare summarized content before hand and send as per the scheduled time.
If a person has opted all email notifications then you can send an email immediately, else you can set one time in a day and use either Quazrtz or Hangfire as suggested by #tede24 to send these notifications asynchronously. I personally prefer hangfire and using it.

SQL Server - Best practice to circumvent large IN (...) clause (>40000 items)

I'm developing an ASP.NET app that analyzes Excel files uploaded by user. The files contain various data about customers (one row = one customer), the key field is CustomerCode. Basically the data comes in form of DataTable object.
At some point I need to get information about the specified customers from SQL and compare it to what user uploaded. I'm doing it the following way:
Make a comma-separated list of customers from CustomerCode column: 'Customer1','Customer2',...'CustomerN'.
Pass this string to SQL query IN (...) clause and execute it.
This was working okay until I ran into The query processor ran out of internal resources and could not produce a query plan exception when trying to pass ~40000 items inside IN (...) clause.
The trivial ways seems to:
Replace IN (...) with = 'SomeCustomerCode' in query template.
Execute this query 40000 times for each CustomerCode.
Do DataTable.Merge 40000 times.
Is there any better way to work this problem around?
Note: I can't do IN (SELECT CustomerCode FROM ... WHERE SomeConditions) because the data comes from Excel files and thus cannot be queried from DB.

"Table valued parameters" would be worth investigating, which let you pass in (usually via a DataTable on the C# side) multiple rows - the downside is that you need to formally declare and name the data shape on the SQL server first.
Alternatively, though: you could use SqlBulkCopy to throw the rows into a staging table, and then just JOIN to that table. If you have parallel callers, you will need some kind of session identifier on the row to distinguish between concurrent uses (and: don't forget to remove your session's data afterwards).

You shouldn't process too many records at once, because of errors as you mentioned, and it is such a big batch that it takes too much time to run and you can't do anything in parallel. You shouldn't process only 1 record at a time either, because then the overhead of the SQL server communication will be too big. Choose something in the middle, process eg. 10000 records at a time. You can even parallelize the processing, you can start running the SQL for the next 10000 in the background while you are processing the previous 10000 batch.

Concurrent reading and updating in a database table

I have an Oracle database that I access using Devart and Entity Framework.
There's a table called IMPORTJOBS with a column STATUS.
I also have multiple processes running at the same time. They each read the first row in IMPORTJOBS that has status 'REGISTERED', put it to status 'EXECUTING', and if done put it to status 'EXECUTED'.
Now because these processes are running in parallel, I believe the following could happen:
process A reads row 10 which has status REGISTERED,
process B also reads row 10 which has still status REGISTERED,
process A updates row 10 to status EXECUTING.
Process B should not be able to read row 10 as process A already read it and is going to update its status.
How should I solve this? Put read and update in a transaction? Or should I use some versioning approach or something else?
Thanks!
EDIT: thanks to the accepted answer I got it working and documented it here: http://ludwigstuyck.wordpress.com/2013/02/28/concurrent-reading-and-writing-in-an-oracle-database.

You should use the built-in locking mechanisms of the database. Don't reinvent the wheel, especially since RDBMS are designed to deal with concurrency and consistency.
In Oracle 11g, I suggest you use the SKIP LOCKED feature. For example each process could call a function like this (assuming id are number):
CREATE OR REPLACE TYPE tab_number IS TABLE OF NUMBER;
CREATE OR REPLACE FUNCTION reserve_jobs RETURN tab_number IS
CURSOR c IS
SELECT id FROM IMPORTJOBS WHERE STATUS = 'REGISTERED'
FOR UPDATE SKIP LOCKED;
l_result tab_number := tab_number();
l_id number;
BEGIN
OPEN c;
FOR i IN 1..10 LOOP
FETCH c INTO l_id;
EXIT WHEN c%NOTFOUND;
l_result.extend;
l_result(l_result.size) := l_id;
END LOOP;
CLOSE c;
RETURN l_result;
END;
This will return 10 rows (if possible) that are not locked. These rows will be locked and the sessions will not block each other.
In 10g and before since Oracle returns consistent results, use FOR UPDATE wisely and you should not have the problem that you describe. For instance consider the following SELECT:
SELECT *
FROM IMPORTJOBS
WHERE STATUS = 'REGISTERED'
AND rownum <= 10
FOR UPDATE;
What would happen if all processes reserve their rows with this SELECT? How will that affect your scenario:
Session A gets 10 rows that are not processed.
Session B would get the same 10 rows, is blocked and waits for session A.
Session A updates the selected rows' statuses and commits its transaction.
Oracle will now (automatically) rerun Session B's select from the beginning since the data has been modified and we have specified FOR UPDATE (this clause forces Oracle to get the last version of the block).
This means that session B will get 10 new rows.
So in this scenario, you have no consistency problem. Also, assuming that the transaction to request a row and change its status is fast, the concurrency impact will be light.

Each process can issue a SELECT ... FOR UPDATE to lock the row when they read it. In this scenario, process A will read and lock the row, process B will attempt to read the row and block until process A releases the lock by committing (or rolling back) its transaction. Oracle will then determine whether the row still meets B's criteria and, in your example, won't return the row to B. This works but it means that your multi-threaded process may now be effectively single-threaded depending on how your transaction control needs to work.
Possible ways to improve scalability
A relatively common approach on the consumer to resolving this is to have a single coordinator thread that reads the data from the table, parcels out work to different threads, and updates the table appropriately (including knowing how to re-assign a job if the thread that was assigned it has died).
If you are using Oracle 11.1 or later, you can use the SKIP LOCKED clause on your FOR UPDATE so that each session gets back the first row that meets their criteria and is not locked (the clause existed in earlier versions but was not documented so it may not work correctly).
Rather than using a table for ImportJobs, you can use a queue with multiple consumers. This will allow Oracle to distribute messages to each process without you needing to build any additional locking (Oracle queues are doing it all behind the scenes).

Use versioning and optimistic concurrency.
The IMPORTJOBS table should have a timestamp column that you mark as ConcurrencyMode = Fixed in your model. Now when EF tries to do an update the timestamp column is incorporated in the update statement: WHERE timestamp = xxxxx.
For B, the timestamp changed in the mean time, so a concurrency exception is raised, which, in this case, you handle by skipping the update.
I'm from a SQL server background and I don't know the Oracle equivalent of timestamp (or rowversion), but the idea is that it's a field that auto-updates when an update is made to a record.

How do I lock a table from write operations until my insert is complete with Linq-to-sql

I am working on an auction system and one of the issues I am trying to make sure I don't get affected by is a situation where 2 people put in a bid at the exact same time for the same item.
To do this I need to put a lock on the table, get the highest bid for the current item, make sure the entered bid is greater than that bid, add a new bid entry into the table, then unlock the table.
I need to lock this so a second webserver does not trigger a bid insert between when I check for the highest bid and when I insert my new bid into the table, as this would cause data issues.
How do I accomplish this with Linq-to-sql?
Note, I don't know if transactionscopes can do this but I can't use them, as they tend to trigger a distributed transaction due to our webfarm setup, and I can't use distributed transactions.

There seem to be a couple of obstacles implementing a solution in pure Linq:
You should definitely avoid a table lock
A table lock would make it impossible for several items to be bid on during the processing of one single bid, thus severely harming performance
Linq to SQL does not seem to support pessimistic locking
as stated in other answers on SO.
If you cannot have transactions in your code, I suggest the following procedure:
generate a GUID for your operation
pseudo-lock the item's record using the guid:
UPDATE Items SET LockingGuid = #guid
WHERE ItemId = #ItemId and LockingGuid IS NULL
SELECT #recordsaffected = ##ROWCOUNT
the lock succeeded if ##rowcount == 1
perform your bidding operation
UPDATE the record back to LockingGuid = NULL
if the lock fails, either raise the failure to the .Net client, or busy-wait using WAITFOR.
You should implement proper exception handling so that item records do not get locked indefinitely by a dying or failing process, probably by adding a datetime column storing the timestamp the lock occurred, and cleaning up orphaned locks.
If your architecture allows for separate backend operation, you might want to have a look and CQRS and Event Sourcing for processing such bidding operations.

You could use a separate table to store information when this processing occurs. For example, your second table could be something like:
Table name:
ItemProcessing
Columns:
ItemId (int)
ProcessingToken (guid)
When a process wants to check on a current bid, it writes the ID of the item and a token/guid to the ItemProcessing table. That tells other processes that this item is currently being inspected. If there is already a row in the ItemProcessing table for this item, the other process must wait or abort. When the original process is done, it removes the token (sets it to null), or removes the row from ItemProcessing altogether. Then other processes know they can process that item.
Of course, you'll need a way to make sure both processes don't write to this processing table at the same time. You could accomplish that by inserting into this table where ProcessingToken is null. If another table just beat a process to it, the second process won't be able to insert because the ProcessingToken will exist.
While not a full solution, in detail, that's the basic idea.

You can manually begin a transaction and pass that transaction to the DataContext.
http://geekswithblogs.net/robp/archive/2009/04/02/your-own-transactions-with-linq-to-sql.aspx
I think it is necessary as well to manually control the opening and closing of the Connection to avoid an unwanted escalation to a distributed transaction. It seems that the DataContext will actually get in its own way and try to open two connections sometimes, thus causing a promotion to a distributed transaction.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.