Just worked through a quick sandbox version of what I posted earlier.
The original requirements still stand - and this worked nicely in my console sandbox verison:
Read a file (of account IDs), CSV format
Download the account data
file from the web for each account (by Id) (REST API)
Pass the file to a converter that will produce a report (financial predictions etc)
[~20ms]
If the prediction threshold is within limits, run a parser to analyse the data [400ms]
Generate a report for the analysis above [80ms]
Upload all files generated to the web (REST API)
I'm trying to make use of NServiceBus this time as suggested but I'm finding it hard to workout how to make things fit.
I'm pushing a bulk of 20 account Ids into a message, that's read by NServiceBus Handler and they get Posted it to the BufferBlock one at a time. The reason for bulk loading is our systems use a very slow web-service to get account info - especially archived accounts (deceased etc) that take 3-4s sometimes! (don't worry about that for the moment)
How do I keep popping stuff into the BufferBlock and still have my TPL Dataflow active as I read messages etc? Do I configure the TPL Dataflow separately and wait for ever and just Post when a new batch of messages come through?
There are roughly 2 million accounts to process. Ideally I'd like the Account Report Generator (TPL-DF) to be alive all the time and I just push messages to its buffer which it will work through. It's the lifetime that confuses me.
Related
I have been requested to use Amazon SQS in our new system. Our business depends on having some tasks/requests from the clients to our support agents, and once the client submit his task/request, it should be queued in my SQL Server database, and all queued tasks should be assigned to the non-busy agent because the flow says that the agent can process or handle one task/request at the meantime, so, If I have 10 tasks/requests came to my system, all should be queued, then, the system should forward the task to the agent who is free now and once the agent solves the task, he should get the next one if any, otherwise, the system should wait for any agent until finishing his current task to assign a new one, and for sure, there should not be any duplication in tasks/requests handling ... and so on.
What do I need, now?
Simple reference which can clarify what is Amazon SQS as this is my first time to use queuing service?
How can I use the same with C# and SQL Server? I have read this topic but I still feel that there is something messing as I am not able to start. I am just aiming at the way which I can process the task in run-time and assign it to an agent, then close it and getting a new one as I explained above.
Asking us to design a system based on a paragraph of prose is a pretty tall order.
SQS is simply a cloud queue system. Based on your description, I'm not sure it would make your system any better.
First off, you are already storing everything in your database, so why do you need to store things in the queue as well? If you want to have queue semantics while storing stuff in your database you could consider SQL Server Service Broker (https://technet.microsoft.com/en-us/library/ms345108(v=sql.90).aspx#sqlsvcbr_topic2) which supports queues within SQL. Alternatively unless your scale is pretty high (100+ tasks/second maybe) you could just query the table for tasks which need to be picked up.
Secondly, it sounds like you might have a workflow around tasks that could extend to more than just a single queue for agents to pick them up. For example, do you have any follow up on the tasks (emailing clients to ask them how their service was, putting a task on hold until a client gets back to you, etc)? If so, you might want to look at Simple Workflow Service (https://aws.amazon.com/swf/) or since you are already on Microsoft's stack you can look at Windows Workflow (https://msdn.microsoft.com/en-us/library/ee342461.aspx)
BTW, SQS does not guarantee "only one" delivery by default, so if duplication is a big problem for you then you will either have to do your own deduplication or use FIFO queues (http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/FIFO-queues.html) which support deduplication, but are limited to 300 transactions/second (aka: roughly 100 messages/second accounting for the standard send -> receive -> delete APIs. Using batching obviously that number could be much higher, but considering your use case it doesn't sound like you would be able to use batching without a lot of work).
I have to refactor a fairly time-consuming process in one of my applications and after doing some research I think it's a perfect match for using TPL. I wanted to clarify my understanding of it and ask if there are any more issues which I should take into account.
In few words, I have a windows service, which runs overnight and sends out emails with data updates to around 10000 users. At presence, the whole process takes around 8 hrs to complete. I would like to reduce it to 2 hrs max.
Application workflow follows steps below:
1. Iterate through all users list
2. Check if this user has to be notified
3. If so, create an email body by calling external service
4. Send an email
Analysis of the code has shown that step 3 is the most time-consuming one and takes around 3,5 sec to complete. It means, that when processing 10000 users, my application waits well over 6 hrs in total for a response from the external service! I think this is a reason good enough to try to introduce some asynchronous and parallel processing.
So, my plan is to use Parallel class and ForEach method to iterate through users in step 1. As I can understand this should distribute processing each user into a separate thread, making them run in parallel? Processes are completely independent of each other and each doesn't return any value. In the case of any exception being thrown it will be persisted in logs db. As with regards to step 3, I would like to convert a call to external service into an async call. As I can understand this would release the resources on the thread so it could be reused by the Parallel class to start processing next user from the list?
I had a read through MS documentation regarding TPL, especially Potential Pitfalls in Data and Task Parallelism document and the only point I'm not sure about is "Avoid Writing to Shared Memory Locations". I am using a local integer to count a total number of emails processed. As with regards to all of the rest, I'm quite positive they're not applicable to my scenario.
My question is, without any implementation as yet. Is what I'm trying to achieve possible (especially the async await part for external service call)? Should I be aware of any other obstacles that might affect my implementation? Is there any better way of improving the workflow?
Just to clarify I'm using .Net v4.0
Yes, you can use the TPL for your problem. If you cannot influence your external problem, then this might be the best way.
However, you can make the best gains if you can get your external source to accept batches. Because this source could actually optimize the performance. Right now you have a message overhead of 10000 messages to serialize, send, work on, receive and deserialize. This is stuff that could be done once. In addition, your external source might be able to optimize the work they do if they know they will get multiple records.
So the bottom line is: if you need to optimize locally, the TPL is fine. If you want to optimize your whole process for actual gains, try to find out if your external source can help you, because that is where you can make some real progress.
You didn't show any code, and I'm assuming that step 4 (send an e-mail) is not that fast either.
With the presented case, unless your external service from step 3 (create an email body by calling external service) processes requests in parallel and supports a good load of simultaneous requests, you will not gain much with this refactor.
In other words, test the external service and the e-mail server first for:
Parallel request execution
The way to test this is to send at least 2 simultaneous requests and observe how long it takes to process them.
If it takes about double the time of a single, the requests have some serial processing, either they're queued or some broad lock is being taken.
Load test
Go up to 4, 8, 12, 16, 20, etc, and see where it starts to degrade.
You should set a limit on the amount of simultaneous requests to something that keeps execution time above e.g. 80% of the time it takes to process a single request, assuming you're the sole consumer
Or a few requests before it starts degrading (e.g. divide by the number of consumers) to leave the external service available for other consumers.
Only then can you decide if the refactor is worth. If you can't change the external service or the e-mail server, you must weight it they offer enough parallel capability without degrading.
Even so, be realistic. Don't let your service push the external service and the e-mail server to their limits in production.
This is my first post here, so apologies if this isn't structured well.
We have been tasked to design a tool that will:
Read a file (of account IDs), CSV format
Download the account data file from the web for each account (by Id) (REST API)
Pass the file to a converter that will produce a report (financial predictions etc) [~20ms]
If the prediction threshold is within limits, run a parser to analyse the data [400ms]
Generate a report for the analysis above [80ms]
Upload all files generated to the web (REST API)
Now all those individual points are relatively easy to do. I'm interested in finding out how best to architect something to handle this and to do it fast & efficiently on our hardware.
We have to process roughly around 2 Million accounts. The square brackets gives an idea of how long each process takes on average. I'd like to use the maximum resources available on the machine - 24 core Xeon processors. It's not a memory intensive process.
Would using TPL and creating each of these as a task be a good idea? Each has to happen sequentially but many can be done at once. Unfortunately the parsers are not multi-threading aware and we don't have the source (it's essentially a black box for us).
My thoughts were something like this - assumes we're using TPL:
Load account data (essentially a CSV import or SQL SELECT)
For each Account (Id):
Download the data file for each account
ContinueWith using the data file, send to the converter
ContinueWith check threshold, send to parser
ContinueWith Generate Report
ContinueWith Upload outputs
Does that sound feasible or am I not understanding it correctly? Would it be better to break down the steps a different way?
I'm a bit unsure on how to handle issues with the parser throwing exceptions (it's very picky) or when we get failures uploading.
All this is going to be in a scheduled job that will run after-hours as a console application.
I would think about using some kind of messagebus. So you can seperate the steps and if one wouldn't work (for example because the REST Service isn't accessible for some time) you can store the message for processing them later on.
Depending on what you use as a messagebus you can introduce threads with it.
In my opinion you could better design workflows, handle exceptional states and so on, if you have a more high level abstraction like a service bus.
Also beaucase the parts could run indepdently they don't block each other.
One easy way could be to use servicestack messaging with Redis ServiceBus.
Some advantages quoted from there:
Message-based design allows for easier parallelization and introspection of computations
DLQ messages can be introspected, fixed and later replayed after server updates and rejoin normal message workflow
I think the easy way to start with multiple thread in your case, will be putting the entire operation for each account id in a thread (or better, in a ThreadPool). In the proposed way below, I think you will not need to control inter-thread operations.
Something like this to put the data on the thread pool queue:
var accountIds = new List<int>();
foreach (var accountId in accountIds)
{
ThreadPool.QueueUserWorkItem(ProcessAccount, accountId);
}
And this is the function you will process each account:
public static void ProcessAccount(object accountId)
{
// Download the data file for this account
// ContinueWith using the data file, send to the converter
// ContinueWith check threshold, send to parser
// ContinueWith Generate Report
// ContinueWith Upload outputs
}
I have a pricing application. It sends pricing requests to an Azure Service Bus Queue (could be any queue) "PricingRequestQueue". There are a number of workers that pick these up, process them and return the results to a PricingResponse Queue.
I would like to create an Observable over the PricingResponse queue. I do not require any filtering, but would like to read the messages off using the batch interface (QueueClient.BeginReceiveBatch). The queue has the number of messages expected, and has a session to read from (QueueClient.AcceptMessageSession(correlationIdentifier).
I'm still trying to get my head around RX, and this would really clear things up.
There is the CloudFx library that adds Rx extensions to Azure.
https://www.nuget.org/packages/Microsoft.Experience.CloudFx/ (Updated link)
However I must warn you that we have found some thread leaks in the current CloudFx libraries (in particular with the table storage one - however you have not needed the Rx extensions since table storage 2.0).
I have a project that I need to make a service that we will add to it about 500 RSS for different sites and we want this service to collect new RSS feeds from these sources and save Title and URL in my SQL Server database.
How can I determine the best architecture design, and what codes would help me in that?
These indications are not specific to your stack (c#, asp.net), but I would definitely not recommend doing anything from the request-response cycle of your web app. It must be done in an asynchronous fashion, but results can be served from the database that you populate with the feed entries.
It's likely that you'll have to
build an architecture where you
poll each feed every X minutes. Whether it's using a cron job, or
a daemon that runs continuously,
you'll have to poll each feed one
after other other (or with some kind
of concurrency, but the design is
the same). Please make use of the
HTTP headers likes Etags and
If-Modified to avoid polling data
that hasn't been updated.
Then, you will need to parse the
feeds themselves. It's very likely
that you'll have to support
different flavors of RSS and Atom, but most parsers actually support
both.1.
Finally, you'll have to store the
entries and, more importantly before
you insert them, make sure you
haven't already added them. You
should use the the id or guid
for the entries, but it's likely
that you'll have to use your own
system too (links, hash...) because
many feeds do not have these.
If you want to reduce the amount of polling that you'll have to do, while still keeping timely results, you'll have to implement PubSubHubbub for the feeds which support it.
If you don't want to deal with any of the numerous issues exposed earlier (polling in a timely maner, parsing content, diffing to keep uniqueness of entries...), I would recommand using Superfeedr as it deals with all the pain points.
I am not going to go into details about implementation or detailed architecture here (mostly from lack of time at this particular moment), but I will say this:
It's not the web service that should consume the RSS feeds, it should merely be responsible of spawning the work to do so asynchronously.
You should not use threads from the ThreadPool to do this, for two reasons. One is that the work can be assumed to be more or less time consuming (ThreadPool is recommended primarily for short-running tasks), and, perhaps more important, ThreadPool threads are used to serve incoming web requests; don't want to compete with that.