Azure backend data processing architecture

Azure backend data processing architecture - c#

I am creating a website that creates randomised test data, part of this is a random name generator that I am wanting to scale to create circa a million names. (written in .net 4.5 - C#)
My initial solution was to create the names on the web server thread, obviously a bad idea and very slow. That has slowly evolved into a solution that has an offline batch processor that populates an Azure table with precompiled names, that are then downloaded by the webserver (or a worker role that does the final data compile)
However, this also seems quite slow, even running with parallel processes it is taking minutes to download the data.
So I am looking for the best architecture to speed this up.
I have considered having a worker role that does this processing, and keeps the results in memory and waits for the web server to request them. However I'm not sure this is the best approach, or if it will even solve the problem! (mostly because I don't know how to transfer the data out)
So I'm hoping for a little architectural advice, on the best way to bring this data in. I'm not sure if it is simply the case that it is going to take a couple of minutes to process that many records.
(additionally added)
The code is running on an Azure web instance, pulling data out of an Azure Storage Table in the same region.
I have profiled the app, and most of the time is spent downloading data from the table.
The data that the random name generator seeds from is a few hundred thousand records in another Azure Table.
I'm now wondering if maybe I'm asking the wrong question! Maybe the simpler question is, given a source of a few hundred thousand first name / surnames what would the best way to compile a million of them to pull into a web query
P.S. I am not a c# guy by any stretch (more a sysadmin) my c# generally follows the script kiddie approach of find something online that is vaguely close and assimilate - I just can't find anything that is vaguely close in this case. (which makes me think I'm missing something obvious)

Related

Application architecture with data on a shared network, without a database on the server

I'm currently working on a C# project of an application we'd like to develop. We're brainstorming over the question of sharing the data between users. We'd like to be able to specify a folder where all the files of the application are going to be saved and we'd like to be able to save them on a shared folder (server, different PC or Mac, Nas, etc.).
The deployment would be like so :
Installation on the first PC, we choose a network drive, share, whatever and create all the files for the application in this location.
On the second PC we install the application and we choose the same location (on the network), the application doesn't create anything, it sees that it's already existing and it uses these files as the application's data
Same thing on the other clients
The application's files are going to be documents (most likely XML formatted documents) and when opening the application we want to show all the existing documents. The thing is, we don't only want to have the list of documents and be able to edit their content, we also would like to be able to edit the document's property, so in a way we'd like a file (Sqlite, XML, whatever) representing the list of all the documents and their attributes. Same thing for a list of addresses.
I know all that looks exactly like a client / server with database solution, but this solution is out of the question. I was first looking at SQLite for my data files, but I know concurrency can be a real problem and file lock doesn't work well. The thing is, I would have the same problem with simple XML files (refreshing the content when several users are working, accessing locked files).
So I guess my final question is : Is it feasable? Is there an alternative I didn't see which would allow us to do that more easily?
EDIT :
OK I'm not responding to every post or comment, because I'm currently testing concurrency with SQLite. What I did, and please correct me if the way I test this is wrong, is launch X BackgroundWorker which are all going to insert record in a sample database (which is recreated everytime I start the application). I tried launching 100 iterations of INSERT in the database via these backgroundWorkers.
Of course concurrency is working with one application running, it's simply waiting for the last BackgroundWorker to do it's job and then writing the next record. I also tried inserting at (almost) the same time, meaning I put a loop in every BackgroundWorker waiting for a modulo 5 timestamp (every 5 seconds, every BackgroundWorker runs). Again, it's waiting for the previous insert query to end before doing the next and everything's working fine. I even tried it with 500 BackgroundWorkers and it worked fine.
I then tried launching my app several times and running them simultaneously. When doing this I did have some issue. With two instances of my app it was still working fine, but when trying this with 4-5 instances, it got really buggy and I got two types of error : 1. database is locked 2. disk I/O failure. But mostyle locked databases.
What I did was pretty intensive, in the scenario of my application, it will never ever come to 5 processes trying to simultaneously insert 500 hunded rows at the same time (maybe I'll get a concurrency of two or three connections). But what really bugged me and what makes me think my testing method is not really a good one, is that I got these errors trying to work on a database on a shared network, on a NAS AND on my own HDD. Everytime it worked for maybe 30-40 queries then throwing me "database is locked" error.
Am I testing it wrong? Maybe I shouldn't be trying so hard to make this work, but I'm still not convinced that SQLite is not a good alternative to what I'm trying to do, since the concurrency is going to be really small.

With your optimistic/pessimistic locking, you are ultimately trying to build a database. Also, you WILL have issues with consistency while trying to keep multiple files in sync with each other. Think about if you update the "metadata" file, and the write fails half-way through because of a network blip. File corruption will ensue, and you will be left trying to reconstruct things from backups.
I would suggest a couple of likely solutions:
1) Host the content yourselves, and let them be pure clients (cloud based deployments are ideal for this). Most network/firewall issues can be circumvented by using HTTP as your transport (web services).
2) Have one of the workstations be the "server", which keeps it data files on the NFS. This will give you transactional integrity, incremental backups, etc. There are lots of good embedded database managements systems to help you manage this complexity. MS SQL Server even has some great options for this.

You right, Sqlite uses file locks on database file, so storing all data files in database would bring write-starvation problem for editing your documents.
May be it's better choice to implement simple optimistic/pessimistic locking by yourself on particular-file level? For example, in case of using pessimistic lock you just don't allow anyone to edit particular file, if somebody already in process of editing it. In this case you will hold lock just on one file, but not on the entire database. If possibility of conflict(editing particular file at the same time) is pretty low, it is better to go with optimistic locking.
Simple optimistic locking implementation:
When user get file for reading - it's OK, no problem here. If user get file for editing, you could calculate hash for this file(or get timestamp of last updated time of the file), and then, when user tries to save edited file, compare current(at the moment of saving) hash/timestamp to make sure that file has not been changed by somebody else. If file has not been changed then it's ok to save it. IF file has been changed, then current user is out of luck, you need to inform him about it. This optimistic scenario is nice when possibility of this "out of luck" is pretty low. Otherwise it's better to stick with pessimistic locking, when you do not allow user even to start file editing if somebody else is doing it.

Handling limitations in multithreaded server

In my client-server architecture I have few API functions which usage need to be limited.
Server is written in .net C# and it is running on IIS.
Until now I didn't need to perform any synchronization. Code was written in a way that even if client would send same request multiple times (e.g. create sth request) one call will end with success and all others with error (because of server code + db structure).
What is the best way to perform such limitations? For example I want no more that 1 call of API method: foo() per user per minute.
I thought about some SynchronizationTable which would have just one column unique_text and before computing foo() call I'll write something like foo{userId}{date}{HH:mm} to this table. If call end with success I know that there wasn't foo call from that user in current minute.
I think there is much better way, probably in server code, without using db for that. Of course, there could be thousands of users calling foo.
To clarify what I need: I think it could be some light DictionaryMutex.
For example:
private static DictionaryMutex FooLock = new DictionaryMutex();
FooLock.lock(User.GUID);
try
{
...
}
finally
{
FooLock.unlock(User.GUID);
}
EDIT:
Solution in which one user cannot call foo twice at the same time is also sufficient for me. By "at the same time" I mean that server started to handle second call before returning result for first call.

Note, that keeping this state in memory in an IIS worker process opens the possibility to lose all this data at any instant in time. Worker processes can restart for any number of reasons.
Also, you probably want to have two web servers for high availability. Keeping the state inside of worker processes makes the application no longer clustering-ready. This is often a no-go.
Web apps really should be stateless. Many reasons for that. If you can help it, don't manage your own data structures like suggested in the question and comments.
Depending on how big the call volume is, I'd consider these options:
SQL Server. Your queries are extremely simple and easy to optimize for. Expect 1000s of such queries per seconds per CPU core. This can bear a lot of load. You can use a SQL Express for free.
A specialized store like Redis. Stack Overflow is using Redis as a persistent, clustering-enabled cache. A good idea.
A distributed cache, like Microsoft Velocity. Or others.
This storage problem is rather easy because it fits a key/value store model well. And the data is near worthless so you don't even need to backup.
I think you're overestimating how costly this rate limitation will be. Your web-service is probably doing a lot more costly things than a single UPDATE by primary key to a simple table.

Live Analytic Data

I'm planning on creating a live analytics's page for my website - A bit like Google Analytic but will real live data which will change as new users load a page on my site etc.
The main site is/will be written using Asp.Net/C# as the back end with a MS SQL database and the front end will support things like JavaScript (JQuery), CSS3, HTML5 (If required).
I was wondering what methods can I use to have the live analytic in terms of; How to get the data onto the analytic's page, what efficient graphing can I use, and storing the data with fast input/output.
The first thing that came to my mind is to use Node.js - Could I use this to achieve a live analytic's page? Is a good idea? Are there any better alternatives? Any drawbacks with this?
Would I need a C# Application running on a server to use Node.js to send/receive all the data to and from the website?
Would using a MS SQL database be fast enough? Would I need to store all the data live, or could I store it in chunks every x amount of seconds/minutes? (Which would be more efficient?)
This illustrates my initial thoughts on the matter -
Edit:
I'm going to be using this system over multiple sites, I could be getting 10 hits at a time to around 1,000,000 (Highly unlikely, but still possible). I want to be able to scale this system and adapt it to the environment it's in.

It really depends on how "real time" the realtime data needs to be. For example, I made this recently:
http://www.reed.co.uk/labs/realtime/
Which shows job applications coming into the system. Obviously there is way too much going on during busy periods to actually be querying the main database in realtime - so, what we do is query a sliding "window" and cache it on the server - this is a chunk of the last 5minutes worth of events.
We then play this back to the user as is it's happening "now". having a little latency as part of a SLA (wherein the users don't really care) can make the whole system vastly more scalable.
[EDIT- further explanation]
The data is just retrieved from a basic stored procedure call - naturally, a big system like reed has hundreds of transactions/second - so we cant keep hitting the main cluster, for every user.
All we do, is make sure we have a current window, in this case the last 5min of data cached on the server. When a client comes to the site, we get that last 5min of data, and play it back like it's happening right now - the end user is none-the-wiser - but what it means is that all clients are reading off the cache. Once the cache is 5min old, we invalidate it, and start again. This means a max of 1 DB hit, every five min - thus making teh system vastly more scalable (not that it really needs to be - as it's just for fun, really)

Just so you are aware Google analytics's already offers live user tracking. when inside the dashboard of a site on Google analytics's. click the home button on the top bar, and then the real time button on the left bar. Considering the design work and quality of this service, it seems this may be a better option then to attempt to recreate its service. If you do choose to proceed to create your own, then you can at least use their services as a benchmark for the desired features.
Using Api's like the googles charting API https://developers.google.com/chart/ would be a good approach to displaying the output of your stored data, with decreased development time. If you provide more information on the number of hits you exspect, and the scale of the server this software will be hosted, then it will be easier to give you answers to the speed questions.

Scalability and availability

I am quite confused on which approach to take and what is best practice.
Lets say i have a C# application which does the following:
sends emails from a queue. Emails to send and all the content is stored in the DB.
Now, I know how to make my C# application almost scalable but I need to go somewhat further.
I want some form of responsibility of being able to distribute the tasks across say X servers. So it is not just 1 server doing all the processing but to share it amoungst the servers.
If one server goes down, then the load is shared between the other servers. I know NLB does this but im not looking for an NLB here.
Sure, you could add a column of some kind in the DB table to indicate which server should be assigned to process that record, and each of the applications on the servers would have an ID of some kind that matches the value in the DB and they would only pull their own records - but this I consider to be cheap, bad practice and unrealistic.
Having a DB table row lock as well, is not something I would do due to potential deadlocks and other possible issues.
I am also NOT indicating using threading "to the extreme" here but yes, there will be threading per item to process or batching them up per thread for x amount of threads.
How should I approach and what do you recommend on making a C# application which is scalable and has high availability? The aim is to have X servers, each with the same application and for each to be able to get records and process them but have the level of processing/items to process shared amoungst the servers so incase if one server or service fails, the other can take on that load until another server is put back.
Sorry for my lack of understanding or knowledge but have been thinking about this quite alot and had lack of sleep trying to think of a good robust solution.

I would be thinking of batching up the work, so each app only pulled back x number of records at a time, marking those retrieved records as taken with a bool field in the table. I'd amend the the SELECT statement to pull only records not marked as taken/done. Table locks would be ok in this instance for very short periods to ensure there is no overlap of apps processing the same records.
EDIT: It's not very elegant, but you could have a datestamp and a status for each entry (instead of a bool field as above). Then you could run a periodic Agent job which runs a sproc to reset the status of any records which have a status of In Progress but which have gone beyond a time threshold without being set to complete. They would be ready for reprocessing by another app later on.
This may not be enterprise-y enough for your tastes, but I'd bet my hide that there are plenty of apps out there in the enterprise which are just as un-sophisticated and work just fine. The best things work with the least complexity.

Approach for caching data from data logger

Greetings,
I've been working on a C#.NET app that interacts with a data logger. The user can query and obtain logs for a specified time period, and view plots of the data. Typically a new data log is created every minute and stores a measurement for a few parameters. To get meaningful information out of the logger, a reasonable number of logs need to be acquired - data for at least a few days. The hardware interface is a UART to USB module on the device, which restricts transfers to a maximum of about 30 logs/second. This becomes quite slow when reading in the data acquired over a number of days/weeks.
What I would like to do is improve the perceived performance for the user. I realize that with the hardware speed limitation the user will have to wait for the full download cycle at least the first time they acquire a larger set of data. My goal is to cache all data seen by the app, so that it can be obtained faster if ever requested again. The approach I have been considering is to use a light database, like SqlServerCe, that can store the data logs as they are received. I am then hoping to first search the cache prior to querying a device for logs. The cache would be updated with any logs obtained by the request that were not already cached.
Finally my question - would you consider this to be a good approach? Are there any better alternatives you can think of? I've tried to search SO and Google for reinforcement of the idea, but I mostly run into discussions of web request/content caching.
Thanks for any feedback!

Seems like a very reasonable approach. Personally I'd go with SQL CE for storage, make sure you index the column holding the datetime of the record, then use TableDirect on the index for getting and inserting data so it's blazing fast. Since your data is already chronological there's no need to get any slow SQL query processor involved, just seek to the date (or the end) and roll forward with a SqlCeResultSet. You'll end up being speed limited only by I/O. I profiled doing really, really similar stuff on a project and found TableDirect with SQLCE was just as fast as a flat binary file.

I think you're on the right track wanting to store it locally in some queryable form.
I'd strongly recommend SQLite. There's a .NET class here.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.