Processing files off a queue, archiving and storing in a DB

Processing files off a queue, archiving and storing in a DB - c#

We will be implementing a solution to pull files off of a queue (IBM - MQ). The messages will be 10-20 different xml messages that will need to be dequeued, processed, and archived (store). However, when we store the data contained in the messages in a DB we want to retain the source file so the FileId that gets generated from the archive process will have to be retained and stored with the meta data.
I am trying to figure out what will provide me with the most throughput?
Requirements:
Keep an archive of the file.
Store the parsed data (not the xml blob) from the messages.
Retain the Source File ID from the Archive.
Implement a solution that will scale could grow significantly....currently probably 40-50,000 messages an hour.
So basically my currently bottleneck is that it seems that my archive process and data processing / db load are serial (archive has to process and be successful before I can start on xml parsing / loading).....didn't know if there is a better way to accomplish this.
I would assume we could add other app servers that would be listening on the same queue and could process the messages in parallel if need be. Try to eliminate the DB as the bottleneck by having it perform as little processing as possible (Could send xml blob to DB, but it would have to perform the xml shredding).

Work with DBAs to tune writes to the database.
Test multiple reads and see if you get better throughput.
My experience is that you'll want to have multiple readers, but the number will depend on lots of factors. Test it and see what is bets.

Related

FIFO log file in C#

I am making an evolution for a log file system I have had in place for a few builds of a service I develop on. I had previously been opening the file, appending data, and prior to writing checking to see if the log file had grown over a predetermined size, if so starting a new log.
So say the log size was 100mb, at that size I delete, and start a new file, but I loose history, functional, but not the best model.
What I want to do is a FIFO model that would chop off the top and add to the end while keeping it consistently no larger than 100mb, and at least as far back as that represents.
The data is high speed in a failure prone industrial environment, so keeping it all in memory and writing the whole file at interval has proven unreliable. (SSD, fast enough to do it reasonably most of the time, spinners fail too often to tolerate)
Likewise the records are of greatly variable length (formatted as XML nodes, so parsing them back out accounts for this easily)
So the only workable model I have come up with thus far is to keep smaller slices (say 10mb) chunks, create new ones then delete the oldest 10mb slice on count >= 10.
What I would prefer to do is be able to keep the file on disk and work with the tag ends.
Open to suggestions on how this might be best achieved in a reasonable manner, or is there no reasonable manner and the layered multi log approach will be the best option?

The biggest issue with expiring old log entries in a single file is that you have to rewrite the file's content in order to expire older entries. This isn't too bad for small files (up to a few MB in size), but once you get to the point where rewriting takes a significant period of time it becomes problematic.
One of the more common ways to retire logs is to rename the existing log file and/or start a new file. Lots of programs do it that way, with either dated log file names or by using a sequential numbering system - logfile, logfile.1, logfile.2, etc. with higher-numbered files being older. You can add compression to the process to further reduce the storage requirements for expired files, etc.
Another option is to use a more database-like format, or an out-and-out database like SQLite to store your log entries. The primary downside of this of course is that your log files become more difficult to read, since they're not just in plain text form. It's simple enough to write a dump-to-text program whose output can be piped to a log parser... but even this will probably require a change in the way your consumers are interfacing with the log file.
The problem as stated is unlikely to be realistically solvable, I suspect. On the one hand you have the limitations of file manipulation, and on the other the fact that your log consumers are many and varied and therefore changes to the logging structure will be an involved process.
About all I can suggest is that you trial a log aging process similar to this:
Rename current log file
Walk renamed file and copy desired contents to new log file
Discard or archive renamed log
Beware duplication or data loss.

i dunno why u need this feature "chop off the top and add to the end while keeping it consistently no larger than 100mb".
general design approach is archiving. simply rename the oversized file to another file, or move it to somwhere else, then using back the same filename as new file.
simple as this is.

More appropriate for my task: background worker or thread pool?

I have a simple web application module which basically accepts requests to save a zip file on PageLoad from a mobile client app.
Now, What I want to do is to unzip the file and read the file inside it and process it further..including making entries into a database.
Update: the zip file and its contents will be fairly smaller in size so the server shouldn't be burdened with much load.
Update 2: I just read about when IIS queues requests (at global/app level). So does that mean that I don't need to implement complex request handling mechanism and the IIS can take care of the app by itself?
Update 3: I am looking for offloading the processing of the downloaded zip not only for the sake of minimizing the overhead (in terms of performance) but also in order to avoid the problem of table-locking when the file is processed and records updated into the same table. In the scenario of multiple devices requesting the page and the background task processing database updateing in parallel would cause an exception.
As of now I have zeroed on two solutions:
To implement a concurrent/message queue
To implement the file processing code into a separate tool and schedule a job on the server to check for non-processed file(s) and process them serially.
Inclined towards a Queuing Mechanism I will try to implement is as it seems less dependent on config. v/s manually configuring the job/schedule at the server side.
So, what do you guys recommend me for this purpose?
Moreover after the zip file is requested and saved on server side, the client & server side connection is released after doing so. Not looking to burden my IIS.
Imagine a couple of hundred clients simultaneously requesting the page..
I actually haven't used neither of them before so any samples or how-to's will be more appreciated.

I'd recommend TPL and Rx Extensions: you make your unzipped file list an observable collection and for each item start a new task asynchronously.

I'd suggest a queue system.
When you received a file you'll save the path into a thread-synchronized queue. Meanwhile a background worker (or preferably another machine) will check this queue for new files and dequeue the entry to handle it.
This way you won't launch an unknown amount of threads (every zip file) and can handle the zip files in one location. This way you can also easier move your zip-handling code to another machine when the load gets too heavy. You just need to access a common queue.
The easiest would probably be to use a static Queue with a lock-object. It is the easiest to implement and does not require external resources. But this will result in the queue being lost when your application recycles.
You mentioned losing zip files was not an option, then this approach is not the best if you don't want to rely on external resources. Depending on your load it may be worth to utilize external resources - meaning upload the zip file to a common storage on another machine and add a message to an queue on another machine.
Here's an example with a local queue:
ConcurrentQueue<string> queue = new ConcurrentQueue<string>();
void GotNewZip(string pathToZip)
{
queue.Enqueue(pathToZip); // Added a new work item to the queue
}
void MethodCalledByWorker()
{
while (true)
{
if (queue.IsEmpty)
{
// Supposedly no work to be done, wait a few seconds and check again (new iteration)
Thread.Sleep(TimeSpan.FromSeconds(5));
continue;
}
string pathToZip;
if (queue.TryDequeue(out pathToZip)) // If TryDeqeue returns false, another thread dequeue the last element already
{
HandleZipFile(pathToZip);
}
}
}
This is a very rough example. Whenever a zip arrives, you add the path to the queue. Meanwhile a background worker (or multiple, the example s threadsafe) will handle one zip after another, getting the paths from the queue. The zip files will be handled in the order they arrive.
You need to make sure that your application does not recycle meanwhile. But that's the case with all resources you have on the local machine, they'll be lost when your machine crashes.

I believe you are optimising prematurely.
You mentioned table-locking - what kind of db are you using? If you add new rows or update existing ones most modern databases in most configurations will:
use row-level locking; and
be fast enough without you needing to worry about
locking.
I suggest starting with a simple method
//Unzip
//Do work
//Save results to database
and get some proof it's too slow.

Howto read this large text file? Memory Mapped File?

I am in the design phase of a simple tool I want to write where I need to read large log files. To give you guys some context I will first explain you something about it.
The log files I need to read consists of log entries which always consist of the following 3-line format:
statistics : <some data which is more of less of the same length about 100 chars>
request : <some xml string which can be small (10KB) or big (25MB) and anything in between>
response : <ditto>
The log files can be about 100-600MB of size which means a lot of log entries. Now these log entries can have a relation with each other, for this I need to start reading the file from the end to the beginning. These relationship can be deduced from the statistics line.
I want to use the info in the statistics line to build up some datagrid which the users can use to search through the data and do some filtering operations. Now I don't want to load the request / response lines into memory until the user actually needs it. In addition I want to keep the memory load small by limiting the maximum of loaded request/response entries.
So I think I need to save the offsets of the statistics line when I am parsing the file for the first time and creating a index of statistics. Then when the user clicks on some statistic which is a element of a log entry then I read the request / response from the file by using this offset. I can then hold it some memory pool which takes care that there are not to much loaded request / response entries (see earlier req).
The problem is that I don't know how often the user is going to need the request/response data. It could be a lot it could be a few times. In addition the log file could be loaded from a network share.
The question I have is:
Is this a scenario when you should use a memory mapped file because of the fact there could be a lot of read operations? Or is it better to use a plain filestream. BTW. I don't need write operations to the log file at this stage but it could be in the future!
If you have other tips or see flaws in my thinking so far please let me know as well. I am open for any approach.
Update:
To clarify some more:
The tool itself has to do the parsing when the user loads a log file from a drive or network share.
The tool will be written as WinForms application.
The user can export a made selection of log entries. At this moment the format of this export is unknown (binary, file db, textfile). This export can be imported by the application itself which then only shows the selection made by the user.

You're talking about some stored data that has some defined relationships between actual entries... Maybe it's just me, but this scenario just calls for some kind of a relational database. I'd suggest to consider some portable db, like SQL Server CE for instance. It'll make your life much easier and provide exactly the functionality you need. If you use db instead, you can query exactly the data you need, without ever needing to handle large files like this.

If you're sending the request/response chunk over the network, the network send() time is likely to be so much greater than the difference between seek()/read() and using memmap that it won't matter. To really make this scale, a simple solution is to just breakup the file into many files, one for each chunk you want to serve (since the "request" can be up to 25 MB). Then your HTTP server will send that chunk as effeciently as possible (perhaps even using zerocopy, depending on your webserver). If you have many small "request" chunks, and only a few giant ones, you could break-out only the ones past a certain threshold.

I don't disagree with with answer from walther. I would go db or all memory.
Why are you so concerned about saving memory as 600 MB is not that much. Are you going to be running on machines with less than 2 GB of memory?
Load into a dictionary with statistics as a key and the value a class with two properties - request and response. Dictionary is fast. LINQ is powerful and fast.

Fastest way to read a file

My application currently implements a FileSystemWatcher object to monitor a directory (C:\Incoming) for the creation of a file (Input.xml).
I'm currently using a Streamreader object to read the file into my application, however I'm concerned about performance considering that the data will be used to perform operations in a SQL Server database.
What would be the FASTEST way to read the file into memory (or am I already using it)?

You are using the standard approach to the task.
If you need to load the data (large data, that is) into a database, the bottleneck will be between your code and the database, unless you use a BULK INSERT technique, as opposed to inserting rows one by one. The details of that depend on the particular database server.
This is however no concern if the files are relatively small in which case the load will be more evenly distributed. Even then I would not care much about the speed of disk access.
Make sure that the file is completely written before you start reading it. For example, try opening it for exclusive reading first. It does help a bit if the file is make accessible to you through a rename operation as opposed to a create operation, especially if you expect many files to be arriving, because your server side then does not have to busy wait until the file is completely written. This basically means that the file should first be written with a file name for which you are NOT watching.

.NET has the SqlBulkCopy class too

Is there a fast and scalable solution to save data?

I'm developing a service that needs to be scalable in Windows platform.
Initially it will receive aproximately 50 connections by second (each connection will send proximately 5kb data), but it needs to be scalable to receive more than 500 future.
It's impracticable (I guess) to save the received data to a common database like Microsoft SQL Server.
Is there another solution to save the data? Considering that it will receive more than 6 millions "records" per day.
There are 5 steps:
Receive the data via http handler (c#);
Save the received data; <- HERE
Request the saved data to be processed;
Process the requested data;
Save the processed data. <- HERE
My pre-solution is:
Receive the data via http handler (c#);
Save the received data to Message Queue;
Request from MSQ the saved data to be processed using a windows services;
Process the requested data;
Save the processed data to Microsoft SQL Server (here's the bottleneck);

6 million records per day doesn't sound particularly huge. In particular, that's not 500 per second for 24 hours a day - do you expect traffic to be "bursty"?
I wouldn't personally use message queue - I've been bitten by instability and general difficulties before now. I'd probably just write straight to disk. In memory, use a producer/consumer queue with a single thread writing to disk. Producers will just dump records to be written into the queue.
Have a separate batch task which will insert a bunch of records into the database at a time.
Benchmark the optimal (or at least a "good" number of records to batch upload) at a time. You may well want to have one thread reading from disk and a separate one writing to the database (with the file thread blocking if the database thread has a big backlog) so that you don't wait for both file access and the database at the same time.
I suggest that you do some tests nice and early, to see what the database can cope with (and letting you test various different configurations). Work out where the bottlenecks are, and how much they're going to hurt you.

I think that you're prematurely optimizing. If you need to send everything into a database, then see if the database can handle it before assuming that the database is the bottleneck.
If the database can't handle it, then maybe turn to a disk-based queue like Jon Skeet is describing.

Why not do this:
1.) Receive data
2.) Process data
3.) Save original and processsed data at once
That would save you the trouble of requesting it again if you already have it. I'd be more worried about your table structure and your database machine then the actual flow though. I'd be sure to make sure that your inserts are as cheap as possible. If that isn't possible then queuing up the work makes some sense. I wouldn't use message queue myself. Assuming you have a decent SQL Server machine 6 million records a day should be fine assuming you're not writing a ton of data in each record.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.