I need to be able to insert thousands of transactions per second into some form of storage, and be able to query it quickly.
What I would like to do is log all TCP requests (the key being the source or dest IP Address).
Then when another requests comes in check if a prior request by the IP is in the store/database/cache and perform the appropriate action.
SQLExpress, SQLCompact, SQLIte, MongoDB, Firebird ? Are any of these fast enough ?
Should I be just use an in memory data structure of some description ?
The store needs to be accessible by multiple threads concurrently...
Any suggestions please...
UPDATE: RaptorDB ? Is it any good ? Worth considering along with the other Databases/noSQL above ?
http://www.codeproject.com/KB/database/RaptorDB.aspx
At these speeds, your bottleneck is likely to be your disk; a standard SATA drive can do only a few tens of transactions per second. To supplement the disk, throw lots of RAM at it so that the database is stored mostly in memory.
Alternatively, just do it all in memory by rolling your own logger that uses an associative array. Do you need to log ALL data being sent, or just the fact that PC A sent data to PC B?
Related
I have a project where at pick times the site will get 1000 calls per secs this calls need to be saved in the DB ( MS-SQL DB ) .
What is the best practice to manage this large scale connections.
I am using .net C# .
Currently building this as a site that get all the calls in post way.
Thanks For your answers.
From my experience the fastest way to insert a large amount of data is to use a GUID as PK. If you use an autoincrement the sql will write the rows one by one to get the next ID.
But if there is a GUID, the server will write all your data at the "same" time.
Depending on the ressources of your production machine there are several approaches:
1fst: Store all the requests in one big list and your application writes all the requests synchronously into your database. Pro: pretty easy to code, Con: could be a real bottleneck
2nd: Store all the requests in one big list and your application starts a thread for every 1000 requests that you received, these threads write the data into your database asynchronously. Pro: Should be faster, Con: hard to code and hard to maintain.
3rd: Store the request inside the memory of the server and write the data into the database when the application is not busy.
I am in the design phase of a simple tool I want to write where I need to read large log files. To give you guys some context I will first explain you something about it.
The log files I need to read consists of log entries which always consist of the following 3-line format:
statistics : <some data which is more of less of the same length about 100 chars>
request : <some xml string which can be small (10KB) or big (25MB) and anything in between>
response : <ditto>
The log files can be about 100-600MB of size which means a lot of log entries. Now these log entries can have a relation with each other, for this I need to start reading the file from the end to the beginning. These relationship can be deduced from the statistics line.
I want to use the info in the statistics line to build up some datagrid which the users can use to search through the data and do some filtering operations. Now I don't want to load the request / response lines into memory until the user actually needs it. In addition I want to keep the memory load small by limiting the maximum of loaded request/response entries.
So I think I need to save the offsets of the statistics line when I am parsing the file for the first time and creating a index of statistics. Then when the user clicks on some statistic which is a element of a log entry then I read the request / response from the file by using this offset. I can then hold it some memory pool which takes care that there are not to much loaded request / response entries (see earlier req).
The problem is that I don't know how often the user is going to need the request/response data. It could be a lot it could be a few times. In addition the log file could be loaded from a network share.
The question I have is:
Is this a scenario when you should use a memory mapped file because of the fact there could be a lot of read operations? Or is it better to use a plain filestream. BTW. I don't need write operations to the log file at this stage but it could be in the future!
If you have other tips or see flaws in my thinking so far please let me know as well. I am open for any approach.
Update:
To clarify some more:
The tool itself has to do the parsing when the user loads a log file from a drive or network share.
The tool will be written as WinForms application.
The user can export a made selection of log entries. At this moment the format of this export is unknown (binary, file db, textfile). This export can be imported by the application itself which then only shows the selection made by the user.
You're talking about some stored data that has some defined relationships between actual entries... Maybe it's just me, but this scenario just calls for some kind of a relational database. I'd suggest to consider some portable db, like SQL Server CE for instance. It'll make your life much easier and provide exactly the functionality you need. If you use db instead, you can query exactly the data you need, without ever needing to handle large files like this.
If you're sending the request/response chunk over the network, the network send() time is likely to be so much greater than the difference between seek()/read() and using memmap that it won't matter. To really make this scale, a simple solution is to just breakup the file into many files, one for each chunk you want to serve (since the "request" can be up to 25 MB). Then your HTTP server will send that chunk as effeciently as possible (perhaps even using zerocopy, depending on your webserver). If you have many small "request" chunks, and only a few giant ones, you could break-out only the ones past a certain threshold.
I don't disagree with with answer from walther. I would go db or all memory.
Why are you so concerned about saving memory as 600 MB is not that much. Are you going to be running on machines with less than 2 GB of memory?
Load into a dictionary with statistics as a key and the value a class with two properties - request and response. Dictionary is fast. LINQ is powerful and fast.
Greetings,
I've been working on a C#.NET app that interacts with a data logger. The user can query and obtain logs for a specified time period, and view plots of the data. Typically a new data log is created every minute and stores a measurement for a few parameters. To get meaningful information out of the logger, a reasonable number of logs need to be acquired - data for at least a few days. The hardware interface is a UART to USB module on the device, which restricts transfers to a maximum of about 30 logs/second. This becomes quite slow when reading in the data acquired over a number of days/weeks.
What I would like to do is improve the perceived performance for the user. I realize that with the hardware speed limitation the user will have to wait for the full download cycle at least the first time they acquire a larger set of data. My goal is to cache all data seen by the app, so that it can be obtained faster if ever requested again. The approach I have been considering is to use a light database, like SqlServerCe, that can store the data logs as they are received. I am then hoping to first search the cache prior to querying a device for logs. The cache would be updated with any logs obtained by the request that were not already cached.
Finally my question - would you consider this to be a good approach? Are there any better alternatives you can think of? I've tried to search SO and Google for reinforcement of the idea, but I mostly run into discussions of web request/content caching.
Thanks for any feedback!
Seems like a very reasonable approach. Personally I'd go with SQL CE for storage, make sure you index the column holding the datetime of the record, then use TableDirect on the index for getting and inserting data so it's blazing fast. Since your data is already chronological there's no need to get any slow SQL query processor involved, just seek to the date (or the end) and roll forward with a SqlCeResultSet. You'll end up being speed limited only by I/O. I profiled doing really, really similar stuff on a project and found TableDirect with SQLCE was just as fast as a flat binary file.
I think you're on the right track wanting to store it locally in some queryable form.
I'd strongly recommend SQLite. There's a .NET class here.
Question: I currently store ASP.net application data in XML files.
Now the problem is I have asynchronous operations, which means I ran into the problem of simultanous write access on a XML file...
Now, I'm considering moving to an embedded database to solve the issue.
I'm currently considering SQlite and embeddable Firebird.
I'm not sure however if SQlite or Firebird can handle multiple concurrent write access.
And I certainly don't want the same problem again.
Anybody knows ?
SQlite certainly is better known, but which one is better - SQlite or Firebird ? I tend to say Firebird, but I don't really know.
No MS-Access or MS-SQL-express recommodations please, I'm a sane person.
I wll choose Firebird for many reasons and for this too
Although it is transactional, SQLite
does not support concurrent
transactions, so if your embedded
application needs two or more
connections, they must be serialized.
An embedded Firebird database is
simple to upgrade to a fully shared
database - just change the shared
library.
May be you can also check this
SQLITE can be configured to gracefully handle simultaneous writes in most situations. What happens is that when one thread or process begins a write to the db, the file is locked. When the second write is attempted, and encounters the lock, it backs off for a short period before attempting the write again, until it succeeds or times out. The timeout is configurable, but otherwise all this happens without the application code having to do anything special except enabling the option, like this:
// set SQLite to wait and retry for up to 100ms if database locked
sqlite3_busy_timeout( db, 100 );
All this works very well and without any difficulty, except in two circumstances:
If an application does a great many writes, say a thousand inserts, all in one transaction, then the database will be locked up for a significant period and can cause problems for any other application attempting to write. The solution is to break up such large writes into seperate transactions, so other applications can get access to the database.
If the database is shared by different processes running on different machines, sharing a network mounted disk. Many operating systems have bugs in network mounted disks that making file locking unreliable. There is no answer to this. If you need to share a db on a network mounted disk, you need another database engine such as MySQL.
I do not have any experience with Firebird. I have used SQLITE in situations like this for many applications over several years.
Have you looked into Berkeley DB with the SQLite API for SQL support?
It sounds like SQLite will be a good fit. We use SQLite in a number of production apps, it supports, actually, it prefers transactions which go a long way to handling concurrency.
transactional sqlite? in C#
I would add #3 to the list from ravenspoint above: if you have a large call-center or order-processing center, say, where dozens of people might be hitting the SAVE button at the same time, even if each is updating or inserting just one record, you can run into problems using the busy timeout approach.
For scenario #3, a true SQL engine that can serialize is ideal; less ideal but serviceable is a dbms that can do byte-range record locking of a shared-file. But be aware that even a byte-range record lock will be inadequate for a large number of concurrent writes when new records are appended to the end of the file like a caboose on the end of a freight train, so that multiple processes are trying at the same time to set a lock on the same byte-range. On the other hand, a byte-range record locking scheme coupled with a hashed-key sparse file approach (e.g. the old Revelation/OpenInsight database for LANs) will be far superior to ISAM for this scenario.
I'm developing a service that needs to be scalable in Windows platform.
Initially it will receive aproximately 50 connections by second (each connection will send proximately 5kb data), but it needs to be scalable to receive more than 500 future.
It's impracticable (I guess) to save the received data to a common database like Microsoft SQL Server.
Is there another solution to save the data? Considering that it will receive more than 6 millions "records" per day.
There are 5 steps:
Receive the data via http handler (c#);
Save the received data; <- HERE
Request the saved data to be processed;
Process the requested data;
Save the processed data. <- HERE
My pre-solution is:
Receive the data via http handler (c#);
Save the received data to Message Queue;
Request from MSQ the saved data to be processed using a windows services;
Process the requested data;
Save the processed data to Microsoft SQL Server (here's the bottleneck);
6 million records per day doesn't sound particularly huge. In particular, that's not 500 per second for 24 hours a day - do you expect traffic to be "bursty"?
I wouldn't personally use message queue - I've been bitten by instability and general difficulties before now. I'd probably just write straight to disk. In memory, use a producer/consumer queue with a single thread writing to disk. Producers will just dump records to be written into the queue.
Have a separate batch task which will insert a bunch of records into the database at a time.
Benchmark the optimal (or at least a "good" number of records to batch upload) at a time. You may well want to have one thread reading from disk and a separate one writing to the database (with the file thread blocking if the database thread has a big backlog) so that you don't wait for both file access and the database at the same time.
I suggest that you do some tests nice and early, to see what the database can cope with (and letting you test various different configurations). Work out where the bottlenecks are, and how much they're going to hurt you.
I think that you're prematurely optimizing. If you need to send everything into a database, then see if the database can handle it before assuming that the database is the bottleneck.
If the database can't handle it, then maybe turn to a disk-based queue like Jon Skeet is describing.
Why not do this:
1.) Receive data
2.) Process data
3.) Save original and processsed data at once
That would save you the trouble of requesting it again if you already have it. I'd be more worried about your table structure and your database machine then the actual flow though. I'd be sure to make sure that your inserts are as cheap as possible. If that isn't possible then queuing up the work makes some sense. I wouldn't use message queue myself. Assuming you have a decent SQL Server machine 6 million records a day should be fine assuming you're not writing a ton of data in each record.