Trying to optimize I/O for MongoDB - c#

I have an updater script that runs every few hours for various regions on a gaming server. I am looking to run this script more frequently and add more regions. Ideally I would love to spread the load of the CPU and I/O as evenly as possible. I used to run this script using mysql, but now the website uses mongodb for everything, so it kinda made sense to move the updater scripts to mongodb too. I am having really high I/O spikes when mongodb flushes all of the updates to the database.
The script is written in C#, although I don't think that's too relative. More importantly is that we are doing about 500k to 1.2 million updates each time one of these scripts runs. We have done some small optimizations in the code and with indexes, but at this point we are stuck at how to optimize the actual mongodb settings.
Some other important information is that we do something like this
update({'someIdentifier':1}, $newDocument)
instead of this:
$set : { internalName : 'newName' }
Not sure if this is a lot worse in performance than doing $set or not.
What can we do to try and spread the load out? I can assign more memory to the VM if that will help as well.
I am happy to provide more information.

Here are my thoughts:
1) Properly explain your performance concerns.
So far I can't really figure out what the issue is or if you have one at all. As far as I can tell you're doing around a GB of updates and are writing about a GB of data to the disk... not much of a shock.
Oh and do some damn testing - Not sure if this is a lot worse in performance than doing $set or not. - why don't you know? What do your tests say?
2) Check to see if there is any hardware mismatch.
Is your disk just slow? Is your working set bigger than RAM?
3) Ask on mongo-user and other MongoDB specific communities...
...simply because you might get a better answer there than the lack of answers here.
4) Consider trying TokuMX.
Wait what? Didn't I just accuse the last guy of suggesting that basically spamming his own product?
Sure, it's a new product that's only been very newly introduced into Mongo (it appears to have a mysql version for a bit longer), but the fundamentals seem sound. In particular it's very good at being fast of not only insertions, but updates/deletions. It does this by not needing to actually go and make the changes to the document in question - but store the insertion/update/deletion message in a buffered queue as part of the index structure. As the buffer fills up it applies these changes in bulk, which is massively more efficient in terms of I/O. On top of that, it uses compression in storing data which should additionally reduce I/O - there's physically less to write.
The biggest disadvantage I can see so far is that its best performance is seen with big data - if your data fits into RAM than it loses to BTrees in a bunch of tests. Still fast, but not as fast.
So yeah, it's very new and I would not trust it for anything without testing, and even then only for non-mission-critical stuff, but it might be what you're looking for. And TBH, as it's just a new index/store sub-system... it fits the bill of being an optimisation for mongodb than a separate product. Especially since index/storage systems in mongodb have always been a bit simple - 'lets use memory-mapped files for caching' etc.

Related

Thread Affinity

I have a multithreaded program which consist of a C# interop layer over C++ code.
I am setting threads affinity (like in this post) and it works on part of my code, however on second part it doesn't work. Can Intel Compiler / IPP / MKL libs / inline assembly interfere with external affinity setting?
UPDATE:
I can't post code as it is whole environment with many many dlls. I set environment values: OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 IPP_NUM_THREADS=1. When it runs in single thread, it runs ok, but when I use number of C# threads and set affinity per thread (on a quad core machine), the initialization is going fine on separate cores, but during processing all threads start using the same core. Hope I am clear enough.
Thanks.
We've had this exact problem; we'd set our thread affinity to what we wanted, and the IPP/MKL functions would blow that away! The answer to your question is 'yes'.
Auto Parallelism
The issue is that, by default, the Intel libraries like to automatically execute multi-threaded versions of the routines. So, a single FFT gets computed by a number of threads setup by the library specifically for this purpose.
Intel's intent is that the programmer could get on with the job of writing a single threaded application, and the library would allow that single thread to benefit from a multicore processor by creating a number of threads for the maths work. A noble intent (your source code then need know nothing about the runtime hardware to extract the best achievable performance - handy sometimes), but a right bloody nuisance when one is doing one's own threading for one's own reasons.
Controlling the Library's Behaviour
Take a look at these Intel docs, section Support Functions / Threading Support Functions. You can either programmatically control the library's threading tendancies, or there's some environment variables you can set (like MKL_NUM_THREADS) before your program runs. Setting the number of threads was (as far as I recall) enough to stop the library doing its own thing.
Philosophical Essay Inspired By Answering Your Question (best ignored)
More or less everything Intel is doing in CPU design and software (e.g. IPP/MKL) is aimed at making it unnecessary for the programmer to Worry About Threads. You want good math performance? Use MKL. You want that for loop to go fast? Turn on Auto Parallelisation in ICC. You want to make the best use of cache? That's what Hyperthreading is for.
It's not a bad approach, and personally speaking I think that they've done a pretty good job. AMD too. Their architectures are quite good at delivering good real world performance improvements to the "Average Programmer" for the minimal investment in learning, re-writing and code development.
Irritation
However, the thing that irritates me a little bit (though I don't want to appear ungrateful!) is that whilst this approach works for the majority of programmers out there (which is where the profitable market is), it just throws more obstacles in the way of those programmers who want to spin their own parallelism. I can't blame Intel for that of course, they've done exactly the right thing; they're a market led company, they need to make things that will sell.
By offering these easy features the situation of there being too many under skilled and under trained programmers becomes more entrenched. If all programmers can get good performance without having to learn what auto parallelism is actually doing, then we'll never move on. The pool of really good programmers who actually know that stuff will remain really small.
Problem
I see this as a problem (though only a small one, I'll explain later). Computing needs to become more efficient for both economic and environmental reasons. Intel's approach allows for increased performance, and better silicon manufacturing techniques produces lower power consumption, but I always feel like it's not quite as efficient as it could be.
Example
Take the Cell processor at the heart of the PS3. It's something that I like to witter on about endlessly! However, IBM developed that with a completely different philosophy to Intel. They gave you no cache (just some fast static RAM instead to use as you saw fit), the architecture was pretty much pure NUMA, you had to do all your own parallelisation, etc. etc. The result was that if you really knew what you were doing you could get about 250GFLOPS out of the thing (I think the none-PS3 variants went to 320GLOPS), for 80Watts, all the way back in 2005.
It's taken Intel chips about another 6 or 7 years or so for a single device to get to that level of performance. That's a lot of Moores law growth. If the Cell got manufactured on Intel's latest silicon fab and was given as many transistors as Intel put into their big Xeons, it would still blow everything else away.
No Market
However, apart from PS3, Cell was a none-starter market proposition. IBM decided that it would never be a big enough seller to be worth their while. There just wasn't enough programmers out there who could really use it, and to indulge the few of us who could makes no commercial sense, which wouldn't please the shareholders.
Small Problem, Bigger Problem
I said earlier that this was only a small problem. Well, most of the world's computing isn't about high maths performance, it's become Facebook, Twitter, etc. That sort is all about I/O performance, and for that you don't need high maths performance. So in that sense the dependence on Intel Doing Everything For You so that the average programmer to get good math performance matters very little. There's just not enough maths being done to warrant a change in design philosophy.
In fact, I strongly suspect that the world will ultimately decide that you don't need a large chip at all, an ARM should do just fine. If that does come to pass then the market for Intel's very large chips with very good general purpose maths compute performance will vanish. Effectively those of use who want good maths performance are being heavily subsidised by those who want to fill enourmous data centres with Intel based hardware and put Intel PCs on every desktop.
We're simply lucky that Intel apparently has a desire to make sure that every big CPU they build is good at maths regardless of whether most of their users actually use that maths performance. I'm sure that desire has its foundations in marketing prowess and wanting the bragging rights, but those are not hard, commercially tangible artifacts that bring shareholder value.
So if those data centre guys decide that, actually, they'd rather save electricity and fill their data centres with ARMs, where does that leave Intel? ARMs are fine devices for the purpose for which they're intended, but they're not at the top of my list of Supercomputer chips. So where does that leave us?
Trend
My take on the current market trend is that 'Workstations' (PCs as we call them now) are going to start costing lots and lots of money, just like they did in the 1980s / early 90s.
I think that better supercomputers will become unaffordable because no one can spare the $10billions it would take to do the next big chip. If people stop having PCs there won't be a mass market for large all-out GPUs, so we won't even be able to use those instead. They're an exclusive thing, but super computers do play a vital role in our world and we do need them to get better. So who is going to pay for that? Not me, that's for sure.
Oops, that went on for quite a while...

MsgSetRequest - Can I throw too much data at it?

I'm working on a C# library project that will process transactions between SQL and QuickBooks Enterprise, keeping both data stores in sync. This is great and all, but the initial sync is going to be a fairly large set of transactions. Once the initial sync is complete, transactions will sync as needed for the remainder of the life of the product.
At this point, I'm fairly familiar with the SDK using QBFC, as well as all of the various resources and sample code available via the OSR, the ZOMBIE project by Paul Keister (thanks, Paul!) and others. All of these resources have been a huge help. But one thing I haven't come across yet is whether there is a limit or substantial or deadly performance cost associated with large amounts of data via a single Message Set Request. As I understand it, the database on QuickBooks' end is just a SQL database as well, but I don't want to make any assumptions.
Again, I just need to hit this hard once, so I don't want to engineer a separate solution to do the import. This also affords me an opportunity to test a copy of live data against my library, logs and all.
For what it's worth, this is my first ever post on Stack, so feel free to educate me on posting here if I've steered off course in any way. Thanks.
For what it's worth, I found that in a network environment (as opposed to everything happening on 1 box) it's better to have a larger MsgSetRequest as opposed to a smaller one. Of course everything has its limits, and maybe I just never hit it. I don't remember exactly how big the request set was, but it was big. The performance improvement was easily 10 to 1 or better.
If I was you, I'd build some kind of iteration into my design from the beginning (to iterate through your SQL data set). Start with a big number that will do it all at once, and if that breaks just scale it back until you find something that works.
I know this answer doesn't have the detail you're looking for, but hopefully it will help.

SQL Performance, .Net Optimizations vs Best Practices

I need confirmation/explanation from you pros/gurus with the following because my team is telling me "it doesn't matter" and it's fustrating me :)
Background: We have a SQL Server 2008 that is being used by our main MVC3 / .Net4 web app. We have about 200+ concurrent users at any given point. The server is being hit EXTREMELY hard (locks, timeouts, overall slowness) and I'm trying to apply things i learned throughout my career and at my last MS certification class. They are things we've all been drilled on ("close SQL connections STAT") and I'm trying to explain to my team that these 'little things", though not one alone makes a difference, adds up in the end.
I need to know if the following do have a performance impact or if it's just 'best practice'
1. Using "USING" keyword.
Most of their code is like this:
public string SomeMethod(string x, string y) {
SomethingDataContext dc = new SomethingDataContext();
var x = dc.StoredProcedure(x, y);
}
While I'm trying to tell them that USING closes/frees up resources faster:
using (SomethingDataContext dc = new SomethingDataContext()) {
var x = dc.StoredProcedure(x, y);
}
Their argument is that the GC does a good enough job cleaning up after the code is done executing, so USING doesn't have a huge impact. True or false and why?
2. Connection Pools
I always heard setting up connection pools can significantly speed up any website (at least .Net w/ MSSQL).
I recommended we add the following to our connectionstrings in the web.config:
..."Pooling=True;Min Pool Size=3;Max Pool Size=100;Connection
Timeout=10;"...
Their argument is that .Net/MSSQL already sets up the connection pools behind the scenes and is not necessary to put in our web.config. True or false? Why does every other site say pooling should be added for optimal performance if it's already setup?
3. Minimize # of calls to DB
The Role/Membership provider that comes with the default .Net MVC project is nice - it's handy and does most of the legwork for you. But these guys are making serious use of UsersInRoles() and use it freely like a global variable (it hits the DB everytime this method is called).
I created a "user object" that loads all the roles upfront on every pageload (along with some other user stuff, such as GUIDs, etc) and then query this object for if the user has the Role.
Other parts of the website have FOR statements that loop over 200 times and do 20-30 sql queries on every pass = over 4,000 database calls. It somehow does this in a matter of seconds, but what I want to do is consolidate the 20-30 DB calls into one, so that it makes ONE call 200 times (each loop).
But because SQL profiler says the query took "0 seconds", they're argument is it's so fast and small that the servers can handle these high number of DB queries.
My thinking is "yeah, these queries are running fast, but they're killing the overall SQL server's performance."
Could this be a contributing factor? Am I worrying about nothing, or is this a (significant) contributing factor to the server's overall performance issues?
4. Other code optimizations
The first one that comes to mind is using StringBuilder vs a simple string variable. I understand why I should use StringBuilder (especially in loops), but they say it doesn't matter - even if they need to write 10k+ lines, their argument is that the performance gain doesn't matter.
So all-in-all, are all the things we learn and have drilled into us ("minimize scope!") just 'best practice' with no real performance gain or do they all contribute to a REAL/measurable performance loss?
EDIT***
Thanks guys for all your answers! I have a new (5th) question based on your answers:
They in fact do not use "USING", so what does that mean is happening? If there is connection pooling happening automatically, is it tying up connections from the pool until the GC comes around? Is it possible each open connection to the SQL server is adding a little more burden to the server and slowing it down?
Based on your suggestions, I plan on doing some serious benchmarking/logging of connection times because I suspect that a) the server is slow, b) they aren't closing connections and c) Profiler is saying it ran in 0 seconds, the slowness might be coming from the connection.
I really appreciate your help guys. THanks again
Branch the code, make your changes & benchmark+profile it against the current codebase. Then you'll have some proof to back up your claims.
As for your questions, here goes:
You should always manually dispose of classes which implement IDisposable, the GC won't actually call dispose however if the class also implements a finalizer then it will call the finalizer however in most implementations they only clean up unmanaged resources.
It's true that the .NET framework already does connection pooling, I'm not sure what the defaults are but the connection string values would just be there to allow you to alter them.
The execution time of the SQL statement is only part of the story, in SQL profiler all you will see is how long the database engine took to execute the query, what you're missing there is the time it takes the web server to connect to and receive the results from the database server so while the query may be quick, you can save on a lot of IO & network latency by batching queries.
This one is a good one to do some profiling on to prove the extra memory used by concatenation over string builders.
Oye. For sure, you can't let GC close your database connections for you. GC might not happen for a LONG time...sometimes hours later. It doesn't happen right away as soon as a variable goes out of scope. Most people use the IDisposable using() { } syntax, which is great, but at the very least something, somewhere needs to be calling connection.Close()
Objects that implement IDisposable and hold on inmanaged resources also implement a finilizer that will ensure that dispose is called during GC, the problem is when it is called, the gc can take a lot of time to do it and you migth need those resources before that. Using makes the call to the dispose as soon as you are done with it.
You can modify the parameters of pooling in the webconfig but its on by default now, so if you leave the default parameters you are no gaining anything
You not only have to think about how long it takes the query to execute but also the connection time between application server and database, even if its on the same computer it adds an overhead.
StringBuilder wont affect performance in most web applications, it would only be important if you are concatenating 2 many times to the same string, but i think its a good idea to use it since its easier to read .
I think that you have two separate issues here.
Performance of your code
Performance of the SQL Server database
SQL Server
Do you have any monitoring in place for SQL Server? Do you know specifically what queries are being run that cause the deadlocks?
I would read this article on deadlocks and consider installing the brilliant Who is active to find out what is really going on in your SQL Server. You might also consider installing sp_Blitz by Brent Ozar. This should give you an excellent idea of what is going on in your database and give you the tools to fix that problem first.
Other code issues
I can't really comment on the other code issues off the top of my head. So I would look at SQL server first.
Remember
Monitor
Identify Problems
Profile
Fix
Go to 1
Well, I'm not a guru, but I do have a suggestion: if they say you're wrong, tell them, "Prove it! Write me a test! Show me that 4000 calls are just as fast as 200 calls and have the same impact on the server!"
Ditto the other things. If you're not in a position to make them prove you right, prove them wrong, with clear, well-documented tests that show that what you're saying is right.
If they're not open even to hard evidence, gathered from their own server, with code they can look at and inspect, then you may be wasting your time on that team.
At the risk of just repeating what others here have said, here's my 2c on the matter
Firstly, you should pick your battles carefully...I wouldn't go to war with your colleagues on all 4 points because as soon as you fail to prove one of them, it's over, and from their perspective they're right and you're wrong.
Also bear in mind that no-one likes to be told their beatiful code is an ugly baby, so I assume you'll be diplomatic - don't say "this is slow", say "I found a way to make this even faster"....(of course your team could be perfectly reasonable so I'm basing that on my own experience as well:) So you need to pick one of the 4 areas above to tackle first.
My money is on #3.
1, 2 and 4 can make a difference, but in my own experience, not that much - but what you described in #3 sounds like death by a thousand papercuts for the poor old server! The queries probably execute fast because they're parameterised so they're cached, but you need to bear in mind that "0 seconds" in the profiler could be 900 milliseconds, if you see what I mean...add that up for many and things start getting slow; this could also be a primary source of the locks because if each of these nested queries is hitting the same table over and over, no matter how fast it runs, with the number of users you mentioned, it's certain you will have contention.
Grab the SQL and run it in SSMS but include Client Statistics so you can see not only the execution time but also the amount of data being sent back to the client; that will give you a clearer picture of what sort of overhead in involved.
Really the only way you can prove any of this is to setup a test and measure as others have mentioned, but also be certain to also run some profiling on the server as well - locks, IO queues, etc, so that you can show that not only is your way faster, but that it places less load on the server.
To touch on your 5th question - I'm not sure, but I would guess that any SqlConnection that's not auto-disposed (via using) is counted as still "active" and is not available from the pool any more. That being said - the connection overhead is pretty low on the server unless the connection is actually doing anything - but you can again prove this by using the SQL Performance counters.
Best of luck with it, can't wait to find out how you get on.
I recently was dealing with a bug in the interaction between our web application and our email provider. When an email was sent, a protocol error occurred. But not right away.
I was able to determine that the error only occurred when the SmtpClient instance was closed, which was occurring when the SmtpClient was disposed, which was only happening during garbage collection.
And I noticed that this often took two minutes after the "Send" button was clicked...
Needless to say, the code now properly implements using blocks for both the SmtpClient and MailMessage instances.
Just a word to the wise...
1 has been addressed well above (I agree with it disposing nicely, however, and have found it to be a good practice).
2 is a bit of a hold-over from previous versions of ODBC wherein SQL Server connections were configured independently with regards to pooling. It used to be non-default; now it's default.
As to 3 and 4, 4 isn't going to affect your SQL Server's performance - StringBuilder might help speed the process within the UI, certainly, which may have the effect of closing off your SQL resources faster, but they won't reduce the load on the SQL Server.
3 sounds like the most logical place to concentrate, to me. I try to close off my database connections as quickly as possible, and to make the fewest calls possible. If you're using LINQ, pull everything into an IQueryable or something (list, array, whatever) so that you can manipulate it & build whatever UI structures you need, while releasing the connection prior to any of that hokum.
All of that said, it sounds like you need to spend some more quality time with the profiler. Rather than looking at the amount of time each execution took, look at the processor & memory usage. Just because they're fast doesn't mean they're not "hungry" executions.
The using clause is just syntactic sugar, you are essentially doing
try
{
resouce.DoStuff();
}
finally
{
resource.Dispose()
}
Dispose is probably going to get called anyway when the object is garbage collected, but only if the framework programmers did a good job of implementing the disposable pattern. So the arguments against your colleagues here are:
i) if we get into the habit of utilizing using we make sure to free unmanaged resources because not all framework programmers are smart to implement the disposable pattern.
ii) yes, the GC will eventually clean that object, but it may take a while, depending on how old that object is. A gen 2 GC cleanup is done only once per second.
So on short:
see above
yes, pooling is set by default to true and max pool size to 100
you are correct, definitely the best area to push on for improvements.
premature optimization is the root of all evil. Get #1 and #3 in first. Use SQL
profiler and db specific methods (add indexes, defragment them, monitor deadlocks etc.).
yes, could be. best way is to measure it - look at the perf counter SQLServer: General Statistics – User Connections; here is an article describing how to do it.
Always measure your improvements, don't change the code without evidence!

Database recommendations needed -> Columnar, Embedded (if possible)

EDIT: As result of the answers so far I like to add more focus in what I like to zero in on: A database that allows writing in-memory (could be simple C# code) with persistence to storage options in order to access the data from within R. Redis so far looks the most promising. I also consider to actually use something similar to Lockfree++ or ZeroMQ, in order to avoid writing data concurrently to the database, but rather sending all to be persisted data over a message bus/other implementation and to have one "actor" handle all write operations to an in-memory db or other solution. Any more ideas aside Redis (some mentioned SQLite and I will need to still test its performance). Any other suggestions?
I am searching for the ideal database structure/solution that meets most of my below requirements but so far I utterly failed. Can you please help?
My tasks: I run a process in .Net 4.5 (C#) and generate (generally) value types that I want to use for further analysis in other applications and therefore like to either preserve in-memory or persist on disk. More below. The data is generated within different tasks/threads and thus a row based data format does not lend itself well to match this situation (because the data generated in different threads is generated at different times and is thus not aligned). Thus I thought a columnar data structure may be suitable but please correct me if I am wrong.
Example:
Tasks/Thread #1 generates the following data at given time stamps
datetime.ticks / value of output data
1000000001 233.23
1000000002 233.34
1000000006 234.23
...
Taks/Thread #2 generates the following data at given time stamps
datetime.ticks / value of output data
1000000002 33.32
1000000005 34.34
1000000015 54.32
...
I do not need to align the time stamps at the .Net run-time, I am first and foremost after preserving the data and to process the data within R or Python at a later point.
My requirements:
Fast writes, fast writes, fast writes: It can happen that I generate 100,000- 1,000,000 data points per second and need to persist (worst case) or retain in memory the data. Its ok to run the writes on its own thread so this process can lag the data generation process but limitation is 16gb RAM (64bit code), more below.
Preference is for columnar db format as it lends itself well to how I want to query the data later but I am open to any other structure if it makes sense in regards to the examples above (document/key-value also ok if all other requirements are met, especially in terms of write speed).
API that can be referenced from within .Net. Example: HDF5 may be considered capable by some but I find their .Net port horrible.Something that supports .Net a little better would be a plus but if all other requirements are met then I can deal with something similar to the HDF5 .Net port.
Concurrent writes if possible: As described earlier I like to write data concurrently from different tasks/threads.
I am constrained by 16gb memory (run .Net process in 64bit) and thus I probably look for something that is not purely in-memory as I may sometimes generate more data than that. Something in-memory which persists at times or a pure persistence model is probably preferable.
Preference for embedded but if a server in a client/server solution can run as a windows service then no issue.
In terms of data access I have strong preference for a db solution for which interfaces from R and Python already exist because I like to use the Panda library within Python for time series alignments and other analysis and run analyses within R.
If the API/library supports in addition SQL/SQL-like/Linq/ like queries that would be terrific but generally I just need the absolute bare bones such as load columnar data in between start and end date (given the "key"/index is in such format) because I analyze and run queries within R/Python.
If it comes with a management console or data visualizer that would be a plus but not a must.
Should be open source or priced within "reach" (no, KDB does not qualify in that regards ;-)
OK, here is what I have so far, and again its all I got because most db solution simply fail already on the write performance requirement:
Infobright and Db4o. I like what I read so far but I admit I have not checked into any performance stats
Something done myself. I can easily store value types in binary format and index the data by datetime.ticks , I just would need to somehow write scripts to load/deserialize the data in Python/R. But it would be a massive tasks if I wanted to add concurrency, a query engine, and other goodies. Thus I look for something already out there.
I can't comment -- low rep (I'm new here) -- so you get a full answer instead...
First, are you sure you need a database at all? If fast write speed and portability to R is your biggest concern then have you just considered a flat file mechanism? According to your comments you're willing to batch writes out but you need persistence; if those were my requirements I'd write a straight-to-disck buffering system that was lightning fast then build a separate task that periodically took the disk files and moved them into a data store for R, and that's only if R reading the flat files wasn't sufficient in the first place.
If you can do alignment after-the-fact, then you could write the threads to separate files in your main parallel loop, cutting each file off every so often, and leave the alignment and database loading to the subprocess.
So (in crappy pseudo_code), build a thread process that you'd call with backgroundworker or some such and include a threadname string uniquely identifying each worker and thus each filestream (task/thread):
file_name = threadname + '0001.csv' // or something
open(file_name for writing)
while(generating_data) {
generate_data()
while (buffer_not_full and very_busy) {
write_data_to_buffer
generate_data()
}
flush_buffer_to_disk(file_name)
if(file is big enough or enough time has passed or we're not too busy) {
close(file_name)
move(file_name to bob's folder)
increment file_name
open(file_name for writing)
}
)
Efficient and speedy file I/O and buffering is a straightforward and common problem. Nothing is going to be faster than this. Then you can just write another process to do the database loads and not sweat the performance there:
while(file_name in list of files in bob's folder sorted by date for good measure)
{
read bob's file
load bob's file to database
align dates, make pretty
}
And I wouldn't write that part in C#, I'd batch script it and use the database's native loader which is going to be as fast as anything you can build from scratch.
You'll have to make sure the two loops don't interfere much if you're running on the same hardware. That is, run the task threads at a higher priority, or build in some mutex or performance limiters so that the database load doesn't hog resources while the threads are running. I'd definitely segregate the database server and hardware so that file I/O to the flat files isn't compromised.
FIFO queues would work if you're on Unix, but you're not. :-)
Also, hardware is going to have more of a performance impact for you than the database engine, I'd imagine. If you're on a budget I'm guessing you're on COTS hardware, so springing for a solid state drive may up performance fairly cheaply. As I said, separating the DB storage from the flat file storage would help, and the CPU/RAM for R, the Database, and your Threads should all be segregated ideally.
What I'm saying is that choice of DB vendor probably isn't your biggest issue, unless you have a lot of money to spend. You'll be hardware bound most of the time otherwise. Database tuning is an art, and while you can eek out minor performance gains at the top end, having a good database administrator will keep most databases in the same ballpark for performance. I'd look at what R and Python support well and that you're comfortable with. If you think in columnar fashion then look at R and C#'s support for Cassandra (my vote), Hana, Lucid, HBase, Infobright, Vertica and others and pick one based on price and support. For traditional databases on a single commodity machine, I haven't seen anything that MySQL can't handle.
This is not to answer my own question but to keep track of all data bases which I tested so far and why they have not met my requirements (yet): each time I attempted to write 1 million single objects (1 long, 2 floats) to the database. For ooDBs, I stuck the objects into a collection and wrote the collection itself, similar story for key/value such as Redis but also attempted to write simple ints (1mil) to columnar dbs such as InfoBright.
Db4o, awefully slow writes: 1mil objects within a collection took about 45 seconds. I later optimized the collection structure and also wrote each object individually, not much love here.
InfoBright: Same thing, very slow in terms of write speed, which surprised me quite a bit as it organizes data in columnar format but I think the "knowledge tree" only kicks in when querying data rather than when saving flat data structures/tables-like structures.
Redis (through BookSleeve): Great API for .Net: Full Redis functionality (though couple drawbacks to run the server on Windows machines vs. a Linux or Unix box). Performance was very fast...North of 1 million items per second. I serialized all objects using Protocol Buffers (protobuf-net, both written by Marc Gravell), still need to play a lot more with the library but R and Python both have full access to the Redis DB, which is a big plus. Love it so far. The Async framework that Marc wrote around the Redis base functions is awesome, really neat and it works so far. I wanna spend a little more time to experiment with the Redis Lists/Collection types as well, as I so far only serialized to byte arrays.
SqLite: I ran purely in-memory and managed to write 1 million value type elements in around 3 seconds. Not bad for a pure RDBMS, obviously the in-memory option really speeds things up. I only created one connection, one transaction, created one command, one parameter, and simply adjusted the value of the parameter within a loop and ran the ExecuteNonQuery on each iteration. The transaction commit was then run outside the loop.
HDF5: Though there is a .Net port and there also exists a library to somehow work with HDF5 files out of R, I strongly discourage anyone to do so. Its a pure nightmare. The .Net port is very badly written, heck, the whole HDF5 concept is more than questionable. Its a very old and in my opinion outgrown solution to store vectorized/columnar data. This is 2012 not 1995. If one cannot completely delete datasets and vectors out of the file in which they were stored before then I do not call that an annoyance but a major design flaw. The API in general (not just .Net) is very badly designed and written imho, there are tons of class objects that nobody, without having spent hours and hours of studying the file structure, understands how to use. I think that is somewhat evidenced by the very sparse amount of documentation and example code that is out there. Furthermore, the h5r R library is a drama, an absolute nightmare. Its badly written as well (often the file upon writing is not correctly close due to a faulty flush and it corrupts files), the library has issues to even be properly installed on 32 bit OSs...and it goes on and on. I write the most about HDF5 because I spent the most of my time on this piece of .... and ended up with the most frustration. The idea to have a fast columnar file storage system, accessible from R and .Net was enticing but it just does not deliver what it promised in terms of API integration and usability or lack thereof.
Update: I ditched testing velocityDB simply because there does not seem any adapter to access the db from within R available. I currently contemplate writing my own GUI with charting library which would access the generated data either from a written binary file or have it sent over a broker-less message bus (zeroMQ) or sent through LockFree++ to an "actor" (my gui). I could then call R from within C# and have results returned to my GUI. That would possibly allow me the most flexibility and freedom, but would obviously also be the most tedious to code. I am running into more and more limitations during my tests that with each db test I befriend this idea more and more.
RESULT: Thanks for the participation. In the end I awarded the bounty points to Chipmonkey because he suggested partly what I considered important points to the solution to my problem (though I chose my own, different solution in the end).
I ended up with a hybrid between Redis in memory storage and direct calls out of .Net to the R.dll. Redis allows access to its data stored in memory by different processes. This makes it a convenient solution to quickly store the data as key/value in Redis and to then access the same data out of R. Additionally I directly send data and invoke functions in R through its .dll and the excellent R.Net library. Passing a collection of 1 million value types to R takes about 2.3 seconds on my machine which is fast enough given that I get the convenience to just pass in the data, invoke computational functions within R out of the .Net environment and getting the results back sync or async.
Just a note: I once had a similar problem posted by a fellow in a delphi forum. I could help him with a simple ID-key-value database backend I wrote at that time (kind of a NoSQL engine). Basically, it uses a B-Tree to store triplets (32bit ObjectID, 32bit PropertyKey, 64bit Value). I could manage to save about 500k/sec Values in real time (about 5 years ago). Of course, the data was indexed on all three values (ID, property-ID and value). You could optimize this by ignoring the value index.
The source I still have is in Delphi, but I would think about implementing something like that using C#. I cannot tell you whether it will meet your needs for performance, but if all else fails, give it a try. Using a buffered write should also drastically improve performance.
I would go with way combining persistence storage (I personally prefer db4o, but you can use files as well as mentioned above) and storing objects into memory this way:
use BlockingCollection<T> to store objects in memory (I believe you will achieve better performance then 1000000/s to store objects in memory), and than have one or more processing threads which will consume the objects and store them into persistent database
// Producing thread
for (int i=0; i<1000000; i++)
blockingCollection.Add(myObject);
// Consuming threads
while (true)
{
var myObject = blockingCollection.Take();
db4oSession.Store(myObject); // or write it to the files or whathever
}
BlockingCollection pretty much solves Producer-Consumer workflow, and in case you will use multiple instance of them and use AddToAny/TakeFromAny you can reach any kind of multithreaded performance
each consuming thread could have different db4o session (file) to reach desired performance (db4o is singlethreaded).
Since you want to use ZeroMQ why not use memcache over Redis?
ZeroMQ offers no persistence as far as I know. Memcache also offers no persistence and is a bit faster than Redis.
Or perhaps the other way, if you use Redis why not use beanstalk MQ?
If you want to use Redis (for the persistence) you might want to switch from ZeroMQ to beanstalk MQ (also a fast in memory queue, but also has persistence via logging). Beanstalk also has C# libs.

What to make parallel? What will make me better? (.net Web Business Application, MVC+SL)

I'm working on a web application framework, which uses MSSQL for data storage, mostly just does CRUD operations (but on arbitrarly complex structures), provides a WCF interface for rich Silverlight admin and has an MVC3 display (and some basic forms like user settings, etc).
It's getting quite good at being able to load, display, edit and save any (reasonably) complex data structure, in a user-friendly way.
But, I'm looking towards the future, and want to expand my capabilities (and it would be fun to learn new things along the way as well...) - so I've decided (in the light of what's coming for C#5...) to try to get some parallel/async optimalization... Now, I haven't even learned TPL and PLinq yet, so I'm happy for any advice there as well.
So my question is, what are possible areas where parallel processing maybe of help, and where does TPL and PLinq help me on that?
My guts tell me, I could try saving branches of a data structure in a parallel way to the database (this is where I'd expect the biggest peformance optimalization), I could perform some complex operations (file upload, mail sending maybe?) in a multithreaded enviroment, etc. Can I build complex SL UI views in parallel on the client? (Creating 60 data-bound fields on a view can cause "blinking"...) Can I create partial views (menus, category trees, search forms, etc) in MVC at once?
ps: If this turns into "Tell me everything about parallel stuffs" thread, I'm happy to make it community-wiki...
Remember that an asp.net web application is intrinsically a parallel application in any case. Requests can be serviced in parallel and this will all be managed by the asp.net framework. So there are two cases:
You have lots of users all hitting the site at once. In which case the parallel processing capability of the server is probably being used to capacity in any case.
You don't have lots of users all hitting the site at once. In which case the server is probably quite capable of dealing with the responses without parallel processing in a suitable fast response time.
Any time you start thinking about optimising something just because it might be fun, or because you just think you should make stuff faster then you are almost certainly guilty of premature optimization. Your efforts could almost certainly be better spent enriching the functionality of the framework, rather than making what is probably a plenty fast enough solution a little bit faster (at the cost of significantly increase complexity).
In answer to the question of where can TPL and PLINQ really help. In my opinion the main advantage of these technologies is in places in the application where you really do have a lot of long running blocking processes. For example if you have a situation where you call out several times to an external web service - it can be a significant advantage to make these calls in parallel. I would strongly question whether writing to a local database - or even a database on a different box on a local network would count as being a long running blocking process to the extent that this kind of parallelisation is of any significant value.
Pretty much all the examples you list fall in to the category of getting the PC to do something in parallel that it was previously doing in sequence. How many CPUs are on your server - how many are really free when the website is under load. Making something parallel does not necessarily equate to making it faster unless the process involved has some measure of time when you PC is sitting around doing nothing waiting for an external event.
First question is to ask the users / testers which bits seem slow. The only way to know for sure what's slowing you down is to use a profiler like dottrace. The results are sometimes surprising.
If you do find something, parallel processing may not be the answer. You need to remember that there is an overhead in splitting tasks up, so if the task is fairly quick in the first place, it could end up being slower. You also have to consider the added complexity, e.g. what happens if half a task succeeds, and half fails? (Although TPL and PLINQ hide you from this to an extend)
Have fun, but I wondering whether this is a case of 1) solution chasing a problem, and 2) premature optimization.

Categories