Memcached + batch data loading + replication + load balancing, any existing solutions? - c#

Here is my scenario:
A dozen of clients reading from memcached-like store.
Read-only access
50K gets/sec
99.999% availability
300 million records, 100 bytes each
If one of the stores goes down the system should be able to automatically switch to another replica. When it is time for update the system should be able to quickly reload data without affecting clients.
Is there existing solution to satisfy these requirements? I already evaluated memcached, velocity, and reviewed bunch of other projects (anti-rdbms-a-list-of-distributed-key-value-stores). I would prefer something that runs on Windows x64, but wouldn't shy away from *nix if there is something that would support my requirements out of the box. Paid products are OK. Quality is very important, can't rely on half-baked betas.
Thanks!

Take a look at Velocity. Right now it is CTP 3, so it may or may not violate your half-baked betas requirement. Check out their blog for perf numbers. Looks promising.

Related

Database recommendations needed -> Columnar, Embedded (if possible)

EDIT: As result of the answers so far I like to add more focus in what I like to zero in on: A database that allows writing in-memory (could be simple C# code) with persistence to storage options in order to access the data from within R. Redis so far looks the most promising. I also consider to actually use something similar to Lockfree++ or ZeroMQ, in order to avoid writing data concurrently to the database, but rather sending all to be persisted data over a message bus/other implementation and to have one "actor" handle all write operations to an in-memory db or other solution. Any more ideas aside Redis (some mentioned SQLite and I will need to still test its performance). Any other suggestions?
I am searching for the ideal database structure/solution that meets most of my below requirements but so far I utterly failed. Can you please help?
My tasks: I run a process in .Net 4.5 (C#) and generate (generally) value types that I want to use for further analysis in other applications and therefore like to either preserve in-memory or persist on disk. More below. The data is generated within different tasks/threads and thus a row based data format does not lend itself well to match this situation (because the data generated in different threads is generated at different times and is thus not aligned). Thus I thought a columnar data structure may be suitable but please correct me if I am wrong.
Example:
Tasks/Thread #1 generates the following data at given time stamps
datetime.ticks / value of output data
1000000001 233.23
1000000002 233.34
1000000006 234.23
...
Taks/Thread #2 generates the following data at given time stamps
datetime.ticks / value of output data
1000000002 33.32
1000000005 34.34
1000000015 54.32
...
I do not need to align the time stamps at the .Net run-time, I am first and foremost after preserving the data and to process the data within R or Python at a later point.
My requirements:
Fast writes, fast writes, fast writes: It can happen that I generate 100,000- 1,000,000 data points per second and need to persist (worst case) or retain in memory the data. Its ok to run the writes on its own thread so this process can lag the data generation process but limitation is 16gb RAM (64bit code), more below.
Preference is for columnar db format as it lends itself well to how I want to query the data later but I am open to any other structure if it makes sense in regards to the examples above (document/key-value also ok if all other requirements are met, especially in terms of write speed).
API that can be referenced from within .Net. Example: HDF5 may be considered capable by some but I find their .Net port horrible.Something that supports .Net a little better would be a plus but if all other requirements are met then I can deal with something similar to the HDF5 .Net port.
Concurrent writes if possible: As described earlier I like to write data concurrently from different tasks/threads.
I am constrained by 16gb memory (run .Net process in 64bit) and thus I probably look for something that is not purely in-memory as I may sometimes generate more data than that. Something in-memory which persists at times or a pure persistence model is probably preferable.
Preference for embedded but if a server in a client/server solution can run as a windows service then no issue.
In terms of data access I have strong preference for a db solution for which interfaces from R and Python already exist because I like to use the Panda library within Python for time series alignments and other analysis and run analyses within R.
If the API/library supports in addition SQL/SQL-like/Linq/ like queries that would be terrific but generally I just need the absolute bare bones such as load columnar data in between start and end date (given the "key"/index is in such format) because I analyze and run queries within R/Python.
If it comes with a management console or data visualizer that would be a plus but not a must.
Should be open source or priced within "reach" (no, KDB does not qualify in that regards ;-)
OK, here is what I have so far, and again its all I got because most db solution simply fail already on the write performance requirement:
Infobright and Db4o. I like what I read so far but I admit I have not checked into any performance stats
Something done myself. I can easily store value types in binary format and index the data by datetime.ticks , I just would need to somehow write scripts to load/deserialize the data in Python/R. But it would be a massive tasks if I wanted to add concurrency, a query engine, and other goodies. Thus I look for something already out there.
I can't comment -- low rep (I'm new here) -- so you get a full answer instead...
First, are you sure you need a database at all? If fast write speed and portability to R is your biggest concern then have you just considered a flat file mechanism? According to your comments you're willing to batch writes out but you need persistence; if those were my requirements I'd write a straight-to-disck buffering system that was lightning fast then build a separate task that periodically took the disk files and moved them into a data store for R, and that's only if R reading the flat files wasn't sufficient in the first place.
If you can do alignment after-the-fact, then you could write the threads to separate files in your main parallel loop, cutting each file off every so often, and leave the alignment and database loading to the subprocess.
So (in crappy pseudo_code), build a thread process that you'd call with backgroundworker or some such and include a threadname string uniquely identifying each worker and thus each filestream (task/thread):
file_name = threadname + '0001.csv' // or something
open(file_name for writing)
while(generating_data) {
generate_data()
while (buffer_not_full and very_busy) {
write_data_to_buffer
generate_data()
}
flush_buffer_to_disk(file_name)
if(file is big enough or enough time has passed or we're not too busy) {
close(file_name)
move(file_name to bob's folder)
increment file_name
open(file_name for writing)
}
)
Efficient and speedy file I/O and buffering is a straightforward and common problem. Nothing is going to be faster than this. Then you can just write another process to do the database loads and not sweat the performance there:
while(file_name in list of files in bob's folder sorted by date for good measure)
{
read bob's file
load bob's file to database
align dates, make pretty
}
And I wouldn't write that part in C#, I'd batch script it and use the database's native loader which is going to be as fast as anything you can build from scratch.
You'll have to make sure the two loops don't interfere much if you're running on the same hardware. That is, run the task threads at a higher priority, or build in some mutex or performance limiters so that the database load doesn't hog resources while the threads are running. I'd definitely segregate the database server and hardware so that file I/O to the flat files isn't compromised.
FIFO queues would work if you're on Unix, but you're not. :-)
Also, hardware is going to have more of a performance impact for you than the database engine, I'd imagine. If you're on a budget I'm guessing you're on COTS hardware, so springing for a solid state drive may up performance fairly cheaply. As I said, separating the DB storage from the flat file storage would help, and the CPU/RAM for R, the Database, and your Threads should all be segregated ideally.
What I'm saying is that choice of DB vendor probably isn't your biggest issue, unless you have a lot of money to spend. You'll be hardware bound most of the time otherwise. Database tuning is an art, and while you can eek out minor performance gains at the top end, having a good database administrator will keep most databases in the same ballpark for performance. I'd look at what R and Python support well and that you're comfortable with. If you think in columnar fashion then look at R and C#'s support for Cassandra (my vote), Hana, Lucid, HBase, Infobright, Vertica and others and pick one based on price and support. For traditional databases on a single commodity machine, I haven't seen anything that MySQL can't handle.
This is not to answer my own question but to keep track of all data bases which I tested so far and why they have not met my requirements (yet): each time I attempted to write 1 million single objects (1 long, 2 floats) to the database. For ooDBs, I stuck the objects into a collection and wrote the collection itself, similar story for key/value such as Redis but also attempted to write simple ints (1mil) to columnar dbs such as InfoBright.
Db4o, awefully slow writes: 1mil objects within a collection took about 45 seconds. I later optimized the collection structure and also wrote each object individually, not much love here.
InfoBright: Same thing, very slow in terms of write speed, which surprised me quite a bit as it organizes data in columnar format but I think the "knowledge tree" only kicks in when querying data rather than when saving flat data structures/tables-like structures.
Redis (through BookSleeve): Great API for .Net: Full Redis functionality (though couple drawbacks to run the server on Windows machines vs. a Linux or Unix box). Performance was very fast...North of 1 million items per second. I serialized all objects using Protocol Buffers (protobuf-net, both written by Marc Gravell), still need to play a lot more with the library but R and Python both have full access to the Redis DB, which is a big plus. Love it so far. The Async framework that Marc wrote around the Redis base functions is awesome, really neat and it works so far. I wanna spend a little more time to experiment with the Redis Lists/Collection types as well, as I so far only serialized to byte arrays.
SqLite: I ran purely in-memory and managed to write 1 million value type elements in around 3 seconds. Not bad for a pure RDBMS, obviously the in-memory option really speeds things up. I only created one connection, one transaction, created one command, one parameter, and simply adjusted the value of the parameter within a loop and ran the ExecuteNonQuery on each iteration. The transaction commit was then run outside the loop.
HDF5: Though there is a .Net port and there also exists a library to somehow work with HDF5 files out of R, I strongly discourage anyone to do so. Its a pure nightmare. The .Net port is very badly written, heck, the whole HDF5 concept is more than questionable. Its a very old and in my opinion outgrown solution to store vectorized/columnar data. This is 2012 not 1995. If one cannot completely delete datasets and vectors out of the file in which they were stored before then I do not call that an annoyance but a major design flaw. The API in general (not just .Net) is very badly designed and written imho, there are tons of class objects that nobody, without having spent hours and hours of studying the file structure, understands how to use. I think that is somewhat evidenced by the very sparse amount of documentation and example code that is out there. Furthermore, the h5r R library is a drama, an absolute nightmare. Its badly written as well (often the file upon writing is not correctly close due to a faulty flush and it corrupts files), the library has issues to even be properly installed on 32 bit OSs...and it goes on and on. I write the most about HDF5 because I spent the most of my time on this piece of .... and ended up with the most frustration. The idea to have a fast columnar file storage system, accessible from R and .Net was enticing but it just does not deliver what it promised in terms of API integration and usability or lack thereof.
Update: I ditched testing velocityDB simply because there does not seem any adapter to access the db from within R available. I currently contemplate writing my own GUI with charting library which would access the generated data either from a written binary file or have it sent over a broker-less message bus (zeroMQ) or sent through LockFree++ to an "actor" (my gui). I could then call R from within C# and have results returned to my GUI. That would possibly allow me the most flexibility and freedom, but would obviously also be the most tedious to code. I am running into more and more limitations during my tests that with each db test I befriend this idea more and more.
RESULT: Thanks for the participation. In the end I awarded the bounty points to Chipmonkey because he suggested partly what I considered important points to the solution to my problem (though I chose my own, different solution in the end).
I ended up with a hybrid between Redis in memory storage and direct calls out of .Net to the R.dll. Redis allows access to its data stored in memory by different processes. This makes it a convenient solution to quickly store the data as key/value in Redis and to then access the same data out of R. Additionally I directly send data and invoke functions in R through its .dll and the excellent R.Net library. Passing a collection of 1 million value types to R takes about 2.3 seconds on my machine which is fast enough given that I get the convenience to just pass in the data, invoke computational functions within R out of the .Net environment and getting the results back sync or async.
Just a note: I once had a similar problem posted by a fellow in a delphi forum. I could help him with a simple ID-key-value database backend I wrote at that time (kind of a NoSQL engine). Basically, it uses a B-Tree to store triplets (32bit ObjectID, 32bit PropertyKey, 64bit Value). I could manage to save about 500k/sec Values in real time (about 5 years ago). Of course, the data was indexed on all three values (ID, property-ID and value). You could optimize this by ignoring the value index.
The source I still have is in Delphi, but I would think about implementing something like that using C#. I cannot tell you whether it will meet your needs for performance, but if all else fails, give it a try. Using a buffered write should also drastically improve performance.
I would go with way combining persistence storage (I personally prefer db4o, but you can use files as well as mentioned above) and storing objects into memory this way:
use BlockingCollection<T> to store objects in memory (I believe you will achieve better performance then 1000000/s to store objects in memory), and than have one or more processing threads which will consume the objects and store them into persistent database
// Producing thread
for (int i=0; i<1000000; i++)
blockingCollection.Add(myObject);
// Consuming threads
while (true)
{
var myObject = blockingCollection.Take();
db4oSession.Store(myObject); // or write it to the files or whathever
}
BlockingCollection pretty much solves Producer-Consumer workflow, and in case you will use multiple instance of them and use AddToAny/TakeFromAny you can reach any kind of multithreaded performance
each consuming thread could have different db4o session (file) to reach desired performance (db4o is singlethreaded).
Since you want to use ZeroMQ why not use memcache over Redis?
ZeroMQ offers no persistence as far as I know. Memcache also offers no persistence and is a bit faster than Redis.
Or perhaps the other way, if you use Redis why not use beanstalk MQ?
If you want to use Redis (for the persistence) you might want to switch from ZeroMQ to beanstalk MQ (also a fast in memory queue, but also has persistence via logging). Beanstalk also has C# libs.

Max Amount / LIMIT of ASP.NET sites on one server

My question is simple. About 2 years ago we began migrating to ASP.NET from ASP Classic.
Our issue is we currently have about 350 sites on a server and the server seems to be getting bogged down. We have been trying various things to improve performance, Query Optimizations, Disabling ViewState, Session State, etc and they have all worked, but as we add more sites we end up using more of the server's resources and so the improvements we made in code are virtually erased.
Basically we're now at a tipping point, our CPUs currently average near 100%. Our IS would like us to find new ways to reword the code on the sites to improve performance.
I have a theory, that we are simply at the limit on the amount of sites one server can handle.
Any ideas? Please only respond if you have a good idea about what you are talking about. I've heard a lot of people theorize about the station. I need someone who has actual knowledge about what might be going on.
Here are the details.
250 ASP.NET Sites
250 Admin Sites (Written in ASP.NET, basically they are backend admin sites)
100 Classic ASP Sites
Running on a virtualized Windows Server 2003.
3 CPUs, 4 GB Memory.
Memory stays around 3 - 3.5 GB
CPUs Spike very badly, sometimes they remain near 100% for short period of time ( 30 - 180 seconds)
The database is on a separate server and is SQL SERVER 2005.
It looks like you've reached that point. You've optimised your apps, you've looked at server performance, you can see you are hitting peak memory usage, maxing out the CPU, and, lets face it, administering so many websites musn't be easy.
Also, the spec of your VM isn't fantastic. It's memory, in particular, potentially isn't great for the number of sites you have.
You have plenty of reasons to move.
However, some things to look at:
1) How many of those 250 sites are actually used? Which ones are the peak performance offenders? Those ones are prime candidates for being moved off onto their own box.
2) How many are not used at all? Can you retire any?
3) You are running on a virtual machine. What kind of virtual machine platform are you using? What other servers are running on that hardware?
4) What kind of redundancy do you currently have? 250 sites on one box with no backup? If you have a backup server, you could use that to round robin requests, or as a web farm, sharing the load.
Lets say you decide to move. The first thing you should probably think about is how.
Are you going to simply halve the number of sites? 125 + admins on one box, 125 + admins on the other? Or are you going to move the most used?
Or you could have several virtual machines, all active, as part of a web farm or load balanced system.
By the sounds of things, though, there's a real resistance to buy more hardware.
At some point, you are going to have to though, as sometimes, things just get old or get left behind. New servers have much more processing power and memory in the same space, and can be cheaper to run.
Oh, and one more thing. The cost of all those repeated optimizations and testing probably could easily be offset by buying more hardware. That's no excuse for not doing any optimization at all, of course, and I am impressed by the number of sites you are running, especially if you have a good number of users, but there is a balance, and I hope you can tilt towards the "more hardware" side of it some more.
I think you've answered your own question really. You've optimised the sites, you've got the database server on a different server. And you have 600 sites (250 + 250 + 100).
The answer is pretty clear to me. Buy a box with more memory and CPU power.
There is no real limit on the number of sites your server can handle, if all 600 sites had no users, you wouldn't have very much load on the server.
I think you might find a better answer at serverfault, but here are my 2 cents.
You can scale up or scale out.
Scale up -- upgrade the machine with more memory / more cores in the CPU.
Scale out -- distribute the load by splitting the sites across 2 or more servers. 300 on server A, 300 on server B, or 200 each across 3 servers.
As #uadrive mentions, this is an issue of load, not of # of sites.
Just thinking this through, it seems like you would be better off measuring the # of users hitting the server instead of # of sites. You could have 300 sites and only half are used. Knowing the usage would be better in my mind.
There's no simple formula answer, like "you can have a maximum of 47.3 sites per gig of RAM". You could surely maintain performance with many more sites if each site had only one user per day. There are likely servers that have only two sites but performance is terrible because each hit requires a massive database query.
In practice, the only way to approach this is empirically: When performance starts to degrade, you have a problem. The fact that somebody wrote in a book somewhere that a server with such-and-such resources should be able to support more sites is of little value if, in practice, YOUR server can't support YOUR sites and YOUR users.
Realistic options are:
(a) Optimize your code and database queries. You say you've already done that. Maybe you can do more. It's unlikely that your code is now the absolute best that it can possibly be, but it may well be that the effort to find further improvements will be hugely expensive.
(b) Buy a bigger server.
(c) Break your sites across multiple servers, and either update DNS or install a front-end to map requests to the correct server.
Maxing out on CPU use can be a good sign, in the sense that moving to a large server or dividing the sites between multiple servers, is likely to help.
There are many things you can do to help improve performance and scalability (in fact, I've written a book on this subject -- see my profile).
It's difficult to make meaningful suggestions without knowing much more about your apps, but here are a few quick tips that might help to get you started:
Multiple AppPools are expensive. How many sites do you have per AppPool? Combine multiple sites per AppPool if you can
Minimize client round-trips: improve client and proxy-level caching, offload static files to a CDN, use image sprites, merge multiple CSS and JS files
Enable output caching on pages and/or controls were possible
Enable compression for static files (more CPU use on first access, but less after that)
Avoid Session state all together if you can (prefer cookies for state management). If you can't, then at least configure EnableSessionState="ReadOnly" session state for pages that don't need to write it, or "false" for pages that don't need it at all
Many things on the SQL Server side: caching, SqlCacheDependency, command batching, grouping multiple insert/update/deletes into a single transaction, using stored procedures instead of dynamic SQL, using async ADO.NET instead of LINQ or EF, make sure your DB logs are on separate spindles from data, etc
Look for algorithmic issues with your code; for example, hash tables are often better than linear searches, etc
Minimize cookie sizes, and only set cookies on pages, not on static content.
In addition, using a VM is likely to cost you up to about 10% in performance -- make sure it's really worth that for what it buys you in terms of improved manageability.

is access 2007 can work good with 30 users parallel

is access 2007 can work good with 30 users parallel through my C# program ?
thank's in advance
Access is not very good for concurrent use. I have seen recommendation of a maximum of 10 people at one time.
To be honest, it depends on how these users are working and the load on it, however, it is not designed for such use (it is designed to be a desktop databasew not an enterprise database) so may fail under such usage. Use a database designed for you scenario - something like MySql or SQL Server Express, if you want to avoid extra costs.
See this article on 15seconds for a discussion on the suitability (or lack thereof) of access to concurrent usage.
The Jet and ACE database engines can support 255 users, not just 255 concurrent connections. This is because the standard for interaction with a Jet/ACE data store is a single connection for each user, opened and then re-used throughout the session. However, it definitely is the case that under normal usage Jet/ACE may open more than one connection per user, so 255 is not even a reliable theoretical limit.
Jet/ACE interacts with a data file, and maintains locking via its locking file (*.LDB). Contention for the data file and the LDB file can easily overwhelm the file system's ability to keep up, so in general, the practical limit on number of users is much lower than the 255 theoretical limit (you'll note that 255 is one less than a power of 2, hint, hint).
In real-world scenarios, a properly designed Access application with a Jet/ACE data store running on a reliable network and stored on a server with a native Windows file system can be quite stable into the 20-30 users range. But it depends on what those users are doing. The more that are read-only, the higher the number of simultaneous users that can be supported.
Experienced Access developers report engineering apps to work with as many as 100 simultaneous users, but at that point, you basically have to rewrite as an unbound app, and then you're giving up most of the advantages of Access as front end in order to nurse along a back end that is better used with a smaller user population.
My basic rule is that any time a user population reaches 15 simultaneous users, I start talking to the client about upsizing to SQL Server, not because it's required, but because they need to get used to the idea that as usage grows, they're going to need to upsize. Whether that happens at 15 users or 20 or 30 depends on the nature of the particular app. As I said above, if many of the users are read-only for most of their session, you have more headroom than if everybody is adding/updating records most of the time.
Given that a C# app is going to be an unbound app, I wouldn't think that 30 users should be terribly problematic, but I'm not a C# programmer. If it's new development and there's any possibility that the user population will grow beyond 30 users, it just seems like a no-brainer to me to build with a server back end instead of with Jet/ACE.
I never did it with 2007, but I had problems in the past with the XP version and only 3 users working 8 hours a day.
So, based in my previous experience, try to avoid it. Make your customer to change their requirement will be easier than the problems derived from to use Access in a paralell enviroment. After all, also based in my experience... your customer will be changing their requirements almost every week! :D
May the Force be with you.

Determining system requirements (hardware, processor & memory) for a batch based software application

I am tasked with building an application wherein the business users will be defining a number of rules for data manipulation & processing (e.g. taking one numerical value and splitting it equally amongst a number of records selected on the basis of the condition specified in the rule).
On a monthly basis, a batch application has to be run in order to process around half a million records as per the rules defined. Each record has around 100 fields. The environment is .NET, C# and SQL server with a third party rule engine
Could you please suggest how to go about defining and/or ascertaining what kind of hardware will be best suited if the requirement is to process records within a timeframe of let's say around 8 to 10 hours. How will the specs vary if the user either wants to increase or decrease the timeframe depending on the hardware costs?
Thanks in advance
Abby
Create the application and profile it?
Step 0. Create the application. It is impossible to tell real world performance of a multi-computer system like you're describing from "paper" specifications... You need to try it and see what holds the biggest slow downs... This is traditionally physical IO, but not always...
Step 1. Profile with sample sets of data in an isolated environment. This is a gross metric. You're not trying to isolate what takes the time, just measuring the overall time it takes to run the rules.
What does isolated environment mean? You want to use the same sorts of network hardware between the machines, but do not allow any other traffic on that network segment. That introduces too many variables at this point.
What does profile mean? With current hardware, measure how long it takes to complete under the following circumstances. Write a program to automate the data generation.
Scenario 1. 1,000 of the simplest rules possible.
Scenario 2. 1,000 of the most complex rules you can reasonably expect users to enter.
Scenarios 3 & 4. 10,000 Simplest and most complex.
Scenarios 5 & 6. 25,000 Simplest and Most complex
Scenarios 7 & 8. 50,000 Simplest and Most complex
Scenarios 9 & 10. 100,000 Simplest and Most complex
Step 2. Anaylze the data.
See if there are trends in completion time. Figure out if they appear tied to strictly the volume of rules or if the complexity also factors in... I assume it will.
Develop a trend line that shows how long you can expect it to take if there are 200,000 and 500,000 rules. Perform another run at 200,000. See if the trend line is correct, if not, revise your method of developing the trend line.
Step 3. Measure the database and network activity as the system processes the 20,000 rule sets. See if there is more activity happening with more rules. If so the more you speed up the throughput to and from the SQL server the faster it will run.
If these are "relatively low," then CPU and RAM speed are likely where you'll want to beef up the requested machines specification...
Of course if all this testing is going to cost your employer more than buying the beefiest server hardware possible, just quantify the cost of the time spent testing vs. the cost of buying the best server and being done with it and only tweaking your app and the SQL that you control to improve performance...
If this system is not first of a kind, so you can consider following:
Re-use (after additional evaluation) hardware requirements from previous projects
Evaluate hardware requirements based on workload and hardware configuration of existing application
If that is not the case and performance requirements are very important, then the best way would be to create a prototype with, say, 10 rules implemented. Process the dataset using the prototype and extrapolate to a full rule set. Based on this information you should be able to derive initial performance and hardware requirements. Then you can fine tune these specifications taking into account planned growth in processed data volume, scalability requirements and redundancy.

Any real-world, enterprise-grade experience with Transactional NTFS (TxF)?

Background:
I am aware of this SO question about Transactional NTFS (TxF) and this article describing how to use it, but I am looking for real-world experience with a reasonably high-volume enterprise system where lots of blob data (say documents and/or photos) need to be persisted once transactionally and read many times.
We are expecting a few tens of thousands of documents written per day and reads of several tens of thousands per hour.
We could either store indexes within the file system or in SQL Server but must be able to scale this out over several boxes.
We must retain the ability to back up and restore the data easily for disaster recovery.
The Question:
Any real-world, enterprise-grade experience with Transactional NTFS (TxF)?
Related questions:
Anyone tried distributed transactions using TxF where the same file is committed to two mirror servers at once?
Anyone tried a distributed transaction with the file system and a database?
Any performance concerns/reliability concerns/performance data you can share?
Has anyone even done something on this scale before where transactions are a concern?
Edits: To be more clear, I have researched other technologies, including SQL Server 2008's new FILESTREAM data type, but this question is specificially targeted at the transactional file system only.
More Resources:
An MSDN Magazine article on TxF called "Enhance Your Apps With File System Transactions".
A webcast called "Transactional Vista: Kernel Transaction Manager and friends (TxF, TxR)". This video quotes an overhead from using TxF of 2-5%, with the performance discussion starting about 25 minutes in. This is first set of hard numbers I've found. And the video is a very good overview of how this works under the hood. At about 34:30, the speaker describes a very similar scenario to this question.
A Channel 9 screencast called "Surendra Verma: Vista Transactional File System". He talks about performance starting around 35 minutes in. No hard numbers.
A list of TxF articles on the B# .NET Blog.
An Channel 9 screencast called "Transactional NTFS".
I suppose "real-world, enterprise-grade" experience is more subjective than it sounds.
Windows Update uses TXF. So it is being used quite heavily in terms of frequency. Now, it isn't doing any multi-node work and it isn't going through DTC or anything fancy like that, but it is using TXF to manipulate file state. It coordinates these changes with changes to the registry (TXR). Does that count?
A colleague of mine presented this talk to SNIA, which is pretty frank about a lot of the work around TXF and might shed a little more light. If you're thinking of using TXF, it's worth a read.
Unfortunately, it appears that the answer is "No."
In nearly two weeks (one week with a 100 point bounty) and 156 views, no one has answered that they have used TxF for any high-volume applications as I described. I can't say this was unexpected, and of course I cannot prove a negative, but it appears this feature of Windows is not well known or frequently used, at least by active members of the SO community at the time of writing.
If I ever get around to writing some kind of proof of concept, I'll post here what I learn.
Have you considered filestream support in SQL Server 2008 (if you're using SQL Server 2008 of course)? I'm not sure about performance, but it offers transactionality and supports backup/restore.
While I don't have extensive experienve with TxF, I do have experience with MS DTC. TxF itself is fairly performant. When you throw in the MS DTC to handle multiple resource managers across multiple machines, performance takes a considerable hit.
From your description, it sounds like you are storing and indexing very large volumes of unstructured data. I assume that you also need the ability to search for this data. As such, I would highly recommend looking into something like Microsoft's Dryad or Google's MapReduce and a high performance distributed file system to handle your unstructured data storage and indexing. The best examples of high-volume enterprise systems that store and index massive volumes of blob data are Internet search engines like Bing and Google.
There are quite a few resources available for managing high-throughput unstructured data, and they would probably solve your problem more effectively than SQL Server and NTFS.
I know its a bit farther out of the box than you were probably looking for...but you did mention that you had already exhausted all other search avenues around the NTFS/TxF/SQL box. ;)
Ronald: FileStream is layered on top of TxF.
JR: While Windows Update uses TxF/KTM and demonstrates it's utility, it is not a high throughput application.

Categories