From the name FindOneAndUpdate() I understand this is an atomic operation.
What if I want to find 10 items (Limit(10)) and Update them all alike?
For example set a state field to "in progress"?
Is that atomically achievable with MongoDb? Is there some built-in functionality in the C# driver maybe? I don't want to implement 2PC myself if it is avoidable :-)
I have other consumers asking for documents as well, I therefore want to avoid double processing although it is not critical to my business case.
The motivation NOT to use FindOneAndUpdate() 10 times is purely networking (less chatter, and better performance) related. I do not have a requirement for transaction-like behavior.
The database nor the business case is under my control but I was told to expect many documents going in and out rather quickly.
In MongoDB, operations are only considered atomic on a per-document basis. That is, if you update multiple fields in a document with a single update statement, you will get all of the updates or none of them should they be queried while the update is happening. That means an update that affects more than one document will not be an atomic operation across all documents, but will be atomic within the document.
Since it sounds like you are concerned with being more efficient with sending commands to the server, not so much as to whether or not the operation is atomic server side, you can use BulkWriteAsync() which takes a IEnumerable<WriteModel<TDocument>> of updates to perform on the server.
This allows you to build a list updates and execute them in one operation to the server. Care must be taken during this process to properly handle failed writes. Take a look at the MongoDB Docs on this over here.
Related
session.StartTransaction();
await mongo.Collection1.UpdateOneAsync(session, filter1, update1);
await mongo.Collection2.BulkWriteAsync(session, updatesToDifferentDocs);
await mongo.Collection3.UpdateOneAsync(session, filter2, update2);
await session.CommitTransactionAsync();
The above code is running concurrently on multiple threads. The final update for Collection3 has a high chance of writing on the same document by multiple threads. I wanted the transactions across the 3 collections to be atomic which is why I put them in one session, which is what I thought session is essentially used for, however, I'm not familiar with the details of its inner-workings.
Without knowing much about the built-in features of Mongo. It's pretty obvious why this is giving me a write conflict. I simply can't write to the same document in Collection3 at the same time on multiple threads.
However, I tried Googling a bit and it seems like Mongo >= 3.2 has WiredTiger Storage Engine by default which has Document level locks that doesn't need to be used by the developer. I've read that it automatically retries the transaction if the document was initially locked.
I don't really know if I'm using session incorrectly here, or I just have to manually implement some kind of lock/semaphore/queue system. Another option would be to manually check for write conflict and re-attempt the entire session. But it feels like I'm just reinventing the wheel here if Mongo is already supposed to have concurrency support.
Should have updated this thread earlier but nonetheless, here is what I ended up doing to solve my problem. While MongoDB does have automatic retries to transactions along with some locking mechanisms, I couldn't find a clean way to do leverage this for my specific problematic session. Kept getting write conflicts even though I thought I'd acquired locks on all the colliding documents start of each session.
Having to maintain atomicity for a session that reads and writes across multiple collections not just documents, I thought it was cleaner to simply wrap it in custom retry logic. I followed the example bottom of page here and used a timeout that I thought was reasonable for my use case.
I decided to use a timeout-based retry logic because I knew most of the collisions would be tightly packed together temporally. For less predictable collisions, maybe some queue-based mechanism would be better.
the automatic retry behavior is different for transactions in mongodb. by default a transactional write operation will wait for 5ms and aborts the transaction if a lock cannot be aquired.
you could try increasing the 5ms timeout by running the following admin command on your mongodb instance:
db.adminCommand( { setParameter: 1, maxTransactionLockRequestTimeoutMillis: 100 } )
more details about that can be found here.
alternatively you could manually implement some retry logic for failed transactions.
I'm facing the following situation:
A system I'm working on has a few different parts(services and ASP.net) with seperate responsibilities. These parts are combined by 2 resources: A MSSQL-DB and files on a windows filesystem.
Currently all these parts access these resources individually. I think this is causing unpredictability and inconsistency.
I'm thinking of introducing a service that regulates access to these resources. I'm not sure if this is an accepted design principle.
The general question is:
What kind of solution should I be looking at and what should I keep in mind when designing this?
Specific questions:
Is this just a Data Access Layer?
Is it bad to introduce a SPOF like this?
Can you recommend any reading material aimed at this kind of solution? (especially if there's specific material for C#)
edit because of a great question by allen-smithee:
The database is currently accessed by embedded queries. They are seperated into a class but these are different for every service so it's not a shared library.
1/ A Data Access Layer simply encapsulates the data logic, what you need is concurrency control to ensure consistency of your data model across the independent services.
2/ Depending how you implement concurrency it can be a single point of failure but I don't think there is anything wrong with that - "plan for failure" is a great design mantra. You can build in redundancy and fail-over mechanisms, or you can distribute your concurrency control across your services.
3/ The way you choose to implement concurrency will depend on how your application functions and what your users expect. To give some specific scenarios:
Scenario A
When a service begins an update start a transaction and take out one or more row-level locks for the records involved. If any other service tries to edit the record at the same time either block or return an error such as 'this record is currently locked'. Note that all locks have to be taken before reading and kept for the duration of the update to ensure consistency with other writes.
Pros - Fairly straight forward to implement for small data models. MSSQL supports plenty of locking scenarios and even custom application locks that you can use to group resources.
Cons - If your transaction needs to access multiple tables/rows and different services or functions access overlapping tables you can easily get into all sorts of deadlock problems.
MSSQL generally prefers pessimistic locking and can escalate locks from row to page and table level, which means read and write locks may behave in ways you wouldn't initially expect. You may need to spend a considerable amount of time debugging these interactions in SQL Server Profiler and be prepared to make changes to your data model to work around these issues.
Scenario B
Each table row has an incremental version number. A service reads the data it needs, performs a series of updates, and then within a transaction lock checks the current row version against the one it used for the update. If the version numbers do not match it rolls back the transaction, cancelling the update. The service may then attempt to perform the operation again starting with reading the data.
Pros - Readers are not blocked and the lock is held only very briefly while the service tries to commit the update. MSSQL has built-in support for this concurrency method in the form of 'Row Versioning' with the 'Snapshot Isolation' level. If conflicts are rare this method can be extremely responsive - perfect for real-time applications.
Cons - This method may require significant changes to your data model and the service behaviour.
Scenario C
A single data service is responsible for all data access. Other services request data from and submit updates to this service. The service is responsible for reading and writing to the database and filesystem, and performs some level of data integrity checking and resolves data conflicts.
Pros - Encapsulates data integrity and control in one module, simplifying other services. Allows you to implement caching, locking etc at the application level providing finer-grained control.
Cons - Significant changes to existing architecture required. Resolving data conflicts can require a significant amount of code if you choose to resolve at the field level. Services will need to be able to handle a rejected update when resolution is not possible.
That's the major scenarios I can think of off the top of my head but there are plenty more. Generally all concurrency control for data will revolve around locking while performing an action (pessimistic locking); performing an action and then checking for a conflict (optimistic locking via versioning); or performing an action and then merging conflicts (conflict resolution.)
Thinking about your specific data model and how the model is updated will guide which mix of these techniques you will use. Searching for any of the terms above will give you plenty to read and there are a lot of Technet articles that specifically address these issues in an MSSQL context. Take heart - I've seen good programmers get this stuff wrong, it really is a challenging problem, but it is solvable if you work through it methodically.
I have code that carries out data retrieval - basically executes anything from 3 to 12 SQL (oracle) read statements to retrieve data about an object.
Unfortunantly its running slowly (no SQL statement in particular, its just the fact I have so many of them - and they take around 0.2 seconds per statement, which can mean over 2 secs for the code to complete).
I am looking into ways of improving the performance. One way is to merge some of the tables into a single query (which can reduce the combined results by 0.5 secs). However it doesn't make sense to merge the rest since there will only be data there under certain cicumstances, and trying to determine when there is data there to marshal could get tricky.
I am considering introducing threading into my program, so after the initial query, I would spawn a thread for each of the other queries, so they are executed at the same time. However I have never used threading and am wary of introducing deadlocks or other pit falls.
Currently the other queries marshal the results into different sections of the SAME object. Would this cause any issues (i.e. since we are accessing/updating the same object in different threads though different sections/fields within the object?). Would it be better to return the results and marshal into the object after all the threads have finished?
I know these types of questions are hard to answer since its more general advice, but I would appreciate if anyone thought it was a good idea, or had other suggestions?
If you are doing only reading (select from) - don't worry about deadlocks. Oracle readings are not blockable (mostly). The biggest problem with threading queries to oracle would be how to deal with connections. To create connection, run a query and close connection - is very very very bad. Connections are expensive. They are also limited, so you don't want to create one million connections to execute your logic.
As a result, you would use some sort of connection pool and put your queries in a queue.
Also, I hope you are using bind variables and not string concatenation to pass queries to oracle.
In general, I would collect all the data (better in one query) and only then update the object. You could also consider to brake your object into it sections.
Threading workss perfectly. 2 years ago I did a project that used a multi strage / multi threading approeach to push data into a oracle database (and pull some data out of it for updates).
I basicallly used a staged approach (a request would go through multiple stages, get consumed there and new data be pusehd to the next stage) and every stage used a configurable thread pool, which would take a message, process it and post the new messages.
We used I think at that time close to 200 threads to process about a million SQL statements per minute (hitting an Oracle Exadata that was really getting some work out of that).
So, multithreading "just works" - obviously if you know how to do it and you have to get your architecture and the sql statements nice and non blocking. Databases in general are perfectly calable of handling multiple threads.
Now, for details: THAT DEPENDS.
Example:
Currently the other queries marshal the results into different
sections of the SAME object. Would this cause any issues (i.e. since
we are accessing/updating the same object in different threads though
different sections/fields within the object?)
Absolutely no problem as long as:
You make suer all updates are finished before moving the object to the next phase and
The updates do not overlap or have a cardinality (1 must finish for 2 to have the required data).
These are implementation details and it is really hard to make a generic answer for those (totally impossible). Especially as this is multi threading 101 - and has nothing to do with any database access.
In general - you will also have to tune the number of threads. .NET can not do that itself - as it will see the CPU not busy and spawn up more threads, even if the database server is the bottleneck. This is why we went with multiple stages - so we could tune the number of threads depending what they do (and the last stage used bulk inserting to insert the aggregated data into temporary staging tables with a small number of threads, moving a lot of data in every statement - this will require some tuning possibilities to not totally overload the database side).
I have a data entry ASP.NET application. During a one complete data entry many transactions occur. I would like to keep track of all those transactions so that if the user wants to abandon the data entry, all the transaction of which I have been keeping record can be rolled back.
SQL 2008 ,Framework version is 4.0 and I am using c#.
This is always a tough lesson to learn for people that are new to web development. But here it is:
Each round trip web request is a separate, stand-alone thread of execution
That means, simply put, each time you submit a page request (click a button, navigate to a new page, even refresh a page) then it can run on a different thread than the previous one. What's more, even if you do get the same thread twice, several other web requests may have been processed by the thread in the time between your two requests.
This makes it effectively impossible to span simple transactions across more than one web request.
Here's another concept that you should keep in mind:
Transactions are intended for batch operations, not interactive operations.
What this means is that transactions are meant to be short-lived, and to encompass several operations executing sequentially (or simultaneously) in which all operations are atomic, and intended to either all complete, or all fail. Transactions are not typically designed to be long-lived (meaning waiting for a user to decide on various actions interactively).
Web apps are not desktop apps. They don't function like them. You have to change your thinking when you do web apps. And the biggest lesson to learn, each request is a stand-alone unit of execution.
Now, above, I said "simple transactions", also known as lightweight or local transactions. There's also what's known as a Distributed Transaction, and to use those requires a Distributed Transaction Coordinator. MSDTC is pretty commonly used. However, DT's perform much more slowly than LWT's. Also, they require that the infrastructure be setup to use a DTC.
It's possible to span a transaction over web requests using a DTC. This is done by "Enlisting" in a Distribute Transaction, and then somehow sharing this transaction identifier between requests. But this is a lot of work to setup, and deal with, and has a lot of error prone situations. It's not something you want to do if you have other options.
In general, you're better off adding the data to a temporary table or tables, and then when the final save is done, transfer that data to the permanent tables. Another option is to maintain some state (such as using ViewState or Session) to keep track of the changes.
One popular way of doing this is to perform operations client-side using JavaScript and then submitting all the changes to the server when you are done. This is difficult to implement if you need to navigate to different pages, however.
From your question, it appears that the transactions are complete when the user exercises the option to roll them back. In such cases, I doubt if the DBMS's transaction rollback semantics would be available. So, I would provide such semantics at the application layer as follows:
Any atomic operation that can be performed on the database should be encapsulated in a Command object. Each command will implement the undo method that would revert the action performed by its execute method.
Each transaction would contain a list of commands that were run as part of it. The transaction is persisted as is for further operations in future.
The user would be provided with a way to view these transactions that can be potentially rolled back. Upon selection of a transaction by user to roll it back, the list of commands corresponding to such a transaction are retrieved and the undo method is called on all those command objects.
HTH.
You can also store them on temporary Table and move those records to your original table 'at later stage'..
If you are just managing transactions during a single save operation, use TransactionScope. But it doesn't sound like that is the case.
If the user may wish to abandon n number of previous save operations, it suggests that an item may exist in draft form. There might be one working draft or many. Subsequently, there must be a way to promote a draft to a final version, either implicitly or explicitly. Think of how an email program saves a draft. It doesn't actually send your message, you may abandon it at any time, and you may recall it at a later time. When you send the message, you have "committed the transaction".
You might also add a user interface to rollback to a specific version.
This will be a fair amount of work, but if you are willing to save and manage multiple copies of the same item it can be accomplished.
You may save the a copy of the same data in the same schema using a status flag to indicate that it is a draft, or you might store the data in an intermediate format in separate table(s). I would prefer the first approach in that it allows the same structures to be used.
Here I am dealing with a database containing tens of millions of records. I have an application which connects to the database, gets all the data from a single column in a table and does some operation on it and updates it (for SQL Server - using cursors).
For millions of records it is taking very very ... long time to update. So I want to make it faster by
using multiple threads with an independent connection for each thread.
or
by using a single connection throughout all the threads to fire the update queries.
Which one is faster, or if you have any other ideas plz explain.
I need a solution which is independent of database type , or even if you know specific solutions for each type of db, please reply.
The speedup you're trying to achieve won't work. To the contrary, it will slow down the overall processing as the database now has also to keep multiple connections/sessions/transactions in sync.
Keep with as few connections/transactions as possible for repetitive and comparable operations.
If it takes too long for your taste, maybe try to analyze if the queries can be optimized somehow. Also have a look at database-specific extensions (ie bulk operations) suitable for your problem.
All depends on the database, and the hardware it is running on.
If the database can make use of concurrent processing, and avoids contention on shared resources (e.g. page base locks would span multiple records, record based would not). Shared resources in this case include hardware, a single core box will not be able to execute multiple CPU intensive activities (e.g. parsing SQL) truely in parallel.
Network latency is something you might help alleviate with concurrent inserts even if the database is itself not able to exploit concurrency.
As with any question of performance there is substitute for testing in your specific scenario.
If possible try to use the Stored procedure the do all the processing and update the records.