I have a .NET application written in C#, and use Mongo for my database backend. One of my collections, UserSearchTerms, repeatedly (and unintentionally) has duplicate documents created.
I've teased out the problem to an update function that gets called asynchronously, and can be called multiple times simultaneously. In order to avoid problems with concurrent runs, I've implemented this code using an update which I trigger on any documents that match a specific query (unique on user and program), upserting if no documents are found.
Initially, I can guarantee that no duplicates exist and so expect that only the following two cases can occur:
No matching documents exist, triggering an upsert to add a new document
One matching document exists, and so an update is triggered only on that one document
Given these two cases, I expect that there would be no way for duplicate documents to be inserted through this function - the only time a new document should be inserted is if there are none to begin with. Yet over an hour or so, I've found that even though documents for a particular user/program pair exist, new documents for them are created.
Am I implementing this update correctly to guarantee that duplicate documents will not be created? If not, what is the proper way to implement an update in order to assure this?
This is the function in question:
public int UpdateSearchTerm(UserSearchTerm item)
{
_userSearches = _uow.Db.GetCollection<UserSearchTerm>("UserSearchTerms");
var query = Query.And(Query<UserSearchTerm>.EQ(ust => ust.UserId, item.UserId), Query<UserSearchTerm>.EQ(ust => ust.ProgramId, item.ProgramId));
_userSearches.Update(query, Update<UserSearchTerm>.Replace(item), new MongoUpdateOptions { Flags = UpdateFlags.Upsert });
return (int)_userSearches.Count(query);
}
Additional Information:
I'm using mongod version 2.6.5
The mongocsharpdriver version I'm using is 1.9.2
I'm running .NET 4.5
UserSearchTerms is the collection I store these documents in.
The query is intended to match users on both userId AND programId - my definition of a 'unique' document.
I return a count after the fact for debugging purposes.
You could add a unique index on userId and programId to ensure that no duplicate will be inserted
Doc : https://docs.mongodb.org/v2.4/tutorial/create-a-unique-index/
Related
we recently had a migration project that went badly wrong and we now have 1000's of duplicate records. The business has been working with them which has made the issue worse as we now have records that have the same name and address but could have different contact information. A small number are exact duplicates. we have started the panful process of manually merging the records but this is very slow. Can anyone suggest another way of tackling the problem please?
You can write a console app quickly to merge them & refer the MSDN sample code for the same.
Sample: Merge two records
// Create the target for the request.
EntityReference target = new EntityReference();
// Id is the GUID of the account that is being merged into.
// LogicalName is the type of the entity being merged to, as a string
target.Id = _account1Id;
target.LogicalName = Account.EntityLogicalName;
// Create the request.
MergeRequest merge = new MergeRequest();
// SubordinateId is the GUID of the account merging.
merge.SubordinateId = _account2Id;
merge.Target = target;
merge.PerformParentingChecks = false;
// Execute the request.
MergeResponse merged = (MergeResponse)_serviceProxy.Execute(merge);
When merging two records, you specify one record as the master record, and Microsoft Dynamics CRM treats the other record as the child record or subordinate record. It will deactivate the child record and copies all of the related records (such as activities, contacts, addresses, cases, notes, and opportunities) to the master record.
Read more
Building on #Arun Vinoth's answer, you might want to see what you can leverage with out-of-box duplicate detection to get sets of duplicates to apply the merge automation to.
Alternatively you can build your own dupe detection to match records on the various fields where you know dupes exist. I've done similar things to compare records across systems, including creating match codes to mimic how Microsoft does their dupe detection in CRM.
For example, a contact's match codes might be
1. the email address
2. the first name, last name, and company concatenated together without spaces.
If you need to match Companies, you can implement the an algorithm like Scribe's stripcompany to generate matchcodes based on company names.
Since this seems like a huge problem you may want to consider drastic solutions like deactivating the entire polluted data set and redoing the data import clean, then finding any of the deactivated records that got touched in the interim to merge them, then deleting the entire polluted (deactivated) data set.
Bottom line, all paths seem to lead to major headaches and the only consolation is that you get to choose which path to follow.
I have a document I want to upsert. It has a unique index on one of the properties, so I have something like this to ensure I get no collisions
var barVal = 1;
collection.UpdateOne(
x=>x.Bar == barVal,
new UpdateDefinitionBuilder<Foo>().Set(x=>x.Bar, barVal),
new UpdateOptions { IsUpsert = true });
But I seem to sometimes get collisions from this on the unique index on bar.
Is mongo atomic around upserts, so if the filter matches the document cant be changed before the update completes?
If it is I probably have a problem somewhere else, if its not I need to handle the fact its not.
The docs don't seem to sugguest that this is one way or the other.
https://docs.mongodb.com/v3.2/reference/method/Bulk.find.upsert/
https://docs.mongodb.com/v3.2/reference/method/db.collection.update/
Actually, docs says something about this. Here is what I found in db.collection.update#use-unique-indexes
To avoid inserting the same document more than once, only use upsert:
true if the query field is uniquely indexed.
...
With a unique index, if multiple applications issue the same update
with upsert: true, exactly one update() would successfully insert a
new document.
The remaining operations would either:
update the newly inserted document, or
fail when they attempted to insert a duplicate.
If the operation fails because of a duplicate index key error,
applications may retry the operation which will succeed as an update
operation.
So, if you have created a unique index on the field you are querying, it is guaranteed that the insertion is "atomic" and a sort of rollback is performed if a failure occures.
I am trying to insert documents into MongoDB but I want to have only unique documents and whenever encounter a duplicate document, just ignore it if it is already exists and go to the next one. I am using the following code but apparently it does not work.
var keys = IndexKeys.Ascending("TrackingNumber");
var options = IndexOptions.SetUnique(true).SetDropDups(true);
_collection.CreateIndex(keys, options);`
If you really want to ignore these, it's probably best to do it in code, though that might not be that easy in a multi-client environment.
The dropDups flag is a parameter of the index creation only, so it will drop duplicates it finds while creating the index. The flag will be ignored for inserts afterwards, because it's not even a parameter of the index.
A better way, though not exactly the behavior you're looking for, is to use upserts, i.e. operations that insert a document if not yet present and update it if the document that was searched existed before. That has the advantage of being an idempotent operation (which the ignore strategy is not).
I'm updating a document in Mongodb with the C# driver. I've verified the update is successfully completing, but if I select the collection that contains the updated document immediately after the update I'm not seeing the new values immediately. If I put a breakpoint in my code after the update but before the select, I will see the new values in the results from the Select. If I let the code run straight through, I get the old values in my names collection. I tried changing the write concern, but I don't think that's it. Is there some way to chain these two operations together so the select won't happen until the update has completed?
var qry = Query.EQ("_id", new ObjectId(id));
var upd = Update.Set("age", BsonValue.Create(newAge));
db.GetCollection<MongoTest>("mongotest").Update(qry,upd);
... would like to pause here until update is complete ...
var names = db.GetCollection<MongoTest>("mongotest")
.FindAll().SetSortOrder(SortBy.Ascending("name"))
.ToList<MongoTest>();
if (names.Count() > 0)
{
return View(names);
}
One clarification. The official MongoDB .Net driver defaults to Acknowledged write (write concern 1) when you start with using MongoClient. If you start with the old style of using MongoServer.Create (now obsoleted), the default is Unacknowledged.
Also ensure that you are not using a read preference that could route your reads to a secondary.
On the RavenDB site it says "Use Load over Query when you know the documents Id". In my tests on a simple collection of approximately 1,500 objects Load is always slower. Why?
Load:
var doc = session.Load<Document>("Documents/123");
Query
var doc = session.Query<Document>().Where(x => x.Id == "123").SingleOrDefault();
In a test, retrieving every document, the average Query time was 66 milliseconds vs 137 for the Load. The RavenDB instance is located in another office hence the high times. Regardless should Load not always be faster?
Edit
This is statement I'm referring to http://ravendb.net/kb/31/my-10-tips-and-tricks-with-ravendb. Tip #4. Is it wrong?
From what I understand, Load will guarantee to return a result (provided that id exists in the database) whereas Query might not return a result if the indexes haven't yet been updated.
You could have a scenario whereby you insert a record, then on the next line try to retrieve that same record using Query and then not get anything back. Load would return a record in this scenario.
So I guess the performance degradation you are seeing might be related to the fact that you are querying by index when using Query, whereas Load is hitting the actual data store.
When retrieving an item by its Id, you are required to use the .Load(id) method.
Load is an ACID compliant operation. It retrieves documents directly from the document store.
Query is a BASE operation that is "eventually consistent". It goes first against an index, finds the documents in the document store, and then returns them. Querying by an Id could potentially return null if the document was just added and has not been indexed yet.
RavenDB 2.0 added a feature to prevent you from querying by Id. It will throw an exception if you try to do so. So using Load is not just a best practice, it's a requirement.