Bucketing with MongoDB C# Driver - c#

I'm trying to implement a technique known as bucketing in MongoDb (or so it was referred to as in a MongoDB workshop) and it uses the Push and Slice to achieve this. This is to achieve a user feed system similar to that of twitter/facebook.
Essentially I have a document with an array of items (feed items). I want to create a new document when this number of items reaches a certain number for a user.
So, if the latest userFeed document's collection has 50 items, i want a new document to be created and the new item to be inserted into the item array of the newly created document.
This is the code I have thus far:
var update = Builders<UserFeed>
.Update
.CurrentDate(x => x.DateLastUpdated)
.PushEach(x =>
x.Items,
new List<FeedItemBase> { feedItem },
50);
var result = await Collection.UpdateOneAsync(x =>
x.User.Id == userFeedToWriteTo,
update,
new UpdateOptions { IsUpsert = true }
).ConfigureAwait(false);
...
But it does not appear to create a new document, or even insert the item into the existing document's array. I thought the creation of the new document would be handled by this
new UpdateOptions { IsUpsert = true }
but apparently not. Any help would be greatly appreciated

So after having written out the problem and saying it out loud a few times i realised what the problem was.
I needed a counter on the main userfeed document which needed to be incremented every time an item was added (a feed item was posted). And in my query to perform the update/upsert i just needed to check for < 50. After which, everything works as expected. Here is the corrected code
var update = Builders<UserFeed>
.Update
.CurrentDate(x => x.DateLastUpdated)
.PushEach(x =>
x.Items,
new List<FeedItemBase> { feedItem },
50)
.Inc(x => x.Count, 1);
var result = await Collection.UpdateOneAsync(x =>
x.User.Id == userFeedToWriteTo && x.Count < 50,
update,
new UpdateOptions { IsUpsert = true }
).ConfigureAwait(false);
And as long as the count is NOT corrected if a user feed item is deleted from the items array, everything should work as expected. However, problems will arise if you modify the count on removal because you will end up with items added to previous documents and you will need to perform sorts after you unwind the data, which at present, i do not need to. It does mean you will end up with some documents with less than 50 items in the array, but to me that doesn't really matter.
I hope this helps someone trying to implement a similar solution in c#.

Related

Prevent possible duplicates on MongoDb Document Array

I have have a bunch of information that is saved inside an array of a document in MongoDb, the structure is something like this:
{
"Ticker":"TSLA34",
"History":[
{
"Price":26.36,
"UpdatedAt":"10/22/2015 10:12:00 AM"
},
{
"Price":26.37,
"UpdatedAt":"10/22/2015 10:13:00 AM"
}
]
}
Im saving the information inside the "History" array, and this information is time based, i need to insert minute by minute.
Sometimes the same minute came to me two or three times in a row. So i need to check if the current minute that i received is already inserted in this array before push into this array and save it in MongoDb, but only if the whole document does not exist, i need to create it and insert the information.
Im using MongoDb.Driver in .NET 6 and i build this method to insert the information only if the "UpdatedAt" field is not already saved into the "History" array
// var item = new Chart("TSLA34", new HistoryEntry(26.36, DateTime.UtcNow))
var filterByTicker = Builders<Chart>.Filter.Eq(g => g.Ticker, item.Ticker);
var filterByDate = Builders<Chart>.Filter.ElemMatch(g => g.History, Builders<HistoryEntry>.Filter.Eq(x => x.UpdatedAt, item.History.First().UpdatedAt));
var filter = filterByTicker & Builders<Chart>.Filter.Not(filterByDate);
var update = Builders<Chart>.Update.Push(asset => asset.History, item.History.First());
var updateOneModel = new UpdateOneModel<Chart>(filter, update) {IsUpsert = true};
I need to create the document if it isn't created, but if it is already created i need to verify if this information minute is inserted into the history.
I can do this doing a Find search and check it with Linq before do the Update process, but i was wondering if i can do this more efficiently with MongoDb.
Anyone has an idea?

Reactive - how to combine / join / look up items with two sequences

I am connecting to a web service that gives me all prices for a day (without time info). Each of those price results has the id for a corresponding "batch run".
The "batch run" has a date+time stamp, but I have to make a separate call to get all the batch info for the day.
Hence, to get the actual time of each result, I need to combine the two API calls.
I'm using Reactive for this, but I can't reliably combine the two sets of data. I thought that CombineLatest would do it, but it doesn't seem to work as I thought (based on http://reactivex.io/documentation/operators/combinelatest.html, http://introtorx.com/Content/v1.0.10621.0/12_CombiningSequences.html#CombineLatest).
[TestMethod]
public async Task EvenMoreBasicCombineLatestTest()
{
int batchStart = 100, batchCount = 10;
//create 10 results with batch ids [100, 109]
//the test uses lists just to make debugging easier
var resultsWithBatchIdList = Enumerable.Range(batchStart, batchCount)
.Select(id => new { BatchRunId = id, ResultValue = id * 10 })
.ToList();
var resultsWithBatchId = Observable.ToObservable(resultsWithBatchIdList);
Assert.AreEqual(batchCount, await resultsWithBatchId.Count());
//create 10 batches with ids [100, 109]
var batchesList = Enumerable.Range(batchStart, batchCount)
.Select(id => new
{
ThisId = id,
BatchName = String.Concat("abcd", id)
})
.ToList();
var batchesObservable = Observable.ToObservable(batchesList);
Assert.AreEqual(batchCount, await batchesObservable.Count());
//turn the batch set into a dictionary so we can look up each batch by its id
var batchRunsByIdObservable = batchesObservable.ToDictionary(batch => batch.ThisId);
//for each result, look up the corresponding batch id in the dictionary to join them together
var resultsWithCorrespondingBatch =
batchRunsByIdObservable
.CombineLatest(resultsWithBatchId, (batchRunsById, result) =>
{
Assert.AreEqual(NumberOfResultsToCreate, batchRunsById.Count);
var correspondingBatch = batchRunsById[result.BatchRunId];
var priceResultAndSourceBatch = new
{
Result = result,
SourceBatchRun = correspondingBatch
};
return priceResultAndSourceBatch;
});
Assert.AreEqual(batchCount, await resultsWithCorrespondingBatch.Count());
}
I would expect as each element of the 'results' observable comes through, it would get combined with each element of the batch-id dictionary observable (which only ever has one element). But instead, it looks like only the last element of the result list gets joined.
I have a more complex problem deriving from this but while trying to create a minimum repro, even this is giving me unexpected results. This happens with version 3.1.1, 4.0.0, 4.2.0, etc.
(Note that the sequences don't generally match up as in this artificial example, so I can't just Zip them.)
So how can I do this join? A stream of results that I want to look up more info via a Dictionary (which also is coming from an Observable)?
Also note that the goal is to return the IObservable (resultsWithCorrespondingBatch), so I can't just await the batchRunsByIdObservable.
Ok I think I figured it out. I wish either of the two marble diagrams in the documentation had been just slightly different -- it would have made a subtlety of CombineLatest much more obvious:
N------1---2---3---
L--z--a------bc----
R------1---2-223---
a a bcc
It's combine latest -- so depending on when items get emitted, it's possible to miss some tuples. What I should have done is SelectMany:
NO: .CombineLatest(resultsWithBatchId, (batchRunsById, result) =>
YES: .SelectMany(batchRunsById => resultsWithBatchId.Select(result =>
Note that the "join" order is important: A.SelectMany(B) vs B.SelectMany(A) -- if A has 1 item and B has 100 items, the latter would result in 100 calls to subscribe to A.

Retrieve the previous indexed record in a list, if exists

I am using .NET 3.5 . My requirement is to traverse through a list of objects ordered descending by date, find a match for a particular record, capture that object and then, IF a record exists on a date prior to that, which means the captured object's index minus one (if exists), capture this object too, which means my output list can have either one record or two records depending on whether there was a previous dated record or not. Is there a clean way of achieving this ?
I tried capturing index of matched record and going for previous index by adding -1 to index >> Risk of index out of bounds exception if the previous element does not exist.
How can I avoid the index out of bound exception, yet check for the existence of previous element , if exists ? I am sure there is a much cleaner way of doing it rather than the way I am trying. So I am reaching out to you to advice if there is a nicer way of doing this....
Any advise is highly appreciated. Thank you
Take a look at the below. Object is just a datetimeoffset, but should illustrate LINQ query. You are looking for .Top(2) (this could be more complicated if you need this grouped by something):
LinqPad sample below, but should be easily pasted into console app.
void Main()
{
var threeItems = new List<DateTimeOffset>(new[] { DateTimeOffset.Now, DateTimeOffset.Now.AddDays(-1), DateTimeOffset.Now.AddDays(-2) });
var twoItems = new List<DateTimeOffset>(new[] { DateTimeOffset.Now, DateTimeOffset.Now.AddDays(-1) });
var oneItem = new List<DateTimeOffset>(new[] { DateTimeOffset.Now });
ShowItems(GetItems(threeItems));
ShowItems(GetItems(twoItems));
ShowItems(GetItems(oneItem));
}
IEnumerable<DateTimeOffset> GetItems(List<DateTimeOffset> items)
{
return items
.OrderByDescending(i => i)
.Select(i => i)
.Take(2);
}
void ShowItems(IEnumerable<DateTimeOffset> items)
{
Console.WriteLine("List of Items:");
foreach (var item in items)
{
Console.WriteLine(item);
}
}
I think what you are looking for will require using the List.IndexOf to find the index of the matching item, then retreive the previous item if there is a datetime before the date you have searched for. My example here uses an object called listObject which contains a datetime as well as other properties;
DateTime searchDate = DateTime.Parse("26/01/2019");
var orderedList = listObjects.OrderBy(x => x.DateProperty).ToList();
listObject matchingItem = orderedList.First(x => x.DateProperty.Date == searchDate.Date); //gets the first matching date
listObject previousMatching = orderedList.Any(x => x.DateProperty.Date < searchDate.Date) ? orderedList[orderedList.IndexOf(matchingItem) - 1] : null; //returns previous if existing, else returns null

Is a MongoDB bulk upsert possible? C# Driver

I'd like to do a bulk upsert in Mongo. Basically I'm getting a list of objects from a vendor, but I don't know which ones I've gotten before (and need to be updated) vs which ones are new. One by one I could do an upsert, but UpdateMany doesn't work with upsert options.
So I've resorted to selecting the documents, updating in C#, and doing a bulk insert.
public async Task BulkUpsertData(List<MyObject> newUpsertDatas)
{
var usernames = newUpsertDatas.Select(p => p.Username);
var filter = Builders<MyObject>.Filter.In(p => p.Username, usernames);
//Find all records that are in the list of newUpsertDatas (these need to be updated)
var collection = Db.GetCollection<MyObject>("MyCollection");
var existingDatas = await collection.Find(filter).ToListAsync();
//loop through all of the new data,
foreach (var newUpsertData in newUpsertDatas)
{
//and find the matching existing data
var existingData = existingDatas.FirstOrDefault(p => p.Id == newUpsertData.Id);
//If there is existing data, preserve the date created (there are other fields I preserve)
if (existingData == null)
{
newUpsertData.DateCreated = DateTime.Now;
}
else
{
newUpsertData.Id = existingData.Id;
newUpsertData.DateCreated = existingData.DateCreated;
}
}
await collection.DeleteManyAsync(filter);
await collection.InsertManyAsync(newUpsertDatas);
}
Is there a more efficient way to do this?
EDIT:
I did some speed tests.
In preparation I inserted 100,000 records of a pretty simple object. Then I upserted 200,000 records into the collection.
Method 1 is as outlined in the question. SelectMany, update in code, DeleteMany, InsertMany. This took approximately 5 seconds.
Method 2 was making a list of UpdateOneModel with Upsert = true and then doing one BulkWriteAsync. This was super slow. I could see the count in the mongo collection increasing so I know it was working. But after about 5 minutes it had only climbed to 107,000 so I canceled it.
I'm still interested if anyone else has a potential solution
Given that you've said you could do a one-by-one upsert, you can achieve what you want with BulkWriteAsync. This allows you to create one or more instances of the abstract WriteModel, which in your case would be instances of UpdateOneModel.
In order to achieve this, you could do something like the following:
var listOfUpdateModels = new List<UpdateOneModel<T>>();
// ...
var updateOneModel = new UpdateOneModel<T>(
Builders<T>.Filter. /* etc. */,
Builders<T>.Update. /* etc. */)
{
IsUpsert = true;
};
listOfUpdateModels.Add(updateOneModel);
// ...
await mongoCollection.BulkWriteAsync(listOfUpdateModels);
The key to all of this is the IsUpsert property on UpdateOneModel.

Inserting many rows with Entity Framework is extremely slow

I'm using Entity Framework to build a database. There's two models; Workers and Skills. Each Worker has zero or more Skills. I initially read this data into memory from a CSV file somewhere, and store it in a dictionary called allWorkers. Next, I write the data to the database as such:
// Populate database
using (var db = new SolverDbContext())
{
// Add all distinct skills to database
db.Skills.AddRange(allSkills
.Distinct(StringComparer.InvariantCultureIgnoreCase)
.Select(s => new Skill
{
Reference = s
}));
db.SaveChanges(); // Very quick
var dbSkills = db.Skills.ToDictionary(k => k.Reference, v => v);
// Add all workers to database
var workforce = allWorkers.Values
.Select(i => new Worker
{
Reference = i.EMPLOYEE_REF,
Skills = i.GetSkills().Select(s => dbSkills[s]).ToArray(),
DefaultRegion = "wa",
DefaultEfficiency = i.TECH_EFFICIENCY
});
db.Workers.AddRange(workforce);
db.SaveChanges(); // This call takes 00:05:00.0482197
}
The last db.SaveChanges(); takes over five minutes to execute, which I feel is far too long. I ran SQL Server Profiler as the call is executing, and basically what I found was thousands of calls to:
INSERT [dbo].[SkillWorkers]([Skill_SkillId], [Worker_WorkerId])
VALUES (#0, #1)
There are 16,027 rows being added to SkillWorkers, which is a fair amount of data but not huge by any means. Is there any way to optimize this code so it doesn't take 5min to run?
Update: I've looked at other possible duplicates, such as this one, but I don't think they apply. First, I'm not bulk adding anything in a loop. I'm doing a single call to db.SaveChanges(); after every row has been added to db.Workers. This should be the fastest way to bulk insert. Second, I've set db.Configuration.AutoDetectChangesEnabled to false. The SaveChanges() call now takes 00:05:11.2273888 (In other words, about the same). I don't think this really matters since every row is new, thus there are no changes to detect.
I think what I'm looking for is a way to issue a single UPDATE statement containing all 16,000 skills.
One easy method is by using the EntityFramework.BulkInsert extension.
You can then do:
// Add all workers to database
var workforce = allWorkers.Values
.Select(i => new Worker
{
Reference = i.EMPLOYEE_REF,
Skills = i.GetSkills().Select(s => dbSkills[s]).ToArray(),
DefaultRegion = "wa",
DefaultEfficiency = i.TECH_EFFICIENCY
});
db.BulkInsert(workforce);

Categories