How to aggregate millions of rows using EF Core - c#
I'm trying to aggregate approximately two million rows based on user.
One user has several Transactions, each Transaction has a Platform and a TransactionType.I aggregate Platform and TransactionType columns as json and save as a single row.
But my code is slow.
How can I improve the performance?
public static void AggregateTransactions()
{
using (var db = new ApplicationDbContext())
{
db.ChangeTracker.AutoDetectChangesEnabled = false;
//Get a list of users who have transactions
var users = db.Transactions
.Select(x => x.User)
.Distinct();
foreach (var user in users.ToList())
{
//Get all transactions for a particular user
var _transactions = db.Transactions
.Include(x => x.Platform)
.Include(x => x.TransactionType)
.Where(x => x.User == user)
.ToList();
//Aggregate Platforms from all transactions for user
Dictionary<string, int> platforms = new Dictionary<string, int>();
foreach (var item in _transactions.Select(x => x.Platform).GroupBy(x => x.Name).ToList())
{
platforms.Add(item.Key, item.Count());
};
//Aggregate TransactionTypes from all transactions for user
Dictionary<string, int> transactionTypes = new Dictionary<string, int>();
foreach (var item in _transactions.Select(x => x.TransactionType).GroupBy(x => x.Name).ToList())
{
transactionTypes.Add(item.Key, item.Count());
};
db.Add<TransactionByDay>(new TransactionByDay
{
User = user,
Platforms = platforms, //The dictionary list is represented as json in table
TransactionTypes = transactionTypes //The dictionary list is represented as json in table
});
db.SaveChanges();
}
}
}
Update
So a basic view of the data would look like the following:
Tansactions Data:
Id: b11c6b67-6c74-4bbe-f712-08d609af20cf,
UserId: 1,
PlatformId: 3,
TransactionypeId: 1
Id: 4782803f-2f6b-4d99-f717-08d609af20cf,
UserId: 1,
PlatformId: 3,
TransactionypeId: 4
Aggregate data as TransactionPerDay:
Id: 9df41ef2-2fc8-441b-4a2f-08d609e21559,
UserId: 1,
Platforms: {"p3":2},
TransactionsTypes: {"t1":1,"t4":1}
So in this case, two transactions are aggregated into one. You can see that the platforms and transaction types will be aggregated as json.
You probably should not be calling db.saveChanges() within the loop. Putting it outside the loop to persist the changes once, may help.
But having said this, when dealing with large volumes of data and performance is key, I've found that ADO.NET is probably a better choice. This does not mean you have to stop using Entity Framework, but perhaps for this method you could use ADO.NET. If you go down this path you could either:
Create a stored procedure to return the data you need to work on, populate a datatable, manipulate the data and the persist everything in bulk using sqlBulkCopy.
Use a stored procedure to completely perform this operation. This avoids the need to shuttle the data to your application and the entire processing can happen within the database itself.
Linq To EF is not built for speed (LinqToSQL is easier and faster IMHO, or you could run direct SQL commands with Linq EF\SQL). Anyway, I don't know how this would speed wise:
using (var db = new MyContext(connectionstring))
{
var tbd = (from t in db.Transactions
group t by t.User
into g
let platforms = g.GroupBy(tt => tt.Platform.Name)
let trantypes = g.GroupBy(tt => tt.TransactionType.Name)
select new {
User = g.Key,
Platforms = platforms,
TransactionTypes = trantypes
}).ToList()
.Select(u => new TransactionByDay {
User=u.User,
Platforms=u.Platforms.ToDictionary(tt => tt.Key, tt => tt.Count()),
TransactionTypes = u.TransactionTypes.ToDictionary(tt => tt.Key, tt => tt.Count())
});
//...
}
The idea is to try to do less queries and includes by getting as much data as needed first. So there is no need to include with every transaction the Platform and TransactionType, where you can just query them once in a Dictionary and look the data up. Further more we could do our processing in Parallel, then save all the data at once.
public static void AggregateTransactions()
{
using (var db = new ApplicationDbContext())
{
db.ChangeTracker.AutoDetectChangesEnabled = false;
//Get a list of users who have transactions
var transactionsByUser = db.Transactions
.GroupBy(x => x.User) //Not sure if EF Core supports this kind of grouping
.ToList();
var platforms = db.Platforms.ToDictionary(ks => ks.PlatformId);
var Transactiontypes = db.TransactionTypes.ToDictionary(ks => ks.TransactionTypeId);
var bag = new ConccurentBag<TransactionByDay>();
Parallel.ForEach(transactionsByUser, transaction =>
{
//Aggregate Platforms from all transactions for user
Dictionary<string, int> platforms = new Dictionary<string, int>(); //This can be converted to a ConccurentDictionary
//This can be converted to Parallel.ForEach
foreach (var item in _transactions.Select(x => platforms[x.PlatformId]).GroupBy(x => x.Name).ToList())
{
platforms.Add(item.Key, item.Count());
};
//Aggregate TransactionTypes from all transactions for user
Dictionary<string, int> transactionTypes = new Dictionary<string, int>(); //This can be converted to a ConccurentDictionary
//This can be converted to Parallel.ForEach
foreach (var item in _transactions.Select(x => Transactiontypes[c.TransactionTypeId]).GroupBy(x => x.Name).ToList())
{
transactionTypes.Add(item.Key, item.Count());
};
bag.Add(new TransactionByDay
{
User = transaction.Key,
Platforms = platforms, //The dictionary list is represented as json in table
TransactionTypes = transactionTypes //The dictionary list is represented as json in table
});
});
//Before calling this we may need to check the status of the Parallel ForEach, or just convert it back to regular foreach loop if you see no benefit.
db.AddRange(bag);
db.SaveChanges();
}
}
Variation #2
public static void AggregateTransactions()
{
using (var db = new ApplicationDbContext())
{
db.ChangeTracker.AutoDetectChangesEnabled = false;
//Get a list of users who have transactions
var users = db.Transactions
.Select(x => x.User)
.Distinct().ToList();
var platforms = db.Platforms.ToDictionary(ks => ks.PlatformId);
var Transactiontypes = db.TransactionTypes.ToDictionary(ks => ks.TransactionTypeId);
var bag = new ConccurentBag<TransactionByDay>();
Parallel.ForEach(users, user =>
{
var _transactions = db.Transactions
.Where(x => x.User == user)
.ToList();
//Aggregate Platforms from all transactions for user
Dictionary<string, int> userPlatforms = new Dictionary<string, int>();
Dictionary<string, int> userTransactions = new Dictionary<string, int>();
foreach(var transaction in _transactions)
{
if(platforms.TryGetValue(transaction.PlatformId, out var platform))
{
if(userPlatforms.TryGetValue(platform.Name, out var tmp))
{
userPlatforms[platform.Name] = tmp + 1;
}
else
{
userPlatforms.Add(platform.Name, 1);
}
}
if(Transactiontypes.TryGetValue(transaction.TransactionTypeId, out var type))
{
if(userTransactions.TryGetValue(type.Name, out var tmp))
{
userTransactions[type.Name] = tmp + 1;
}
else
{
userTransactions.Add(type.Name, 1);
}
}
}
bag.Add(new TransactionByDay
{
User = user,
Platforms = userPlatforms, //The dictionary list is represented as json in table
TransactionTypes = userTransactions //The dictionary list is represented as json in table
});
});
db.AddRange(bag);
db.SaveChanges();
}
}
Related
Build Dictionary with LINQ
Let's say we have a variable 'data' which is a list of Id's and Child Id's: var data = new List<Data> { new() { Id = 1, ChildIds = new List<int> {123, 234, 345} }, new() { Id = 1, ChildIds = new List<int> {123, 234, 345} }, new() { Id = 2, ChildIds = new List<int> {678, 789} }, }; I would like to have a dictionary with ChildId's and the related Id's. If the ChildId is already in the dictionary, it should overwrite with the new Id. Currently I have this code: var dict = new Dictionary<int, int>(); foreach (var dataItem in data) { foreach (var child in dataItem.ChildIds) { dict[child] = dataItem.Id; } } This works fine, but I don't like the fact that I am using two loops. I prefer to use Linq ToDictionary to build up the dictionary in a Functional way. What is the best way to build up the dictionary by using Linq? Why? I prefer functional code over mutating a state. Besides that, I was just curious how to build up the dictionary by using Linq ;-)
In this case your foreach appproach is both, readable and efficient. So even if i'm a fan of LINQ i would use that. The loop has the bonus that you can debug it easily or add logging if necessary(for example invalid id's). However, if you want to use LINQ i would probably use SelectMany and ToLookup. The former is used to flatten child collections like this ChildIds and the latter is used to create a collection which is very similar to your dictionary. But one difference is that it allows duplicate keys, you get multiple values in that case: ILookup<int, int> idLookup = data .SelectMany(d => d.ChildIds.Select(c => (Id:d.Id, ChildId:c))) .ToLookup(x => x.ChildId, x => x.Id); Now you have already everything you needed since it can be used like a dictionary with same lookup performance. If you wanted to create that dictionary anyway, you can use: Dictionary<int, int> dict = idLookup.ToDictionary(x => x.Key, x => x.First()); If you want to override duplicates with the new Id, as mentioned, simply use Last(). .NET-Fiddle: https://dotnetfiddle.net/mUBZPi
The SelectMany linq operator actually has a few less known overloads. One of these has a result collector which is a perfect use case for your scenario. Following is an example code snippet to turn that into a dictionary. Note that I had to use the Distinct, since you had 2 id's with value 1 which had some duplicated child id's which would pose problems for a dictionary. void Main() { // Get the data var list = GetData(); // Turn it into a dictionary var dict = list .SelectMany(d => d.ChildIds, (data, childId) => new {data.Id, childId}) .Distinct() .ToDictionary(x => x.childId, x => x.Id); // show the content of the dictionary dict.Keys .ToList() .ForEach(k => Console.WriteLine($"{k} {dict[k]}")); } public List<Data> GetData() { return new List<Data> { new Data { Id = 1, ChildIds = new List<int> {123, 234, 345} }, new Data { Id = 1, ChildIds = new List<int> {123, 234, 345} }, new Data { Id = 2, ChildIds = new List<int> {678, 789} }, }; } public class Data { public int Id { get; set; } public List<int> ChildIds { get; set; } }
The approach is to create pairs of each combination of Id and ChildId, and build a dictionary of these: var list = new List<(int Id, int[] ChildIds)>() { (1, new []{10, 11}), (2, new []{11, 12}) }; var result = list .SelectMany(pair => pair.ChildIds.Select(childId => (childId, pair.Id))) .ToDictionary(p => p.childId, p => p.Id); ToDictionary will throw if there are duplicate keys, to avoid this you can look at this answer and create your own ToDictionary: public static Dictionary<K, V> ToDictionaryOverWriting<TSource, K, V>( this IEnumerable<TSource> source, Func<TSource, K> keySelector, Func<TSource, V> valueSelector) { Dictionary<K, V> output = new Dictionary<K, V>(); foreach (TSource item in source) { output[keySelector(item)] = valueSelector(item); } return output; }
With LINQ you can achieve the result like this: Dictionary<int, int> dict = (from item in data from childId in item.ChildIds select new { item.Id, childId} ).Distinct() .ToDictionary(kv => kv.childId, kv => kv.Id); Update: Fully compatible version with foreach loop would use group by with Last(), instead of Distict(): Dictionary<int, int> dict2 = (from item in data from childId in item.ChildIds group new { item.Id, childId } by childId into g select g.Last() ).ToDictionary(kv => kv.childId, kv => kv.Id); As some already pointed out, depending on order of input elements does not feel "functional". LINQ expression becomes more convoluted then original foreach loop.
There is an overload of SelectMany which not only flattens the collection but also allows you to have any form of result. var all = data.SelectMany( data => data.ChildIds, //collectionSelector (data, ChildId) => new { data.Id, ChildId } //resultSelector ); Now if you want to transform all into a Dictionary, you have to remove the duplicate ChildIds first. You can use GroupBy as in below, and then pick the last item from each group (as you stated in your question you want to overwrite Ids as you go). The key of your dictionary should also be unique=ChildId: var dict = all.GroupBy(x => x.ChildId) .Select(x => x.Last()) .ToDictionary(x => x.ChildId, x => x.Id); Or you can write a new class with IEquatable<> implemented and use it as the return type of resultSelector (instead of new { data.Id, ChildId }). Then write all.Reverse().Distinct().ToDictionary(x => x.ChildId); so it would detect duplicates based on your own implementation of Equals method. Reverse, because you said you want the last occurrence of the duplicates.
Get items from input list that do not exist in the database using EF Core 2.1
I want to get items from input list that do not exist in the database table. I pass a list of IDs, then I want to return those IDs that do not exist in my table. This is what I have for now: var input = new List<string>() // list of Ids, for example count of 10 var itemsThatExistInDb = await DbContext.Set<Data>() // in table exist 100k+ records, .AsQueryable() // can't use simple !Contains() .Where(x => input.Contains(x.Id)) .Select(x => x.Id) .ToListAsync(); var itemsThatNotExistInDb = input.Except(itemsThatExistInDb).ToList(); How to write a query in EF Core 2.1 to get items from my input list that do not exist in my database without using linq extensions like Except()? If it's possible, I want to get those Ids straight from my database query to DbContext
If you don't want to use !contain you can use logical method as below: datatTable is the content of your table. List<string> fullList = new List<string>() { "3", "1", "2" }; //or any dynamic list content or class list based on requirement. List<string> itemsThatNotExistInDb = new List<string>(); //New container list same as fullList foreach (var item in fullList) { var isRemoved = dataTable.RemoveAll(a => a.Id == item) == 1 ? true : false; if (isRemoved) { itemsThatNotExistInDb.Add(item); } }
I think you can query it like this: var input = new List<string>() // list of Ids List<string> itemsThatExist = new List<string>(); var itemsThatNotExist = await DbContext.Set<Data>() .AsQueryable() .ToDictionaryAsync(_ => _.ID, _ => _); foreach (var item in input) { if(AllItems.ContainsKey(item)){ itemsThatExist.Add(item); itemsThatNotExist.Remove(item); } }
Dictionary change of order
I have a dictionary. Dictionary<int, string> inboxMessages = new Dictionary<int, string>(); This dictionary contains messages with their own unique ID (the newer the message, the higher the ID). I put the messages in a picker (xamarin) but it shows the oldest messages first. How can I change this? The Picker: inboxPicker = new Picker { WidthRequest = 320, }; foreach (string inboxMessage in inboxMessages.Values) { inboxPicker.Items.Add(inboxMessage); } How i get my messages: private async Task getMessages() { await Task.Run(async () => { MailModel[] mails = await api.GetMails(App.userInfo.user_id); foreach (MailModel mail in mails) { inboxMessages.Add(mail.message_id,mail.sender_user_id +" "+ mail.subject +" "+ mail.time_send); } }); }
The Values property of a dictionary is not ordered. Quote from the documentation: The order of the values in the Dictionary<TKey, TValue>.ValueCollection is unspecified [...] If you want to retrieve the values in some specific order, you need to sort it yourself. For example: var sorted = inboxMessages.OrderByDescending(kv => kv.Key).Select(kv => kv.Value); foreach (string inboxMessage in sorted) { inboxPicker.Items.Add(inboxMessage); } This retrieves the KeyValuePairs from the dictionary, sorts them descending on their int key and then returns an enumeration of the values.
You should sort the dictionary entries while you still have access to their keys: foreach (string inboxMessage in inboxMessages .OrderByDescending(m => m.Key) .Select(m => m.Value) { inboxPicker.Items.Add(inboxMessage); }
Improving conversion from List to List<Dictionary<string,string>> with Linq
I've a Key/Value table in my DB and I would return a List of Dictionary. The following code works fine for me but with a lot of data is not performing. note: r.name doesn't contains unique value List<Dictionary<string, string>> listOutput = null; using (ExampleDB db = new ExampleDB()) { var result = (from r in db.FormField where r.Form_id == 1 select new { r.ResponseId, r.name, r.value}).toList(); listOutput = new List<Dictionary<string, string>>(); foreach (var element in result) { listOutput.Add((from x in listOutput where x.ResponseId == element.ResponseId select x).ToDictionary(x => x.name, x => x.value)); } } return listOutput; Do you have suggestions on how to improve this code ?
I suspect you want something like: List<Dictionary<string, string>> result; using (var db = new ExampleDB()) { result = db.FormField .Where(r => r.Form_id == 1) .GroupBy(r => r.ResponseId, r => new { r.name, r.value }) .AsEnumerable() .Select(g => g.ToDictionary(p => p.name, p => p.value)) .ToList(); } In other words, we're filtering so that r.Form_id == 1, then grouping by ResponseId... taking all the name/value pairs associated with each ID and creating a dictionary from those name/value pairs. Note that you're losing the ResponseId in the list of dictionaries - you can't tell which dictionary corresponds to which response ID. The AsEnumerable part is to make sure that the last Select is performed using LINQ to Objects, rather than trying to convert it into SQL. It's possible that it would work without the AsEnumerable, but it will depend on your provider at the very least.
From what I gather you're trying to create a list of Key/Value pairs based on each ResponseId. Try GroupBy: var output = result.GroupBy(r => r.ResponseId) .Select(r => r.ToDictionary(s => s.Name, s => s.Value)); This will return an IEnumerable<Dictionary<string,string>>, which you can ToList if you actually need a list.
Linq query to with SortedList<int,list<int>>
I am currently using Linq to retrieve a list of distinct values from my data table. I am then looping through the list and again calling a linq query to retrieve a list of values for each value in the first list. _keyList = new SortedList<int, List<int>>(); var AUGroupList = ProcessSummaryData.AsEnumerable() .Select(x => x.Field<int>("AUGroupID")) .Distinct() .ToList<int>(); foreach (var au in AUGroupList) { var AUList = ProcessSummaryData.AsEnumerable() .Where(x => x.Field<int>("AUGroupID") == au) .Select(x => x.Field<int>("ActivityUnitID")) .ToList<int>(); _keyList.Add(au, AUList); } I am then adding the value to a sorted list along with the corresponding second list. How can I combine the above two queries into one Linq query so that I don't have to call them separately?
You should be able to do something like: var groupQuery = from d in ProcessSummary.AsEnumerable() group d by new { Key = d.Field<int>("AUGroupID") } into g select new { GroupID = g.Key, Values = g.Distinct().ToList() }; Then you can loop through the groupQuery and populate the sorted list. The Key property will contain the group id, and the Values property will have a distinct list of values.
Have you tried this? var _keyList = new SortedList<int, List<int>>(); var AUGroupList = ProcessSummaryData.AsEnumerable() .Select(x => x.Field<int>("AUGroupID")) .Distinct() .Where(x => x.Field<int>("AUGroupID") == au) .Select(x => x.Field<int>("ActivityUnitID")) .ToList<int>(); _keyList.Add(au, AUList); } Your provider should cope with that, if not there's a few other ways.