Data not being inserted into Cosmos C# - c#

I have a list of items (32007 of them)
I am adding them in bulk
However, it seems as though some are not being inserted
The even stranger thing is that if I run my process several times from scratch (i.e. recreating my collection), the number items created varies, I have seen 32005, 32003
I have scaled up my collection to have a lot of RUs (Autoscaled) to 25000
I split the data into tranches of 100
My logic is below
public async Task<List<Account>> ProcessAccountsAsync(List<Account> accounts)
{
var cosmosConnection = await ConnectToDatabaseAsync().ConfigureAwait(false);
var failedAccounts = new List<Account>();
var accountsToInsert = new Dictionary<PartitionKey, Stream>(accounts.Count);
Parallel.ForEach(accounts, (account) =>
{
var stream = new MemoryStream();
var json = JsonConvert.SerializeObject(account, JsonHelper.DefaultSettings());
stream.Write(Encoding.Default.GetBytes(json));
stream.Position = 0;
accountsToInsert.Add(new PartitionKey(account.Id), stream);
});
var tasks = new List<Task>(accounts.Count);
foreach (var account in accountsToInsert)
{
tasks.Add(cosmosConnection.Container.CreateItemStreamAsync(account.Value, account.Key)
.ContinueWith((Task<ResponseMessage> task) =>
{
using (var response = task.Result)
{
if (!response.IsSuccessStatusCode)
{
var actualAccount = accounts.FirstOrDefault(x => account.Key.ToString().Contains(x.Id));
Debug.WriteLine($"Processing Account : {actualAccount?.ArcContactId} Received {response.StatusCode} ({response.ErrorMessage}).");
failedAccounts.Add(actualAccount);
}
}
}));
}
await Task.WhenAll(tasks);
return failedAccounts;
}
In my calling logic I retry all accounts that have failed
var tidiedJson = JsonConvert.SerializeObject(list, Formatting.Indented);
accounts = JsonConvert.DeserializeObject <List<DomainModels.Versions.v2.Account>>(tidiedJson);
var failedAccounts = await cosmosAccountRepository.ProcessAccountsAsync(accounts);
while (failedAccounts.Count > 0)
{
failedAccounts = await cosmosAccountRepository.ProcessAccountsAsync(failedAccounts);
}
I have no idea why accounts are not being inserted and why the behaviour is so random!
I have tried this with a variety of throughput and tranche sizes, no difference
Can anyone see anything obvious?
Failing this, I have the accounts with an ID field, is there a way of finding out which of the accounts are not in the database, without having to go through all 32007 of them one by one, which is obviously not a good plan!
Paul

Related

Deduplicate stream records

I'm using Redis to stream data. I have multiple producer instances producing the same data, aiming event consistency.
Right now the producers generate trades with random trade ids between 1 and 2. I want a deduplication service or something which based on trade id to distrinct the duplicates. How do I do that?
Consumer
using System.Text.Json;
using Shared;
using StackExchange.Redis;
var tokenSource = new CancellationTokenSource();
var token = tokenSource.Token;
var muxer = ConnectionMultiplexer.Connect("localhost:6379");
var db = muxer.GetDatabase();
const string streamName = "positions";
const string groupName = "avg";
if (!await db.KeyExistsAsync(streamName) ||
(await db.StreamGroupInfoAsync(streamName)).All(x => x.Name != groupName))
{
await db.StreamCreateConsumerGroupAsync(streamName, groupName, "0-0");
}
var consumerGroupReadTask = Task.Run(async () =>
{
var id = string.Empty;
while (!token.IsCancellationRequested)
{
if (!string.IsNullOrEmpty(id))
{
await db.StreamAcknowledgeAsync(streamName, groupName, id);
id = string.Empty;
}
var result = await db.StreamReadGroupAsync(streamName, groupName, "avg-1", ">", 1);
if (result.Any())
{
id = result.First().Id;
var dict = ParseResult(result.First());
var trade = JsonSerializer.Deserialize<Trade>(dict["trade"]);
Console.WriteLine($"Group read result: trade: {dict["trade"]}, time: {dict["time"]}");
}
await Task.Delay(1000);
}
});
Console.ReadLine();
static Dictionary<string, string> ParseResult(StreamEntry entry)
{
return entry.Values.ToDictionary(x => x.Name.ToString(), x => x.Value.ToString());
}
Producer
using System.Text.Json;
using Shared;
using StackExchange.Redis;
var tokenSource = new CancellationTokenSource();
var token = tokenSource.Token;
var muxer = ConnectionMultiplexer.Connect("localhost:6379");
var db = muxer.GetDatabase();
const string streamName = "positions";
var producerTask = Task.Run(async () =>
{
var random = new Random();
while (!token.IsCancellationRequested)
{
var trade = new Trade(random.Next(1, 3), "btcusdt", 25000, 2);
var entry = new List<NameValueEntry>
{
new("trade", JsonSerializer.Serialize(trade)),
new("time", DateTimeOffset.Now.ToUnixTimeSeconds())
};
await db.StreamAddAsync(streamName, entry.ToArray());
await Task.Delay(2000);
}
});
Console.ReadLine();
You can use a couple tactics here, depending on the level of distribution required and the degree to which you can handle missing messages incoming from your stream. Here are a couple workable solutions using Redis:
Use Bloom Filters When you can tolerate a 1% miss in events
You can use a BloomFilter in Redis, which will be a very compact, very fast way to determine if a particular record has not been recorded yet. If you run:
var hasBeenAdded = ((int)await db.ExecuteAsync("BF.ADD", "bf:trades",dict["trade"])) == 1;
If hasBeenAdded is true, you can definitively say that the record is not a duplicate, if it is false, there's about a probability depending on how you set up the bloom filter with BF.RESERVE
If you want to use a Bloom Filter, you'll need to either side-load RedisBloom into your instance of Redis, or you can just use Redis Stack
Use a Sorted Set when misses aren't acceptable
If your app cannot tolerate a miss, you are probably wiser to use a Set or a Sorted Set, in general I'd advise you to use a set because they are much easier to clean up.
Basically if you are using a sorted set, you would check to see if a record has already been recorded in your average by using a ZSCORE zset:trades trade-id, if a score comes back you know that the records been used already, otherwise you can add it to the sorted set. Importantly, because your sorted set grows linearly you are going to want to clean it up periodically, so if you set the timestamp from the message id to the score, you likely can determine some workable interval to go back in and do a ZREMRANGEBYSCORE to clear out older records.

Report when input to first dataflow block finishes all linked blocks

I am using TPL Dataflow to download data from a ticketing system.
The system takes the ticket number as the input, calls an API and receives a nested JSON response with various information. Once received, a set of blocks handles each level of the nested structure and writes it to a relational database. e.g. Conversation, Conversation Attachments, Users, User Photos, User Tags, etc
Json
{
"conversations":[
{
"id":12345,
"author_id":23456,
"body":"First Conversation"
},
{
"id":98765,
"authorid":34567,
"body":"Second Conversation",
"attachments":[
{
"attachmentid":12345
"attachment_name":"Test.jpg"
}
}
],
"users":[
{
"userid":12345
"user_name":"John Smith"
},
{
"userid":34567
"user_name":"Joe Bloggs"
"user_photo":
{
"photoid":44556,
"photo_name":"headshot.jpg"
},
tags:[
"development",
"vip"
]
}
]
Code
Some blocks need to broadcast so that deeper nesting can still have access to the data. e.g. UserModelJson is broadcast so that 1 block can handle writing the user, 1 block can handle writing the User Tags and 1 block can handle writing the User Photos.
var loadTicketsBlock = new TransformBlock<int, ConversationsModelJson>(async ticketNumber => await p.GetConversationObjectFromTicket(ticketNumber));
var broadcastConversationsObjectBlock = new BroadcastBlock<ConversationsModelJson>(conversations => conversations);
//Conversation
var getConversationsFromConversationObjectBlock = new TransformManyBlock<ConversationsModelJson, ConversationModelJson>(conversation => ModelConverter.ConvertConversationsObjectJsonToConversationJson(conversation));
var convertConversationsBlock = new TransformBlock<ConversationModelJson, ConversationModel>(conversation => ModelConverter.ConvertConversationJsonToConversation(conversation));
var batchConversionBlock = new BatchBlock<ConversationModel>(batchBlockSize);
var convertConversationsToDTBlock = new TransformBlock<IEnumerable<ConversationModel>, DataTable>(conversations => ModelConverter.ConvertConversationModelToConversationDT(conversations));
var writeConversationsBlock = new ActionBlock<DataTable>(async conversations => await p.ProcessConversationsAsync(conversations));
var getUsersFromConversationsBlock = new TransformManyBlock<ConversationsModelJson, UserModelJson>(conversations => ModelConverter.ConvertConversationsJsonToUsersJson(conversations));
var broadcastUserBlock = new BroadcastBlock<UserModelJson>(userModelJson => userModelJson);
//User
var convertUsersBlock = new TransformBlock<UserModelJson, UserModel>(user => ModelConverter.ConvertUserJsonToUser(user));
var batchUsersBlock = new BatchBlock<UserModel>(batchBlockSize);
var convertUsersToDTBlock = new TransformBlock<IEnumerable<UserModel>, DataTable>(users => ModelConverter.ConvertUserModelToUserDT(users));
var writeUsersBlock = new ActionBlock<DataTable>(async users => await p.ProcessUsersAsync(users));
//UserTag
var getUserTagsFromUserBlock = new TransformBlock<UserModelJson, UserTagModel>(user => ModelConverter.ConvertUserJsonToUserTag(user));
var batchTagsBlock = new BatchBlock<UserTagModel>(batchBlockSize);
var convertTagsToDTBlock = new TransformBlock<IEnumerable<UserTagModel>, DataTable>(tags => ModelConverter.ConvertUserTagModelToUserTagDT(tags));
var writeTagsBlock = new ActionBlock<DataTable>(async tags => await p.ProcessUserTagsAsync(tags));
DataflowLinkOptions linkOptions = new DataflowLinkOptions()
{
PropagateCompletion = true
};
loadTicketsBlock.LinkTo(broadcastConversationsObjectBlock, linkOptions);
//Conversation
broadcastConversationsObjectBlock.LinkTo(getConversationsFromConversationObjectBlock, linkOptions);
getConversationsFromConversationObjectBlock.LinkTo(convertConversationsBlock, linkOptions);
convertConversationsBlock.LinkTo(batchConversionBlock, linkOptions);
batchConversionBlock.LinkTo(convertConversationsToDTBlock, linkOptions);
convertConversationsToDTBlock.LinkTo(writeConversationsBlock, linkOptions);
var tickets = await provider.GetAllTicketsAsync();
foreach (var ticket in tickets)
{
cts.Token.ThrowIfCancellationRequested();
await loadTicketsBlock.SendAsync(ticket.TicketID);
}
loadTicketsBlock.Complete();
The LinkTo blocks are repeated for each type of data to be written.
I know when the whole pipeline is complete by using
await Task.WhenAll(<Last block of each branch>.Completion);
but if I pass in ticket number 1 into the loadTicketsBlock block then how do I know when that specific ticket has been through all blocks in the pipeline and therefore is complete?
The reason that I want to know this is so that I can report to the UI that ticket 1 of 100 is complete.
You could consider using the TaskCompletionSource as the base class for all your sub-entities. For example:
class Attachment : TaskCompletionSource
{
}
class Conversation : TaskCompletionSource
{
}
Then every time you insert a sub-entity in the database, you mark it as completed:
attachment.SetResult();
...or if the insert fails, mark it as faulted:
attachment.SetException(ex);
Finally you can combine all the asynchronous completions in one, with the method Task.WhenAll:
Task ticketCompletion = Task.WhenAll(Enumerable.Empty<Task>()
.Append(ticket.Task)
.Concat(attachments.Select(e => e.Task))
.Concat(conversations.Select(e => e.Task)));
If I am tracking progress in Dataflow, usually I will set up the last block as a notify the UI of progress type block. To be able to track the progress of your inputs, you will need to keep the context of the original input in all the objects you are passing around, so in this case you need to be able to tell that you are working on ticket 1 all the way through your pipeline, and if one of your transforms removes the context that it is working on ticket 1, then you will need to rethink the object types that you are passing through your pipeline so you can keep that context.
A simple example of what I'm talking about is laid out below with a broadcast block going to three transform blocks, and then all three transform blocks going back to an action block that notifies about the progress of the pipelines.
When combining back into the single action block you need to make sure not to propagate completion at that point because as soon as one block propagates completion to the action block, that action block will stop accepting input, so you will still wait for the last block of each pipeline to complete, and then after that manually complete your final notify the UI action block.
using System;
using System.Threading.Tasks.Dataflow;
using System.Threading.Tasks;
using System.Collections.Generic;
public class Program
{
public static void Main()
{
var broadcastBlock = new BroadcastBlock<string>(x => x);
var transformBlockA = new TransformBlock<string, string>(x =>
{
return x + "A";
});
var transformBlockB = new TransformBlock<string, string>(x =>
{
return x + "B";
});
var transformBlockC = new TransformBlock<string, string>(x =>
{
return x + "C";
});
var ticketTracking = new Dictionary<int, List<string>>();
var notifyUiBlock = new ActionBlock<string>(x =>
{
var ticketNumber = int.Parse(x.Substring(5,1));
var taskLetter = x.Substring(7,1);
var success = ticketTracking.TryGetValue(ticketNumber, out var tasksComplete);
if (!success)
{
tasksComplete = new List<string>();
ticketTracking[ticketNumber] = tasksComplete;
}
tasksComplete.Add(taskLetter);
if (tasksComplete.Count == 3)
{
Console.WriteLine($"Ticket {ticketNumber} is complete");
}
});
DataflowLinkOptions linkOptions = new DataflowLinkOptions() {PropagateCompletion = true};
broadcastBlock.LinkTo(transformBlockA, linkOptions);
broadcastBlock.LinkTo(transformBlockB, linkOptions);
broadcastBlock.LinkTo(transformBlockC, linkOptions);
transformBlockA.LinkTo(notifyUiBlock);
transformBlockB.LinkTo(notifyUiBlock);
transformBlockC.LinkTo(notifyUiBlock);
for(var i = 0; i < 5; i++)
{
broadcastBlock.Post($"Task {i} ");
}
broadcastBlock.Complete();
Task.WhenAll(transformBlockA.Completion, transformBlockB.Completion, transformBlockC.Completion).Wait();
notifyUiBlock.Complete();
notifyUiBlock.Completion.Wait();
Console.WriteLine("Done");
}
}
This will give an output similar to this
Ticket 0 is complete
Ticket 1 is complete
Ticket 2 is complete
Ticket 3 is complete
Ticket 4 is complete
Done

Using Parallel.ForEach Create multiple requests in parallel and put them in the list

So I had to create dozens of API requests and get json to make it an object and put it in a list.
I also wanted the requests to be parallel because I do not care about the order in which the objects enter the list.
public ConcurrentBag<myModel> GetlistOfDstAsync()
{
var req = new RequestGenerator();
var InitializedObjects = req.GetInitializedObjects();
var myList = new ConcurrentBag<myModel>();
Parallel.ForEach(InitializedObjects, async item =>
{
RestRequest request = new RestRequest("resource",Method.GET);
request.AddQueryParameter("key", item.valueOne);
request.AddQueryParameter("key", item.valueTwo);
var results = await GetAsync<myModel>(request);
myList.Add(results);
});
return myList;
}
What creates a new problem, I do not understand how to put them in the list and it seems I do not use a solution that exists in a form ConcurrentBag
Is my assumption correct and I implement it wrong or should I use another solution?
I also wanted the requests to be parallel
What you actually want is concurrent requests. Parallel does not work as expected with async.
To do asynchronous concurrency, you start each request but do not await the tasks yet. Then, you can await all the tasks together and get the responses using Task.WhenAll:
public async Task<myModel[]> GetlistOfDstAsync()
{
var req = new RequestGenerator();
var InitializedObjects = req.GetInitializedObjects();
var tasks = InitializedObject.Select(async item =>
{
RestRequest request = new RestRequest("resource",Method.GET);
request.AddQueryParameter("key", item.valueOne);
request.AddQueryParameter("key", item.valueTwo);
return await GetAsync<myModel>(request);
}).ToList();
var results = await TaskWhenAll(tasks);
return results;
}

How to read large number of records from SQL table in batches using Task Parallel library

I have a sql table called Accounts in SQL server where data is sorted with respect to date. The table has 20 columns including AccountId.
I want to read the records for each day (Around 200K records). This way I would have to read the data of 6 months for each day.
What I need to do is to fetch records from Accounts Table for 6 Months of data. So I planned my code to get the data from SQL Server Accounts table inside a do while loop for each day.
Now, each day consists of fetching 200K records from the database. So I break this 200K records for one day into batches (let's say 10000 or 20000 records in one read which makes around 10 batches of records for one day). Once I get these 10k or 20k records I want to get those values fetched from database and convert the same into a csv file and export the csv file to a location.
Now, my problem is that this procedure is taking too much time (Around 50 minutes for fetching records for one day and I need to fetch the records for 6 months of data. So you can imagine how much time it will take).
I am thinking to make use of TPL to break the code and processing into tasks but not sure how to go about it.
Please suggest how do I make use of Task parallel library to enhance the performance so that I can easily get 6 months of data.
My C# code looks like below:
public void Main()
{
do
{
done = true;
var accountsTableRecors = ReadsDatabaseForADay(lastId);
foreach (var accountsHistory in accountsTableRecors)
{
if (accountsHistory.accountsId != null) lastId = (long)accountsHistory.accountsId;
done = false;
recordCount++;
}
var flatFileDataList = ProcessRecords();
} while (!done);
}
The ProcessRecords method above in Main() parses some xml and converts to fetched data into csv.
private IEnumerable<AccountsTable> ReadsDatabaseForADay(long lastId)
{
var transactionDataRecords = DatabaseHelper.GetTransactions(lastId, 10000);
var accountsTableData = transactionDataRecords as IList<AccountsTable> ?? transactionDataRecords.ToList();
ListAccountsTable.AddRange(accountsTableData);
return accountsTableData;
}
DatabaseHelperClass:
internal static IEnumerable<AccountsTable> GetTransactions(long lastTransactionId, int count)
{
const string sql = "SELECT TOP(#count) [accounts_id],[some_columns],[some_other_columns]. .....[all_other_columns] "
+ "FROM AccountsTable WHERE [accounts_id] > #LastTransactionId AND [creation_dt] > DATEADD(DAY,-1, GETDATE())" +
" ORDER BY [accounts_id]";
return GetData(connection =>
{
var command = new SqlCommand(sql, connection);
command.Parameters.AddWithValue("#count", count);
command.Parameters.AddWithValue("#LastTransactionId", lastTransactionId);
return command;
}, DataRecordToTransactionHistory);
}
private static IEnumerable<T> GetData<T>(Func<SqlConnection, SqlCommand> commandBuilder, Func<IDataRecord, T> dataFunc)
{
using (var connection = GetOpenConnection())
{
using (var command = commandBuilder(connection))
{
command.CommandTimeout = 0;
using (IDataReader reader = command.ExecuteReader())
{
while (reader.Read())
{
var record = dataFunc(reader);
yield return record;
}
}
}
}
}
Your code has some issues into it, which may lead to poor performance. First of all, you're creating a list from IEnumerable each time you get another batch of account tables. List<T> uses an array internally, which goes to large object heap by default, and that heap doesn't GC'ed frequently, so by time you'll got a memory issues and performance drawback.
Second issue is that your code is highly I/O oriented, but you do not use the async methods, so your thread will be blocked by waiting the database to respond. You should switch to async version and use the blocking methods only inside the Main entrance point, but in this case you wouldn't be able to yield your results, as async Task<IEnumerable> can't be an iterator, and you need to edit your DataRecordToTransactionHistory function or create a list locally inside your GetData function:
private static async Task GetData(Func<SqlConnection, SqlCommand> commandBuilder,
Action<IDataRecord> dataFunc)
{
using (var connection = new SqlConnection())
{
await connection.OpenAsync();
using (var command = commandBuilder(connection))
{
command.CommandTimeout = 0;
using (var reader = await command.ExecuteReaderAsync())
{
while (await reader.ReadAsync())
{
dataFunc(reader);
}
}
}
}
}
// OR
private static async Task<IEnumerable<T>> GetData<T>(Func<SqlConnection, SqlCommand> commandBuilder,
Func<IDataRecord, T> dataFunc)
{
using (var connection = new SqlConnection())
{
await connection.OpenAsync();
using (var command = commandBuilder(connection))
{
command.CommandTimeout = 0;
using (var reader = await command.ExecuteReaderAsync())
{
// linked list to avoid allocation of an array
var result = new LinkedList<T>();
while (await reader.ReadAsync())
{
result.AddLast(dataFunc(reader));
}
return result;
}
}
}
}
However, both of these options have their disadvantages. As for me, you should give a try to the TPL Dataflow library with processing pipeline like this:
// default linking options
var options = new DataflowLinkOptions { PropagateCompletion = true };
// store all the ids you need to process
var idBlock = new BufferBlock<long>();
// transform id to list of AccountsTable
var transformBlock = new TransformBlock<long, Task<IEnumerable<AccountsTable>>>(async id =>
{
return await ReadsDatabaseForADay(id);
});
// connect the blocks between each other
idBlock.LinkTo(transformBlock, options);
// flatten the already completed task to enumerable
var flattenBlock = new TransformManyBlock<Task<IEnumerable<AccountsTable>>, AccountsTable>(async tables =>
{
return (await tables).Select(t => t);
});
transformBlock.LinkTo(flattenBlock, options);
// gather results in batches of 10000
var batchBlock = new BatchBlock<AccountsTable>(10000);
flattenBlock.LinkTo(batchBlock);
// export resulting array to csv file
var processor = new ActionBlock<AccountsTable[]>(a =>
{
ExportToCSV(a);
});
// populate the pipeline with needed ids
foreach (var id in GetAllIds())
{
// await the adding for each id
await idBlock.SendAsync(id);
}
// notify the pipeline that all the data has been sent
idBlock.Complete();
// await whole ids to be processed
await processor.Completion;
TPL Dataflow executes in default thread pool and uses all the advantages of it. You may adjust the blocks with MaxDegreeOfParrallelism so it can handle more than one id in at a moment.
So, you should use async methods and you shouldn't create too many buffer lists/arrays just for storing the data. Assemble your pipeline to use full advantages of iterators and async/await functions.

C# MongoDB FindAsync never returns on Await

There are other questions like this one but non relating to the actual FindAsync from what I can tell.
My ClientsController calls ClientService.GetClients which uses the mongo drivers to query a mongodb on Azure.
Stepping through the debugger it gets up to the point where I call clientCollection.FindAsync. If I step over this the line following is never hit and no errors are given. It's like the awaited task never returns.
public async Task<List<Client>> GetClients(SearchRequestDTO searchRequest)
{
var response = new List<Client>();
var db = _databaseUtilityService.GetCoreDatabase();
var clientCollection = db.GetCollection<Client>(Properties.Settings.Default.ClientCollectionName);
var cursor = await clientCollection.FindAsync(new BsonDocument());
while (await cursor.MoveNextAsync())
{
response.Concat(cursor.Current.ToList());
}
return response;
}
What would be the reason why the debugger never steps over the var cursor = ... line ?
Edit-
I can instead get Result-
var cursor = clientCollection.FindAsync(new BsonDocument()).Result;
But I'm not sure that's what I want to do.
public async Task<List<Client>> GetClients(SearchRequestDTO searchRequest)
{
var db = _databaseUtilityService.GetCoreDatabase();
var clientCollection = db.GetCollection<Client>(Properties.Settings.Default.ClientCollectionName);
var results = clientCollection.FindAsync(new BsonDocument()).Result;
return results.ToList();
}
Since there is not much information about context, so I came up with mock classes to satisfy question.
Please see below an overloaded method and when called, it will always return you list of three records. Now what's wrong with your code? I believe it's in your while loop. You are calling response.Concat which is causing an issue.
I'm calling response.AddRange instead and it works.
public async Task<List<Client>> GetClients()
{
var mongoUri = "mongodb://localhost:27017";
var mongoClient = new MongoClient(mongoUri);
var mongoDatabase = mongoClient.GetDatabase("ClientDB");
var clientCollection = mongoDatabase.GetCollection<Client>("Clients");
// Empty collection to always get accurate result.
clientCollection.DeleteMany(new BsonDocument());
// Insert some dummy data
clientCollection.InsertOne(new Client() {Address = "One street, some state", ZipCode = 11111});
clientCollection.InsertOne(new Client() { Address = "2nd street, some state", ZipCode = 22222 });
clientCollection.InsertOne(new Client() { Address = "Third street, some state", ZipCode = 33333 });
var response = new List<Client>();
var cursor = await clientCollection.FindAsync(new BsonDocument());
while (await cursor.MoveNextAsync())
{
response.AddRange(cursor.Current);
}
return response;
}

Categories