Service Fabric Reliable Dictionary parallel reads - c#

I have a Reliable Dictionary partitioned across a cluster of 7 nodes. [60 partitions]. I've setup remoting listener like this:
var settings = new FabricTransportRemotingListenerSettings
{
MaxMessageSize = Common.ServiceFabricGlobalConstants.MaxMessageSize,
MaxConcurrentCalls = 200
};
return new[]
{
new ServiceReplicaListener((c) => new FabricTransportServiceRemotingListener(c, this, settings))
};
I am trying to do a load test to prove Reliable Dictionary "read" performance will not decrease under load. I have a "read" from dictionary method like this:
using (ITransaction tx = this.StateManager.CreateTransaction())
{
IAsyncEnumerable<KeyValuePair<PriceKey, Price>> items;
IAsyncEnumerator<KeyValuePair<PriceKey, Price>> e;
items = await priceDictionary.CreateEnumerableAsync(tx,
(item) => item.Id == id, EnumerationMode.Unordered);
e = items.GetAsyncEnumerator();
while (await e.MoveNextAsync(CancellationToken.None))
{
var p = new Price(
e.Current.Key.Id,
e.Current.Key.Version, e.Current.Key.Id, e.Current.Key.Date,
e.Current.Value.Source, e.Current.Value.Price, e.Current.Value.Type,
e.Current.Value.Status);
intermediatePrice.TryAdd(new PriceKey(e.Current.Key.Id, e.Current.Key.Version, id, e.Current.Key.Date), p);
}
}
return intermediatePrice;
Each partition has around 500,000 records. Each "key" in dictionary is around 200 bytes and "Value" is around 600 bytes. When I call this "read" directly from a browser [calling the REST API which in turn calls the stateful service], it takes 200 milliseconds.
If I run this via a load test with, let's say, 16 parallel threads hitting the same partition and same record, it takes around 600 milliseconds on average per call. If I increase the load test parallel thread count to 24 or 30, it takes around 1 second for each call.
My question is, can a Service Fabric Reliable Dictionary handle parallel "read" operations, just like SQL Server can handle parallel concurrent reads, without affecting throughput?

If you check the Remarks about Reliable Dictionary CreateEnumerableAsync Method, you can see that it was designed to work concurrently, so concurrency is not an issue.
The returned enumerator is safe to use concurrently with reads and
writes to the Reliable Dictionary. It represents a snapshot consistent
view
The problem is that concurrently does not mean fast
When you make your query this way, it will:
have to take the snapshot of the collection before it start processing it, otherwise you wouldn't be able to write to it while processing.
you have to navigate through all the values in the collection to find the item you are looking for and take note of these values before you return anything.
Load the data from the disk if not in memory yet, only the Keys is kept in the memory, the values are kept in the disk when not required and might get paged for memory release.
The following queries will probably(i am not sure, but I assume) not reuse the previous one, your collection might have changed since last query.
When you have a huge number of queries running this ways, many factors will take in place:
Disk: loading the data to memory,
CPU: Comparing the values and scheduling threads
Memory: storing the snapshot to be processed
The best way to work with Reliable Dictionary is retrieving these values by Keys, because it knows exactly where the data for a specific key is stored, and does not add this extra overhead to find it.
If you really want to use it this way, I would recommend you design it like an Index Table where you store the data indexed by id in one Dictionary, and another dictionary with the key being the searched value, and value being the key to the main dicitonary. This would be much faster.

Based on the code I see all you reads are executed on primary replicas - therefore you have 7 nodes and 60 service instances that process requests. If I get everything right there are 60 replicas that process requests.
You have 7 nodes and 60 replicas - therefore if we imagine they are distributed more or less equally between nodes we have 8 replicas per node.
I am not sure about physical configuration of each node but if we assume for a moment that each node has 4 vCPU then you can imagine that when you make 8 concurrent requests on the same node all of these requests now should be executed using 4 vCPU. This situation causes worker threads to fight for resources - keeping it simple it significantly slows down the processing.
The reason why this effect is so visible here is that because you are scanning the IReliableDictionary instead of getting items by key using TryGetValueAsync like it supposed to be.
You can try to change you code to use TryGetValueAsync and the difference will be very noticeable.

Related

C# Reading SQLite table concurrently

The goal here is to use SQL to read a SQLite database, uncompress a BLOB field, and parse the data. The parsed data is written to a different SQLite DB using EF6. Because the size of the incoming database could be 200,000 records or more, I want to do this all in parallel with 4 C# Tasks.
SQLite is in its default SERIALIZED mode. I am converting a working single background task into multiple tasks. The SQLite docs say to use a single connection and so I am using a single connection for all the tasks to read the database:
using sqlite_datareader = sqlite_cmd.ExecuteReader();
while (sqlite_datareader.Read() && !Token.IsCancellationRequested)
{
....
}
However, each task reads each record of the database. Not what I want. I need each task to take the next record from the table.
Any ideas?
From SQLite's standpoint, it's likely the limiting factor is the raw disk or network I/O. Naively splitting the basic query into separate tasks or parts would mean more seeks, which makes things slower. We see, then, that the fastest way to get the raw data from the DB is a simple query over a single connection, just like the sqlite documentation says.
But now we want to do some meaningful processing on this data, and this part might benefit from parallel work. What you need to do to get good parallelization, therefore, is create a queuing system as you receive each record.
For this, you want a single process to send the one SQL statement to the sqlite database and retrieve the results from the datareader. This thread will then queue an additional task from each record as quickly as possible, such that each task acts only the received data for the one record... that is, the additional tasks neither know nor care the data came from a database or any other specific source.
The result is you'll end up with as many tasks as you have records. However, you don't have to run that many tasks all at once. You can tune it to 4 or whatever other number you want (2 * the number CPU cores is a good rule of thumb to start with). And the easiest way to do this is to turn to ThreadPool.QueueUserWorkItem().
As we do this, one thing to remember is the DataReader will mutate itself with each read. So our main thread creating the queue must also be smart enough to copy this data to a new object with each read, so the individual threads don't end up looking at data that was already changed out for a later record.
using sqlite_datareader = sqlite_cmd.ExecuteReader();
while (sqlite_datareader.Read())
{
var temp = CopyDataFromReader(sqlite_datareader);
ThreadPool.QueueUserWorkItem(a => ProcessRecord(temp));
}
Additionally, each task itself has some overhead. If you have enough records, you may also gain some benefit from batching up a bunch of records before sending them to the queue:
int index = 0;
object[] temp;
using sqlite_datareader = sqlite_cmd.ExecuteReader();
while (sqlite_datareader.Read())
{
temp[count] = CopyDataFromReader(sqlite_datareader);
if (++count >= 50)
{
ThreadPool.QueueUserWorkItem(a => ProcessRecords(temp, 50));
count = 0;
}
}
if (count != 0) ThreadPool.QueueUserWorkItem(a => ProcessRecords(temp, count));
Finally, you probably want to do something with this data once it is no longer compressed. One option is wait for all the items to finish, so you can stitch them back into a single IEnumerable of some variety (List, Array, DataTable, iterator, etc). Another is to make sure to include all of the work with the ProcessRecord() method. Another is to use an Event delegate to signal when each item is ready for further work.

ASP.Net Core API response takes too much time

I have a SQL database table with 9000 rows and 97 columns. Its primary key has 2 columns: Color and Name. You can see the simplified table to image better:
I have an ASP.NET Core API listening at URL api/color/{colorName}, it reads the table to get color information. Currently I have 3 colors and about 3000 rows each.
It takes too much time. It reads table in 2383ms and maps to DTO in 14ms. And after that I immediately return the DTO to consumer but somehow the API takes 4135.422ms. I don't understand why. I guess I should take 2407.863ms but it not. It takes almost 2 times more.
You can see my code and logs below. Do you have an idea how can I improve the response time?
I use Entity Framework Core 3.1, AutoMapper and ASP.NET Core 3.1.
Service:
public async Task<IEnumerable<ColorDTO>> GetColors(string requestedColor)
{
var watch = System.Diagnostics.Stopwatch.StartNew();
var colors = await _dbContext.Colors.Where(color => color.color == requestedColor).ToListAsync();
watch.Stop();
_logger.LogError("Color of:{requestedColor} Reading takes:{elapsedMs}", requestedColor, watch.ElapsedMilliseconds);
var watch2 = System.Diagnostics.Stopwatch.StartNew();
var colorDtos = _mapper.Map<IEnumerable<ColorDTO>>(colors);
watch2.Stop();
_logger.LogError("Color of:{requestedColor} Mapping takes:{elapsedMs}", requestedColor, watch2.ElapsedMilliseconds);
return colorDtos;
}
Controller:
public async Task<ActionResult<IEnumerable<ColorDTO>>> GetBlocksOfPanel(string requestedColor)
{
return Ok(await _colorService.GetColors(requestedColor));
}
And the logs:
2020-04-27 15:21:54.8793||0HLVAKLTJO59T:00000003|MyProject.Api.Services.IColorService|INF|Color of Purple Reading takes:2383ms
2020-04-27 15:21:54.8994||0HLVAKLTJO59T:00000003|MyProject.Api.Services.IColorService|INF|Color of Purple Mapping takes:14ms
2020-04-27 15:21:54.9032||0HLVAKLTJO59T:00000003|Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker|INF|Executed action method MyProject.Api.Web.Controllers.ColorsController.GetColors (MyProject.Api.Web), returned result Microsoft.AspNetCore.Mvc.OkObjectResult in 2407.863ms.
2020-04-27 15:21:54.9081||0HLVAKLTJO59T:00000003|Microsoft.AspNetCore.Mvc.Infrastructure.ObjectResultExecutor|INF|Executing ObjectResult, writing value of type 'System.Collections.Generic.List`1[[MyProject.Api.Contracts.Dtos.ColorDTO, MyProject.Api.Contracts, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]]'.
2020-04-27 15:21:56.4895||0HLVAKLTJO59T:00000003|Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker|INF|Executed action MyProject.Api.Web.Controllers.ColorsController.GetColors (MyProject.Api.Web) in 4003.8022ms
2020-04-27 15:21:56.4927||0HLVAKLTJO59T:00000003|Microsoft.AspNetCore.Routing.EndpointMiddleware|INF|Executed endpoint 'MyProject.Api.Web.Controllers.ColorsController.GetColors (MyProject.Api.Web)'
2020-04-27 15:21:56.4972||0HLVAKLTJO59T:00000003|Microsoft.AspNetCore.Hosting.Diagnostics|INF|Request finished in 4135.422ms 200 application/json; charset=utf-8
As #ejwill mentioned in his comment, you need to consider latency in the entire operation. Fetching from the database and mapping to DTOs is only part of what is happening during the round trip of the request and response to your API.
You can probably reduce the the query time against your database table through some optimizations there. You don't indicate what database you're using, but a composite key based on two string/varchar values may not necessarily be the most performant, and the use of indexes on values you're filtering on may also help -- there are tradeoffs there depending on whether you're optimizing for write or for read. That being said, 97 columns is not trivial either way. Do you need to query and return all 97 columns over the API? Is pagination an option?
If you must return all the data for all 97 columns at once and you're querying the API frequently, you can also consider the use of an in-memory cache, especially if the table is not changing often; instead of making the roundtrip to the database every time, you keep a copy of the data in memory so it can be returned much more quickly. You can look at an implementation of an in-memory cache that supports a generational model to keep serving up data while new versions are fetched.
https://github.com/jfbosch/recache
Serialization of result could take huge time.
First thing is serialization itself: if you return 3k records, it will take significant time to serialize it to JSON or XML. Consider moving to more compact binary formats.
Second thing is memory and GC. If amount of serialized data exceeds 85,000 bytes, memory for this data will be allocated on LOH in one chunk. This could take time. You might consider inspecting your LOH and look for response data stored there. Possible workaround could be responding with chunks of data and utilization of kind of paging with offset and position.
You can easily check that serialization causes performance trouble: leave call to DB as it is, but return to the client only 100-200 rows instead of the whole result, or return less object fields (for example, only 3). Time should be reduced.
your problem relates to SQL side. you should check the indexing of your columns and run your query in the execution plan state to find your bottleneck. Also, for increasing performance, I suggest that rewrite code in the async state.

How to insert into documentDB from Excel file containing 5000 records?

I have an Excel file that originally had about 200 rows, and I was able to convert the excel file to a data table and everything got inserted into the documentdb correctly.
The Excel file now has 5000 rows and it is not inserting after 30-40 records insertion and rest of all the rows are not inserted into the documentdb
I found some exception as below.
Microsoft.Azure.Documents.DocumentClientException: Exception:
Microsoft.Azure.Documents.RequestRateTooLargeException, message:
{"Errors":["Request rate is large"]}
My code is :
Service service = new Service();
foreach(data in exceldata) //exceldata contains set of rows
{
var student = new Student();
student.id= "";
student.name = data.name;
student.age = data.age;
student.class = data.class;
student.id = service.savetoDocumentDB(collectionLink,student); //collectionlink is a string stored in web.config
students.add(student);
}
Class Service
{
public async Task<string> AddDocument(string collectionLink, Student data)
{
this.DeserializePayload(data);
var result = await Client.CreateDocumentAsync(collectionLink, data);
return result.Resource.Id;
}
}
Am I doing anything wrong?
Any help would be greatly appreciable.
Update:
As of 4/8/15, DocumentDB has released a data import tool, which supports JSON files, MongoDB, SQL Server, and CSV files. You can find it here: http://www.microsoft.com/en-us/download/details.aspx?id=46436
In this case, you can save your Excel file as a CSV and then bulk-import records using the data import tool.
Original Answer:
DocumentDB Collections are provisioned 2,000 request-units per second. It's important to note - the limits are expressed in terms of request-units and not requests; so writing larger documents costs more than smaller documents, and scanning is more expensive than index seeks.
You can measure the overhead of any operations (CRUD) by inspecting the x-ms-request-charge HTTP response header or the RequestCharge property in the ResourceResponse/FeedResponse objects returned by the SDK.
A RequestRateTooLargeException is thrown when you exhaust the provisioned throughput. Some solutions include:
Back off w/ a short delay and retry whenever you encounter the exception. A recommended retry delay is included in the x-ms-retry-after-ms HTTP response header. Alternatively, you could simply batch requests with a short delay
Use lazy indexing for faster ingestion rate. DocumentDB allows you to specify indexing policies at the collection level. By default, the index is updated synchronously on each write to the collection. This enables the queries to honor the same consistency level as that of the document reads without any delay for the index to “catch up”. Lazy indexing can be used to amortize the work required to index content over a longer period of time. It is important to note, however, that when lazy indexing is enabled, query results will be eventually consistent regardless of the consistency level configured for the DocumentDB account.
As mentioned, each collection has a limit of 2,000 RUs - you can increase throughput by sharding / partitioning your data across multiple collections and capacity units.
Delete empty collections to utilize all provisioned throughput - every document collection created in a DocumentDB account is allocated reserved throughput capacity based on the number of Capacity Units (CUs) provisioned, and the number of collections created. A single CU makes available 2,000 request units (RUs) and supports up to 3 collections. If only one collection is created for the CU, the entire CU throughput will be available for the collection. Once a second collection is created, the throughput of the first collection will be halved and given to the second collection, and so on. To maximize throughput available per collection, I'd recommend the number of capacity units to collections is 1:1.
References:
DocumentDB Performance Tips:
http://azure.microsoft.com/blog/2015/01/27/performance-tips-for-azure-documentdb-part-2/
DocumentDB Limits:
http://azure.microsoft.com/en-us/documentation/articles/documentdb-limits/

MongoDB - Inserting the result of a query in one round-trip

Consider this hypothetical snippet:
using (mongo.RequestStart(db))
{
var collection = db.GetCollection<BsonDocument>("test");
var insertDoc = new BsonDocument { { "currentCount", collection.Count() } };
WriteConcernResult wcr = collection.Insert(insertDoc);
}
It inserts a new document with "currentCount" set to the value returned by collection.Count().
This implies two round-trips to the server. One to calculate collection.Count() and one to perform the insert. Is there a way to do this in one round-trip?
In other words, can the value assigned to "currentCount" be calculated on the server at the time of the insert?
Thanks!
There is no way to do this currently (Mongo 2.4).
The upcoming 2.6 version should have batch operations support but I don't know if it will support batching operations of different types and using the results of one operation from another operation.
What you can do, however, is execute this logic on the server by expressing it in JavaScript and using eval:
collection.Database.Eval(new BsonJavaScript(#"
var count = db.test.count();
db.test.insert({ currentCount: count });
");
But this is not recommended, because of several reasons: you lose the write concern, it is very unsafe in terms of security, it requires admin permissions, it holds a global write lock, and it won't work on sharded clusters :)
I think your best route at the moment would be to do this in two queries.
If you're looking for atomic updates or counters (which don't exactly match your example but seem somewhat related), take a look at findAndModify and the $inc operator of update.
If you've got a large collection and you're looking to save CPU, it's recommended that you create another collection called counters that only has one document per collection that you want to count and increment the document pertaining to you're collection each time you insert a document.
See the guidance here.
It appears that you can place a JavaScript function inside your query, so perhaps it can be done in one trip, but I haven't implemented this in my own app, so I can't confirm that.

C# Multithread database access for Firebird (.NET)

I'd like to take advantage of multithreading when we write data from the database into our own objects. We are currently using Firebird and retrieving data using the "forward-only" reader FbDataReader.
We cycle through the records held in the FbDataReader and populate an object, adding the object to a List which is then used within the application. All this occurs in the Data Access Layer of our application.
Ideally, we would like to retrieve data from the database (in a FbDataReader) and then split the work of writing to objects (one per row) between threads. The problem I see is that the FbDataReader is forward only and different threads may cause the reader to step the next record before another thread is finished.
A solution might be to dump the FbDataReader into an indexed List, Array or Dictionary but this would come at a cost.
Does anyone have any ideas or are we just wasting our time looking to refactor this part of our code?
If you can obtain large blocks of contiguous records that don't overlap, and assign a data reader object to each, then you can use a thread for each reader and get gains, provided the data source doesn't cause a bottle neck. You will basically be using multiple reader objects in place of intermediate storage.
e.g.
where ID >= 0 && ID < 10000 << block 1 for data reader instance 1
where ID >= 10000 && ID < 20000 << block 2 for data reader instance 2
where ID >= 20000 && ID < 30000 << block 3 for data reader instance 3
This example of three readers will allow you to instantiate three objects "simultaneously".
If you additionally use a C# iterator wrapped around this entire process you might have a way to return all the objects as if they were a single collection without using any intermediate storage for the instances.
foreach ( object o in MyIterator() ) { ....
This would bring the objects back under the banner of being used from a single thread even though they were created on different threads. I'm just blue-skying this last part.
You can try to create a new thread and create there a new database connection based on the existing connectionstring.
In one thread you can proccess the even records (2,4,6 etc) and in the other thread the odd records (1,3,5 etc).
However this will increase the complexity of your code.
For lenghty database operations I prefer to create a new database connection object to do the work in a seperate thread and display a progress to the user, so that the UI thread does not freeze and the application remains responsive.

Categories