C# Multithread database access for Firebird (.NET) - c#

I'd like to take advantage of multithreading when we write data from the database into our own objects. We are currently using Firebird and retrieving data using the "forward-only" reader FbDataReader.
We cycle through the records held in the FbDataReader and populate an object, adding the object to a List which is then used within the application. All this occurs in the Data Access Layer of our application.
Ideally, we would like to retrieve data from the database (in a FbDataReader) and then split the work of writing to objects (one per row) between threads. The problem I see is that the FbDataReader is forward only and different threads may cause the reader to step the next record before another thread is finished.
A solution might be to dump the FbDataReader into an indexed List, Array or Dictionary but this would come at a cost.
Does anyone have any ideas or are we just wasting our time looking to refactor this part of our code?

If you can obtain large blocks of contiguous records that don't overlap, and assign a data reader object to each, then you can use a thread for each reader and get gains, provided the data source doesn't cause a bottle neck. You will basically be using multiple reader objects in place of intermediate storage.
e.g.
where ID >= 0 && ID < 10000 << block 1 for data reader instance 1
where ID >= 10000 && ID < 20000 << block 2 for data reader instance 2
where ID >= 20000 && ID < 30000 << block 3 for data reader instance 3
This example of three readers will allow you to instantiate three objects "simultaneously".
If you additionally use a C# iterator wrapped around this entire process you might have a way to return all the objects as if they were a single collection without using any intermediate storage for the instances.
foreach ( object o in MyIterator() ) { ....
This would bring the objects back under the banner of being used from a single thread even though they were created on different threads. I'm just blue-skying this last part.

You can try to create a new thread and create there a new database connection based on the existing connectionstring.
In one thread you can proccess the even records (2,4,6 etc) and in the other thread the odd records (1,3,5 etc).
However this will increase the complexity of your code.
For lenghty database operations I prefer to create a new database connection object to do the work in a seperate thread and display a progress to the user, so that the UI thread does not freeze and the application remains responsive.

Related

C# Reading SQLite table concurrently

The goal here is to use SQL to read a SQLite database, uncompress a BLOB field, and parse the data. The parsed data is written to a different SQLite DB using EF6. Because the size of the incoming database could be 200,000 records or more, I want to do this all in parallel with 4 C# Tasks.
SQLite is in its default SERIALIZED mode. I am converting a working single background task into multiple tasks. The SQLite docs say to use a single connection and so I am using a single connection for all the tasks to read the database:
using sqlite_datareader = sqlite_cmd.ExecuteReader();
while (sqlite_datareader.Read() && !Token.IsCancellationRequested)
{
....
}
However, each task reads each record of the database. Not what I want. I need each task to take the next record from the table.
Any ideas?
From SQLite's standpoint, it's likely the limiting factor is the raw disk or network I/O. Naively splitting the basic query into separate tasks or parts would mean more seeks, which makes things slower. We see, then, that the fastest way to get the raw data from the DB is a simple query over a single connection, just like the sqlite documentation says.
But now we want to do some meaningful processing on this data, and this part might benefit from parallel work. What you need to do to get good parallelization, therefore, is create a queuing system as you receive each record.
For this, you want a single process to send the one SQL statement to the sqlite database and retrieve the results from the datareader. This thread will then queue an additional task from each record as quickly as possible, such that each task acts only the received data for the one record... that is, the additional tasks neither know nor care the data came from a database or any other specific source.
The result is you'll end up with as many tasks as you have records. However, you don't have to run that many tasks all at once. You can tune it to 4 or whatever other number you want (2 * the number CPU cores is a good rule of thumb to start with). And the easiest way to do this is to turn to ThreadPool.QueueUserWorkItem().
As we do this, one thing to remember is the DataReader will mutate itself with each read. So our main thread creating the queue must also be smart enough to copy this data to a new object with each read, so the individual threads don't end up looking at data that was already changed out for a later record.
using sqlite_datareader = sqlite_cmd.ExecuteReader();
while (sqlite_datareader.Read())
{
var temp = CopyDataFromReader(sqlite_datareader);
ThreadPool.QueueUserWorkItem(a => ProcessRecord(temp));
}
Additionally, each task itself has some overhead. If you have enough records, you may also gain some benefit from batching up a bunch of records before sending them to the queue:
int index = 0;
object[] temp;
using sqlite_datareader = sqlite_cmd.ExecuteReader();
while (sqlite_datareader.Read())
{
temp[count] = CopyDataFromReader(sqlite_datareader);
if (++count >= 50)
{
ThreadPool.QueueUserWorkItem(a => ProcessRecords(temp, 50));
count = 0;
}
}
if (count != 0) ThreadPool.QueueUserWorkItem(a => ProcessRecords(temp, count));
Finally, you probably want to do something with this data once it is no longer compressed. One option is wait for all the items to finish, so you can stitch them back into a single IEnumerable of some variety (List, Array, DataTable, iterator, etc). Another is to make sure to include all of the work with the ProcessRecord() method. Another is to use an Event delegate to signal when each item is ready for further work.

Service Fabric Reliable Dictionary parallel reads

I have a Reliable Dictionary partitioned across a cluster of 7 nodes. [60 partitions]. I've setup remoting listener like this:
var settings = new FabricTransportRemotingListenerSettings
{
MaxMessageSize = Common.ServiceFabricGlobalConstants.MaxMessageSize,
MaxConcurrentCalls = 200
};
return new[]
{
new ServiceReplicaListener((c) => new FabricTransportServiceRemotingListener(c, this, settings))
};
I am trying to do a load test to prove Reliable Dictionary "read" performance will not decrease under load. I have a "read" from dictionary method like this:
using (ITransaction tx = this.StateManager.CreateTransaction())
{
IAsyncEnumerable<KeyValuePair<PriceKey, Price>> items;
IAsyncEnumerator<KeyValuePair<PriceKey, Price>> e;
items = await priceDictionary.CreateEnumerableAsync(tx,
(item) => item.Id == id, EnumerationMode.Unordered);
e = items.GetAsyncEnumerator();
while (await e.MoveNextAsync(CancellationToken.None))
{
var p = new Price(
e.Current.Key.Id,
e.Current.Key.Version, e.Current.Key.Id, e.Current.Key.Date,
e.Current.Value.Source, e.Current.Value.Price, e.Current.Value.Type,
e.Current.Value.Status);
intermediatePrice.TryAdd(new PriceKey(e.Current.Key.Id, e.Current.Key.Version, id, e.Current.Key.Date), p);
}
}
return intermediatePrice;
Each partition has around 500,000 records. Each "key" in dictionary is around 200 bytes and "Value" is around 600 bytes. When I call this "read" directly from a browser [calling the REST API which in turn calls the stateful service], it takes 200 milliseconds.
If I run this via a load test with, let's say, 16 parallel threads hitting the same partition and same record, it takes around 600 milliseconds on average per call. If I increase the load test parallel thread count to 24 or 30, it takes around 1 second for each call.
My question is, can a Service Fabric Reliable Dictionary handle parallel "read" operations, just like SQL Server can handle parallel concurrent reads, without affecting throughput?
If you check the Remarks about Reliable Dictionary CreateEnumerableAsync Method, you can see that it was designed to work concurrently, so concurrency is not an issue.
The returned enumerator is safe to use concurrently with reads and
writes to the Reliable Dictionary. It represents a snapshot consistent
view
The problem is that concurrently does not mean fast
When you make your query this way, it will:
have to take the snapshot of the collection before it start processing it, otherwise you wouldn't be able to write to it while processing.
you have to navigate through all the values in the collection to find the item you are looking for and take note of these values before you return anything.
Load the data from the disk if not in memory yet, only the Keys is kept in the memory, the values are kept in the disk when not required and might get paged for memory release.
The following queries will probably(i am not sure, but I assume) not reuse the previous one, your collection might have changed since last query.
When you have a huge number of queries running this ways, many factors will take in place:
Disk: loading the data to memory,
CPU: Comparing the values and scheduling threads
Memory: storing the snapshot to be processed
The best way to work with Reliable Dictionary is retrieving these values by Keys, because it knows exactly where the data for a specific key is stored, and does not add this extra overhead to find it.
If you really want to use it this way, I would recommend you design it like an Index Table where you store the data indexed by id in one Dictionary, and another dictionary with the key being the searched value, and value being the key to the main dicitonary. This would be much faster.
Based on the code I see all you reads are executed on primary replicas - therefore you have 7 nodes and 60 service instances that process requests. If I get everything right there are 60 replicas that process requests.
You have 7 nodes and 60 replicas - therefore if we imagine they are distributed more or less equally between nodes we have 8 replicas per node.
I am not sure about physical configuration of each node but if we assume for a moment that each node has 4 vCPU then you can imagine that when you make 8 concurrent requests on the same node all of these requests now should be executed using 4 vCPU. This situation causes worker threads to fight for resources - keeping it simple it significantly slows down the processing.
The reason why this effect is so visible here is that because you are scanning the IReliableDictionary instead of getting items by key using TryGetValueAsync like it supposed to be.
You can try to change you code to use TryGetValueAsync and the difference will be very noticeable.

What is the best way to load huge result set in memory?

I am trying to load 2 huge resultsets(source and target) coming from different RDBMS but the problem with which i am struggling is getting those 2 huge result set in memory.
Considering below are the queries to pull data from source and target:
Sql Server -
select Id as LinkedColumn,CompareColumn from Source order by LinkedColumn
Oracle -
select Id as LinkedColumn,CompareColumn from Target order by LinkedColumn
Records in Source : 12377200
Records in Target : 12266800
Following are the approaches i have tried with some statistics:
1) open data reader approach for reading source and target data:
Total jobs running in parallel = 3
Time taken by Job1 = 01:47:25
Time taken by Job1 = 01:47:25
Time taken by Job1 = 01:48:32
There is no index on Id Column.
Major time is spent here:
var dr = command.ExecuteReader();
Problems:
There are timeout issues also for which i have to kept commandtimeout to 0(infinity) and it is bad.
2) Chunk by chunk reading approach for reading source and target data:
Total jobs = 1
Chunk size : 100000
Time Taken : 02:02:48
There is no index on Id Column.
3) Chunk by chunk reading approach for reading source and target data:
Total jobs = 1
Chunk size : 100000
Time Taken : 00:39:40
Index is present on Id column.
4) open data reader approach for reading source and target data:
Total jobs = 1
Index : Yes
Time: 00:01:43
5) open data reader approach for reading source and target data:
Total jobs running in parallel = 3
Index : Yes
Time: 00:25:12
I observed that while having an index on LinkedColumn does improve performance, the problem is we are dealing with a 3rd party RDBMS table which might not have an index.
We would like to keep database server as free as possible so data reader approach doesn't seem like a good idea because there will be lots of jobs running in parallel which will put so much pressure on database server which we don't want.
Hence we want to fetch records in the resource memory from source to target and do 1 - 1 records comparison to keep the database server free.
Note: I want to do this in my c# application and don't want to use SSIS or Linked Server.
Update:
Source Sql Query Execution time in sql server management studio: 00:01:41
Target Sql Query Execution time in sql server management studio:00:01:40
What will be the best way to read huge result set in memory?
Code:
static void Main(string[] args)
{
// Running 3 jobs in parallel
//Task<string>[] taskArray = { Task<string>.Factory.StartNew(() => Compare()),
//Task<string>.Factory.StartNew(() => Compare()),
//Task<string>.Factory.StartNew(() => Compare())
//};
Compare();//Run single job
Console.ReadKey();
}
public static string Compare()
{
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
var srcConnection = new SqlConnection("Source Connection String");
srcConnection.Open();
var command1 = new SqlCommand("select Id as LinkedColumn,CompareColumn from Source order by LinkedColumn", srcConnection);
var tgtConnection = new SqlConnection("Target Connection String");
tgtConnection.Open();
var command2 = new SqlCommand("select Id as LinkedColumn,CompareColumn from Target order by LinkedColumn", tgtConnection);
var drA = GetReader(command1);
var drB = GetReader(command2);
stopwatch.Stop();
string a = stopwatch.Elapsed.ToString(#"d\.hh\:mm\:ss");
Console.WriteLine(a);
return a;
}
private static IDataReader GetReader(SqlCommand command)
{
command.CommandTimeout = 0;
return command.ExecuteReader();//Culprit
}
There is nothing (I know of) faster than a DataReader for fetching db records.
Working with large databases comes with its challenges, reading 10 million records in under 2 seconds is pretty good.
If you want faster you can:
jdwend's suggestion:
Use sqlcmd.exe and the Process class to run query and put results into a csv file and then read the csv into c#. sqlcmd.exe is designed to archive large databases and runs 100x faster than the c# interface. Using linq methods are also faster than the SQL Client class
Parallize your queries and fetch concurrently merging results: https://shahanayyub.wordpress.com/2014/03/30/how-to-load-large-dataset-in-datagridview/
The easiest (and IMO the best for a SELECT * all) is to throw hardware at it:
https://blog.codinghorror.com/hardware-is-cheap-programmers-are-expensive/
Also make sure you're testing on the PROD hardware, in release mode as that could skew your benchmarks.
This is a pattern that I use. It gets the data for a particular record set into a System.Data.DataTable instance and then closes and disposes all un-managed resources ASAP. Pattern also works for other providers under System.Data include System.Data.OleDb, System.Data.SqlClient, etc. I believe the Oracle Client SDK implements the same pattern.
// don't forget this using statements
using System.Data;
using System.Data.SqlClient;
// here's the code.
var connectionstring = "YOUR_CONN_STRING";
var table = new DataTable("MyData");
using (var cn = new SqlConnection(connectionstring))
{
cn.Open();
using (var cmd = cn.CreateCommand())
{
cmd.CommandText = "Select [Fields] From [Table] etc etc";
// your SQL statement here.
using (var adapter = new SqlDataAdapter(cmd))
{
adapter.Fill(table);
} // dispose adapter
} // dispose cmd
cn.Close();
} // dispose cn
foreach(DataRow row in table.Rows)
{
// do something with the data set.
}
I think I would deal with this problem in a different way.
But before lets make some assumptions:
According to your question description, you will get data from SQL Server and Oracle
Each query will return a bunch of data
You do not specify what is the point of getting all that data in memory, neither the use of it.
I assume that the data you will process is going to be used multiple times and you will not repeat both queries multiple times.
And whatever you will do with the data, probably is not going to be displayed to the user all at the same time.
Having these foundation points I would process the following:
Think at this problem as a data processing
Have a third database or some other place with auxiliar Database tables where you can store all the result of the 2 queries.
To avoid timeouts or so, try to obtain the data using pagging (get thousands at a time) and save then in these aux DB tables, and NOT in "RAM" memory.
As soon as your logic completes all the data loading (import migration), then you can start processing it.
Data processing is a key point of database engines, they are efficient and lots of evolution during many years, do don't spend time reinventing the wheel. Use some Stored procedure to "crunch/process/merge" of the 2 auxiliary tables into only 1.
Now that you have all "merged" data in a 3th aux table, now you can use it to display or something else you need to use it.
If you want to read it faster, you must use original API to get the data faster. Avoid framework like linq and do rely on DataReader that one. Try to check weather you need something like dirty read (with(nolock) in sql server).
If your data is very huge, try to implement partial read. Something like making index to your data. Maybe you can put condition where date from - to until everything selected.
After that you must consider using Threading in your system to parallelize the flow. Actually 1 thread to get from job 1, another thread to get from job 2. This one will cut lot of time.
Technicalities aside, I think there is a more fundamental problem here.
select [...] order by LinkedColumn
I does observe that while having index on LinkedColumn does improve performance but the problem is we are dealing with 3rd party RDBMS tables which might have index or might not.
We would like to keep database server as free as possible
If you cannot ensure that the DB has a tree based index on that column, it means the DB will be quite busy sorting your millions of elements. It's slow and resource hungry. Get rid of the order by in the SQL statement and perform it on application side to get results faster and reduce load on DB ...or ensure the DB has such an index!!!
...depending if this fetching is a common or a rare operation, you'll want to either enforce a proper index in the DB, or just fetch it all and sort it client side.
I had a similar situation many years ago. Before I looked at the problem it took 5 days running continuously to move data between 2 systems using SQL.
I took a different approach.
We extracted the data from the source system into just a small number of files representing a flattened out data model and arranged the data in each file so it all naturally flowed in the proper sequence as we read from the files.
I then wrote a Java program that processed these flattened data files and produced individual table load files for the target system. So, for example, the source extract had less than a dozen data files from the source system which turned into 30 to 40 or so load files for the target database.
That process would run in just a few minutes and I incorporated full auditing and error reporting and we could quickly spot problems and discrepancies in the source data, get them fixed, and run the processor again.
The final piece of the puzzle was a multi-threaded utility I wrote that performed a parallel bulk load on each load file into the target Oracle database. This utility created a Java process for each table and used Oracle's bulk table load program to quickly push the data into the Oracle DB.
When all was said and done that 5 day SQL-SQL transfer of millions of records turned into just 30 minutes using a combination of Java and Oracle's bulk load capabilities. And there were no errors and we accounted for every penny of every account that was transferred between systems.
So, maybe think outside the SQL box and use Java, the file system, and Oracle's bulk loader. And make sure you're doing your file IO on solid state hard drives.
If you need to process large database result sets from Java, you can opt for JDBC to give you the low level control required. On the other hand, if you are already using an ORM in your application, falling back to JDBC might imply some extra pain. You would be losing features such as optimistic locking, caching, automatic fetching when navigating the domain model and so forth. Fortunately most ORMs, like Hibernate, have some options to help you with that. While these techniques are not new, there are a couple of possibilities to choose from.
A simplified example; let's assume we have a table (mapped to class "DemoEntity") with 100.000 records. Each record consists of a single column (mapped to the property "property" in DemoEntity) holding some random alphanumerical data of about ~2KB. The JVM is ran with -Xmx250m. Let's assume that 250MB is the overall maximum memory that can be assigned to the JVM on our system. Your job is to read all records currently in the table, doing some not further specified processing, and finally store the result. We'll assume that the entities resulting from our bulk operation are not modified

`BatchStatement` occasionally gets data out of sync

"Cassandra: The Definitive Guide, 2nd Edition" says:
Cassandra’s batches are a good fit for use cases such as making
multiple updates to a single partition, or keeping multiple tables in
sync. A good example is making modifications to denormalized tables
that store the same data for different access patterns.
The last statement above applies to the following attempt, where all the Save... are insert statements for different tables
var bLogged = new BatchStatement();
var now = DateTimeOffset.UtcNow;
var uuidNow = TimeUuid.NewId(now);
bLogged.Add(SaveMods.Bind(id, uuidNow, data1)); // 1
bLogged.Add(SaveMoreMods.Bind(id, uuidNow, data2)); // 2
bLogged.Add(SaveActivity.Bind(now.ToString("yyyy-MM-dd"), id, now)); // 3
await GetSession().ExecuteAsync(bLogged);
We'll focus on statements 1 and 2 (the 3rd one is just to signify there's one more statement in the batch).
Statement 1 writes to table1 partitioned by id with uuidNow being a clustering key desc.
Statement 2 writes to table2 partitioned by id only, so it's the tip of the table1 for the same id.
More times than I'd like the two tables get out of sync in the sense that table2 does not have the tip of the table1. It would be one or two mods behind within a few milliseconds.
While looking for resolution most on the web advise against all batches, which prompted my solution eliminating all mismatches:
await Task.WhenAll(
GetSession().ExecuteAsync(SaveMods.Bind(id, uuidNow, data1)),
GetSession().ExecuteAsync(SaveMoreMods.Bind(id, uuidNow, data2)),
GetSession().ExecuteAsync(SaveActivity.Bind(now.ToString("yyyy-MM-dd"), id, now))
);
The question is: what are batches good for, just the first statement in the quote? In that case how do I ensure modifications to different tables are in sync?
Using higher consistency (ie quorum) on reads/writes may help but there is always a possibility for inconsistencies between the table/partitions.
Batch statements will try to ensure that all the mutations in the batch will all happen or not. It does not guarantee that all the mutations will occur in an instant (no isolation, you can do a read where first mutation has been applied but others haven't). Also, batch statements will not provide a consistent view of all the data across all the nodes. For linearizable consistency you should consider using paxos (lightweight transactions) for conditional updates and trying to limit things that require the linearizability into a single partition.

Parallelizing database access

I create table with objects to process in SQL Server
The database is in dbserver.
Then, with a my app(c#), I use a SqlDataReader to iterate over all the object, and it makes it in time T. I use multithreading and mutex in my app and it use the same SqlDataReader for all the threads. I run in the serverp1.
Then to make it faster, I separate the object in 2 ranks or groups by a column.
Then I run the myapp in serverp01 for the objects in rank1 (SqlDataReader with a select where rank = 1) and then run the myapp in serverp02 for the object in rank2 (SqlDataReader with a select where rank = 2).
My issue is that it takes the same time T for both configs. May be I'm wrong but it should take T/2 time or close to it.
Somebody has an idea what its happening?
Sounds like you're being bound by IO speed. When you run the thing on serverp1, are the CPU's maxed out? If not, then probably the network or the DB disks are the bottleneck. You can check the disk and network throughput on the DB server to see if they hit certain limit.
If the disk is the bottleneck, them try to make your table rows narrower, each row in your table should be as few bytes a possible. Make sure that the table you're querying only holds the few columns you actually need and that they're as compressed as possible (i.e. highly normalized with integer keys instead of varchar values, non nullable etc).
Remember that even when you only ask for a few columns, the whole page needs to be read from disk into memory. The more rows you can fit onto a page, the less pages the server needs to read.
If the network is the bottleneck, then only selecting the columns you need and making them as narrow (int key instead of varchar value) should be enough.
Regards GJ

Categories