Takes a long time to insert rows to database - data Parallel method

Takes a long time to insert rows to database - data Parallel method - c#

i insert data to database from terminal to access Through WebService
like this:
using (Conn = new OleDbConnection(Work_Connect))
{
foreach (DataRow R in ds.Tables["MyCount"].Rows)
{
U_TermNum = TermNum;
U_Id = Id;
U_Bar = R["Bar"].ToString().Trim();
U_Qty = R["Qty"].ToString().Trim();
U_Des = R["Des"].ToString().Trim();
U_UserName = UserName;
U_UserID = UserID;
SQL = "INSERT INTO MyTbl (ID,Bar,Qty,TermNum,Des,UserName,UserID) VALUES (#A,#B,#C,#D,#E,#F,#G)";
using (OleDbCommand Cmd4 = new OleDbCommand(SQL, Conn))
{
Cmd4.Parameters.AddWithValue("#A", Convert.ToInt32(U_Id));
Cmd4.Parameters.AddWithValue("#B", U_Bar);
Cmd4.Parameters.AddWithValue("#C", Convert.ToDouble(U_Qty));
Cmd4.Parameters.AddWithValue("#D", U_TermNum);
Cmd4.Parameters.AddWithValue("#E", U_Des);
Cmd4.Parameters.AddWithValue("#F", U_UserName);
Cmd4.Parameters.AddWithValue("#G", U_UserID);
Cmd4.ExecuteNonQuery();
}
}
i try to send from 20 terminal's
if i send 1--> wait 10 seconds send 2 --> wait 10 seconds --> .......... it works very fast and all teriminals ends to send after 1 minute
but if i send Parallel at ones --> it works very slow and all terminals ends after 6 minuts
why ? and how to change my code that i can send parallel and all ends fast ?
Now I noticed that not all rows was insert to the database
(When I want to put them all in - one)
How to deal with this trouble ?

If you find that your application is bogging down under load then using an Access back-end database might not be the right choice for your situation. Specifically:
ACE/Jet (Access) databases are generally not recommended for use with Web applications where the number of concurrent connections can vary greatly and web traffic can "spike" the level of activity well above ACE/Jet's "comfort zone"..
Informal discussions among Access developers tend to consider ~10 concurrent users as the point where an Access application will start to slow down, and ~25 concurrent users is often cited as the practical limit. These are very general guidelines, of course, and some Access applications can handle many more concurrent users depending on their usage patterns (e.g., mostly lookups with occasional inserts and updates).
So, if your application will regularly have ~20 concurrent connections hammering INSERTs into the database as fast as they can then then you should consider switching your database back-end to a server-based product that is better-suited to that type of activity.

Related

Queries from background worker conflict?

Is there a problem if I execute queries from multiple threads using the same ConnectionString? What happens if two or more threads try to send data at the same time?
string globalConnectionString = #"some_stringHere!";
//create new backgroundWorker if new logFile is created (txt file).
// ....
private void backgroundWorker_DoWork(object sender, DoWorkEventArgs e)
{
// get some data from created logFile
string serialNumber = getSerialNumber(logFile);
string testResult = getTestResult(logFile);
// if server is online, send data
if(serverIsOnline)
{
using(SqlConnection connection = new SqlConnecton(globalConnectionString))
{
SqlCommand someCommand = new SqlCommand("some insert/update command here!", connection);
connection.Open();
Command.ExecuteNonQuery();
connection.Close();
}
}
}

Concurrent connections are OK, if used correctly
There's no problem with using multiple connections concurrently, assuming it's done for the right reason. Databases can handle thousands of concurrent client connections.
Executing the same slow query in parallel to make it finish faster will probably make it even slower as each connection may block the others. Many databases parallelize query processing already, producing far better results than crude client-side parallelism.
If you want to make a slow query go faster, you'd get better results by investigating why it's slow and fixing the perf issues. For example, if you want to insert 10K rows, it's faster to use eg SqlBulkCopy or BULK INSERT to load the rows than executing 10K INSERTs that will end up blocking each other for access to the same table and even data pages
You can use the same connection to execute asynchronous queries (eg with ExecuteNonQueryAsync(), ExecuteReaderAsync() etc, provided they execute one after the other. You can't execute multiple concurrent queries on the same connection, at least not without going through some hoops.
The real problem
The real problem is using a BackgroundWorker in the first place. That class is obsolete since 2012 when async/await were introduced. With BGW it's extremely hard to combine multiple asynchronous operations. Progress reporting is available through the Progress<T> class and cooperative cancellation through CancellationTokenSource. Check Async in 4.5: Enabling Progress and Cancellation in Async APIs for a detailed explanation.
You can replace the BGW calls in your code with only await command.ExecuteNonQueryAsync(). You could create an asynchronous method to perform insert the data into the database :
private async Task InsertTestData(string serialNumber,string testResult)
{
// if server is online, send data
if(serverIsOnline)
{
using(SqlConnection connection = new SqlConnecton(globalConnectionString))
{
var someCommand = new SqlCommand("some insert/update command here!", connection);
someCommand.Parameters.Add("#serial",SqlDbType.NVarChar,30).Value=serialNumber;
...
connection.Open();
Command.ExecuteNonQueryAsync();
}
}
}
If retrieving the serial number and test data is time consuming, you can use Task.Run to run each of them in the background :
string serialNumber = await Task.Run(()=>getSerialNumber(logFile));
string testResult = await Task.Run(()=>getTestResult(logFile));
await InsertTestData(serialNumber,testResult);
You could also use a library like Dapper to simplify the database :
private async Task InsertTestData(string serialNumber,string testResult)
{
// if server is online, send data
if(serverIsOnline)
{
using(SqlConnection connection = new SqlConnecton(globalConnectionString))
{
await connection.ExecuteAsync("INSERT .... VALUES(#serial,#test)",
new {serial=serialNumber,test=testResults});
}
}
}
Dapper will generate a parameterized query and match the parameters in the query with properties in the anonymous object by name.

Reading the connection string isn't an issue here. You would have a problem if you would share the SqlConnection object through multiple threads. But that's not the case in your code.

I believe this is a question about Isolation from ACID properties. Please have a look at them.
Based on the SQL standard a single SQL query operates on a steady (consistent) state of the table(s) which the query works on. So this definition dictates that, it can NOT see any changes while it's being executed. However, as far as I know not all DBMS software follow this rule perfectly. For example there are products and / or Isolation levels that allow dirty reads.
Here is very detailed explanation from another user.

SQLite poor performance issue in multi-user local-network environment

We use SQLite as shared DB in our application. (I know this is not the best solution but server/client architecture was not possible)
There are only a few users, a very small db and just few writes.
The application is written in c# and we use System.Data.SQLite.dll but the problem occures also for example with the SQLiteDatabaseBrowser
As long as only one user connects to the DB and queries some results, it is very fast. Just some milliseconds. One user can establish multiple connections and execute select statements in parallel. This has also no impact on the performance.
But as soon as another user from a different mashine connects to the db, the performance becomes very poor for every connected user. The performance keeps poor as long as all connections/apps are closed.
After that, the first user connecting, gets the good performance back until the next user connects.
I tried many things:
PRAGMA synchronous = OFF
updated to the lates sqlite version (and created a new db file with that version)
DB-File read-only
network share read-only for everyone
connection string with different options (nearly all)
different sqlite programms (our application and SQLiteDatabaseBrowser)
different filesystems hostet on (NTFS and FAT32)
After that, I wrote a little app that opens a connection, queries some results and displays the passed time. This all in an endless loop.
Here is the Code of this simple app:
static void Main(string[] args)
{
SQLiteConnectionStringBuilder conBuilder = new SQLiteConnectionStringBuilder();
conBuilder.DataSource = args[0];
conBuilder.Pooling = false;
conBuilder.ReadOnly = true;
string connectionString = conBuilder.ConnectionString;
while (true)
{
RunQueryInNewConnection(connectionString);
System.Threading.Thread.Sleep(500);
}
}
static void RunQuery(SQLiteConnection con)
{
using (SQLiteCommand cmd = con.CreateCommand())
{
cmd.CommandText = "select * from TabKatalog where ReferenzName like '%0%'";
Console.WriteLine("Execute Query: " + cmd.CommandText);
Stopwatch watch = new Stopwatch();
watch.Start();
int lines = 0;
SQLiteDataReader reader = cmd.ExecuteReader();
while (reader.Read())
lines++;
watch.Stop();
Console.WriteLine("Query result: " + lines + " in " + watch.ElapsedMilliseconds + " ms");
}
}
static void RunQueryInNewConnection(string pConnectionString)
{
using (SQLiteConnection con = new SQLiteConnection(pConnectionString, true))
{
con.Open();
RunQuery(con);
}
System.Data.SQLite.SQLiteConnection.ClearAllPools();
GC.Collect();
GC.WaitForPendingFinalizers();
}
While testing with this little app, I realised, that it is enough to let another system take a file handle on the sqlite db to decrease the performance. So it seems, that this has nothing to do wih the connection to the db. The performance keeps low until ALL file handles are released. I tracked it with procexp.exe. In addition, only the remote systems encounter the performance issue. On the db file host itself, the queries runs fast every time.
Has anybody encountered the same issue or has some hints?

Windows does not cache files that are concurrently accessed on another computer.
If you need high concurrency, consider using a client/server database.

Why is this eating memory?

I wrote an application whose purpose is to read logs from a large table (90 million) and process them into easily understandable stats, how many, how long etc.
The first run took 7.5 hours and only had to process 27 of the 90 million. I would like to speed this up. So I am trying to run the queries in parallel. But when I run the below code, within a couple minutes I crash with an Out of Memory exception.
Environments:
Sync
Test : 26 Applications, 15 million logs, 5 million retrieved, < 20mb, takes 20 seconds
Production: 56 Applications, 90 million logs, 27 million retrieved, < 30mb, takes 7.5 hours
Async
Test : 26 Applications, 15 million logs, 5 million retrieved, < 20mb, takes 3 seconds
Production: 56 Applications, 90 million logs, 27 million retrieved, Memory Exception
public void Run()
{
List<Application> apps;
//Query for apps
using (var ctx = new MyContext())
{
apps = ctx.Applications.Where(x => x.Type == "TypeIWant").ToList();
}
var tasks = new Task[apps.Count];
for (int i = 0; i < apps.Count; i++)
{
var app = apps[i];
tasks[i] = Task.Run(() => Process(app));
}
//try catch
Task.WaitAll(tasks);
}
public void Process(Application app)
{
//Query for logs for time period
using (var ctx = new MyContext())
{
var logs = ctx.Logs.Where(l => l.Id == app.Id).AsNoTracking();
foreach (var log in logs)
{
Interlocked.Increment(ref _totalLogsRead);
var l = log;
Task.Run(() => ProcessLog(l, app.Id));
}
}
}
Is it ill advised to create 56 contexts?
Do I need to dispose and re-create contexts after a certain number of logs retrieved?
Perhaps I'm misunderstanding how the IQueryable is working? <-- My Guess
My understanding is that it will retrieve logs as needed, I guess that means for the loop is it like a yield? or is my issue that 56 'threads' call to the database and I am storing 27 million logs in memory?
Side question
The results don't really scale together. Based on the Test environment results i would expect Production would only take a few minutes. I assume the increase is directly related to the number of records in the table.

With 27 Million rows the problem is one of stream processing, not parallel execution. You need to approach the problem as you would with SQL Server's SSIS or any other ETL tools: each processing step is a transofrmation that processes its input and sends its output to the next step.
Parallel processing is achieved by using a separate thread to run each step. Some steps could also use multiple threads to process multiple inputs up to a limit. Setting limits to each step's thread count and input buffer ensures you can achieve maximum throughput without flooding your machine with waiting tasks.
.NET's TPL Dataflow addresses exactly this scenario. It provides blocks to transfrom inputs to outputs (TransformBlock), split collections to individual messages (TransformManyBlock), execute actions without transformations (ActionBlock), combine data in batches (BatchBlock) etc.
You can also specify the Maximum degree of parallelism for each step so that, eg. you have only 1 log queries executing at each time, but use 10 tasks for log processing.
In your case, you could:
Start with a TransformManyBlock that receives an application type and returns a list of app IDs
A TranformBlock reads the logs for a specific ID and sends them downstream
An ActionBlock processes the batch.
Step #3 could be broken to many other steps. Eg if you don't need to process all app log entries together, you can use a step to process individual entries. Or you could first group them by date.
Another option is to create a custom block to read data from the database using a DbDataReader and post each entry to the next step immediatelly, instead of waiting for all rows to return. This would allow you to process each entry as it arrives, instead of waiting to receive all entries.
If each app log contains many entries, this could be a huge memory and time saver

Azure Table Storage QueryAll(), ImproveThroughput

I have some data (approximatly 5 Mio of items in 1500 tables, 10GB) in azure tables. The entities can be large and contain some serialized binary data in the protobuf format.
I have to process all of them and transform it to another structure. This processing is not thread safe. I also process some data from a mongodb replica set using the same code (the mongodb is hosted in another datacenter).
For debugging purposes I log the throughput and realized that it is very low. With mongodb I have a throughput of 5000 items / sec, with azure table storage only 30 items per second.
To improve the performance I try to use TPL dataflow, but it doesnt help:
public async Task QueryAllAsync(Action<StoredConnectionSetModel> handler)
{
List<CloudTable> tables = await QueryAllTablesAsync(companies, minDate);
ActionBlock<StoredConnectionSetModel> handlerBlock = new ActionBlock<StoredConnectionSetModel>(handler, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 1 });
ActionBlock<CloudTable> downloaderBlock = new ActionBlock<CloudTable>(x => QueryTableAsync(x, s => handlerBlock.Post(s), completed), new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 20 });
foreach (CloudTable table in tables)
{
downloaderBlock.Post(table);
}
}
private static async Task QueryTableAsync(CloudTable table, Action<StoredConnectionSetModel> handler)
{
TableQuery<AzureTableEntity<StoredConnectionSetModel>> query = new TableQuery<AzureTableEntity<StoredConnectionSetModel>>();
TableContinuationToken token = null;
do
{
TableQuerySegment<AzureTableEntity<StoredConnectionSetModel>> segment = await table.ExecuteQuerySegmentedAsync<AzureTableEntity<StoredConnectionSetModel>>(query, token);
foreach (var entity in segment.Results)
{
handler(entity.Entity);
}
token = segment.ContinuationToken;
}
while (token != null)
}
I run the batch process on my local machine (with 100mbit connection) and in azure (as worker role) and it is very strange, that the throughput on my machine is higher (100 items / sec) than on azure. I reach my max capacity of the internet connection locally but the worker role should not have this 100mbit limitation I hope.
How can I increase the throughput? I have no ideas what is going wrong here.
EDIT: I realized that I was wrong with the 30items per second. It is often higher (100/sec), depending on the size of the items I guess. According to the documentation (http://azure.microsoft.com/en-us/documentation/articles/storage-performance-checklist/#subheading10) there is a limit:
The scalability limit for accessing tables is up to 20,000 entities (1KB each) per second for an account. This are only 19MB / sec, not so impressive, if you keep in mind, that there are also normal requests from the production system). Probably I test it to use multiple accounts.
EDIT #2: I made two single tests, starting with a list of 500 keys [1...500] (Pseudo Code)
Test#1 Old approach (TABLE 1)
foreach (key1 in keys)
foreach (key2 in keys)
insert new Entity { paritionkey = key1, rowKey = key2 }
Test#2 New approach (TABLE 2)
numpartitions = 100
foreach (key1 in keys)
foreach (key2 in keys)
insert new Entity { paritionkey = (key1 + key2).GetHashCode() % numParitions, rowKey = key1 + key2 }
Each entity gets another property with 10KB of random text data.
Then I made the query tests, in the first case I just query all entities from Table 1 in one thread (sequential)
In the next test I create on task for each partitionkey and query all entities from Table 2 (parallel). I know that the test is no that good, because in my production environment I have a lot more partitions than only 500 per table, but it doesnt matter. At least the second attempt should perform well.
It makes no difference. My max throughput is 600 entities/sec, varying from 200 to 400 the most of the time. The documentation says that I can query 20.000 entities / sec (with 1 KB each), so I should get at least 1500 or so in average, I think. I tested it on a machine with 500MBit internet connection and I only reached about 30mbit, so this should not be the problem.

You should also check out the Table Storage Design Guide. Hope this helps.

Multiple dbcontexts in parallel threads, EntityException "Rerun your statement when there are fewer active users"

I am using Parallel.ForEach to do work on multiple threads, using a new EF5 DbContext for each iteration, all wrapped within a TransactionScope, as follows:
using (var transaction = new TransactionScope())
{
int[] supplierIds;
using (var appContext = new AppContext())
{
supplierIds = appContext.Suppliers.Select(s => s.Id).ToArray();
}
Parallel.ForEach(
supplierIds,
supplierId =>
{
using (var appContext = new AppContext())
{
Do some work...
appContext.SaveChanges();
}
});
transaction.Complete();
}
After running for a few minutes it is throwing an EntityException "The underlying provider failed on Open" with the following inner detail:
"The instance of the SQL Server Database Engine cannot obtain a LOCK resource at this time. Rerun your statement when there are fewer active users. Ask the database administrator to check the lock and memory configuration for this instance, or to check for long-running transactions."
Does anyone know what's causing this or how it can be prevented? Thanks.

You could also try setting the maximum number of concurrent tasks in the Parallel.ForEach() method using new ParallelOptions { MaxDegreeOfParallelism = 8 } (replace 8 with the whatever you want to limit it to.
See MSDN for more details

You should also find out why your app is taking such huge amounts of locks? You have wrapped a TransactionScope around multiple db connections. This probably causes a distributed transaction which might have to do with it. It certainly causes locks to never be released until the very end. Change that.
You can only turn up locking limits so far. It does not scale to arbitrary amounts of supplier ids. You need to find the cause for the locks, not mitigate the symptoms.

You are running into the maximum number of locks allowed by sql server - which by default is set automatically and governed by available memory.
You can
Set it manually - I forget exactly how but google is your friend.
Add more memory to your sql server
Commit your transactions more frequently.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.