I have about 6500 files for a sum of about 17 GB of data, and this is the first time that I've had to move what I would call a large amount of data. The data is on a network drive, but the individual files are relatively small (max 7 MB).
I'm writing a program in C#, and I was wondering if I would notice a significant difference in performance if I used BULK INSERT instead of SQLBulkCopy. The table on the server also has an extra column, so if I use BULK INSERT I'll have to use a format file and then run an UPDATE for each row.
I'm new to forums, so if there was a better way to ask this question feel free to mention that as well.
By test, BULK INSERT is much faster. After an hour using SQLBulkCopy, I was maybe a quarter of the way through my data, and I had finished writing the alternative method (and having lunch). By the time I finished writing this post (~3 minutes), BULK INSERT was about a third of the way through.
For anyone who is looking at this as a reference, it is also worth mentioning that the upload is faster without a primary key.
It should be noted that one of the major causes for this could be that the server was a significantly more powerful computer, and that this is not an analysis of the efficiency of the algorithm, however I would still recommend using BULK INSERT, as the average server is probably significantly faster than the average desktop computer.
Related
I am working to increase the performance of bulk loads; 100's of millions of records + daily.
I moved this over to use the IDatareader interface in lieu of the data tables and did get a noticeable performance boost (500,000 more records a minute). The current setup is:
A custom cached reader to parse the delimited files.
Wrapping the stream reader in a buffered stream.
A custom object reader class that enumerates over the objects and implements the IDatareader interface.
Then SqlBulkCopy writes to server
The bulk of the performance bottle neck is directly in SqlBulkCopy.WriteToServer. If I unit test the process up to but excluding just the WriteToServer the process returns in roughly 1 minute. WriteToServer is taking on an additional 15 minutes +. For the unit test it is on my local machine so the same drive the database lives on so it's not having to copy the data across the network.
I am using a heap table (no indexes; clustered or unclustered; I have played around various batch sizes without major differences in performance).
There is a need to decrease the load times so I am hoping someone might now a way to squeeze a little more blood out of this turn-up.
Why not use SSIS directly?
Anyway, if you did a treaming from parsing to IDataReader you're already on the right path. To optimize SqlBulkCopy itself you need to turn your focus to SQL Server. The key is minimally logged operations. You must read these MSDN articles:
Prerequisites for Minimal Logging in Bulk Import.
Optimizing Bulk Import Performance.
If your target is a B-Tree (ie a clustered indexed table) unfortunately one of the most important tenets of performant bulk insert, namely the sorted-input rowset, cannot be declared. Sis simple as this, ADO.Net SqlClient does not have the equivalent of SSPROP_FASTLOADOPTIONS -> ORDER(Column) (OleDb). Since the engine does not know that the data is already sorted it will add a Sort operator in the plan which is not that bad except when it spills. To avoid spills, use a small batch size (~10k). See my original point: all these are just options and clicks to set in SSIS rather than digging through OleDB MSDN spec...
If your data stream is unsorted to start with or the destination is a heap then my point above is mute.
However, achieving minimally logging is still a must for decent performance.
I've to transfer the data from datatable to database. The total record is around 15k per time.
I want to redue the time for data inserting. Should I use SqlBulk? or anything else ?
If you use C# take a look on SqlBulkCopy class. There is a good article on CodeProject about how to use it with DataTable
If you just aim for speed, then bulk copy might suite you well. Depending on the requirements you can also reduce logging level.
http://social.msdn.microsoft.com/Forums/en-US/transactsql/thread/edf280b9-2b27-48bb-a2ff-cab4b563dcb8/
How to copy a huge table data into another table in SQL Server
SQL bulk copy is definitively fastest with this kind of sample size as it streams the data rather than actually using regular sql commands. This gives it a slightly longer startup time but with more than ~120 rows at a time (according to my tests)
Check out my post here on how to get here if you want some stats on how sql bulk copy stacks up against a few other methods.
http://blog.staticvoid.co.nz/2012/8/17/mssql_and_large_insert_statements
15k records is not that much really. I have a laptop with i5 CPU and 4GB ram and it inserts 15k records in several seconds. I wouldn't really bother spending too much time on finding an ideal solution.
I have a database that contains more than 100 million records. I am running a query that contains more than 10 million records. This process takes too much time so i need to shorten this time. I want to save my obtained record list as a csv file. How can I do it as quickly and optimum as possible? Looking forward your suggestions. Thanks.
I'm assuming that your query is already constrained to the rows/columns you need, and makes good use of indexing.
At that scale, the only critical thing is that you don't try to load it all into memory at once; so forget about things like DataTable, and most full-fat ORMs (which typically try to associate rows with an identity-manager and/or change-manager). You would have to use either the raw IDataReader (from DbCommand.ExecuteReader), or any API that builds a non-buffered iterator on top of that (there are several; I'm biased towards dapper). For the purposes of writing CSV, the raw data-reader is probably fine.
Beyond that: you can't make it go much faster, since you are bandwidth constrained. The only way you can get it faster is to create the CSV file at the database server, so that there is no network overhead.
Chances are pretty slim you need to do this in C#. This is the domain of bulk data loading/exporting (commonly used in Data Warehousing scenarios).
Many (free) tools (I imagine even Toad by Quest Software) will do this more robustly and more efficiently than you can write it in any platform.
I have a hunch that you don't actually need this for an end-user (the simple observation is that the department secretary doesn't actually need to mail out copies of that; it is too large to be useful in that way).
I suggest using the right tool for the job. And whatever you do,
donot roll your own datatype conversions
use CSV with quoted literals and think of escaping the double quotes inside these
think of regional options (IOW: always use InvariantCulture for export/import!)
"This process takes too much time so i need to shorten this time. "
This process consists of three sub-processes:
Retrieving > 10m records
Writing records to file
Transferring records across the network (my presumption is you are working with a local client against a remote database)
Any or all of those issues could be a bottleneck. So, if you want to reduce the total elapsed time you need to figure out where the time is spent. You will probably need to instrument your C# code to get the metrics.
If it turns out the query is the problem then you will need to tune it. Indexes won't help here as you're retrieving a large chunk of the table (> 10%), so increasing the performance of a full table scan will help. For instance increasing the memory to avoid disk sorts. Parallel query could be useful (if you have Enterprise Edition and you have sufficient CPUs). Also check that the problem isn't a hardware issue (spindle contention, dodgy interconnects, etc).
Can writing to a file be the problem? Perhaps your disk is slow for some reason (e.g. fragmentation) or perhaps you're contending with other processes writing to the same directory.
Transferring large amounts of data across a network is obviously a potential bottleneck. Are you certain you're only sending relevenat data to the client?
An alternative architecture: use PL/SQL to write the records to a file on the dataserver, using bulk collect to retrieve manageable batches of records, and then transfer the file to where you need it at the end, via FTP, perhaps compressing it first.
The real question is why you need to read so many rows from the database (and such a large proportion of the underlying dataset). There are lots of approaches which should make this scenario avoidable, obvious ones being synchronous processing, message queueing and pre-consolidation.
Leaving that aside for now...if you're consolidating the data or sifting it, then implementing the bulk of the logic in PL/SQL saves having to haul the data across the network (even if it's just to localhost, there's still a big overhead). Again if you just want to dump it out into a flat file, implementing this in C# isn't doing you any favours.
I am trying to use sqlite in my application as a sort of cache. I say sort of because items never expire from my cache and I am not storing anything. I simply need to use the cache to store all ids I processed before. I don't want to process anything twice.
I am entering items into the cache at 10,000 messages/sec for a total of 150 million messages. My table is pretty simple. It only has one text column which stores the id's. I was doing this all in memory using a dictionary, however, I am processing millions of messages and, although it is fast that way, I ran out of memory after some time.
I have researched sqlite and performance and I understand that configuration is key, however, I am still getting horrible performance on inserts (I haven't tried selects yet). I am not able to keep up with even 5000 inserts/sec. Maybe this is as good as it gets.
My connection string is as below:
Data Source=filename;Version=3;Count Changes=off;Journal Mode=off;
Pooling=true;Cache Size=10000;Page Size=4096;Synchronous=off
Thanks for any help you can provide!
If you are doing lots of inserts or updates at once, put them in a transaction.
Also, if you are executing essentially the same SQL each time, use a parameterized statement.
Have you looked at the SQLite Optimization FAQ (bit old).
SQLite performance tuning and optimization on embedded systems
If you have many threads writing to the same database, then you're going to run into concurrency problems with that many transactions per second. SQLite always locks the whole database for writes so only one write transaction can be processed at a time.
An alternative is Oracle Berkley DB with SQLite. This latest version of Berkley DB includes a SQLite front end that has a page-level locking mechanism instead of database level. This provides much higher numbers of transactions per second when there is a high concurrency requirement.
http://www.oracle.com/technetwork/database/berkeleydb/overview/index.html
It includes the same SQLite.NET provider and is supposed to be a drop-in replacement.
Since you're requirements are so specific you may be better off with something more dedicated, like memcached. This will provide a very high throughput caching implementation that will be a lot more memory efficient than a simple hashtable.
Is there a port of memcache to .Net?
I'm dealing with chunks of data that are 50k rows each.
I'm inserting them into an SQL database using LINQ:
for(int i=0;i<50000;i++)
{
DB.TableName.InsertOnSubmit
(
new TableName
{
Value1 = Array[i,0],
Value2 = Array[i,1]
}
);
}
DB.SubmitChanges();
This takes about 6 minutes, and I want it to take much less if possible. Any suggestions?
if you are reading in a file you'd be better off using BULK INSERT (Transact-SQL) and if you are writing that much (50K rows) at one time from memory, you might be better off writing to a flat file first and then using Bulk Insert on that file.
As you are doing a simple insert and not gaining much from the use of LinqToSql, have a look at SqlBulkCopy, it will remove most of the round trips and reduce the overhead on the Sql Server side as well. You will have to make very few coding changes to use it.
Also look at pre-sorting your data by the column that the table is indexed on, as this will lead to better cache hits when SQL-Server is update the table.
Also consider if you should upload the data to a temp staging table that is not indexed, then a stored proc to insert into the main table with a single sql statement. This may let SqlServer spread the indexing work over all your CPUs.
There are a lot of things you need to check/do.
How much disk space is allocated to the database? Is there enough free to do all of the inserts without it auto increasing in size? If not, increase the database file size as it has to stop every so many inserts to auto resize the db itself.
do NOT do individual inserts. They take way too long. Instead either use table-value parameters (sql 2008), sql bulk copy, or a single insert statement (in that order of preference).
drop any indexes on that table before and recreate them after the load. With that many inserts they are probably going to be fragged to hell anyway.
If you have any triggers, consider dropping them until the load is complete.
Do you have enough RAM available in the database server? You need to check on the server itself to see if it's consuming ALL the available RAM? If so, you might consider doing a reboot prior to the load... sql server has a tendency to just consume and hold on to everything it can get it's hands on.
Along the RAM lines, we like to keep enough RAM in the server to hold the entire database in memory. I'm not sure if this is feasible for you or not.
How is it's disk speed? Is the queue depth pretty long? Other than hardware replacement there's not much to be done here.