I frequently use large Datasets in a c# console application ( 1 million rows with around 30 odd columns) that I need to process in a sequential manner, these Datasets are first extracted from a remote database, I can't extract them in smaller chunks because the round trip over the wire would be too expensive.
What kind of options do I have in terms of breaking these into smaller chunks locally and reading it say, 10000 records at a time?.
I don't have a lot of RAM just around 2 GB or so, is there an efficient way for me to page these Datasets locally?.
Edit:
Would it make sense to serialize the DataTable or List and store it in a local NoSQL repository and then keep fetching 10000 records at a time?.
If you are using web application then you can enable "EnablePaging" attirbute of your data source control.
Related
I have a C# program that has to process about 100 000 items, and ADO.NET access to sql server database, and in one part of the program I have to make a performance decision:
Durring processing, for each item I have to read data from a database:
should I query database once for every item, or should I query once at the beginning for all items, and keep that 100 000 rows of data (about 10 columns - int and string) in c# object in memory and retrieve required data from it?
If you have a reasonably static data set, and enough memory to read everything upfront, and keep the results cached without starving of memory the rest of your system, the answer is very easy: you should do it.
There are two major components to the cost to any DB operation - the cost of the data transfer, and the cost of a round-trip. In your case, the cost of the data transfer is fixed, because the total number of bytes does not change based on whether you retrieve them all at once, or get them one chunk at a time.
The cost of a round-trip includes the time it takes RDBMS to figure out what data you need from the SQL statement, locating that data, and do all locking required to ensure that the data it serves you is consistent. A single round-trip is not expensive, but when you do it 100,000 times, the costs may very well become prohibitive. That is why it is best to read the data at once if your memory configuration allows it.
Another concern is how dynamic is your data. If the chances of your data changing in the time it takes you to process the entire set are high, you may take additional precautions to see if you need to re-process anything once your computations are done.
I am not sure what you mean by processing but oftentimes if this processing can be done on the db server, then triggering a stored procedure and passing arguments to it would be the preferred option. You then do not need round trips etc. You have to make the decision of whether you want to bring the data to the processing (from db to application) or bring the processing to the data (processing code to stored procedure).
I am building up an SSIS solution for an ETL process and I'm currently working on the extract section.
Our ETL sources are in different file formats including DB, Excel and CSV and also have different number of columns in each case. So instead of creating a dataflow task for each source I am extracting, I am using a script task which can handle all file types and column numbers.
The script reads the data into a datatable and then bulk inserts it into the database.
This works great on the data I have been testing (approx 30k rows) but I was wondering what happens with more rows and how that would affect performance.
I know the size limitations of a datatable is about 16 million rows and I don't think our data sources will have that many rows, but I am conscious of the effects on performance that a large datatable could have on the entire ETL process.
Would I be better processing the source in batches of say 10,000 to help speed performance?
Thanks for your help in advance
I've to transfer the data from datatable to database. The total record is around 15k per time.
I want to redue the time for data inserting. Should I use SqlBulk? or anything else ?
If you use C# take a look on SqlBulkCopy class. There is a good article on CodeProject about how to use it with DataTable
If you just aim for speed, then bulk copy might suite you well. Depending on the requirements you can also reduce logging level.
http://social.msdn.microsoft.com/Forums/en-US/transactsql/thread/edf280b9-2b27-48bb-a2ff-cab4b563dcb8/
How to copy a huge table data into another table in SQL Server
SQL bulk copy is definitively fastest with this kind of sample size as it streams the data rather than actually using regular sql commands. This gives it a slightly longer startup time but with more than ~120 rows at a time (according to my tests)
Check out my post here on how to get here if you want some stats on how sql bulk copy stacks up against a few other methods.
http://blog.staticvoid.co.nz/2012/8/17/mssql_and_large_insert_statements
15k records is not that much really. I have a laptop with i5 CPU and 4GB ram and it inserts 15k records in several seconds. I wouldn't really bother spending too much time on finding an ideal solution.
We are all aware of the popular trend of MMO games. where players face each other live. However during gameplay there is a tremendous flow of SQL inserts and queries, as given below
There are average/minimum 100 tournaments online per 12 minutes or 500 players / hour
In Game Progress table, We are storing each player move
12 round tournament of 4 player there can be 48 records
plus around same number for spells or special items
a total of 96 per tournament or 48000 record inserts per hour (500 players/hour)
In reponse to my previous question ( Improve MMO game performance ), I changed the schema and we are not writing directly to database.
Instead accumulating all values in a DataTable. The process then whenever the DataTable has more than 100k rows (which can sometimes be even within the hour) writes to a text file in csv format. Another background application which frequently scans the folder for CSV files, reads any available CSV file and stores the information into server database.
Questions
Can we access the datatable present in the game application from another application, directly (it reads the datatable and clears records that have read). So that the in place of writing and reading from disk, we read and write directly from memory.
Is there any method that is quicker that DataTable, that can hold large data and yet be fairly quicker in sorting and updating operation. Because we have to frequenly scan for userids, update game status (almost at every insert). It can be a cache utility OR a fast
Scan/Search algorithm OR even a CollectionModel. Right now, we use a foreach loop to go through all records in a DataTable and update rows if user is present. If not then we create a new row. I tried using SortedList and classes, but then it not only doubles the effort, memory usage increases tremendously slowing down overall game performance.
thanks
arvind
Well, let's answer:
You can share object between applications, using Remoting - but it's much slower, and makes the code less readable. But, you have another solution so you'll keep working with memory. you can use MemoryMappedFiles, so all the work will be actually using the memory and not the disk: http://msdn.microsoft.com/en-us/library/dd997372.aspx
you can use NoSQL DB from some kind (there are many out there: Redis, MongoDB, RavenDB) - all of them based on key-value access, and you should test their performance. Even better, some of this db's are persistent and can be used with multiple servers.
Hope this helps.
Using memcache would increase your performance
I have a database that contains more than 100 million records. I am running a query that contains more than 10 million records. This process takes too much time so i need to shorten this time. I want to save my obtained record list as a csv file. How can I do it as quickly and optimum as possible? Looking forward your suggestions. Thanks.
I'm assuming that your query is already constrained to the rows/columns you need, and makes good use of indexing.
At that scale, the only critical thing is that you don't try to load it all into memory at once; so forget about things like DataTable, and most full-fat ORMs (which typically try to associate rows with an identity-manager and/or change-manager). You would have to use either the raw IDataReader (from DbCommand.ExecuteReader), or any API that builds a non-buffered iterator on top of that (there are several; I'm biased towards dapper). For the purposes of writing CSV, the raw data-reader is probably fine.
Beyond that: you can't make it go much faster, since you are bandwidth constrained. The only way you can get it faster is to create the CSV file at the database server, so that there is no network overhead.
Chances are pretty slim you need to do this in C#. This is the domain of bulk data loading/exporting (commonly used in Data Warehousing scenarios).
Many (free) tools (I imagine even Toad by Quest Software) will do this more robustly and more efficiently than you can write it in any platform.
I have a hunch that you don't actually need this for an end-user (the simple observation is that the department secretary doesn't actually need to mail out copies of that; it is too large to be useful in that way).
I suggest using the right tool for the job. And whatever you do,
donot roll your own datatype conversions
use CSV with quoted literals and think of escaping the double quotes inside these
think of regional options (IOW: always use InvariantCulture for export/import!)
"This process takes too much time so i need to shorten this time. "
This process consists of three sub-processes:
Retrieving > 10m records
Writing records to file
Transferring records across the network (my presumption is you are working with a local client against a remote database)
Any or all of those issues could be a bottleneck. So, if you want to reduce the total elapsed time you need to figure out where the time is spent. You will probably need to instrument your C# code to get the metrics.
If it turns out the query is the problem then you will need to tune it. Indexes won't help here as you're retrieving a large chunk of the table (> 10%), so increasing the performance of a full table scan will help. For instance increasing the memory to avoid disk sorts. Parallel query could be useful (if you have Enterprise Edition and you have sufficient CPUs). Also check that the problem isn't a hardware issue (spindle contention, dodgy interconnects, etc).
Can writing to a file be the problem? Perhaps your disk is slow for some reason (e.g. fragmentation) or perhaps you're contending with other processes writing to the same directory.
Transferring large amounts of data across a network is obviously a potential bottleneck. Are you certain you're only sending relevenat data to the client?
An alternative architecture: use PL/SQL to write the records to a file on the dataserver, using bulk collect to retrieve manageable batches of records, and then transfer the file to where you need it at the end, via FTP, perhaps compressing it first.
The real question is why you need to read so many rows from the database (and such a large proportion of the underlying dataset). There are lots of approaches which should make this scenario avoidable, obvious ones being synchronous processing, message queueing and pre-consolidation.
Leaving that aside for now...if you're consolidating the data or sifting it, then implementing the bulk of the logic in PL/SQL saves having to haul the data across the network (even if it's just to localhost, there's still a big overhead). Again if you just want to dump it out into a flat file, implementing this in C# isn't doing you any favours.