sql performance degrading - loading 60,000 xml files - ssis - xml source

sql performance degrading - loading 60,000 xml files - ssis - xml source - c#

I have a ssis data flow task which loads xml data into sql database - more than 60,000 xml files. My first few thousands of xml files gets loaded into the table faster. But as time progresses, the loading speed is reduced drastically.
first 10k files gets loaded in 10 minutes approx. next 10k takes 25 minutes, then slowly the performance degrades. By the time all my 60k+ files get loaded, it takes around 4 hours.
Is there any way to keep a check on the performance and load the files with the same speed as that for the initial files.
I have tried with bulk copy in c# too. But the issue exist even there as well. Is their any work around method to improve my performance ?

Parts of your code would make it easier for us to give you tips and ideas!
I believe that this issue is memory related. Are you reading all of the files into the memory before putting it in the sql database?
Check the Task Manager! If the memory usage keeps growing and growing, you have a potential issue with the memory usage.
I don't know how the files are stored or named, but if you could - why not work with like 1-5000 at a time, move them and take the next?

Try doing it with multiple DFT's instead of a single DFT. Limiting each one to around 5k/10k. This would result in lesser time frame hopefully.
Also, the difference in time might be due to the indexing on the table. Remove the indexing. Load the records. Reapply indexing once the loading is done. To query record sets on an indexed table is fast. But performing Insert on an indexed table and that too 60k records is a time consuming process.

1.execute SQL Task (Drop index Before Loading)
2.for loop ( Multiple Control flow for xml file load)
3.execute SQL Task (Recreate Index)

Related

SQL filtering with select VS lambda filtering with memory

I have a program in C# for a Church service where you can select a song from a list and put on a projector or a screen. This list is currently saved in .txt files (I made it several years ago) and currently I have about 270 different songs.
In the program you can filter and search by the title of the song and by the content of the song. This is very expensive as every time I search some text, the program checks each of the 270 songs in txt files: it opens, reads and closes, each of them every time.
Now I want to change and create a DB (SQLite) to save and search the songs and the verses from there.
My doubt is the following, if I make a select every time I want to consult, filter or search a song, what is better in a performance view? Is better to load everything in memory and create Lambda functions to improve searches and filters? Or is better to make every operation through SQL?
Thanks!
PD: I add some numbers:
270 songs: in disk space is about 1 MB in total.
5-6 strophes per song: 1.620 strophes.
4-5 verses per strophe: 8.100 verses.

Good news is that you do not have to care. See, any database written like in the last 60 years or so will keep whatever it can in memory anyway as long as there is enough memory. So, unless you unload SqlLite and if you set sensible memory limits, the database ends up in memory anyway. Your (implied) assumption that the query hits storage would be a colossal failure to properly use the database. Particularly given how pathetic (1mb in total) it is - we do not talk of a dozen gigabyte memory db that may challenge memory, the db is small enough to end up in the CPU cache.

OutOfMemoryException paging 150K rows using EF

I'm paging 150k rows in pages of 10,000. I mean I bring 10k rows of the database and iterate over them and then the next 10k and do the same until there aren't more rows. The problem is as I bring more rows I see the memory graph increasing in the performance tab of the task manager and in each iteration to bring the 10k rows the query execution last longer and longer until throws an OutOfMemoryException.
The query is a join of 6 tables. I load the results in a list using EF 4.
Al the end of each iteration I clear the list, set it to null and call GC.Collect() but this doesn't have any effect.
What can I make to free memory of the rows I already checked.

I faced a very similar issue when attempting return large datasets from a database.
By executing the query from a BackgroundWorker it removed the load from the UI thread, thus reducing the complexity of the task and eliminate this issue.
Before using this technique for my query I saw that application go from a few hundred MB's to top out at about 1.2GB before throwing the OutOfMemoryException. After implimenting the BackgroundWorkder the increase would only be about 10/15mb, then reduce back down once it had completed. I would also suggest the query executed slightly faster.
-Yes, I realise that this doesn't sound like it would work, but it actually did and I would recommend it.
Additionally if you were running on a 64bit OS (and targeting 64 bit architecture) you could raise the memory limit from 1.2GB to 4GB. This option can be found in the Project Properties under the Build tab...
Depending on the DB, you may be able to shift some of the work off to it by creating a view and querying that to gather your result set. (- Update)

I'm pretty sure the problem is caused by holding the same context instance to fetch all the pages. In this way, even though you no longer need previously fetched context still keep track of all of them.
You should create new DbContext instance before fetching new batch of records.
You could also try using AsNoTracking() method.

Loading multiple large ADO.NET DataTables/DataReaders - Performance improvements

I need to load multiple sql statements from SQL Server into DataTables. Most of the statements return some 10.000 to 100.000 records and each take up to a few seconds to load.
My guess is that this is simply due to the amount of data that needs to be shoved around. The statements themselves don't take much time to process.
So I tried to use Parallel.For() to load the data in parallel, hoping that the overall processing time would decrease. I do get a 10% performance increase, but that is not enough. A reason might be that my machine is only a dual core, thus limiting the benefit here. The server on which the program will be deployed has 16 cores though.
My question is, how I could improve the performance more? Would the use of Asynchronous Data Service Queries be a better solution (BeginExecute, etc.) than PLINQ? Or maybe some other approach?
The SQl Server is running on the same machine. This is also the case on the deployment server.
EDIT:
I've run some tests with using a DataReader instead of a DataTable. This already decreased the load times by about 50%. Great! Still I am wondering whether parallel processing with BeginExecute would improve the overall load time if a multiprocessor machine is used. Does anybody have experience with this? Thanks for any help on this!
UPDATE:
I found that about half of the loading time was consumed by processing the sql statement. In SQL Server Management Studio the statements took only a fraction of the time, but somehow they take much longer through ADO.NET. So by using DataReaders instead of loading DataTables and adapting the sql statements I've come down to about 25% of the initial loading time. Loading the DataReaders in parallel threads with Parallel.For() does not make an improvement here. So for now I am happy with the result and leave it at that. Maybe when we update to .NET 4.5 I'll give the asnchronous DataReader loading a try.

My guess is that this is simply due to the amount of data that needs to be shoved around.
No, it is due to using a SLOW framework. I am pulling nearly a million rows into a dictionary in less than 5 seconds in one of my apps. DataTables are SLOW.

You have to change the nature of the problem. Let's be honest, who needs to view 10.000 to 100.000 records per request? I think no one.
You need to consider to handle paging and in your case, paging should be done on sql server. To make this clear, lets say you have stored procedure named "GetRecords". Modify this stored procedure to accept page parameter and return only data relevant for specific page (let's say 100 records only) and total page count. Inside app just show this 100 records (they will fly) and handle selected page index.
Hope this helps, best regards!

Do you often have to load these requests? If so, why not use a distributed cache?

Improving Game Performance C#

We are all aware of the popular trend of MMO games. where players face each other live. However during gameplay there is a tremendous flow of SQL inserts and queries, as given below
There are average/minimum 100 tournaments online per 12 minutes or 500 players / hour
In Game Progress table, We are storing each player move
12 round tournament of 4 player there can be 48 records
plus around same number for spells or special items
a total of 96 per tournament or 48000 record inserts per hour (500 players/hour)
In reponse to my previous question ( Improve MMO game performance ), I changed the schema and we are not writing directly to database.
Instead accumulating all values in a DataTable. The process then whenever the DataTable has more than 100k rows (which can sometimes be even within the hour) writes to a text file in csv format. Another background application which frequently scans the folder for CSV files, reads any available CSV file and stores the information into server database.
Questions
Can we access the datatable present in the game application from another application, directly (it reads the datatable and clears records that have read). So that the in place of writing and reading from disk, we read and write directly from memory.
Is there any method that is quicker that DataTable, that can hold large data and yet be fairly quicker in sorting and updating operation. Because we have to frequenly scan for userids, update game status (almost at every insert). It can be a cache utility OR a fast
Scan/Search algorithm OR even a CollectionModel. Right now, we use a foreach loop to go through all records in a DataTable and update rows if user is present. If not then we create a new row. I tried using SortedList and classes, but then it not only doubles the effort, memory usage increases tremendously slowing down overall game performance.
thanks
arvind

Well, let's answer:
You can share object between applications, using Remoting - but it's much slower, and makes the code less readable. But, you have another solution so you'll keep working with memory. you can use MemoryMappedFiles, so all the work will be actually using the memory and not the disk: http://msdn.microsoft.com/en-us/library/dd997372.aspx
you can use NoSQL DB from some kind (there are many out there: Redis, MongoDB, RavenDB) - all of them based on key-value access, and you should test their performance. Even better, some of this db's are persistent and can be used with multiple servers.
Hope this helps.

Using memcache would increase your performance

SQLBulkCopy or Bulk Insert

I have about 6500 files for a sum of about 17 GB of data, and this is the first time that I've had to move what I would call a large amount of data. The data is on a network drive, but the individual files are relatively small (max 7 MB).
I'm writing a program in C#, and I was wondering if I would notice a significant difference in performance if I used BULK INSERT instead of SQLBulkCopy. The table on the server also has an extra column, so if I use BULK INSERT I'll have to use a format file and then run an UPDATE for each row.
I'm new to forums, so if there was a better way to ask this question feel free to mention that as well.

By test, BULK INSERT is much faster. After an hour using SQLBulkCopy, I was maybe a quarter of the way through my data, and I had finished writing the alternative method (and having lunch). By the time I finished writing this post (~3 minutes), BULK INSERT was about a third of the way through.
For anyone who is looking at this as a reference, it is also worth mentioning that the upload is faster without a primary key.
It should be noted that one of the major causes for this could be that the server was a significantly more powerful computer, and that this is not an analysis of the efficiency of the algorithm, however I would still recommend using BULK INSERT, as the average server is probably significantly faster than the average desktop computer.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.