SSIS Transform Component: Large Scale Data Storage

SSIS Transform Component: Large Scale Data Storage - c#

I'm developing an SSIS Transform component that will need to store the contents of the incoming data stream and then output the data at a later point in time. This could be a large number of records with many fields (of any data type).
For example, this type of storage would be needed if you were developing a 'Sort' component, where you cannot output a single record until all records have been input.
My question is - what is the recommended practice for storing this temporary data? The Microsoft and Codeplex examples I've seen are somewhat trivial in that they use in-memory structures. I would like to avoid this, as this would seem to be a very bad idea when working with large data sets.
Is there a mechanism in the SSIS library to do this? [okay, it looks like there is not]
I'm considering a few options:
Store the data on disk in a stream,
keeping the record offsets into this
stream in memory. During the output
phase, I'll use these offsets to
locate the desired record.
Store the
data in an ADO or OLEDB data source
of the user's choosing.
Other suggestions?

No - there is no 3rd-party accessible "buffering" mechanism exposed in the API. You're responsible for it yourself, including paging to disk or whatever mechanism you choose to avoid storing all of the rows in memory.

Related

Loading large amounts of data into a List<MyObject> in .net

I have a C# tool that parses a collection of csv files to construct a List. This collection can be small limited to 20 files or can be as large as 10000+ files. MyObject it self has about 20 properties most of them strings. Each file can create sometimes upto 4 items in the list and sometimes as many has 300.
After the parsing is done I first save the list to a csv file so I don't have to reparse the data again later. I then summarize the data by one pivot of the dataset and then there are multiple pivots to the dataset the user can choose. The data is presented in WPF and the user acts on the data and annotates the data with some additional information that then get's added to the MyObject. Finally the user can save all of this information to another csv file.
I ran into OOM when the files got large and have optimized some of my code. First I realized I was storing one parameter, i.e. the path to the csv file which was sometimes close to 255 characters. I changed it to only save the filename and things improved slightly. I then discovered a suggestion to compile to x64 that would give me 4 Gb of memory instead of 2 Gb.
Even with this obviously I hit OOM's when more and more files are added to this data set.
Some of the options I've considered are:
When parsing the files, save to the intermediate.csv file after each file parse and not keep the list in memory. This will work for me to avoid the step of seeing an OOM even before I get to save the intermediate.csv file.
Problem with this approach is I still have to load back the intermediate file into memory once the parsing is all done.
Some of the Properties on MyObject are similar for a collection of files. So I've considered refactoring the single object into multiple objects that will possibly reduce the number of items in the List object. Essentially refactoring to List, with MyTopLevelDetailsObject containing a List. The memory foot print should reduce theoretically. I can then output this to csv by doing some translation to make it appear like a single object.
Move the data to a db like MongoDB internally and load the data to summarize to the db logic.
Use DataTables instead.
Options 2 and 3 will be significant redesign with 3 also needing me to learn MongoDB. :)
I'm looking for some guidance and helpful tips of how Large data sets have been handled.
Regards,
LW

If, after optimizations, the data can't fit in memory, almost by definition you need it to hit the disk.
Rather than reinvent the wheel and create a custom data format, it's generally best to use one of the well vetted solutions. MongoDB is a good choice here, as are other database solutions. I'm fond of SQLite, which despite the name, can handle large amounts of data and doesn't require a local server.
If you ever get to the point where fitting the data on a local disk is a problem, you might consider moving on to large data solutions like Hadoop. That's a bigger topic, though.

Options two and four can't probably help you because (as I see it) they won't reduce the total amount of information in memory.
Also consider an option to load data dynamically. I mean, the user probably can't see all data at one moment of time. So you may load a part of .csv to the memory and show it to the user, then if the user made some annotations/edits you may save this chunk of data to a separate file. If the user scrolls through data you load it on the fly. When the user wants to save final .csv you combine it from the original one and your little saved chunks.
This is often a practice when creating C# desktop application that access some large amounts of data. For example, I adopted loading data in chunks on the fly, when I needed to create a WinForms software to operate with a huge database (tables with more then 10m rows, they can't fit to mediocre office PCs memory).
And yes, too much work to do it with .csv manually. It's easier to use some database to handle saving/saving of edited parts/composition of final output.

How do I use a GTFS feed?

I want to use a GTFS feed in Google Maps, but I don't know how to. I want to display the buses available from a route. Just so you know, I'm planning on implementing the Google Map I make in a Visual C# application.

This is a very general question, so my answer will necessarily be general as well. If you can provide more detail about what you're trying to accomplish I'll try to offer more specific help.
At a high level, the steps for working with a GTFS feed are:
Parse the data. From the GTFS feed's URL you'll obtain a ZIP file containing a set of CSV files. The format of these files is specified in Google's GTFS reference, and most languages already have a CSV-parsing library available that can be used to read in the data. Additionally, for some languages there are GTFS-parsing libraries available that will return data from these files as objects; it looks like there's one available for C#, gtfsengine, you might want to check out.
Load the data. You'll need to store the data somewhere, at least temporarily, to be able to work with it. This could simply be a data structure in memory (particularly if you've written your own parsing code) but since larger feeds can take some time to read you'll probably want to look at using a relational database or some other kind of storage you can serialize to disk. In the application I'm developing, a separate process parses and loads GTFS data into a relational database in one pass.
Query the data. Obviously how you do this will depend on the method you use for storing the data and the purpose of your application. If you're using a relational database, you will typically have one table per GTFS entity (or CSV file) on which you can construct indices and against which you can execute SQL queries. If you're working with objects in memory, you might construct a hash-table index in memory as well and query that to find the data you need.

Caching huge data into the disk

I have a requirement where huge amount of data needs to be cached on the disk.
Whenever there is a change in the database, the data is retreived from the database and cached on the disk. I will be having a background process which keeps checking my cached data with the data base, and updates it as and when required.
I would like to know what would be the best way to organize the cached data on my disk, so that writing and reading from the cache can be faster.
An another thread would be used to fetch some new data from the db and cache it on the disk. I also need to take care of synchronization between the two threads.(one will be updating the existing cache data, and the other will be writing newly fetched data into the cache.)
Please suggest a strategy for organizing the data on the cache and also synchronization between the threads.

SQL Server has something called XML tables. Those tables are based on physical XML files located in the disk. You can map/link XML data in the disk to a table in SQL Server. For users, it is seamless, in other words they see those tables as a regular tables.
Besides technical/philosophical discussion about caching huge data on the disk, this is just an idea...

Do you care about the consistancy of the data? on power failures?
Memory mapped files along with occational flushes porbably get want you want
Do you need to have an indexed access to data?
You probably need to design something B-tree implementation or B+tree implementation. which gives efficient retrival of the indexed data and better block level locking.
http://code.google.com/p/high-concurrency-btree/

As an alternative answer, my own B+Tree implementation will neatly address this as a completely managed code (C#) implementation of an IDictionary<TKey, TValue>. It is a single-file key/value store that is thread-safe and optimized for concurrency. It was built from the ground up expressly for this purpose and for providing write-through caches.
Dicussion - http://csharptest.net/projects/bplustree/
Online Help – http://help.csharptest.net/
Source Code – http://code.google.com/p/csharptest-net/
Downloads – http://code.google.com/p/csharptest-net/downloads
NuGet Package – http://nuget.org/List/Packages/CSharpTest.Net.BPlusTree

Best way to save/load large amout of data in a .Net application?

What is the best way to save large amount of data for a .Net 4.0 application?
Right now I am using Lists and serializing to a file in "User Data" folder, and its working ok, but I want to know if there is a better/faster way of saving/loading large amount of data.
The data that I am saving contains only lots of words, like documents.
The size of the data is almost 1 mb.

That really depends on the type of your application. I wouldn't use SQL database of any sort for to just load and save operation of data that I do not need to query or transform. The time it will take to map your object graph to a relational model just not worth it.
Also I don't believe it will ever be faster than simple serialization due to the overhead associated with databases (connection management and mapping)
My recent experience was with BinnaryFormatter which had excellent results (files ~ 15mb). Worse come to worse you can always write your own formatter.

Kinda depends on your data and how you have it stored in your app.
But all these NoSQL storage systems are a possibility or just plain binary data into a file.

When you say "large amout [sic] of data", what exactly do you mean by that? A megabyte? a terabyte?
And what exactly is the data?
If it's a set of account records, it might well belong in a database of some sort; if it's a set of images or word processing documents, perhaps not.

If you want fast access, one approach would be to serialize to a hashtable, and cache it. In between reads and writes...
Problem here is ofcourse, versioning, changing of namespaces(then you wont be able to deserialize....easyly), deadlocks, concurrency etc....
Better if you save the file as a XML/JSON, and when you do read it in to memory save it into a hashtable...for fast access...

Architecture: set-based data pipeline challenge

I'm working on a database driven web-application (ASP.NET, SQL 2008), which receives structured XML data from various sources. The data resembles a set, and often needs 'cleanup', so it is passed through the database as XML, and turned into a resultset for display.
I'd like to capture the produced 'clean' results, and send them to an archive database to persist them to disk.
The options I've considered so far are:
Serialize the entire 'clean' result set into an object (XML/.NET serialized), and send this back to the archive database
PRO: Easily repeatable - can profile/capture the database calls on the archive machine, and re-run them to identify any problems
CON: Versioning could be tricky
Store the cleaned results in a table, and periodically copy fresh records in this table to the archive machine
PRO: Easy build - quick scheduled job
CON: Harder to repro calls on the archive machine; would need to keep input table contents around
Are there any other options, and has anyone had any experience with similar situations?

I have used both cases succesfuly and what I do depends on the system.
Saving Raw Xml:
I tend to save Raw Xml when I am either dealing with unstructured data, or when we are dealing with a messaging system, and we want to track the messages. For example, an application I worked on collected messages from deployed windows clients, we would dump the messages into a relational structure and then roll them into a warehouse. When I took over the project, we started storing the raw xml that was coming because it did allow us the replayability, and the ability to see exactly what is coming into the system.
Relational Data
If I am going to need to do any reporting aggregation of the data, I would break the data out and store it into regular tables. I know you can query xml data in a database, but I try and avoid that. I might still save the original raw messages for replayability and troubleshooting.
Saving a Binary Object
The last thing I have done is save an entire serialized binary object. I find this handy when the object graph is quite complex, and the relationships between the objects are important. It does have a huge downside which is versioning; however, I have managed this quite sucesfully versioning even with namespace changes, object heirarchy changes etc. If you need to access the data in SQL this is not the way to go.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.