I have multiple 1.5 GB CSV Files which contain billing information on multiple accounts for clients from a service provider. I am trying to split the large CSV file into smaller chunks for processing and formatting the data inside it.
I do not want to roll out my own CSV parser but this is something I haven't seen yet so please correct me if I am wrong. The 1.5GB files contains information in the following order: account information, account number, Bill Date, transactions , Ex gst , Inc gst , type and other lines.
note that BillDate here means the date when the invoice was made, so occassionally we have more than two bill dates in the same CSV.
Bills are grouped by : Account Number > Bill Date > Transactions.
Some accounts have 10 lines of Transaction details, some have over 300,000 lines of Transaction details. A large 1.5GB CSV file contains around 8million lines of data (I used UltraEdit before) to cut paste into smaller chunks but this has become very inefficient and a time consuming process.
I just want to load the large CSV files in my WinForm, click a button, which will split this large files in chunks of say no greater than 250,000 lines but some bills are actually bigger than 250,000 lines in which case keep them in one piece and not split accounts across multiple files since they are ordered anyway. Also I do not wan't accounts with multiple bill date in CSV in which case the splitter can create another additional split.
I already have a WinForm application that does the formatting of the CSV in smaller files automatically in VS C# 2010.
Is it actually possible to process this very large CSV files? I have been trying to load the large files but MemoryOutOfException is an annoyance since it crashes everytime and I don't know how to fix it. I am open to suggestions.
Here is what I think I should be doing:
Load the large CSV file (but fails since OutOfMemoryException). How to solve this?
Group data by account name, bill date, and count the number of lines for each group.
Then create an array of integers.
Pass this array of integers to a file splitter process which will take these arrays and write the blocks of data.
Any suggestions will be greatly appreciated.
Thanks.
You can use CsvReader to stream through and parse the data, without needing to load it all into memory in one go.
Yea about that.... being out of memory is going to happen with files that are HUGE. You need to take your situation seriously.
As with most problems, break everything into steps.
I have had a similar type of situation before (large data file in CSV format, need to process, etc).
What I did:
Make step 1 of your program suite or whatever, something that merely cuts your huge file into many smaller files. I have broken 5GB zipped up PGP encrypted files (after decryption...thats another headache) into many smaller pieces. You can do something simple like numbering them sequentially (i.e. 001, 002, 003...)
Then make an app to do the INPUT processing. No real business logic here. I hate FILE IO with a passion when it comes to business logic and I love the warm fuzzy feeling of data being in a nice SQL Server DB. That's just me. I created a thread pool and have N amount of threads (like 5, you decide how much your machine can handle) read those .csv part files you created.
Each thread reads one file. One to one relationship. Because it is file I/O, make sure you only dont have too many running at the same time. Each thread does the same basic operation. Reads in data, puts it in a basic structure for the db (table format), does lots of inserts, then ends the thread. I used LINQ to SQL because everything is strongly typed and what not, but to each their own. The better the db design the better for you later to do logic.
After all threads have finished executing, you have all the data from the original CSV in the database. Now you can do all your business logic and do whatever from there. Not the prettiest solution, but I was forced into developing that given my situation/data flow/size/requirements. You might go with something completely different. Just sharing I guess.
You can use an external sort. I suppose you'd have to do an initial pass through the file to identify proper line boundaries, as CSV records are probably not of a fixed length.
Hopefully, there might be some ready-made external sort implementations for .NET that you could use.
There's a very useful class in the Microsoft.VisualBasic.FileIO namespace that I've used for dealing with CSV files - the TextFieldParser Class.
It might not help with the large file size, but it's built-in and handles quoted and non-quoted fields (even if mixed in the same line). I've used it a couple of times in projects at work.
Despite the assembly name, it can be used with C#, in case you're wondering.
Related
I have a C# tool that parses a collection of csv files to construct a List. This collection can be small limited to 20 files or can be as large as 10000+ files. MyObject it self has about 20 properties most of them strings. Each file can create sometimes upto 4 items in the list and sometimes as many has 300.
After the parsing is done I first save the list to a csv file so I don't have to reparse the data again later. I then summarize the data by one pivot of the dataset and then there are multiple pivots to the dataset the user can choose. The data is presented in WPF and the user acts on the data and annotates the data with some additional information that then get's added to the MyObject. Finally the user can save all of this information to another csv file.
I ran into OOM when the files got large and have optimized some of my code. First I realized I was storing one parameter, i.e. the path to the csv file which was sometimes close to 255 characters. I changed it to only save the filename and things improved slightly. I then discovered a suggestion to compile to x64 that would give me 4 Gb of memory instead of 2 Gb.
Even with this obviously I hit OOM's when more and more files are added to this data set.
Some of the options I've considered are:
When parsing the files, save to the intermediate.csv file after each file parse and not keep the list in memory. This will work for me to avoid the step of seeing an OOM even before I get to save the intermediate.csv file.
Problem with this approach is I still have to load back the intermediate file into memory once the parsing is all done.
Some of the Properties on MyObject are similar for a collection of files. So I've considered refactoring the single object into multiple objects that will possibly reduce the number of items in the List object. Essentially refactoring to List, with MyTopLevelDetailsObject containing a List. The memory foot print should reduce theoretically. I can then output this to csv by doing some translation to make it appear like a single object.
Move the data to a db like MongoDB internally and load the data to summarize to the db logic.
Use DataTables instead.
Options 2 and 3 will be significant redesign with 3 also needing me to learn MongoDB. :)
I'm looking for some guidance and helpful tips of how Large data sets have been handled.
Regards,
LW
If, after optimizations, the data can't fit in memory, almost by definition you need it to hit the disk.
Rather than reinvent the wheel and create a custom data format, it's generally best to use one of the well vetted solutions. MongoDB is a good choice here, as are other database solutions. I'm fond of SQLite, which despite the name, can handle large amounts of data and doesn't require a local server.
If you ever get to the point where fitting the data on a local disk is a problem, you might consider moving on to large data solutions like Hadoop. That's a bigger topic, though.
Options two and four can't probably help you because (as I see it) they won't reduce the total amount of information in memory.
Also consider an option to load data dynamically. I mean, the user probably can't see all data at one moment of time. So you may load a part of .csv to the memory and show it to the user, then if the user made some annotations/edits you may save this chunk of data to a separate file. If the user scrolls through data you load it on the fly. When the user wants to save final .csv you combine it from the original one and your little saved chunks.
This is often a practice when creating C# desktop application that access some large amounts of data. For example, I adopted loading data in chunks on the fly, when I needed to create a WinForms software to operate with a huge database (tables with more then 10m rows, they can't fit to mediocre office PCs memory).
And yes, too much work to do it with .csv manually. It's easier to use some database to handle saving/saving of edited parts/composition of final output.
This is more of an optimization than coding problem.
CASE:
I have a CSV file on the FTP server, that is updated every 30 minutes and contains product data. To dumb things down, let's assume only ID and quantity.
On the other end, there is API that manages the warehouse.
It's worth noting that the CSV file contains ~7,000 products, while the warehouse will take only select number of them (max. 800-900, mostly the same ones, some may be included and some discarded in the future).
GOAL:
I would like to efficently parse the data from CSV to the API (basically update warehouse stocks).
IDEAS I CAME UP WITH TO SOLVE THE PROBLEM:
Most basic
Download the CSV file from FTP server (using FileStream and FTPWebRequest)
Copy file content to lists / array
request API for product ID's in the warehouse
find those ID's among others in list / array and update their quantities accordingly through API again.
Everything on the fly
Read CSV File line by line
in the same loop instance, do all the quantity updating (idea above)
Anything else?
WHAT I'M WORRIED ABOUT:
File being modified during reading. Even though i could easily fit in between those 30 minute gaps, is there a way to proof the solution "in case"?
Is there a more effective way than this for implementing a user-saved-file in my application?
(I'm using C# / .Net with Sql Server)
MY GOAL:
I wish to allow users to save their created datapoints (along with some other structured data) to a "project file" with arbitrary extension.
PROPOSED METHOD:
Store each datapoint in the database along with a FileID column
When user saves the file, grab all the datapoints: "SELECT * ... WHERE FileID = #CurrentFileID".
Export all of those datapoints to an XML file.
Delete all of those datapoints from the database.
Save the XML file as (or as part of) the project file.
Every time user loads their project file, import the data from the XML back into the database.
Display the datapoints from the database that have FileId = Current file ID.
ALTERNATIVE:
Use Sqlite and create a separate Sqlite database file for each of user's projects?
The "best" answer depends on several factors. You can help yourself arrive at the best implementation for you by asking some probing questions about the use of the data.
The first question I would ask is: is there any reason that you can think of, now or in the future, for storing the data points as discrete fields in the database.
Think about this question in the context of what needs to consume those data points.
If, for example, you need to be able to pull them into a report or export only a portion of them at a time based on some criteria, then you almost certainly need to store them as discrete values in the database.
However, if the points will only ever be used in your application as a complete set and you have to disassemble and reassemble the XML every time, you may want to just store the complete XML file in the database as a blob. This will make storage, backup, and update very simple operations, but will also limit future flexibility.
I tend to lean toward the first approach (discrete fields), but you could easily start with the second approach if it is a stretch to conceive of any other use for the data and, if the need arises to have discrete fields, it should be a relatively easy exercise to convert the data in the blobs into tabled data if needed.
You can refine the answer by also asking additional questions:
What is the magnitude of the data (hundreds, thousands, millions, billions)?
What is the frequency of inserts and updates (x per second, minute, day, year)?
What is the desired response time (milliseconds, seconds, doesn't matter)?
If you need to insert hundreds of thousands of points per second, you might be better off storing the data as a blob.
However, if you only need to insert a few hundred points a couple of times a day, you might be better off starting with the tabled approach.
It can get complex, but by probing how the data is used in your scenario, you should be able to arrive at a pretty good answer for your system.
I use csv files as database in seperate processes. I only store all data or read all data in my datagrid in singular relationship. Every field in every txt file is one and only number starting from zero.
//While reaching countries, i read allcountries.txt,
//while reaching cities, i read allcities.txt
//while reaching places i read allplaces.txt.
but one country has many cities and one city has many places. Yet, i don't use any relationship. I want to use and i know there is some needs for this. How can i reach data for reading and writing by adding all text files one extra column?
And is it possible to reach data without sql queries?
Text files don't have any mechanism for SELECTs or JOINs. You'll be at a pretty steep disadvantage.
However, LINQ gives you the ability to search through object collections in SQL-like ways, and you can certainly create entities that have relationships. The application (whatever you're building) will have to load everything from the text files at application start. Since your code will be doing the searching, it has to have the entities loaded (from text files) and in-memory.
That may or may not be a pleasant or effective experience, but if you're set on replacing SQL with text files, it's not impossible.
CSV files are good for mass input and mass output. They are not good for point manipulations or maintaining relationships. Consider using a database. SQLite might be something useful in your application.
Based on your comments, it would make more sense to use XML instead of CSV. This meets your requirements for being human and machine readable, and XML has nice built in libraries for searching, manipulating, serializing etc.
You can use SQL queries in CSV files: How to use SQL against a CSV file, I have done it for reading but never for writing so I don't know if this will work for you.
I have lot of Forex Tick Data to be saved. My question is what is the best way?
Here is am example: I collect only 1 month data from the EURUSD pair. It is originally in CSV file which is 136MB large and has 2465671 rows. I use a library written by : http://www.codeproject.com/Articles/9258/A-Fast-CSV-Reader and it took around 30 seconds to read all the ticks and saved it in 2465671 objects. first of all, whether it is fast enough?
Secondly, is there any way better than CSV? For example, the binary file which might be faster and whether you have any recommendation about any database which is best? I tried the db4o but it is not very impressive. I think here are some overhead to save data as properties of object and when we have to save 2465671 objects in Yap file of db4o.
I've thought about this before, and if I was collecting this data, I would break up the process:
collect data from the feed, form a line (I'd use fixed width), and append it to a text file.
I would create a new text file every minute and name it something like rawdata.yymmddhhmm.txt
Then I would have another process working in the background reading these files and pushing then into a database via a parameterized insert query.
I would probably use text over a binary file because I know that would append without any problems, but I'd also look into opening a binary file for append as well. This might actually be a bit better.
Also, you want to open the file in append mode since that's the fastest way to write to a file. This will obviously need to be super fast.
Perhaps look at this product:
http://kx.com/kdb+.php
it seems to made for that purpose.
One way to save data space (and hopefully time) is to save numbers as numbers and not as text, which is what CSV does.
You can perhaps make an object out of each row, and the make the reading and writing each object a serialization problem, which there is good support for in C#.
Kx's kdb database would be a great of-the-shelf package if you had a few million to spare. However you could easily write your own column-orientated database to store and analyse high-frequency data for optimal performance.
I save terabytes as compressed binary files (GZIP) that I dynamically uncompress using C#/.NETs built-in gzip compression/decompression readers.
HDF5 is widely used for big data, including by some financial firms. Unlike KDB it's free to use, and there are plenty of libraries to go on top of it, such as the .NET wrapper
This SO question might help you get started.
HDF5 homepage