This is more of an optimization than coding problem.
CASE:
I have a CSV file on the FTP server, that is updated every 30 minutes and contains product data. To dumb things down, let's assume only ID and quantity.
On the other end, there is API that manages the warehouse.
It's worth noting that the CSV file contains ~7,000 products, while the warehouse will take only select number of them (max. 800-900, mostly the same ones, some may be included and some discarded in the future).
GOAL:
I would like to efficently parse the data from CSV to the API (basically update warehouse stocks).
IDEAS I CAME UP WITH TO SOLVE THE PROBLEM:
Most basic
Download the CSV file from FTP server (using FileStream and FTPWebRequest)
Copy file content to lists / array
request API for product ID's in the warehouse
find those ID's among others in list / array and update their quantities accordingly through API again.
Everything on the fly
Read CSV File line by line
in the same loop instance, do all the quantity updating (idea above)
Anything else?
WHAT I'M WORRIED ABOUT:
File being modified during reading. Even though i could easily fit in between those 30 minute gaps, is there a way to proof the solution "in case"?
Related
I have a C# tool that parses a collection of csv files to construct a List. This collection can be small limited to 20 files or can be as large as 10000+ files. MyObject it self has about 20 properties most of them strings. Each file can create sometimes upto 4 items in the list and sometimes as many has 300.
After the parsing is done I first save the list to a csv file so I don't have to reparse the data again later. I then summarize the data by one pivot of the dataset and then there are multiple pivots to the dataset the user can choose. The data is presented in WPF and the user acts on the data and annotates the data with some additional information that then get's added to the MyObject. Finally the user can save all of this information to another csv file.
I ran into OOM when the files got large and have optimized some of my code. First I realized I was storing one parameter, i.e. the path to the csv file which was sometimes close to 255 characters. I changed it to only save the filename and things improved slightly. I then discovered a suggestion to compile to x64 that would give me 4 Gb of memory instead of 2 Gb.
Even with this obviously I hit OOM's when more and more files are added to this data set.
Some of the options I've considered are:
When parsing the files, save to the intermediate.csv file after each file parse and not keep the list in memory. This will work for me to avoid the step of seeing an OOM even before I get to save the intermediate.csv file.
Problem with this approach is I still have to load back the intermediate file into memory once the parsing is all done.
Some of the Properties on MyObject are similar for a collection of files. So I've considered refactoring the single object into multiple objects that will possibly reduce the number of items in the List object. Essentially refactoring to List, with MyTopLevelDetailsObject containing a List. The memory foot print should reduce theoretically. I can then output this to csv by doing some translation to make it appear like a single object.
Move the data to a db like MongoDB internally and load the data to summarize to the db logic.
Use DataTables instead.
Options 2 and 3 will be significant redesign with 3 also needing me to learn MongoDB. :)
I'm looking for some guidance and helpful tips of how Large data sets have been handled.
Regards,
LW
If, after optimizations, the data can't fit in memory, almost by definition you need it to hit the disk.
Rather than reinvent the wheel and create a custom data format, it's generally best to use one of the well vetted solutions. MongoDB is a good choice here, as are other database solutions. I'm fond of SQLite, which despite the name, can handle large amounts of data and doesn't require a local server.
If you ever get to the point where fitting the data on a local disk is a problem, you might consider moving on to large data solutions like Hadoop. That's a bigger topic, though.
Options two and four can't probably help you because (as I see it) they won't reduce the total amount of information in memory.
Also consider an option to load data dynamically. I mean, the user probably can't see all data at one moment of time. So you may load a part of .csv to the memory and show it to the user, then if the user made some annotations/edits you may save this chunk of data to a separate file. If the user scrolls through data you load it on the fly. When the user wants to save final .csv you combine it from the original one and your little saved chunks.
This is often a practice when creating C# desktop application that access some large amounts of data. For example, I adopted loading data in chunks on the fly, when I needed to create a WinForms software to operate with a huge database (tables with more then 10m rows, they can't fit to mediocre office PCs memory).
And yes, too much work to do it with .csv manually. It's easier to use some database to handle saving/saving of edited parts/composition of final output.
Is there a more effective way than this for implementing a user-saved-file in my application?
(I'm using C# / .Net with Sql Server)
MY GOAL:
I wish to allow users to save their created datapoints (along with some other structured data) to a "project file" with arbitrary extension.
PROPOSED METHOD:
Store each datapoint in the database along with a FileID column
When user saves the file, grab all the datapoints: "SELECT * ... WHERE FileID = #CurrentFileID".
Export all of those datapoints to an XML file.
Delete all of those datapoints from the database.
Save the XML file as (or as part of) the project file.
Every time user loads their project file, import the data from the XML back into the database.
Display the datapoints from the database that have FileId = Current file ID.
ALTERNATIVE:
Use Sqlite and create a separate Sqlite database file for each of user's projects?
The "best" answer depends on several factors. You can help yourself arrive at the best implementation for you by asking some probing questions about the use of the data.
The first question I would ask is: is there any reason that you can think of, now or in the future, for storing the data points as discrete fields in the database.
Think about this question in the context of what needs to consume those data points.
If, for example, you need to be able to pull them into a report or export only a portion of them at a time based on some criteria, then you almost certainly need to store them as discrete values in the database.
However, if the points will only ever be used in your application as a complete set and you have to disassemble and reassemble the XML every time, you may want to just store the complete XML file in the database as a blob. This will make storage, backup, and update very simple operations, but will also limit future flexibility.
I tend to lean toward the first approach (discrete fields), but you could easily start with the second approach if it is a stretch to conceive of any other use for the data and, if the need arises to have discrete fields, it should be a relatively easy exercise to convert the data in the blobs into tabled data if needed.
You can refine the answer by also asking additional questions:
What is the magnitude of the data (hundreds, thousands, millions, billions)?
What is the frequency of inserts and updates (x per second, minute, day, year)?
What is the desired response time (milliseconds, seconds, doesn't matter)?
If you need to insert hundreds of thousands of points per second, you might be better off storing the data as a blob.
However, if you only need to insert a few hundred points a couple of times a day, you might be better off starting with the tabled approach.
It can get complex, but by probing how the data is used in your scenario, you should be able to arrive at a pretty good answer for your system.
In my website im having one csv file, which is having millions of records.
based on some search key i need to select one record.
this part i completed.
my doubt is If multiple users (1000 users) access my website (only one csv file will be available)... we can able to read the same file with 100 users?
1M records is not a lot. Frankly I'd just load it all into structured data, and reference that. Any number of users can access it once it is memory (especially for read-only).
But ultimately the ideal answer here is: use a database. SQL Server Express is free and will cope with that effortlessly.
As long as the application only has to read you will not have a problem. However it would be more efficent to use a database for this task. You can make indexes and use sql of easy access. No need to parse the file on each request and you can even add/change data when your site is running.
I have multiple 1.5 GB CSV Files which contain billing information on multiple accounts for clients from a service provider. I am trying to split the large CSV file into smaller chunks for processing and formatting the data inside it.
I do not want to roll out my own CSV parser but this is something I haven't seen yet so please correct me if I am wrong. The 1.5GB files contains information in the following order: account information, account number, Bill Date, transactions , Ex gst , Inc gst , type and other lines.
note that BillDate here means the date when the invoice was made, so occassionally we have more than two bill dates in the same CSV.
Bills are grouped by : Account Number > Bill Date > Transactions.
Some accounts have 10 lines of Transaction details, some have over 300,000 lines of Transaction details. A large 1.5GB CSV file contains around 8million lines of data (I used UltraEdit before) to cut paste into smaller chunks but this has become very inefficient and a time consuming process.
I just want to load the large CSV files in my WinForm, click a button, which will split this large files in chunks of say no greater than 250,000 lines but some bills are actually bigger than 250,000 lines in which case keep them in one piece and not split accounts across multiple files since they are ordered anyway. Also I do not wan't accounts with multiple bill date in CSV in which case the splitter can create another additional split.
I already have a WinForm application that does the formatting of the CSV in smaller files automatically in VS C# 2010.
Is it actually possible to process this very large CSV files? I have been trying to load the large files but MemoryOutOfException is an annoyance since it crashes everytime and I don't know how to fix it. I am open to suggestions.
Here is what I think I should be doing:
Load the large CSV file (but fails since OutOfMemoryException). How to solve this?
Group data by account name, bill date, and count the number of lines for each group.
Then create an array of integers.
Pass this array of integers to a file splitter process which will take these arrays and write the blocks of data.
Any suggestions will be greatly appreciated.
Thanks.
You can use CsvReader to stream through and parse the data, without needing to load it all into memory in one go.
Yea about that.... being out of memory is going to happen with files that are HUGE. You need to take your situation seriously.
As with most problems, break everything into steps.
I have had a similar type of situation before (large data file in CSV format, need to process, etc).
What I did:
Make step 1 of your program suite or whatever, something that merely cuts your huge file into many smaller files. I have broken 5GB zipped up PGP encrypted files (after decryption...thats another headache) into many smaller pieces. You can do something simple like numbering them sequentially (i.e. 001, 002, 003...)
Then make an app to do the INPUT processing. No real business logic here. I hate FILE IO with a passion when it comes to business logic and I love the warm fuzzy feeling of data being in a nice SQL Server DB. That's just me. I created a thread pool and have N amount of threads (like 5, you decide how much your machine can handle) read those .csv part files you created.
Each thread reads one file. One to one relationship. Because it is file I/O, make sure you only dont have too many running at the same time. Each thread does the same basic operation. Reads in data, puts it in a basic structure for the db (table format), does lots of inserts, then ends the thread. I used LINQ to SQL because everything is strongly typed and what not, but to each their own. The better the db design the better for you later to do logic.
After all threads have finished executing, you have all the data from the original CSV in the database. Now you can do all your business logic and do whatever from there. Not the prettiest solution, but I was forced into developing that given my situation/data flow/size/requirements. You might go with something completely different. Just sharing I guess.
You can use an external sort. I suppose you'd have to do an initial pass through the file to identify proper line boundaries, as CSV records are probably not of a fixed length.
Hopefully, there might be some ready-made external sort implementations for .NET that you could use.
There's a very useful class in the Microsoft.VisualBasic.FileIO namespace that I've used for dealing with CSV files - the TextFieldParser Class.
It might not help with the large file size, but it's built-in and handles quoted and non-quoted fields (even if mixed in the same line). I've used it a couple of times in projects at work.
Despite the assembly name, it can be used with C#, in case you're wondering.
Sorry for the bad title.
I'm saving web pages. I currently use 1 XML file as an index. One element contains file created date (UTC), full URL (w. query string and what not). And the headers in a separate file with similar name but appended special extension.
However, going at 40k (incl. header) files, the XML is now 3.5 MB. Recently I was still reading, adding new entry, save this XML file. But now I keep it in memory and save it every once in a while.
When I request a page, the URL is looked up using XPath on the XML file, if there is an entry, the file path is returned.
The directory structure is
.\www.host.com/randomFilename.randext
So I am looking for a better way.
Im thinking:
One XML file per. domain (incl. subdomains). But I feel this might be a hassle.
Using SVN. I just tested it, but I have no experience in large repositories. Executing svn add "path to file" for every download, and commit when I'm done.
Create a custom file system, where I then can include everything I want, for ex. POST-data.
Generating a filename from the URL and somehow flattening the querystring, but large querystrings might be rejected by the OS. And if I keep it with the headers, I still need to keep track of multiple files mapped to each different query string. Hassle. And I don't want it to execute too slow either.
Multiple program instances will perform read/write operations, on different computers.
If I follow the directory/file method, I could in theory add a layer between so it uses DotNetZip on the fly. But then again, the query string.
I'm just looking for direction or experience here.
What I also want is the ability to keep history of these files, so the local file is not overwritten, and then I can pick which version (by date) I want. Thats why I tried SVN.
I would recommend either a relational database or a version control system.
You might want to use SQL Server 2008's new FILESTREAM feature to store the files themselves in the database.
I would use 2 data stores, one for the raw files and another for indexes.
To stored the flat file, I think Berkeley DB is a good choice, the key can be generated by md5 or other hash function, and you can also compress the content of the file to save some disk space.
For indexes, you can use relational database or more sophisticated text search engine like Lucene.