I use csv files as database in seperate processes. I only store all data or read all data in my datagrid in singular relationship. Every field in every txt file is one and only number starting from zero.
//While reaching countries, i read allcountries.txt,
//while reaching cities, i read allcities.txt
//while reaching places i read allplaces.txt.
but one country has many cities and one city has many places. Yet, i don't use any relationship. I want to use and i know there is some needs for this. How can i reach data for reading and writing by adding all text files one extra column?
And is it possible to reach data without sql queries?
Text files don't have any mechanism for SELECTs or JOINs. You'll be at a pretty steep disadvantage.
However, LINQ gives you the ability to search through object collections in SQL-like ways, and you can certainly create entities that have relationships. The application (whatever you're building) will have to load everything from the text files at application start. Since your code will be doing the searching, it has to have the entities loaded (from text files) and in-memory.
That may or may not be a pleasant or effective experience, but if you're set on replacing SQL with text files, it's not impossible.
CSV files are good for mass input and mass output. They are not good for point manipulations or maintaining relationships. Consider using a database. SQLite might be something useful in your application.
Based on your comments, it would make more sense to use XML instead of CSV. This meets your requirements for being human and machine readable, and XML has nice built in libraries for searching, manipulating, serializing etc.
You can use SQL queries in CSV files: How to use SQL against a CSV file, I have done it for reading but never for writing so I don't know if this will work for you.
Related
I have a C# tool that parses a collection of csv files to construct a List. This collection can be small limited to 20 files or can be as large as 10000+ files. MyObject it self has about 20 properties most of them strings. Each file can create sometimes upto 4 items in the list and sometimes as many has 300.
After the parsing is done I first save the list to a csv file so I don't have to reparse the data again later. I then summarize the data by one pivot of the dataset and then there are multiple pivots to the dataset the user can choose. The data is presented in WPF and the user acts on the data and annotates the data with some additional information that then get's added to the MyObject. Finally the user can save all of this information to another csv file.
I ran into OOM when the files got large and have optimized some of my code. First I realized I was storing one parameter, i.e. the path to the csv file which was sometimes close to 255 characters. I changed it to only save the filename and things improved slightly. I then discovered a suggestion to compile to x64 that would give me 4 Gb of memory instead of 2 Gb.
Even with this obviously I hit OOM's when more and more files are added to this data set.
Some of the options I've considered are:
When parsing the files, save to the intermediate.csv file after each file parse and not keep the list in memory. This will work for me to avoid the step of seeing an OOM even before I get to save the intermediate.csv file.
Problem with this approach is I still have to load back the intermediate file into memory once the parsing is all done.
Some of the Properties on MyObject are similar for a collection of files. So I've considered refactoring the single object into multiple objects that will possibly reduce the number of items in the List object. Essentially refactoring to List, with MyTopLevelDetailsObject containing a List. The memory foot print should reduce theoretically. I can then output this to csv by doing some translation to make it appear like a single object.
Move the data to a db like MongoDB internally and load the data to summarize to the db logic.
Use DataTables instead.
Options 2 and 3 will be significant redesign with 3 also needing me to learn MongoDB. :)
I'm looking for some guidance and helpful tips of how Large data sets have been handled.
Regards,
LW
If, after optimizations, the data can't fit in memory, almost by definition you need it to hit the disk.
Rather than reinvent the wheel and create a custom data format, it's generally best to use one of the well vetted solutions. MongoDB is a good choice here, as are other database solutions. I'm fond of SQLite, which despite the name, can handle large amounts of data and doesn't require a local server.
If you ever get to the point where fitting the data on a local disk is a problem, you might consider moving on to large data solutions like Hadoop. That's a bigger topic, though.
Options two and four can't probably help you because (as I see it) they won't reduce the total amount of information in memory.
Also consider an option to load data dynamically. I mean, the user probably can't see all data at one moment of time. So you may load a part of .csv to the memory and show it to the user, then if the user made some annotations/edits you may save this chunk of data to a separate file. If the user scrolls through data you load it on the fly. When the user wants to save final .csv you combine it from the original one and your little saved chunks.
This is often a practice when creating C# desktop application that access some large amounts of data. For example, I adopted loading data in chunks on the fly, when I needed to create a WinForms software to operate with a huge database (tables with more then 10m rows, they can't fit to mediocre office PCs memory).
And yes, too much work to do it with .csv manually. It's easier to use some database to handle saving/saving of edited parts/composition of final output.
I'm making a small game using XNA, but it is cumbersome to effectively type-in the stats to all the entities in the game.
I was thinking that it would be much simpler to save the required information in a separate file with a table-format and use that instead.
I have looked into reading Excel tables with C# but it seems to be overly complex.
Is there any other table-format file types that let me easily edit the contents of the file and also read the contents using C# without too much hassle??
Basically, is there any other simple alternative to Excel? I just need the simplest table files to save some text in.
CSV is probably the simplest format to store table data, Excel can save data in it. There is no built in classes to read data from CSV as far as I know.
You may also consider XML or JSON to store data if you want some more structured data. Both have built in classes to serialize objects to/from.
If you are comfortable using Excel try exporting to a .CSV (Comma Seperate Value) file. The literal string will look like below.
row1col1,row1col2\nrow2col1,row2col2\nrow3col1,row3col2
The format is incredibly simple. Each row is on a separate line (separated by "\n") and the columns within a line are separated by commas. Very easy to parse just iterate though the lines and split on the commas.
while ((row = tr.ReadLine()) != null)
{
row.split(",")[0] //first column
row.split(",")[1] //second column
row.split(",")[2] //ect...
}
This may be overkill, but SQLlite might be worth looking into if you want expand-ability and maintainability. It is an easy setup and learning SQL will be useful in many applications.
This is a good tutorial to get you started:
http://www.dreamincode.net/forums/topic/157830-using-sqlite-with-c%23/
I understand if this isn't exactly what you were looking for, but I wanted to give you a broader range of options. If you are going for absolute simplicity go with CSV or XML like Alexei said.
Edit: If necessary there is a C# SQLlite version for managed environments(XBOX,WP7) http://forums.create.msdn.com/forums/p/47127/282261.aspx
I have multiple 1.5 GB CSV Files which contain billing information on multiple accounts for clients from a service provider. I am trying to split the large CSV file into smaller chunks for processing and formatting the data inside it.
I do not want to roll out my own CSV parser but this is something I haven't seen yet so please correct me if I am wrong. The 1.5GB files contains information in the following order: account information, account number, Bill Date, transactions , Ex gst , Inc gst , type and other lines.
note that BillDate here means the date when the invoice was made, so occassionally we have more than two bill dates in the same CSV.
Bills are grouped by : Account Number > Bill Date > Transactions.
Some accounts have 10 lines of Transaction details, some have over 300,000 lines of Transaction details. A large 1.5GB CSV file contains around 8million lines of data (I used UltraEdit before) to cut paste into smaller chunks but this has become very inefficient and a time consuming process.
I just want to load the large CSV files in my WinForm, click a button, which will split this large files in chunks of say no greater than 250,000 lines but some bills are actually bigger than 250,000 lines in which case keep them in one piece and not split accounts across multiple files since they are ordered anyway. Also I do not wan't accounts with multiple bill date in CSV in which case the splitter can create another additional split.
I already have a WinForm application that does the formatting of the CSV in smaller files automatically in VS C# 2010.
Is it actually possible to process this very large CSV files? I have been trying to load the large files but MemoryOutOfException is an annoyance since it crashes everytime and I don't know how to fix it. I am open to suggestions.
Here is what I think I should be doing:
Load the large CSV file (but fails since OutOfMemoryException). How to solve this?
Group data by account name, bill date, and count the number of lines for each group.
Then create an array of integers.
Pass this array of integers to a file splitter process which will take these arrays and write the blocks of data.
Any suggestions will be greatly appreciated.
Thanks.
You can use CsvReader to stream through and parse the data, without needing to load it all into memory in one go.
Yea about that.... being out of memory is going to happen with files that are HUGE. You need to take your situation seriously.
As with most problems, break everything into steps.
I have had a similar type of situation before (large data file in CSV format, need to process, etc).
What I did:
Make step 1 of your program suite or whatever, something that merely cuts your huge file into many smaller files. I have broken 5GB zipped up PGP encrypted files (after decryption...thats another headache) into many smaller pieces. You can do something simple like numbering them sequentially (i.e. 001, 002, 003...)
Then make an app to do the INPUT processing. No real business logic here. I hate FILE IO with a passion when it comes to business logic and I love the warm fuzzy feeling of data being in a nice SQL Server DB. That's just me. I created a thread pool and have N amount of threads (like 5, you decide how much your machine can handle) read those .csv part files you created.
Each thread reads one file. One to one relationship. Because it is file I/O, make sure you only dont have too many running at the same time. Each thread does the same basic operation. Reads in data, puts it in a basic structure for the db (table format), does lots of inserts, then ends the thread. I used LINQ to SQL because everything is strongly typed and what not, but to each their own. The better the db design the better for you later to do logic.
After all threads have finished executing, you have all the data from the original CSV in the database. Now you can do all your business logic and do whatever from there. Not the prettiest solution, but I was forced into developing that given my situation/data flow/size/requirements. You might go with something completely different. Just sharing I guess.
You can use an external sort. I suppose you'd have to do an initial pass through the file to identify proper line boundaries, as CSV records are probably not of a fixed length.
Hopefully, there might be some ready-made external sort implementations for .NET that you could use.
There's a very useful class in the Microsoft.VisualBasic.FileIO namespace that I've used for dealing with CSV files - the TextFieldParser Class.
It might not help with the large file size, but it's built-in and handles quoted and non-quoted fields (even if mixed in the same line). I've used it a couple of times in projects at work.
Despite the assembly name, it can be used with C#, in case you're wondering.
I'm supposed to do the fallowing:
1) read a huge (700MB ~ 10 million elements) XML file;
2) parse it preserving order;
3) create a text(one or more) file with SQL insert statements to bulk load it on the DB;
4) write the relational tuples and write them back in XML.
I'm here to exchange some ideas about the best (== fast fast fast...) way to do this. I will use C# 4.0 and SQL Server 2008.
I believe that XmlTextReader its a good start. But I do not know if it can handle such a huge file. Does it load all file when is instantiated or holds just the actual reading line in memory? I suppose I can do a while(reader.Read()) and that should be fine.
What is the best way to write the text files? As I should preserve the ordering of the XML (adopting some numbering schema) I will have to hold some parts of the tree in memory to do the calculations etc... Should I iterate with stringbuilder?
I will have two scenarios: one where every node (element, attribute or text) will be in the same table (i.e., will be the same object) and another scenario where for each type of node (just this three types, no comments etc..) I will have a table in the DB and a class to represent this entity.
My last specific question is how good is the DataSet ds.WriteXml? Will it handle 10M tuples? Maybe its best to bring chunks from the database and use a XmlWriter... I really dont know.
I'm testing all this stuff... But I decided to post this question to listen you guys, hopping your expertise can help me doing this things more correctly and faster.
Thanks in advance,
Pedro Dusso
I'd use the SQLXML Bulk Load Component for this. You provide a specially annotated XSD schema for your XML with embedded mappings to your relational model. It can then bulk load the XML data blazingly fast.
If your XML has no schema you can create one from visual studio by loading the file and selecting Create Schema from the XML menu. You will need to add the mappings to your relational model yourself however. This blog has some posts on how to do that.
Guess what? You don't have a SQL Server problem. You have an XML problem!
Faced with your situation, I wouldn't hesitate. I'd use Perl and one of its many XML modules to parse the data, create simple tab- or other-delimited files to bulk load, and bcp the resulting files.
Using the server to parse your XML has many disadvantages:
Not fast, more than likely
Positively useless error messages, in my experience
No debugger
Nowhere to turn when one of the above turns out to be true
If you use Perl on the other hand, you have line-by-line processing and debugging, error messages intended to guide a programmer, and many alternatives should your first choice of package turn out not to do the job.
If you do this kind of work often and don't know Perl, learn it. It will repay you many times over.
I need to analyze tens of thousands of lines of data. The data is imported from a text file. Each line of data has eight variables. Currently, I use a class to define the data structure. As I read through the text file, I store each line object in a generic list, List.
I am wondering if I should switch to using a relational database (SQL) as I will need to analyze the data in each line of text, trying to relate it to definition terms which I also currently store in generic lists (List).
The goal is to translate a large amount of data using definitions. I want the defined data to be filterable, searchable, etc. Using a database makes more sense the more I think about it, but I would like to confirm with more experienced developers before I make the changes, yet again (I was using structs and arraylists at first).
The only drawback I can think of, is that the data does not need to be retained after it has been translated and viewed by the user. There is no need for permanent storage of data, therefore using a database might be a little overkill.
It is not absolutely necessary to go a database. It depends on the actual size of the data and the process you need to do. If you are loading the data into a List with a custom class, why not use Linq to do your querying and filtering? Something like:
var query = from foo in List<Foo>
where foo.Prop = criteriaVar
select foo;
The real question is whether the data is so large that it cannot be loaded up into memory confortably. If that is the case, then yes, a database would be much simpler.
This is not a large amount of data. I don't see any reason to involve a database in your analysis.
There IS a query language built into C# -- LINQ. The original poster currently uses a list of objects, so there is really nothing left to do. It seems to me that a database in this situation would add far more heat than light.
It sounds like what you want is a database. Sqlite supports in-memory databases (use ":memory:" as the filename). I suspect others may have an in-memory mode as well.
I was facing the same problem that you faced now while I was working on my previous company.The thing is I was looking a concrete and good solution for a lot of bar code generated files.The bar code generates a text file with thousands of records with in a single file.Manipulating and presenting the data was so difficult for me at first.Based on the records what I programmed was, I create a class that read the file and loads the data to the data table and able to save it in database. The database what I used was SQL server 2005.Then I able to manage the saved data easily and present it which way I like it.The main point is read the data from the file and save to it to the data base.If you do so you will have a lot of options to manipulate and present as the way you like it.
If you do not mind using access, here is what you can do
Attach a blank Access db as a resource
When needed, write the db out to file.
Run a CREATE TABLE statement that handles the columns of your data
Import the data into the new table
Use sql to run your calculations
OnClose, delete that access db.
You can use a program like Resourcer to load the db into a resx file
ResourceManager res = new ResourceManager( "MyProject.blank_db", this.GetType().Assembly );
byte[] b = (byte[])res.GetObject( "access.blank" );
Then use the following code to pull the resource out of the project. Take the byte array and save it to the temp location with the temp filename
"MyProject.blank_db" is the location and name of the resource file
"access.blank" is the tab given to the resource to save
If the only thing you need to do is search and replace, you may consider using sed and awk and you can do searches using grep. Of course on a Unix platform.
From your description, I think linux command line tools can handle your data very well. Using a database may unnecessarily complicate your work. If you are using windows, these tools are also available by different ways. I would recommend cygwin. The following tools may cover your task: sort, grep, cut, awk, sed, join, paste.
These unix/linux command line tools may look scary to a windows person but there are reasons for people who love them. The following are my reasons for loving them:
They allow your skill to accumulate - your knowledge to a partially tool can be helpful in different future tasks.
They allow your efforts to accumulate - the command line (or scripts) you used to finish the task can be repeated as many times as needed with different data, without human interaction.
They usually outperform the same tool you can write. If you don't believe, try to beat sort with your version for terabyte files.