I want to develop an open source library, for a fast efficient file storage (under one large file, and index file) like NFileStorage. why i want to do this ?
A. under my line of work something like that waS needed.
B. our DBA said its not efficient to store files under the DB.
C. Its a good practice for me.
I am looking for a good article for file indexes
can you recommend one ?
what is your general idea ?
It may not be efficient to store files inside a database, however databases like SQL Server have the concept of FileStreams where it actually stores it on the local file system instead of placing it in the database file itself.
In my opinion this is a bad idea for a project.
You are going to run into exactly the same problem that databases have with storing all of the uploaded files inside the same single file... which is why some of them have moved away from this for binary / large objects and instead support alternative methods.
Some of the problems you will have to deal with include:
Allocating additional disk space for your backing file to store newly uploaded documents.
Permanently removing "files" from your storage and resizing / compressing the backing file.
Multi-user access / locks.
Failure recovery. Such as when you encounter a bad block on the drive and it hoses your backing file.
Transactional support.
Items 1 and 2 cause an increase in the amount of time it takes to write a "file" to your data store. Items 3, 4 and 5 are already supported by network file systems so you're just recreating the wheel.
In short you're going to have to either write your own file system or write your own DBMS. Neither of which I would consider "good practice" for 99% of real world applications. It might be worthwhile if your goal is to work for Seagate.. But even then they'd probably look at you funny.
If you are truly interested in the most efficient method of file storage, it is quite simply to purchase a SAN array and push your files to it while keeping a pointer to the file/location in your database. Easy to back up, fast to store files, much cheaper than spending developer time trying to figure out how to write your own file system and certainly 100% supported and understandable by future devs.
This kind of product already exist. You should read about Mongo Db (http://www.mongodb.org/display/DOCS/Home)
Related
Size: ~5mb
Size on Disk: ~3gb
We're using C# and saving data constantly as it changes, all of the file data has to be accessible at any given time. Basically if something changes the file for that data must save. This is why there are so many files for so much data. The data is processed greatly as well so clumping all of it together is not an option as a minor change would result in a large amount being saved for no reason. These files already contain enough that saving one is mostly redundant for only a small change.
Surely there is a way to get around this absurd expansion of the file size, and still retain the accessibility and saving-efficiency we have achieved. We need a way to package these files into what windows will consider to be a single file, but in such a way that we do not have to rewrite the entire file when something changes.
I understand that having thousands of small files is quite strange, but for our purposes it has improved performance greatly. We just don't want to sacrifice one resource for another if it is at all possible to avoid.
Note: The files have RLE binary data, they are not text files.
Clarity update: 5mb->3gb = 250mb (50x clusters) -> 150gb = PROBLEM!
A database does exactly what you need: You can store arbitrary amounts of tiny rows/blobs and they will be stored efficiently. File systems typically require at least one disk cluster per file which is probably why your size expands so much. Databases don't do that. You can also ask the database to compact itself.
There are embedded and standalone databases available.
We are designing an update to a current system (C++\CLI and C#).
The system will gather small (~1Mb) amounts of data from ~10K devices (in the near future). Currently, they are used to save device data in a CSV (a table) and store all these in a wide folder structure.
Data is only inserted (create / append to a file, create folder) never updated / removed.
Data processing is done by reading many CSV's to an external program (like Matlab). Mainly be used for statistical analysis.
There is an option to start saving this data to an MS-SQL database.
Process time (reading the CSV's to external program) could be up to a few minutes.
How should we choose which method to use?
Does one of the methods take significantly more storage than the other?
Roughly, when does reading the raw data from a database becomes quicker than reading the CSV's? (10 files, 100 files? ...)
I'd appreciate your answers, Pros and Cons are welcome.
Thank you for your time.
Well if you are using data in one CSV to get data in another CSV I would guess that SQL Server is going to be faster than whatever you have come up with. I suspect SQL Server would be faster in most cases, but I can't say for sure. Microsoft has put a lot of resources into make a DBMS that does exactly what you are trying to do.
Based on your description it sounds like you have almost created your own DBMS based on table data and folder structure. I suspect that if you switched to using SQL Server you would probably find a number of areas where things are faster and easier.
Possible Pros:
Faster access
Easier to manage
Easier to expand should you need to
Easier to enforce data integrity
Easier to design more complex relationships
Possible Cons:
You would have to rewrite your existing code to use SQL Server instead of your current system
You may have to pay for SQL Server, you would have to check to see if you can use Express
Good luck!
I'd like to try hitting those questions a bit out of order.
Roughly, when does reading the raw data from a database becomes
quicker than reading the CSV's? (10 files, 100 files? ...)
Immediately. The database is optimized (assuming you've done your homework) to read data out at incredible rates.
Does one of the methods take significantly more storage than the
other?
Until you're up in the tens of thousands of files, it probably won't make too much of a difference. Space is cheap, right? However, once you get into the big leagues, you'll notice that the DB is taking up much, much less space.
How should we choose which method to use?
Great question. Everything in the database always comes back to scalability. If you had only a single CSV file to read, you'd be good to go. No DB required. Even dozens, no problem.
It looks like you could end up in a position where you scale up to levels where you'll definitely want the DB engine behind your data pretty quickly. When in doubt, creating a database is the safe bet, since you'll still be able to query that 100 GB worth of data in a second.
This is a question many of our customers have where I work. Unless you need flat files for an existing infrastructure, or you just don't think you can figure out SQL Server, or if you will only have a few files with small amounts of data to manage, you will be better off with SQL Server.
If you have the option to use a ms-sql database, I would do that.
Maintaining data in a wide folder structure is never a good idea. Reading your data would involve reading several files. These could be stored anywhere on your disk. Your file-io time would be quite high. SQL server being a production database has these problems already taken care of.
You are reinventing the wheel here. This is how foxpro manages data, one file per table. It is usually a good idea to use proven technology unless you are actually making a database server.
I do not have any test statistics here, but reading several files will almost always be slower than a database if you are dealing with any significant amount of data. Given your about 10k devices, you should consider using a standard database.
I'm developing some web app in ASP.Net which is mainly about Storing, Sharing and Processing MS Word doc and PDF files, but I'm not sure how to manage these documents, I was thinking of keeping documents in folders and only keeping metadata of them in DB OR keeping the whole documents in DB,I'm using SQL Server 2008. what's your suggestion?
SQL Server 2008 is reasonably good at storing and serving up large documents (unlike some of the earlier versions), so it is definitely an option. That said, having large blobs being served up from the DB is generally not a great idea. I think you need to think about the advantages and disadvantages of both approaches. Some general things to think about:
How large are the files going to be, and how many of them will there be? It's a lot easier to scale a file system past many TB than it is to do the same for a DB.
How do you want to manage backups? Obviously with a file system approach you'd need to back the files up separately from the DB.
I believe it's probably quicker to implement a solution that stores to the DB, but that storing to the file system is generally the superior solution. In the latter case, however, you will have to worry about some issues, such as having unique file names, and in general not wanting to store too many documents in a single folder (most solutions create new folders after every few thousand documents). Use the quicker approach if the files are not going to be numerous and large, otherwise invest some time in storing on the file system.
In the database unless you don't care about data integrity.
If you store docs outside of the database you will have missing documents and broken links soomer not later. Your backup/restore scenario is a lot more complex: you have no way to ensure that all data is from the same point in time.
FILESTREAM in SQL Server 2008 makes it efficient nowadays (and other RDBMS have features like this too)
If your storing these files in one folder, then maintain the files names in the DB. Since no dir can have 2 files names same with same extension. If you wish to store the file into DB, then you may have to use BLOB or byte array to store.
I see over head in opening a connection to the DB, though i dont know how fast the DB connection is compared to file.Open (even peformance wise).
If files are relatively small I would store them as BLOB fields in database. This way you can use standard procedures for backup/restore as well as transactions. If files are large there are some advantages in keeping them on the hard drive and storing filenames in the database as was suggested earlier
How many documents are you planning to store?
The main advantage of the database approach is the normal ACID properties--the meta-data will always be consistent with the document, which will not be the case if you use the file system to store the documents. Using the file system it would be relatively easy for your meta-data and documents to get out of sync: documents on the file system for which there is no meta-data, meta-data where the document has gone missing or is corrupted. If you need any sort of reliability in your document storage, then the database is a much better approach than using the file system.
If you are going only operate on that files I would think to store them i DB like a BLOB data. In case if you have a files on folders and only names in DB, you should care about the fact that, for example:
1) one day you may need rename file
2) change its location
3) change its extension
or whatever.
In case of DB instead, you can save in separate table BLOB data, in other table name and extension of the file along with its ID on BLOB table. In this case, in moment of previously discussed scenario you will need just execute simple SQL update query.
What is the best way to save large amount of data for a .Net 4.0 application?
Right now I am using Lists and serializing to a file in "User Data" folder, and its working ok, but I want to know if there is a better/faster way of saving/loading large amount of data.
The data that I am saving contains only lots of words, like documents.
The size of the data is almost 1 mb.
That really depends on the type of your application. I wouldn't use SQL database of any sort for to just load and save operation of data that I do not need to query or transform. The time it will take to map your object graph to a relational model just not worth it.
Also I don't believe it will ever be faster than simple serialization due to the overhead associated with databases (connection management and mapping)
My recent experience was with BinnaryFormatter which had excellent results (files ~ 15mb). Worse come to worse you can always write your own formatter.
Kinda depends on your data and how you have it stored in your app.
But all these NoSQL storage systems are a possibility or just plain binary data into a file.
When you say "large amout [sic] of data", what exactly do you mean by that? A megabyte? a terabyte?
And what exactly is the data?
If it's a set of account records, it might well belong in a database of some sort; if it's a set of images or word processing documents, perhaps not.
If you want fast access, one approach would be to serialize to a hashtable, and cache it. In between reads and writes...
Problem here is ofcourse, versioning, changing of namespaces(then you wont be able to deserialize....easyly), deadlocks, concurrency etc....
Better if you save the file as a XML/JSON, and when you do read it in to memory save it into a hashtable...for fast access...
Currently we have thousands of Microsoft Word files, Excel files, PDF's, images etc stored in folders/sub folders. These are generated by an application on a regular basis and can be accessed at any time within that application. As we look to upgrade we are now looking into storing all these documents within SQL Server 2005 instead. Reasons for this are based on being able to compress the documents, adding additional fields to store more information on those documents and applying index’s where necessary.
I suppose what I’m after is the pros and cons of using SQL Server as a document repository instead of keeping them on the file server, as well as any experience you might have in doing this.
We would be using C# and Windows Workflow to do this task.
Thanks for your comments.
Edit
How big are the files?
between 100k = 200k in size (avg. 70KB)
How many will be?
At the moment it’s around 3.1 Million files (ranging from Word/Excel and PDF's), which can grow by 2,600 a day. (The growth will also increase over time)
How many reads?
This one is difficult to quantify as our old system/application makes it hard to work this out.
Also another useful link pointed out on a similar post covers the pros and cons of both methods.
Files Stored on DB vs FileSystem - Pros and Cons
rule of thumb for doc size is:
size < 256 kb: store in db
265 kb < size < 1 MB: test for your load
size > 1 Mb: store on file system
EDIT: this rule of thumb also applies for FILESTREAM storage in SQL Server 2008
If you upgrade all the way, to SQL Server 2008, then you can use the new FILESTREAM feature, that allows the document to appear as a column in a table, yet to reside as a file on a share, where it can be directly accessed by a program (like Word).
I would have both.
I would keep the files renamed with an unique name, thus easier to manage, and i would keep all meta data inside the database (file name, content-type, location on file system, size, description, etcetera), so the files are accessed through the database (indirectly).
Advantages:
files are easy to handle; you can bring several drives in the mix
the database can keep any number of meta information, including file description on which you can search against.
keep track on file accesses and other statistic information
rearrange the files using various paradigms: tree (directory structure), tags, search or context
You can have compression on a drive also. You can have RAID for backup and speed.
What kind of documents are we talking about?
Storing documents in your SQL server might be useful because you can relate the documents to other tables and use techniques like Full-text indexing and do things like fuzzy searches.
A downside is that it might be a bit harder to create a backup of the documents. And compression is also possible with NTFS compression or other techniques.
Are these documents text based and are you planning on using SQL Server's full text search to search these documents? If not, I don't see any benefit in storing these documents on the database. Ofcourse, you can always store the meta data related to the documents including the path information to the database.
A big benefit of stroing docs in the DB is it becomes much easier to control security access to them, as you can do it all via access control in your app. Storing them on a file server requires dealing with access priveledges at the file and folder level to prevent any direct access. Also have them in a DB makes for a single point of backup, so you can more easily make a full copy and/or move it around if needed.
Rather than writing a custom DMS (document management system), you should probably consider buying one or using WSS / SharePoint as this will handle all the mundane details (storage, indexing, meta-data) and let you build your custom functionality on top.