I'm developing some web app in ASP.Net which is mainly about Storing, Sharing and Processing MS Word doc and PDF files, but I'm not sure how to manage these documents, I was thinking of keeping documents in folders and only keeping metadata of them in DB OR keeping the whole documents in DB,I'm using SQL Server 2008. what's your suggestion?
SQL Server 2008 is reasonably good at storing and serving up large documents (unlike some of the earlier versions), so it is definitely an option. That said, having large blobs being served up from the DB is generally not a great idea. I think you need to think about the advantages and disadvantages of both approaches. Some general things to think about:
How large are the files going to be, and how many of them will there be? It's a lot easier to scale a file system past many TB than it is to do the same for a DB.
How do you want to manage backups? Obviously with a file system approach you'd need to back the files up separately from the DB.
I believe it's probably quicker to implement a solution that stores to the DB, but that storing to the file system is generally the superior solution. In the latter case, however, you will have to worry about some issues, such as having unique file names, and in general not wanting to store too many documents in a single folder (most solutions create new folders after every few thousand documents). Use the quicker approach if the files are not going to be numerous and large, otherwise invest some time in storing on the file system.
In the database unless you don't care about data integrity.
If you store docs outside of the database you will have missing documents and broken links soomer not later. Your backup/restore scenario is a lot more complex: you have no way to ensure that all data is from the same point in time.
FILESTREAM in SQL Server 2008 makes it efficient nowadays (and other RDBMS have features like this too)
If your storing these files in one folder, then maintain the files names in the DB. Since no dir can have 2 files names same with same extension. If you wish to store the file into DB, then you may have to use BLOB or byte array to store.
I see over head in opening a connection to the DB, though i dont know how fast the DB connection is compared to file.Open (even peformance wise).
If files are relatively small I would store them as BLOB fields in database. This way you can use standard procedures for backup/restore as well as transactions. If files are large there are some advantages in keeping them on the hard drive and storing filenames in the database as was suggested earlier
How many documents are you planning to store?
The main advantage of the database approach is the normal ACID properties--the meta-data will always be consistent with the document, which will not be the case if you use the file system to store the documents. Using the file system it would be relatively easy for your meta-data and documents to get out of sync: documents on the file system for which there is no meta-data, meta-data where the document has gone missing or is corrupted. If you need any sort of reliability in your document storage, then the database is a much better approach than using the file system.
If you are going only operate on that files I would think to store them i DB like a BLOB data. In case if you have a files on folders and only names in DB, you should care about the fact that, for example:
1) one day you may need rename file
2) change its location
3) change its extension
or whatever.
In case of DB instead, you can save in separate table BLOB data, in other table name and extension of the file along with its ID on BLOB table. In this case, in moment of previously discussed scenario you will need just execute simple SQL update query.
Related
We are designing an update to a current system (C++\CLI and C#).
The system will gather small (~1Mb) amounts of data from ~10K devices (in the near future). Currently, they are used to save device data in a CSV (a table) and store all these in a wide folder structure.
Data is only inserted (create / append to a file, create folder) never updated / removed.
Data processing is done by reading many CSV's to an external program (like Matlab). Mainly be used for statistical analysis.
There is an option to start saving this data to an MS-SQL database.
Process time (reading the CSV's to external program) could be up to a few minutes.
How should we choose which method to use?
Does one of the methods take significantly more storage than the other?
Roughly, when does reading the raw data from a database becomes quicker than reading the CSV's? (10 files, 100 files? ...)
I'd appreciate your answers, Pros and Cons are welcome.
Thank you for your time.
Well if you are using data in one CSV to get data in another CSV I would guess that SQL Server is going to be faster than whatever you have come up with. I suspect SQL Server would be faster in most cases, but I can't say for sure. Microsoft has put a lot of resources into make a DBMS that does exactly what you are trying to do.
Based on your description it sounds like you have almost created your own DBMS based on table data and folder structure. I suspect that if you switched to using SQL Server you would probably find a number of areas where things are faster and easier.
Possible Pros:
Faster access
Easier to manage
Easier to expand should you need to
Easier to enforce data integrity
Easier to design more complex relationships
Possible Cons:
You would have to rewrite your existing code to use SQL Server instead of your current system
You may have to pay for SQL Server, you would have to check to see if you can use Express
Good luck!
I'd like to try hitting those questions a bit out of order.
Roughly, when does reading the raw data from a database becomes
quicker than reading the CSV's? (10 files, 100 files? ...)
Immediately. The database is optimized (assuming you've done your homework) to read data out at incredible rates.
Does one of the methods take significantly more storage than the
other?
Until you're up in the tens of thousands of files, it probably won't make too much of a difference. Space is cheap, right? However, once you get into the big leagues, you'll notice that the DB is taking up much, much less space.
How should we choose which method to use?
Great question. Everything in the database always comes back to scalability. If you had only a single CSV file to read, you'd be good to go. No DB required. Even dozens, no problem.
It looks like you could end up in a position where you scale up to levels where you'll definitely want the DB engine behind your data pretty quickly. When in doubt, creating a database is the safe bet, since you'll still be able to query that 100 GB worth of data in a second.
This is a question many of our customers have where I work. Unless you need flat files for an existing infrastructure, or you just don't think you can figure out SQL Server, or if you will only have a few files with small amounts of data to manage, you will be better off with SQL Server.
If you have the option to use a ms-sql database, I would do that.
Maintaining data in a wide folder structure is never a good idea. Reading your data would involve reading several files. These could be stored anywhere on your disk. Your file-io time would be quite high. SQL server being a production database has these problems already taken care of.
You are reinventing the wheel here. This is how foxpro manages data, one file per table. It is usually a good idea to use proven technology unless you are actually making a database server.
I do not have any test statistics here, but reading several files will almost always be slower than a database if you are dealing with any significant amount of data. Given your about 10k devices, you should consider using a standard database.
I want to develop an open source library, for a fast efficient file storage (under one large file, and index file) like NFileStorage. why i want to do this ?
A. under my line of work something like that waS needed.
B. our DBA said its not efficient to store files under the DB.
C. Its a good practice for me.
I am looking for a good article for file indexes
can you recommend one ?
what is your general idea ?
It may not be efficient to store files inside a database, however databases like SQL Server have the concept of FileStreams where it actually stores it on the local file system instead of placing it in the database file itself.
In my opinion this is a bad idea for a project.
You are going to run into exactly the same problem that databases have with storing all of the uploaded files inside the same single file... which is why some of them have moved away from this for binary / large objects and instead support alternative methods.
Some of the problems you will have to deal with include:
Allocating additional disk space for your backing file to store newly uploaded documents.
Permanently removing "files" from your storage and resizing / compressing the backing file.
Multi-user access / locks.
Failure recovery. Such as when you encounter a bad block on the drive and it hoses your backing file.
Transactional support.
Items 1 and 2 cause an increase in the amount of time it takes to write a "file" to your data store. Items 3, 4 and 5 are already supported by network file systems so you're just recreating the wheel.
In short you're going to have to either write your own file system or write your own DBMS. Neither of which I would consider "good practice" for 99% of real world applications. It might be worthwhile if your goal is to work for Seagate.. But even then they'd probably look at you funny.
If you are truly interested in the most efficient method of file storage, it is quite simply to purchase a SAN array and push your files to it while keeping a pointer to the file/location in your database. Easy to back up, fast to store files, much cheaper than spending developer time trying to figure out how to write your own file system and certainly 100% supported and understandable by future devs.
This kind of product already exist. You should read about Mongo Db (http://www.mongodb.org/display/DOCS/Home)
I want to upload a music file to a database, but i don't know how. I need to upload the file to server and then upload it to the database? Or I can do everything at the same time? And how i can do it?
There is no shortage of tutorials on how to do this. Here's a decent one. Basically, what you're asking is how to store a file as a binary stream in a database field.
However, #Saif al Harthi makes a good point in his comment. It's generally considered bad practice to store a binary file in a relational database. Are you sure this is what you want to do? Your server already has a fairly efficient means of storing/retrieving files... the file system. Unless there's a compelling reason to store the file in the database, it's usually better practice to store it on the file system and just write a database record that references the file (path, maybe type, other application-specific data about it, etc.). The file's name can be changed to, say, the primary identifier for the database table in order to easily reference between them.
It's a little more work, but it's a little better for the server and makes use of the right tools for the right jobs. That is, of course, unless you have a compelling reason for keeping a binary file in a relational database. If there's a reason, please share it.
Sorry for the bad title.
I'm saving web pages. I currently use 1 XML file as an index. One element contains file created date (UTC), full URL (w. query string and what not). And the headers in a separate file with similar name but appended special extension.
However, going at 40k (incl. header) files, the XML is now 3.5 MB. Recently I was still reading, adding new entry, save this XML file. But now I keep it in memory and save it every once in a while.
When I request a page, the URL is looked up using XPath on the XML file, if there is an entry, the file path is returned.
The directory structure is
.\www.host.com/randomFilename.randext
So I am looking for a better way.
Im thinking:
One XML file per. domain (incl. subdomains). But I feel this might be a hassle.
Using SVN. I just tested it, but I have no experience in large repositories. Executing svn add "path to file" for every download, and commit when I'm done.
Create a custom file system, where I then can include everything I want, for ex. POST-data.
Generating a filename from the URL and somehow flattening the querystring, but large querystrings might be rejected by the OS. And if I keep it with the headers, I still need to keep track of multiple files mapped to each different query string. Hassle. And I don't want it to execute too slow either.
Multiple program instances will perform read/write operations, on different computers.
If I follow the directory/file method, I could in theory add a layer between so it uses DotNetZip on the fly. But then again, the query string.
I'm just looking for direction or experience here.
What I also want is the ability to keep history of these files, so the local file is not overwritten, and then I can pick which version (by date) I want. Thats why I tried SVN.
I would recommend either a relational database or a version control system.
You might want to use SQL Server 2008's new FILESTREAM feature to store the files themselves in the database.
I would use 2 data stores, one for the raw files and another for indexes.
To stored the flat file, I think Berkeley DB is a good choice, the key can be generated by md5 or other hash function, and you can also compress the content of the file to save some disk space.
For indexes, you can use relational database or more sophisticated text search engine like Lucene.
Currently we have thousands of Microsoft Word files, Excel files, PDF's, images etc stored in folders/sub folders. These are generated by an application on a regular basis and can be accessed at any time within that application. As we look to upgrade we are now looking into storing all these documents within SQL Server 2005 instead. Reasons for this are based on being able to compress the documents, adding additional fields to store more information on those documents and applying index’s where necessary.
I suppose what I’m after is the pros and cons of using SQL Server as a document repository instead of keeping them on the file server, as well as any experience you might have in doing this.
We would be using C# and Windows Workflow to do this task.
Thanks for your comments.
Edit
How big are the files?
between 100k = 200k in size (avg. 70KB)
How many will be?
At the moment it’s around 3.1 Million files (ranging from Word/Excel and PDF's), which can grow by 2,600 a day. (The growth will also increase over time)
How many reads?
This one is difficult to quantify as our old system/application makes it hard to work this out.
Also another useful link pointed out on a similar post covers the pros and cons of both methods.
Files Stored on DB vs FileSystem - Pros and Cons
rule of thumb for doc size is:
size < 256 kb: store in db
265 kb < size < 1 MB: test for your load
size > 1 Mb: store on file system
EDIT: this rule of thumb also applies for FILESTREAM storage in SQL Server 2008
If you upgrade all the way, to SQL Server 2008, then you can use the new FILESTREAM feature, that allows the document to appear as a column in a table, yet to reside as a file on a share, where it can be directly accessed by a program (like Word).
I would have both.
I would keep the files renamed with an unique name, thus easier to manage, and i would keep all meta data inside the database (file name, content-type, location on file system, size, description, etcetera), so the files are accessed through the database (indirectly).
Advantages:
files are easy to handle; you can bring several drives in the mix
the database can keep any number of meta information, including file description on which you can search against.
keep track on file accesses and other statistic information
rearrange the files using various paradigms: tree (directory structure), tags, search or context
You can have compression on a drive also. You can have RAID for backup and speed.
What kind of documents are we talking about?
Storing documents in your SQL server might be useful because you can relate the documents to other tables and use techniques like Full-text indexing and do things like fuzzy searches.
A downside is that it might be a bit harder to create a backup of the documents. And compression is also possible with NTFS compression or other techniques.
Are these documents text based and are you planning on using SQL Server's full text search to search these documents? If not, I don't see any benefit in storing these documents on the database. Ofcourse, you can always store the meta data related to the documents including the path information to the database.
A big benefit of stroing docs in the DB is it becomes much easier to control security access to them, as you can do it all via access control in your app. Storing them on a file server requires dealing with access priveledges at the file and folder level to prevent any direct access. Also have them in a DB makes for a single point of backup, so you can more easily make a full copy and/or move it around if needed.
Rather than writing a custom DMS (document management system), you should probably consider buying one or using WSS / SharePoint as this will handle all the mundane details (storage, indexing, meta-data) and let you build your custom functionality on top.