I'm working on a project which will have millions of small mp3 files that I was thinking to save in harddisk.
I have the following questions:
What is the structure I should use to save the files? one folder or
many folders.
what is the best way to Searching ?
I had to do a similar thing on a project that involved storing a large number of images. Using some meta data for the file, I generated an MD5 hash which I then used as the file name. The first character of the filename would be the grandparent directory for the file, and the second character the parent. Resulting in a file structure like this:
This has the advantage of keeping the files evenly distributed over the directories. And if you pick the metadata used to generate the hash well, then it also has the advantage of being able to find a file without using a database to store references to it.
I've found this method to work pretty well with 100k or so files, but without more information of what exactly you're trying to do it's hard to know if it's appropriate for your problem...
The best approach might be to store the information you are going to search on in a database and use that to search on. You can then use something like Lucene or Solr to do the searching.
The database would store a reference to the file on disk and just use that directly when the search pops out it's results. This means you can organise the files on disk in any order you like.
However, without a lot more information this is effectively just a guess.
Related
Size: ~5mb
Size on Disk: ~3gb
We're using C# and saving data constantly as it changes, all of the file data has to be accessible at any given time. Basically if something changes the file for that data must save. This is why there are so many files for so much data. The data is processed greatly as well so clumping all of it together is not an option as a minor change would result in a large amount being saved for no reason. These files already contain enough that saving one is mostly redundant for only a small change.
Surely there is a way to get around this absurd expansion of the file size, and still retain the accessibility and saving-efficiency we have achieved. We need a way to package these files into what windows will consider to be a single file, but in such a way that we do not have to rewrite the entire file when something changes.
I understand that having thousands of small files is quite strange, but for our purposes it has improved performance greatly. We just don't want to sacrifice one resource for another if it is at all possible to avoid.
Note: The files have RLE binary data, they are not text files.
Clarity update: 5mb->3gb = 250mb (50x clusters) -> 150gb = PROBLEM!
A database does exactly what you need: You can store arbitrary amounts of tiny rows/blobs and they will be stored efficiently. File systems typically require at least one disk cluster per file which is probably why your size expands so much. Databases don't do that. You can also ask the database to compact itself.
There are embedded and standalone databases available.
I have around 50000 XML files with a size of 50KB per file. I want to search for data in these files, but my solution so far is very slow. Is there any way to enhance the search performance?
You could use Lucene.NET, a lightweight, fast, flat file search indexing engine.
See http://codeclimber.net.nz/archive/2009/09/02/lucene.net-your-first-application.aspx for a getting started tutorial.
You can always index content of files to database and perform search there. Databases are pretty performant in terms of search.
I am assuming you are using Windows and you can use Windows desktop search for quickly searching the files. You will be using the Windows index which would update when ever the file changes. The SDK is available here which can be used from .NET
A lot depends on the nature of these XML files. Are they just 50,000 XML files that won't be re-generated? Or are they constantly changing? Are there only certain elements within the XML files you want to index for searching?
Certainly opening 50k file handles, reading their contents, and searching for text is going to be very slow. I agree with Pavel, putting the data in a database will yield a lot of performance, but if your XML files are changing often, you will have to have some way to keep them synchronized with the database.
If you want to roll your own solution, I recommend scanning all the files and creating a word index. If your files change frequently, you will also want to keep track of your "last modified" date, and if a file has changed more recently than that, update your index. In this way, you'll have one ginormous word index, and if the search is for "foo", the index will reveal that the word can be found in the files file39209.xml, file57209 and file01009.xml. Depending on the nature of the XML, you could even store the elements in the index file (which would, in essence, be like flattening all of your XML files into one).
You could spin up a Splunk instance and have it index your files. It's billed mostly as a log parser but would still serve your needs. It tokenizes files into words, indexes those words, and provides both a web-based and a CLI-based search tool that supports complex search criteria.
Use an XML database. The usual recommendations are eXist if you want open source, MarkLogic if you want something commercial, but you could use SQL Server if being Microsoft matters to you and you don't want the ultimate in XML capability. And there are plenty of others if you want to evaluate them. All database products have a steep learning curve, but for these data volumes, it's the right solution.
I want to develop an open source library, for a fast efficient file storage (under one large file, and index file) like NFileStorage. why i want to do this ?
A. under my line of work something like that waS needed.
B. our DBA said its not efficient to store files under the DB.
C. Its a good practice for me.
I am looking for a good article for file indexes
can you recommend one ?
what is your general idea ?
It may not be efficient to store files inside a database, however databases like SQL Server have the concept of FileStreams where it actually stores it on the local file system instead of placing it in the database file itself.
In my opinion this is a bad idea for a project.
You are going to run into exactly the same problem that databases have with storing all of the uploaded files inside the same single file... which is why some of them have moved away from this for binary / large objects and instead support alternative methods.
Some of the problems you will have to deal with include:
Allocating additional disk space for your backing file to store newly uploaded documents.
Permanently removing "files" from your storage and resizing / compressing the backing file.
Multi-user access / locks.
Failure recovery. Such as when you encounter a bad block on the drive and it hoses your backing file.
Transactional support.
Items 1 and 2 cause an increase in the amount of time it takes to write a "file" to your data store. Items 3, 4 and 5 are already supported by network file systems so you're just recreating the wheel.
In short you're going to have to either write your own file system or write your own DBMS. Neither of which I would consider "good practice" for 99% of real world applications. It might be worthwhile if your goal is to work for Seagate.. But even then they'd probably look at you funny.
If you are truly interested in the most efficient method of file storage, it is quite simply to purchase a SAN array and push your files to it while keeping a pointer to the file/location in your database. Easy to back up, fast to store files, much cheaper than spending developer time trying to figure out how to write your own file system and certainly 100% supported and understandable by future devs.
This kind of product already exist. You should read about Mongo Db (http://www.mongodb.org/display/DOCS/Home)
I'm developing some web app in ASP.Net which is mainly about Storing, Sharing and Processing MS Word doc and PDF files, but I'm not sure how to manage these documents, I was thinking of keeping documents in folders and only keeping metadata of them in DB OR keeping the whole documents in DB,I'm using SQL Server 2008. what's your suggestion?
SQL Server 2008 is reasonably good at storing and serving up large documents (unlike some of the earlier versions), so it is definitely an option. That said, having large blobs being served up from the DB is generally not a great idea. I think you need to think about the advantages and disadvantages of both approaches. Some general things to think about:
How large are the files going to be, and how many of them will there be? It's a lot easier to scale a file system past many TB than it is to do the same for a DB.
How do you want to manage backups? Obviously with a file system approach you'd need to back the files up separately from the DB.
I believe it's probably quicker to implement a solution that stores to the DB, but that storing to the file system is generally the superior solution. In the latter case, however, you will have to worry about some issues, such as having unique file names, and in general not wanting to store too many documents in a single folder (most solutions create new folders after every few thousand documents). Use the quicker approach if the files are not going to be numerous and large, otherwise invest some time in storing on the file system.
In the database unless you don't care about data integrity.
If you store docs outside of the database you will have missing documents and broken links soomer not later. Your backup/restore scenario is a lot more complex: you have no way to ensure that all data is from the same point in time.
FILESTREAM in SQL Server 2008 makes it efficient nowadays (and other RDBMS have features like this too)
If your storing these files in one folder, then maintain the files names in the DB. Since no dir can have 2 files names same with same extension. If you wish to store the file into DB, then you may have to use BLOB or byte array to store.
I see over head in opening a connection to the DB, though i dont know how fast the DB connection is compared to file.Open (even peformance wise).
If files are relatively small I would store them as BLOB fields in database. This way you can use standard procedures for backup/restore as well as transactions. If files are large there are some advantages in keeping them on the hard drive and storing filenames in the database as was suggested earlier
How many documents are you planning to store?
The main advantage of the database approach is the normal ACID properties--the meta-data will always be consistent with the document, which will not be the case if you use the file system to store the documents. Using the file system it would be relatively easy for your meta-data and documents to get out of sync: documents on the file system for which there is no meta-data, meta-data where the document has gone missing or is corrupted. If you need any sort of reliability in your document storage, then the database is a much better approach than using the file system.
If you are going only operate on that files I would think to store them i DB like a BLOB data. In case if you have a files on folders and only names in DB, you should care about the fact that, for example:
1) one day you may need rename file
2) change its location
3) change its extension
or whatever.
In case of DB instead, you can save in separate table BLOB data, in other table name and extension of the file along with its ID on BLOB table. In this case, in moment of previously discussed scenario you will need just execute simple SQL update query.
Sorry for the bad title.
I'm saving web pages. I currently use 1 XML file as an index. One element contains file created date (UTC), full URL (w. query string and what not). And the headers in a separate file with similar name but appended special extension.
However, going at 40k (incl. header) files, the XML is now 3.5 MB. Recently I was still reading, adding new entry, save this XML file. But now I keep it in memory and save it every once in a while.
When I request a page, the URL is looked up using XPath on the XML file, if there is an entry, the file path is returned.
The directory structure is
.\www.host.com/randomFilename.randext
So I am looking for a better way.
Im thinking:
One XML file per. domain (incl. subdomains). But I feel this might be a hassle.
Using SVN. I just tested it, but I have no experience in large repositories. Executing svn add "path to file" for every download, and commit when I'm done.
Create a custom file system, where I then can include everything I want, for ex. POST-data.
Generating a filename from the URL and somehow flattening the querystring, but large querystrings might be rejected by the OS. And if I keep it with the headers, I still need to keep track of multiple files mapped to each different query string. Hassle. And I don't want it to execute too slow either.
Multiple program instances will perform read/write operations, on different computers.
If I follow the directory/file method, I could in theory add a layer between so it uses DotNetZip on the fly. But then again, the query string.
I'm just looking for direction or experience here.
What I also want is the ability to keep history of these files, so the local file is not overwritten, and then I can pick which version (by date) I want. Thats why I tried SVN.
I would recommend either a relational database or a version control system.
You might want to use SQL Server 2008's new FILESTREAM feature to store the files themselves in the database.
I would use 2 data stores, one for the raw files and another for indexes.
To stored the flat file, I think Berkeley DB is a good choice, the key can be generated by md5 or other hash function, and you can also compress the content of the file to save some disk space.
For indexes, you can use relational database or more sophisticated text search engine like Lucene.