Search in thousands of xml files

Search in thousands of xml files - c#

I have around 50000 XML files with a size of 50KB per file. I want to search for data in these files, but my solution so far is very slow. Is there any way to enhance the search performance?

You could use Lucene.NET, a lightweight, fast, flat file search indexing engine.
See http://codeclimber.net.nz/archive/2009/09/02/lucene.net-your-first-application.aspx for a getting started tutorial.

You can always index content of files to database and perform search there. Databases are pretty performant in terms of search.

I am assuming you are using Windows and you can use Windows desktop search for quickly searching the files. You will be using the Windows index which would update when ever the file changes. The SDK is available here which can be used from .NET

A lot depends on the nature of these XML files. Are they just 50,000 XML files that won't be re-generated? Or are they constantly changing? Are there only certain elements within the XML files you want to index for searching?
Certainly opening 50k file handles, reading their contents, and searching for text is going to be very slow. I agree with Pavel, putting the data in a database will yield a lot of performance, but if your XML files are changing often, you will have to have some way to keep them synchronized with the database.
If you want to roll your own solution, I recommend scanning all the files and creating a word index. If your files change frequently, you will also want to keep track of your "last modified" date, and if a file has changed more recently than that, update your index. In this way, you'll have one ginormous word index, and if the search is for "foo", the index will reveal that the word can be found in the files file39209.xml, file57209 and file01009.xml. Depending on the nature of the XML, you could even store the elements in the index file (which would, in essence, be like flattening all of your XML files into one).

You could spin up a Splunk instance and have it index your files. It's billed mostly as a log parser but would still serve your needs. It tokenizes files into words, indexes those words, and provides both a web-based and a CLI-based search tool that supports complex search criteria.

Use an XML database. The usual recommendations are eXist if you want open source, MarkLogic if you want something commercial, but you could use SQL Server if being Microsoft matters to you and you don't want the ultimate in XML capability. And there are plenty of others if you want to evaluate them. All database products have a steep learning curve, but for these data volumes, it's the right solution.

Related

How to store mp3 files in harddisk and maintain performance

I'm working on a project which will have millions of small mp3 files that I was thinking to save in harddisk.
I have the following questions:
What is the structure I should use to save the files? one folder or
many folders.
what is the best way to Searching ?

I had to do a similar thing on a project that involved storing a large number of images. Using some meta data for the file, I generated an MD5 hash which I then used as the file name. The first character of the filename would be the grandparent directory for the file, and the second character the parent. Resulting in a file structure like this:
This has the advantage of keeping the files evenly distributed over the directories. And if you pick the metadata used to generate the hash well, then it also has the advantage of being able to find a file without using a database to store references to it.
I've found this method to work pretty well with 100k or so files, but without more information of what exactly you're trying to do it's hard to know if it's appropriate for your problem...

The best approach might be to store the information you are going to search on in a database and use that to search on. You can then use something like Lucene or Solr to do the searching.
The database would store a reference to the file on disk and just use that directly when the search pops out it's results. This means you can organise the files on disk in any order you like.
However, without a lot more information this is effectively just a guess.

Storing Files on File Server or in Data Base?

I'm developing some web app in ASP.Net which is mainly about Storing, Sharing and Processing MS Word doc and PDF files, but I'm not sure how to manage these documents, I was thinking of keeping documents in folders and only keeping metadata of them in DB OR keeping the whole documents in DB,I'm using SQL Server 2008. what's your suggestion?

SQL Server 2008 is reasonably good at storing and serving up large documents (unlike some of the earlier versions), so it is definitely an option. That said, having large blobs being served up from the DB is generally not a great idea. I think you need to think about the advantages and disadvantages of both approaches. Some general things to think about:
How large are the files going to be, and how many of them will there be? It's a lot easier to scale a file system past many TB than it is to do the same for a DB.
How do you want to manage backups? Obviously with a file system approach you'd need to back the files up separately from the DB.
I believe it's probably quicker to implement a solution that stores to the DB, but that storing to the file system is generally the superior solution. In the latter case, however, you will have to worry about some issues, such as having unique file names, and in general not wanting to store too many documents in a single folder (most solutions create new folders after every few thousand documents). Use the quicker approach if the files are not going to be numerous and large, otherwise invest some time in storing on the file system.

In the database unless you don't care about data integrity.
If you store docs outside of the database you will have missing documents and broken links soomer not later. Your backup/restore scenario is a lot more complex: you have no way to ensure that all data is from the same point in time.
FILESTREAM in SQL Server 2008 makes it efficient nowadays (and other RDBMS have features like this too)

If your storing these files in one folder, then maintain the files names in the DB. Since no dir can have 2 files names same with same extension. If you wish to store the file into DB, then you may have to use BLOB or byte array to store.
I see over head in opening a connection to the DB, though i dont know how fast the DB connection is compared to file.Open (even peformance wise).

If files are relatively small I would store them as BLOB fields in database. This way you can use standard procedures for backup/restore as well as transactions. If files are large there are some advantages in keeping them on the hard drive and storing filenames in the database as was suggested earlier

How many documents are you planning to store?
The main advantage of the database approach is the normal ACID properties--the meta-data will always be consistent with the document, which will not be the case if you use the file system to store the documents. Using the file system it would be relatively easy for your meta-data and documents to get out of sync: documents on the file system for which there is no meta-data, meta-data where the document has gone missing or is corrupted. If you need any sort of reliability in your document storage, then the database is a much better approach than using the file system.

If you are going only operate on that files I would think to store them i DB like a BLOB data. In case if you have a files on folders and only names in DB, you should care about the fact that, for example:
1) one day you may need rename file
2) change its location
3) change its extension
or whatever.
In case of DB instead, you can save in separate table BLOB data, in other table name and extension of the file along with its ID on BLOB table. In this case, in moment of previously discussed scenario you will need just execute simple SQL update query.

What is the best way to store downloaded files?

Sorry for the bad title.
I'm saving web pages. I currently use 1 XML file as an index. One element contains file created date (UTC), full URL (w. query string and what not). And the headers in a separate file with similar name but appended special extension.
However, going at 40k (incl. header) files, the XML is now 3.5 MB. Recently I was still reading, adding new entry, save this XML file. But now I keep it in memory and save it every once in a while.
When I request a page, the URL is looked up using XPath on the XML file, if there is an entry, the file path is returned.
The directory structure is
.\www.host.com/randomFilename.randext
So I am looking for a better way.
Im thinking:
One XML file per. domain (incl. subdomains). But I feel this might be a hassle.
Using SVN. I just tested it, but I have no experience in large repositories. Executing svn add "path to file" for every download, and commit when I'm done.
Create a custom file system, where I then can include everything I want, for ex. POST-data.
Generating a filename from the URL and somehow flattening the querystring, but large querystrings might be rejected by the OS. And if I keep it with the headers, I still need to keep track of multiple files mapped to each different query string. Hassle. And I don't want it to execute too slow either.
Multiple program instances will perform read/write operations, on different computers.
If I follow the directory/file method, I could in theory add a layer between so it uses DotNetZip on the fly. But then again, the query string.
I'm just looking for direction or experience here.
What I also want is the ability to keep history of these files, so the local file is not overwritten, and then I can pick which version (by date) I want. Thats why I tried SVN.

I would recommend either a relational database or a version control system.
You might want to use SQL Server 2008's new FILESTREAM feature to store the files themselves in the database.

I would use 2 data stores, one for the raw files and another for indexes.
To stored the flat file, I think Berkeley DB is a good choice, the key can be generated by md5 or other hash function, and you can also compress the content of the file to save some disk space.
For indexes, you can use relational database or more sophisticated text search engine like Lucene.

Store Files in SQL Server or keep them on the File Server?

Currently we have thousands of Microsoft Word files, Excel files, PDF's, images etc stored in folders/sub folders. These are generated by an application on a regular basis and can be accessed at any time within that application. As we look to upgrade we are now looking into storing all these documents within SQL Server 2005 instead. Reasons for this are based on being able to compress the documents, adding additional fields to store more information on those documents and applying index’s where necessary.
I suppose what I’m after is the pros and cons of using SQL Server as a document repository instead of keeping them on the file server, as well as any experience you might have in doing this.
We would be using C# and Windows Workflow to do this task.
Thanks for your comments.
Edit
How big are the files?
between 100k = 200k in size (avg. 70KB)
How many will be?
At the moment it’s around 3.1 Million files (ranging from Word/Excel and PDF's), which can grow by 2,600 a day. (The growth will also increase over time)
How many reads?
This one is difficult to quantify as our old system/application makes it hard to work this out.
Also another useful link pointed out on a similar post covers the pros and cons of both methods.
Files Stored on DB vs FileSystem - Pros and Cons

rule of thumb for doc size is:
size < 256 kb: store in db
265 kb < size < 1 MB: test for your load
size > 1 Mb: store on file system
EDIT: this rule of thumb also applies for FILESTREAM storage in SQL Server 2008

If you upgrade all the way, to SQL Server 2008, then you can use the new FILESTREAM feature, that allows the document to appear as a column in a table, yet to reside as a file on a share, where it can be directly accessed by a program (like Word).

I would have both.
I would keep the files renamed with an unique name, thus easier to manage, and i would keep all meta data inside the database (file name, content-type, location on file system, size, description, etcetera), so the files are accessed through the database (indirectly).
Advantages:
files are easy to handle; you can bring several drives in the mix
the database can keep any number of meta information, including file description on which you can search against.
keep track on file accesses and other statistic information
rearrange the files using various paradigms: tree (directory structure), tags, search or context
You can have compression on a drive also. You can have RAID for backup and speed.

What kind of documents are we talking about?
Storing documents in your SQL server might be useful because you can relate the documents to other tables and use techniques like Full-text indexing and do things like fuzzy searches.
A downside is that it might be a bit harder to create a backup of the documents. And compression is also possible with NTFS compression or other techniques.

Are these documents text based and are you planning on using SQL Server's full text search to search these documents? If not, I don't see any benefit in storing these documents on the database. Ofcourse, you can always store the meta data related to the documents including the path information to the database.

A big benefit of stroing docs in the DB is it becomes much easier to control security access to them, as you can do it all via access control in your app. Storing them on a file server requires dealing with access priveledges at the file and folder level to prevent any direct access. Also have them in a DB makes for a single point of backup, so you can more easily make a full copy and/or move it around if needed.

Rather than writing a custom DMS (document management system), you should probably consider buying one or using WSS / SharePoint as this will handle all the mundane details (storage, indexing, meta-data) and let you build your custom functionality on top.

Help with Search Engine Architecture .NET C#

I'm trying to create a search engine for all literature (books, articles, etc), music, and videos relating to a particular spiritual group. When a keyword is entered, I want to display a link to all the PDF articles where the keyword appears, and also all the music files and video files which are tagged with the keyword in question. The user should be able to filter it with information such as author/artist, place, date/time, etc. When the user clicks on one of the results links (book names, for instance), they are taken to another page where snippets from that book everywhere the keyword is found are displayed.
I thought of using the Lucene library (or Searcharoo) to implement my PDF search, but I also need a database to tag all the other information so that results can be filtered by author/artist information, etc. So I was thinking of having tables for Text, Music, and Videos, and a field containing the path to the file for each. When a keyword is entered, I need to search the DB for music and video files, and also need to search the PDF's, and when a filter is applied, the music and video search is easy, but limiting the text search based on the filters is getting confusing.
Is my approach correct? Are there better ways to do this? Since the search content is limited only to the spiritual group, there is not an infinite number of items to search. I'd say about 100-500 books and 1000-5000 songs.

Lucene is a great way to get up and running quickly without too much effort, along with several areas for extending the indexing and searching functionality to better suit your needs. It also has several built-in analyzers for common file types, such as HTML/XML, PDF, MS Word Documents, etc.
It provides the ability to use a variety of Fields, and they don't necessarily have to be uniform across all Documents (in other words, music files might have different attributes than text-based content, such as artist, title, length, etc.), which is great for storing different types of content.
Not knowing the exact implementation of what you're working on, this may or may not be feasible, but for tagging and other related features, you might also consider using a database, such as MySQL or SQL Server side-by-side with the Lucene index. Use the Lucene index for full-text search, then once you have a result set, go to the database to extract all the relational content. Our company has done this before, and it's actually not as big of a headache as it sounds.
NOTE: If you decide to go this route, BE CAREFUL, as the "unique id" provided by Lucene is highly volatile (it changes everytime the index is optimized), so you will want to store the actual id (the primary key in the database) as a separate field on the Document.
Another added benefit, if you are set on using C#.NET, there is a port called Lucene.Net, which is written entirely in C#. The down-side here is that you're a few months behind on all the latest features, but if you really need them, you can always check out the Java source and implement the required updates manually.

Yes, there is a better approach. Try Solr and in particular check out facets. It will save you a lot of trouble.

If you definitely want to go the database route then you should use SQL Server with Full Text Search enabled. You can use this with Express versions, too. You can then store and search the contents of PDFs very easily (so long as you install the free Adobe PDF iFilter).

You could try using MS Search Server Express Edition, one of the major benefits is that it is free.
http://www.microsoft.com/enterprisesearch/en/us/search-server-express.aspx#none

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.