Fastest PDF->text library for .NET project - c#

I'm trying to create an application which will be basically a catalogue of my PDF collection. We are talking about 15-20GBs containing tens of thousands of PDFs. I am also planning to include a full-text search mechanism. I will be using Lucene.NET for search (actually, NHibernate.Search), and a library for PDF->text conversion. Which would be the best choice? I was considering these:
PDFBox
pdftotext (from xpdf) via c# wrapper
iTextSharp
Edit: Other good option seems to be using iFilters. How well (speed/quality) would they perform (Foxit/Adobe) in comparison to these libraries?
Commercial libraries are probably out of the question, as it is my private project and I don't really have a budget for commercial solutions - although PDFTextStream looks really nice.
From what I've read pdftotext is a lot faster than PDFBox. How well performs iTextSharp in comparison to pdftotext? Or maybe someone can recommend other good solutions?

If it is for a private project, is this going to an ongoing conversion process? E.g. after you've converted the 15-20Gb are you going to still be converting?
The reason I ask is because I'm trying to work out whether speed is your primary issue. If it were me, for example, converting a library of books, my primary concern would be the quality of the conversion, not the speed. I could always leave the conversion over-night/-weekend if necessary!

The desktop version of Foxit's PDF IFilter is free
http://www.foxitsoftware.com/pdf/ifilter/
It will automatically do the indexing and searching, but perhaps their index is available for you to use as well. If you are planning to use it in an application you sell or distribute, then I guess it won't be a good choice, but if it's just for yourself, then it might work.
The Foxit code is at the core my company's PDF Reader/Text Extraction library, which wouldn't be appropriate for your project, but I can vouch for the speed and quality of the results of the underlying Foxit engine.

I guess using any library is fine, but do you want to search all these 20Gb files at time of search?
For full text search, best is you can create a database, something like sqlite or any local database on client machine, read all pdf and convert them to plain text and store it in database when they are added first.
Your database can simpley be as following..
Table: PDFFiles
PDFFileID
PDFFilePath
PDFTitle
PDFAuthor
PDFKeywords
PDFFullText....
and you can search this table when you need to, this way your search will be extremely fast independent of type of pdf, plus this conversion from pdf to database is needed only when pdf is added to your collection or modified.

Related

What schema, database, searching libraries are good for storing thousands of book pages in c# app

I want to write a C# program to store some books with the total of 5000 pages. But there are a few important issues here that I need your help and advice:
The ability to search all of the books’ content is one of the most important and challenging features of the app. The time that is needed to search a word should be about the time required to search a word in Microsoft Word or a PDF doc (with the same size) or more.
What method should I employ for storing the books so that more suitable approaches to searching the content would be in hand? Relational DB, MongoDB, couchDB, etc. which one is preferred?
For the case of using Database, what kind of Schema and indexing is required and important?
Which method or algorithm or library is better to be used for searching the whole content of the books? Is it possible to use lucene or Solr in a standalone windows app or would traditional searching method be better?
The program should be customized in such a way that the publisher would be able to add their own book contents. How can I handle this feature (can I use XML)?
The users should be able to add one or more lines from the contents to their favorite list. What is the best way to deal with this?
I think Solr will be able to meet most of these requirements. For #1, you can easily develop schema in Solr to hold various information in different formats. Solr's Admin UI has an Analysis tab that will help you greatly in developing your schema because it allows you to test your changes on the fly with different types of data. It is a huge time saver because you don't have to create a bunch of test content and index it in to test it. Additionally, if the contents of the books are in binary format you can use Apache Tika to perform text extraction. Solr also has a number of other bells and whistles that you may find helpful, such as highlighting and user query spell suggestion.
For #2, Solr will support updates to content via JSON files that can be sent to the update handler for your collection. It also supports atomic updates which you may find useful. It seems that in your case, you may need some kind of a security solution to sit on top of Solr to prevent publishers from modifying each other's content, however you will most likely run into this issue regardless of the type of solution you will use.
For #3, I am not sure what you are really looking for here. I think that for content search and retrieval you will find Solr a good fit. For general user information storage and etc, you may need a different tool, since that is kind of outside of scope of what Solr is supposed to do.
Hope it helps.

Suggestions on ways to compare two files and show differences, maybe by rehosting a version comparion tool?

I'm trying to find a good way to have version comparison between two files (.docx files), where the files are compared and the differences are highlighted. Eventually with the ability to output a report.
I was thinking maybe it's possible to rehost a comparison tool that is used by Team Foundation Server or something similar. The documents will be hundreds of pages long.
can u use 3rd party? I have used Beyond Compare 3 which has really good file compare for xml stuff.
API was good in a sense that you could run batch scripts and it will dump the output in the format you want

Attaching arbitrary data to DirectoryInfo/FileInfo?

I have a site which is akin to SVN, but without the version control.. Users can upload and download to Projects, where each Project has a directory (with subdirs and files) on the server. What i'd like to do is attach further information to files, like who uploaded it, how many times its been downloaded, and so on. Is there a way to do this for FileInfo, or should I store this in a table where it associates itself with an absolute path or something? That way sounds dodgy and error prone :\
It is possible to append data to arbitrary files with NTFS (the default Windows filesystem, which I'm assuming you're using). You'd use alternate data streams. Microsoft uses this for extended metadata like author and summary information in Office documents.
Really, though, the database approach is reasonable, widely used, and much less error-prone, in my opinion. It's not really a good idea to be modifying the original file unless you're actually changing its content.
As Michael Petrotta points out, alternate data streams are a nifty idea. Here's a C# tutorial with code. Really though, a database is the way to go. SQL Compact and SQLite are fairly low-impact and straightforward to use.

How do you suggest I approach this unique problem?

I have a website where I allow businesses to register what products they sell individually. Then a consumer can online and search for a product and receive a list of all the shops where it's currently selling.
Although they can upload one product at a time, I want to allow businesses to mass upload things they offer.
I was thinking of using a excel spreadsheet. Have them download the template, and then have them upload the filled in excel sheet.
Others have suggested telling them to create a CSV file, but that is counter-intuitive in my honest opinion. Most likely a secretary will be creating the product sheets and she won't have a clue about what a CSV is.
What is the best way to approach this?
Well, it partly depends on the businesses. If they are medium or large businesses, they'd probably rather submit the data via a webservice anyway - then they don't have to get a human involved at all, after the initial development. They can write an application to periodically suck information from their (inevitable) database of products, and post to your web service.
If you're talking about very small companies without their own IT departments, that's less feasible, and either Excel or CSV would be a better approach. (As Caladain says, it's pretty simple to export to CSV... but you should try from a number of different spreadsheet programs as they may well have different subtleties in their export format. Things like text encoding will be important as well.)
But here's a novel idea... how about you ask some sample companies what they would like you to do? Presumably you have some companies in mind already - if you don't, it's potentially going to be pretty hard to make sure you're really building the right thing.
Find out how they already store their product list, and how they'd want to upload it to you. Then consider how difficult that would be, and possibly go back to them with something which is almost as easy for them, but a lot easier for you to implement, etc.
While I personally don't like Excel very much, it seems to be the best accepted format to do such things (involving a manual process).
My experience is that CSV breaks easily, for instance it uses the regional settings to determine the separator which can cause incompatibilities on either the client or the server side. Also, many people just save the file in any Excel format because they just don't know the difference.
Creating the files can be pretty easily done with some XSLT (e.g. create XMLSS format files, which are "XML Spreadsheet 2003" format).
You may also want to have a look at the Excel Data Reader on Codeplex for parsing the files.
Reading in an Excel file is actually pretty easy with ODBC. Tutorial on it.

How do I store a rating in a song?

I want to be able to store information about a song that has been opened using my application. I would like for the user to be able to give the song a rating and this rating be loaded every time the users opens that file using my application.
I also need to know whether I should store the ratings in a database or an xml file.
C# ID3 Library is a .Net class library for editing id3 tags (v1-2.4). I would store the ratings directly into the comments section of the mp3 since id3v1 does not have many of the storage features that id3v2 does. If you want to store additional information for each mp3, what about placing a unique identifier on the mp3 and then having that do a database lookup?
I would be cautious about adding custom tags to mp3s as it is an easy way to ruin a large library. Also, I have gone down this road before and while I enjoyed the programming knowledge that came out of it, trying something like the iTunes SDK or Last FM might be a better route.
I would use a single-file, zero-config database. SQL Server Compact in your case.
I don't think XML is a good idea. XML shines in data interchange and storing very small amounts of information. In this case a user may rate thousands of tracks ( I have personally in online radios that allow ratings), and you may have lots of other information to store about the track.
Export and import using XML export procedures if you have to. Don't use it as your main datastore.
I would store it in a file as it is easier to keep with the mp3 file itself. If all you're doing is storing ratings, would you consider setting the ID3 rating field instead?
For this type of very simple storage I don't think it really matters all that much. The pro's of XML is its very easy to deploy and its editable outside of your app. the con's are, its editible outside your application (could be good, could be bad, depends on your situation)
Maybe another option (just because you can ;-) is an OODBMS, check out DB4Objects, its seriously addictive and very, very cool.
As mentioned earlier it is better to store such information in media file itself. And my suggestion is to use TagLib# lib for this (best media metadata lib I can find). Very powerful and easy to use.
I would store the ratings in a XML file, that way it's easy to edit from the outside, easy to read in .NET and you don't have to worry about shipping a database for something simple with you application.
Something like this might work for you:
<Songs>
<Song Title="{SongTitle}">
<Path>{Song path}</Path>
<Rating>3</Rating>
</Song>
</Songs>
If the song format supports suitable meta data (eg. MP3), then follow Kevin's advice of using the meta data. This is by far the best way of doing it, and it is what the meta data is intended for.
If not, then it really depends on your application. If you want to share the rating information - especially over a web service, then I would go for XML: it would be trivial to supply your XML listings as one big feed, for example.
XML (or most other text formats) also have the advantage that they can be easily edited by a human in a text editor.
The database would have its advantages if you had a more closed system, you wanted speed and fast indexing, and/or have other tables you might want to store as well (eg. data about albums and bands).

Categories