File system searching (C#) against specified file list - c#

My client wants to add a file system searching feature in a B/S application based on C#. It is a little special that the search shall be in a scope of specified file list but not a whole directory with just certain file extension.
I did some research on Microsoft Office Sharepoint Server Search Service, but couldn't get a clue whether it supports searching against specific files. I'm now using it to search PDF files, but not the same case of what I'm asking for.
Can anyone give me some suggestions what 3rd party search service/engine I should take for the requirement?
Thanks.
Elaine

I assume you are wanting full text indexing of a certain set of files?
Java has the best selection of libraries for this, but there are C# ports as well.
I highly recommend Lucene for indexing and retrieval.:
http://incubator.apache.org/lucene.net/
If this is on a server, it might be easiest to run a Solr instance and use C# as the client:
http://crazorsharp.blogspot.com/2010/01/full-text-search-using-solr-lucene-and.html
Lucene has many examples on indexing different document types, but if you use Solr, it will handle that for you.

Related

How to convert office file to image

I am searching from last two days but did not find any thing.
My requirement is to create a document viewer in my web application (C#.Net) and I don't want to use any third party tool for this. Can I convert the files in image or PDF or in any common formate which can be easly render on web page. I also can not use Introp object.
Any help will be highly appreciated
You mention in one of your comments that you'd like to write all the code yourself but don't know where to start. Here's how I would go about it...
First, you'll need to familiarize yourself with the Microsoft Office Format specification. You can find that here (there's a link to the technical specification). Office documents are actually a .zip file with an XML file inside along with any binary data representing attachments. Just renamed a .docx file as .zip and you'll be able to open it up and see the XML and any other supporting documents inside (same is true for xlsx, etc...).
Then you'll need to become intimately familiar with either PDF or HTML, as your job now will be to convert the various Office document structure into PDF or HTML structure, being sure to respect page layout, margins, order, etc...
As others have said, this is a large task which is why third party tools exist today. Also, each third party toolset has it's limitation as this is really hard to "get right" in all situations and there will be edge cases that work for one document and not another (because maybe they didn't use Microsoft Word to save the .docx, maybe they used OpenOffice and OpenOffice interpreted the standard slightly differently...)
If you cannot use COM/Interop technologies in your solution, you can take a look at the specialized 3rd party options. I see that you prefer not to use them, however, there are no existing built-in solutions in the .NET Framework. Check out my answer in a similar thread that describes how to accomplish exactly the same task using 3rd party libraries (for example, DevExpress, since I have experience with it). In addition, take a look at the Documents demo, where you can see how to create images/thumbnails from different types of MS Office documents.
I believe what you need is an intermediate representation of the documents which can be converted into an image for the viewer to display.
Lets me try to explain with the below diagram:
You can use tools like smallpdf or OfficeToPDF to do that. Just integrate them into your application.
Small PDF(https://smallpdf.com/library-detail)
officetopdf (https://officetopdf.codeplex.com/)

Search for keywords in Word documents and index them

I'm looking for a way to search in Word documents and show a result of documents that matched the search criteria. I'll try to describe the scenario in more detail here.
On a Windows system i have a bunch of folders. Each folder has alot of Word documents. Now i need an application that can search inside a specific folder for keywords that might occure in those word documents. Something like the FULLTEXT search that MySQL has.
So if i search for the following keywords: microsoft, windows XP then i want it to list every Word document that contains one or more of those keywords.
Ofcourse, the more those keywords appear a document, the higher its rank should be in the resulting list.
Now my question is, is there such a tool out there that does exactly this? Or am i better of writing such a tool myself in C#.NET? If so, to what API's do i have to look?
PS. They are .doc and .docx files.
Looks like you need a full-blown search engine to me, including parsing, indexing, ranking, search, etc. Probably not very pleasant to implement it yourself... You could have a look at Apache Lucene.
There is a tool right under your nose. It's Windows Search and it has an API which should meet your needs perfectly.
You might have to install the filter packs to provide Office-specific indexing if you don't have Office installed.
Indexing is available within Windows and can deal with Word documents :
http://windows.microsoft.com/en-US/windows7/Improve-Windows-searches-using-the-index-frequently-asked-questions
make a windows highlight search in c#
If you want to build your own index, you can use IFilters to extract text from documents : How to extract text from MS office documents in C#
You could try SmartFinder APP available on Microsoft Store.
It's developed with Java and Apache Lucene library.
You can search the text and immediately have an extract of the document with the searched words highlighted in the results.
You can refine your search with metadata (authors, keyword, publisher, ...) and you can search also with wildcard (for example with * or ? special chars).
This is the Microsoft Store link to download the APP: https://www.microsoft.com/store/apps/9PD0BCV3WKD1

How to produce documents (docx or pdf) from SQL Server?

I know this is a little subjective, but I'm looking into the following situation:
I need to produce a number of documents automatically from data in a SQL Server database. There will be an MVC3 app sat on the database to allow data entry etc. and (probably) a "Go" button to produce the documents.
There needs to be some business logic about how these documents are created, named and stored (e.g. "Parent" documents get one name and go in one folder, "Child" documents get a computed name and go in a sub-folder.
The documents can either be PDF or Doc(x) (or even both), as long as the output can be in EN-US and AR-QA (RTL text)
I know there are a number of options from SSRS, Crystal Reports, VSTO, "manual" PDF in code, word mail merge, etc... and we already have an HTML to PDF tool if thats any use?
Does anyone have any real world advice on how to go about this and what the "best" (most pragmatic) approach would be? The less "extras" I need to install and configure on a server the better - the faster the development the better (as always!!)
Findings so far:
Word Mail Merge (or VSTO)
Simply doesn't offer the simplicity, control and flexibility I require - shame really. Would be nice to define a dotx and be able to pass in the data to it on an individual basis to generate the docx. Only way I could acheive this (and I may be wrong here) was to loop through controls/bookmarks by name and replace the values...messy.
OpenXML
Creating documents based on dotx templates, even using OpenXML is not as simple as (IMHO) it should be. You have to replace each Content control by name, so maintenance isn't the simplest task.
SSRS
On the face of it this is a good solution (although it needs SQL Enterprise), however it gets more complicated if you want to dynamically produce the folders and documents. Data driven subscription gets very close to what I want though.
Winnovative HTML to PDF Convertor*
This is the tool we already have (albeit a .Net 2.0 version). This allows me to generate the HTML pages and convert those to PDF. A good option for me since I can run this on an MVC3 website adn pass the parameters into the controllers to generate the PDF's. This gives me much finer-grained control over the folder and naming structures - the issue with this method is simply generating the pages in the correct way. A bonus is that it automatically gives me a "preview"...basiclly just the HTML page!
Office OpenXML is a nice and simple way of generating office files. XSLT's can be strong tool to format your content. This technology will not let you create pdf's.
Fast development without using any third party components will be difficult. But if you do consider using a report server, make sure to check out BIRT or Jasper.
To generate pdf's I have been using the deprecated Report.net. It has many ports to different languages and is still sufficient to make simple pdf's. Report.net on sourceforge
I dont think SQL Server itself can produce pdf files. What you can do is, as you mentioned, install an instance of SSRS and create a report that produces the information you need. Then you can create a subscription to deliver your report to where you want, when you want.
Here is an example of a simple subscription:
Go for SSRS only if you are OK with setting it up on a server and there is a definite need for schedule reporting and complex reports.
If you have code for manual PDF/docx generation, I would suggest to go ahead with it. Hopefully the complexity of its code is not a matter to you.
I have used both in separate scenarios. We used excel classes and objects from .NET for a minimal reporting from a web application.
But went for an elaborate reporting scheme for a system which required 1000s of reports to be generated in a scheduled manner and delivered to selected set of people.

.net API for managing pst files

I have an old PC on which i have a large pst file , and i have the idea to write a small C# program to spilt it into smaller files so that i can better manage them if needed ( i know that sounds weird and that there are also available tools in google but i thought it will be fun to play with it ).The problem is that i can't find good article or API info which functions are best suited ( if there are any at all ) for managing those files , Ideas ?
Thanks in advance
Take a look at Redemption API. This API does not require Outlook to be installed (only stand-alone MAPI) and does not require outlook to run.
First just a clarification by PST file you mean you outlook information?
Running on that I know of no API to manipulate it but you can get the documentation for it at http://msdn.microsoft.com/en-us/library/ff385210(v=office.12).aspx this is a large and complex specification for a binary format. Always play with a copy of it not the real thing.
One approach that might be better is using the ActiveX/COM interface provided by outlook to interact with this file abstractly, so instead of dealing with the physical layout of the file work with contacts, folder and email messages.
It may be worth your while finding out how open source mail clients (Like thunderbird) import from outlook. You may be able to pull there code out into an API, as long as you follow the licence conditions.
Not the easier answer, but it is the one I have.

File System indexing

I am creating an application were I need to scan a directory hive to find a certain file.
I also want to better understand how indexing works.
Can anyone point me to any resource preferably in C# that shows how I can create a basic index for file system searching?
So, it sounds like you need a library for doing searches.
Lucene is a java search library, which has been ported to C#.

Categories