Convert Different Format documents into 1 bit png files in .net - c#

Hi I have an initiative of converting bunch of different formats such as Word, PDF, png, jpg, excel files into '1 bit png' files and store them in the database.
I need to use .net framework for this. Do you guys know any tools that can do this, I want some tools or API that I can buy. I am pretty sure I would need more than one tool for doing this, which is fine.
What would be the best way to do this? It is possible to convert them all to TIFF format and then convert them to png?
Thanks

This is close to a duplicate of Is there a programming toolkit for converting “any file type” to a TIFF image?. I'll just repeat what I wrote at that time:
This is an integration project - there is no one tool that will read all of the file types you're interested in. In our case, we developed a generic transcoding service that accepts numerous input types (by file extension) and executes external applications based on that type:
Ghostscript for PDF and PS files,
ImageMagick for image files,
Apose.NET for Office types, and
some homegrown stuff for simpler types like text files.
We haven't found an application that will interpret Visio files of all versions other than Visio itself. And as you may already know, Office Interop should not be used on a server.

Related

open or convert webarchive File in c#

I am trying to find a way to open or convert a webarchive file to any other format in C#. The goal is an automated import system with as few restrictions on file type as possible. I cannot seem to find any way of converting the file other than using safari to open it.
Unfortunately what you are looking for cannot really be done. A webarchive is a proprietary file type made by Apple to display offline webpages in a Safari. This is a combination of xml, html, and binary data, but there are examples in Objective-C to convert the webarchive to a zip archive that contains the html and embedded images/media that was originally displayed on the website that was saved into the webarchive file.
Here is an Objective-C example from GitHub - WebArchiveExtractor
As for converting to PDF...not sure that can be done, you would be better off printing the webpage to PDF in the first place and then uploading that to your document management system.
Apparently though the webarchive filetype contains XML with binary encoded images/media similar to an MHTML file, so you may be able to figure out the format by viewing them in text editors and then writing a conversion utility, but there is very limited information on the web regarding the internal schema of the webarchive file format, so this may be a daunting task. However, since WebKit is open source you can see their code for created an archive and try to reverse it to build your converter. Here's the source code (in C++) for the archiving features in Safari, which actually looks like they are using mhtml, but I haven't explored deep enough to tell if it's exactly the same format: http://trac.webkit.org/browser/trunk/Source/WebCore/loader/archive
Good Luck!

is there any way to write one code that works with all possible office documents?

I'm writing a program that modifies word documents. Currently I have used Microsoft.Office,Interop.Word to work with Word document and it requiers Microsoft Office to be installed on users computer, but some my clients don't have MS Office, but they have Open Office.
So, which library should I use instead of Interop?
and also how can I make my code to be able to work with different word files, not only .doc and .docx, but also with other office program files?
currently I'm writing different code for every type of the document..
My program translates the documents from its original language to another, so it is very important for me to keep the formatting of the document in original format, that's why I used Interop.. but also I want my program to be useful for as many people as possible
I think you are not mentioning but, are you assuming all your clients use the same version of Office. To solve the issue of the office versions, you may want to look at this open source project: NetOffice http://netoffice.codeplex.com/ and do all your .doc and .docx file formats development in using that library.
For the OpenOffice or LibreOffice, I believe the best you can do is going into the projects website and download the SDK. For example, go here: http://api.libreoffice.org/examples/examples.html and you will find some examples in Java, Python, C++ to edit Text Document including odt files.
LibreOffice SDK download here: http://www.libreoffice.org/download/
And finally, there is also the OpenXML format (mentioned on another answer) which is:
ECMA Office Open XML ("Open XML") is an international, open standard for word-processing documents, presentations, and spreadsheets that can be freely implemented by multiple applications on multiple platforms.
And you can download also its SDK here: http://msdn.microsoft.com/en-us/office/bb265236.aspx
Hope that helps.
You will likely end up writing separate code to work with each file type. There may be some similarities within, say, Office products, but for the most part you're going to need an adapter for each type.
However, you could (and should) minimize the amount of duplicate code by placing the translation logic and other non-type-specific functions in a shared library that each adapter would then reference.
We are using aspose words. This supports DOC, DOCX, RTF and OOXML.
But it's not free.

.NET graphic libraries to display images (pdf, .docx and any other format of image) in the browser

I am developing a ASP .NET MVC application where users are able to upload files to a repository. Those files could be pdf, doc, any type of image and so on.
When the user select a file to be imported I would like to display this file in the browser so they can review its contents before the upload.
I know I could use some sort of IFrame to display pdf but I am looking for some specific class or .net libraries to implement this feature.
I just need a north.
This is an extremely difficult problem. There are some libraries that can help. For instance PDF files might be rendered to images with ghostscript. Word and Excel files might be converted to PDF or image with a number of libraries. None of them, AFAIK, are very good at it so I can not recommend one.
You could automate MSO to perform the conversion to PDF, but that is decidedly not safe for server code. Another possibility is convert source documents to SWF files (like flexpaper) and display in flash. There are some great libraries out there, but it will limit your supported clients. Sharepoint has support for providing some of this capability as well. Others have used OpenOffice to convert MSO documents but also at a loss of quality.
I can't really advise any specific direction as it is highly dependent on what you/your company is willing to spend and the desired results. Good luck.
You could try to rely on Windows and the explorer thumbnails for it, like here, but then you'd have to make sure that:
You can abuse the server in the most elaborate way (install stuff, talk to the shell from ASP.NET)
You have a thumbnail provider installed on the server for every type that you want to preview. I guess from the moment you can see the thumbnail in explorer, you're set. So for pdf, you might need to install PDF Reader from Adobe.
Docx files should be saved with thumbnail checked (see link). There seems to be no other easy, free way to convert a docx to a thumbnail. The "best" solution I came across, was saving it automatically again somehow, and making sure the thumbnail option is checked.
I don't want to say that's impossible, but it can't be done with finite effort.
What you are asking for is a browser-based solution, because you want the user to be able to "review" the document before uploading.
Therefore you cannot use a server side solution, which is essentially what is being asked by referring to a ".Net library".
.Net libraries are dependent on an installed version of .Net, which does not exist in all versions for all operating systems for which graphical browsers exist.
Next, recent changes in browser security do not allow to read the full client-side file name of the selected file in the input field.
You'd have to rely on HTML5 and its FileReader to access the file's byte stream, but even then you can only retrieve image from image files. (see sample)
Excluding browser-based solutions in Flash, ActiveX, Java, due to browser and platform support, this leaves JavaScript as the only "reasonable" solution: you'd need a library for each supported format to either convert a file into an image in an image format supported by browsers, or extract the text(+image) representation of a file.
Great awnsers... Just want to share the result of my research and I found a nice client-based solution supported by Mozilla Labs. This is a framework based on HTML5 and Javascript with no native code needed.
Here the project website:
https://github.com/mozilla/pdf.js
This is what you are capable of:
http://mozilla.github.com/pdf.js/web/viewer.html
And for the last a great video explaning how everthing works
http://www.youtube.com/watch?v=Iv15UY-4Fg8&noredirect=1
Reguarding my question we are going to converter every possible file to PDF on the server and then render this PDF using this framework.

how do I generate preview image based on first page of a PDF or PPT/PPTX or DOC/DOCX in ASP.NET C#

I have read all references and haven't found a solution on it.
Is there any 3rd-party component in .NET available for doing such a job? such as converting documents (doc, pdf, ppt, xls....) into images?
thank you very much and I'm waiting online...
You may want to take a look at ImageGlue.
As for the MS Office formats, I think the files need to have the thumbnail option enabled for ImageGlue to get the preview image.
My company implements a Digital Asset Management system that uses ImageGlue to generate preview images for a lot of different file formats, so it should definitely do the trick.

is there a generic api/library for reading metadata of all known filetypes in c#?

I need to read & modify metadata of files uploaded in our server. Is there a generic api/library for reading Metadata in a Key-Value type of deal?
This means that it may be proprietary files such as .doc/docx, .xls/xlsx, etc. And free stuff like .rtf, .txt, .jpg
Thanks for all the help
There's no library for reading the metadata of "all known filetypes" in any language, because it's pretty much impossible.
You may be able to find libraries capable of reading a particular format or family of closely-related formats, which is the most common solution and works in most situations.
For the formats you've listed, libraries do exist. JPG has some support built into C#, I think, through some of the System libraries. TXT is simple text, that's supported in most languages. RTF has some support, mainly through the RichTextBox control, I think. For the other two, I would look into Office's SDK or perhaps the Office development stuff for Visual Studio, those might have more information.
There is a program, TrID, that can identify file formats based on their data, which may be of some interest. It doesn't do proper metadata reading, but it is the closest thing to a universal file reader that exists (that I'm aware of).

Categories