I am trying to find a way to open or convert a webarchive file to any other format in C#. The goal is an automated import system with as few restrictions on file type as possible. I cannot seem to find any way of converting the file other than using safari to open it.
Unfortunately what you are looking for cannot really be done. A webarchive is a proprietary file type made by Apple to display offline webpages in a Safari. This is a combination of xml, html, and binary data, but there are examples in Objective-C to convert the webarchive to a zip archive that contains the html and embedded images/media that was originally displayed on the website that was saved into the webarchive file.
Here is an Objective-C example from GitHub - WebArchiveExtractor
As for converting to PDF...not sure that can be done, you would be better off printing the webpage to PDF in the first place and then uploading that to your document management system.
Apparently though the webarchive filetype contains XML with binary encoded images/media similar to an MHTML file, so you may be able to figure out the format by viewing them in text editors and then writing a conversion utility, but there is very limited information on the web regarding the internal schema of the webarchive file format, so this may be a daunting task. However, since WebKit is open source you can see their code for created an archive and try to reverse it to build your converter. Here's the source code (in C++) for the archiving features in Safari, which actually looks like they are using mhtml, but I haven't explored deep enough to tell if it's exactly the same format: http://trac.webkit.org/browser/trunk/Source/WebCore/loader/archive
Good Luck!
Related
I am developing a ASP .NET MVC application where users are able to upload files to a repository. Those files could be pdf, doc, any type of image and so on.
When the user select a file to be imported I would like to display this file in the browser so they can review its contents before the upload.
I know I could use some sort of IFrame to display pdf but I am looking for some specific class or .net libraries to implement this feature.
I just need a north.
This is an extremely difficult problem. There are some libraries that can help. For instance PDF files might be rendered to images with ghostscript. Word and Excel files might be converted to PDF or image with a number of libraries. None of them, AFAIK, are very good at it so I can not recommend one.
You could automate MSO to perform the conversion to PDF, but that is decidedly not safe for server code. Another possibility is convert source documents to SWF files (like flexpaper) and display in flash. There are some great libraries out there, but it will limit your supported clients. Sharepoint has support for providing some of this capability as well. Others have used OpenOffice to convert MSO documents but also at a loss of quality.
I can't really advise any specific direction as it is highly dependent on what you/your company is willing to spend and the desired results. Good luck.
You could try to rely on Windows and the explorer thumbnails for it, like here, but then you'd have to make sure that:
You can abuse the server in the most elaborate way (install stuff, talk to the shell from ASP.NET)
You have a thumbnail provider installed on the server for every type that you want to preview. I guess from the moment you can see the thumbnail in explorer, you're set. So for pdf, you might need to install PDF Reader from Adobe.
Docx files should be saved with thumbnail checked (see link). There seems to be no other easy, free way to convert a docx to a thumbnail. The "best" solution I came across, was saving it automatically again somehow, and making sure the thumbnail option is checked.
I don't want to say that's impossible, but it can't be done with finite effort.
What you are asking for is a browser-based solution, because you want the user to be able to "review" the document before uploading.
Therefore you cannot use a server side solution, which is essentially what is being asked by referring to a ".Net library".
.Net libraries are dependent on an installed version of .Net, which does not exist in all versions for all operating systems for which graphical browsers exist.
Next, recent changes in browser security do not allow to read the full client-side file name of the selected file in the input field.
You'd have to rely on HTML5 and its FileReader to access the file's byte stream, but even then you can only retrieve image from image files. (see sample)
Excluding browser-based solutions in Flash, ActiveX, Java, due to browser and platform support, this leaves JavaScript as the only "reasonable" solution: you'd need a library for each supported format to either convert a file into an image in an image format supported by browsers, or extract the text(+image) representation of a file.
Great awnsers... Just want to share the result of my research and I found a nice client-based solution supported by Mozilla Labs. This is a framework based on HTML5 and Javascript with no native code needed.
Here the project website:
https://github.com/mozilla/pdf.js
This is what you are capable of:
http://mozilla.github.com/pdf.js/web/viewer.html
And for the last a great video explaning how everthing works
http://www.youtube.com/watch?v=Iv15UY-4Fg8&noredirect=1
Reguarding my question we are going to converter every possible file to PDF on the server and then render this PDF using this framework.
Hi i'm new programming and i have written few application to access pdf content by using some dll files, but now my question is how can we write our own dll to access the pdf files. I know it's a big process but i'm very much interested to learn about this. any one please help me.
You can start by reading the PDF specification (warning 32MB behind this link) in order to understand how the PDF file format is implemented. This is necessary if you want to be able to parse it and extract the information you are interested in.
In the meantime (as this reading might occupy you during a certain amount of time) if you have pressing project deadlines you probably want to use an existing library such as iTextSharp.
I know it's a big process but i'm very much interested to learn about this.
That's true. I'd like to suggest to study some open source APIs (iTextSharp) and PDF SDK.
Hi I have an initiative of converting bunch of different formats such as Word, PDF, png, jpg, excel files into '1 bit png' files and store them in the database.
I need to use .net framework for this. Do you guys know any tools that can do this, I want some tools or API that I can buy. I am pretty sure I would need more than one tool for doing this, which is fine.
What would be the best way to do this? It is possible to convert them all to TIFF format and then convert them to png?
Thanks
This is close to a duplicate of Is there a programming toolkit for converting “any file type” to a TIFF image?. I'll just repeat what I wrote at that time:
This is an integration project - there is no one tool that will read all of the file types you're interested in. In our case, we developed a generic transcoding service that accepts numerous input types (by file extension) and executes external applications based on that type:
Ghostscript for PDF and PS files,
ImageMagick for image files,
Apose.NET for Office types, and
some homegrown stuff for simpler types like text files.
We haven't found an application that will interpret Visio files of all versions other than Visio itself. And as you may already know, Office Interop should not be used on a server.
I need to convert PDF files into .doc files using C#. The computer has no file system though it doesn't have Office installed. Any good ideas how I can approach this? I did some research and most of people use the interop services.
You need to understand that PDF is not really implemented as a single document format.
If your PDF docs are created by rendering text to a PDF file, then direct PDF conversion is not only possible, but can be very good (reliable).
If the source of your PDF is either a scanner or fax (essentially a scanner...) then what you have is a document with an "picture" of text. This scenario is more difficult to deal with. If you open up the markup for this there is no 'text' to be converted. In this situation you have to deal with some manner of OCR (optical character recognition) which is less reliable due to a variety of issues.
If you have the option of intercepting the data before it is rendered to PDF (say like in SSRS or Crystal) then it would be better for you to bypass the PDF stage and move your data to a Word document.
If you are constrained to receiving faxes and then needing to interpret their content, prepare for OCR hell. It has been a while since I was there, so I hope that it has gotten better.
Even with out office installed on your machine, you have access (with Visual Studios) to the Office developer toolkit which will allow you build documents to be distributed in the Word formats.(.doc/.docx).
An option/idea may be to convert the PDF to Html, which can be opened in Word?
use aspose pdf kit to conver pdf to text and then text to doc using filestream or aspose doc
I want to read tables which are in a PDF document and I want to store these values in a Database.
What I have found so far through searching the web:
Read text from PDF using abcpdf .net, which is freeware available. But it's not right solution because I want to read the tables.
Convert PDF document into Excel/Word. Tables will come in the target document as it is. Word conversion is possible by using EasyPDF Converter which is third party tool which is much cheaper than the other solution available in other tool which converts PDF into Excel.
But I am looking for any other solution/API classes which can convert PDF into Excel.
There are 2 possible solutions
a) Cometdocs makes a free online conversion from PDF to XLS surprisingly good and send for your email the result file.
b) Cognview is a comertial shareware that converts PDF to XLS. There is OCR and text version. I didn't use personally, but they have good recomendations.
If you are looking to upload your data into a database, converting your PDFs to CSV is probably the safest option. The PDFTables API will allow you to do this with C#, converting as many PDFs at once as necessary. https://pdftables.com/pdf-to-excel-api#csharp
You can try to use Quablo, a PDF table extractor available at this web page (link updated/corrected).