I am trying to create a windows phone app for reading e-pubs. I extracted the content and now I want to read the ncx file. But when I try to use System.Xml.Serialization.XmlSerializer it is telling me unknown field in the second line itself. Please help
Here is how the basic approach to read an epub file
Treat the EPUB file as a ZIP archive and read it using the Windows
built-in ZIP archive reader, ZipArchive
In the archive, find the file META-INF/container.xml and look in it
to find the full-path attribute of the root-file element. That gives
you the path to the OPF file (probably something like
OPS/content.opf) The 'manifest' element of the OPF file will tell
you the names of all the files that make up the book. The 'spine'
element will tell you the order in which they appear in the book (and
will include a reference, via the 'toc' attribute of the spine
element, to a table-of-contents file that will usually be in NCX
format)
Normally, the EPUB book will consist of series of XHTML files, each
file containing one 'chapter' of the book. The basic procedure to
display a book for reading would be:
figure out which chapter the user wants to look at
load the XHTML for that chapter into a WebView (or some other solution for rendering XHTML on screen)
Problems you are likely to encounter:
Many EPUB books are created using ZIP-generators that, although
compatible with the ZIP standard, are incompatible with the
ZIP-reader APIs built into the OS. You will probably need to use a
third party library like DotNetZip or SharpZipLib (but be careful of
the licence conditions for the latter).
You will need to do some work to display images in the WebView,
especially if you try to cover all the image types that are part
of the EPUB standard.
It will be fiddly to find and apply all the CSS styles that the EPUB
book defines.
You will probably want to display a 'paged' view of the chapter,
rather than displaying it as a long vertically scrollable column.
That will involve some funky javascript work.
You may find that an individual EPUB chapter is too large for
displaying in a WebView. In the end, you may decide that all the
limitations of WebView mean you will be better off writing your own
custom XHTML-parsing rendering solution, and displaying using
TextBlocks, or something more exotic (you can use C++ interop
code and the D2D Font APIs)
To parse .epub file you may want to use a library :
Epubreader
Epubsharp
DotNetEpub
Aspose.Words for .NET
source on SO : 1 2 3 4
Related
I am searching from last two days but did not find any thing.
My requirement is to create a document viewer in my web application (C#.Net) and I don't want to use any third party tool for this. Can I convert the files in image or PDF or in any common formate which can be easly render on web page. I also can not use Introp object.
Any help will be highly appreciated
You mention in one of your comments that you'd like to write all the code yourself but don't know where to start. Here's how I would go about it...
First, you'll need to familiarize yourself with the Microsoft Office Format specification. You can find that here (there's a link to the technical specification). Office documents are actually a .zip file with an XML file inside along with any binary data representing attachments. Just renamed a .docx file as .zip and you'll be able to open it up and see the XML and any other supporting documents inside (same is true for xlsx, etc...).
Then you'll need to become intimately familiar with either PDF or HTML, as your job now will be to convert the various Office document structure into PDF or HTML structure, being sure to respect page layout, margins, order, etc...
As others have said, this is a large task which is why third party tools exist today. Also, each third party toolset has it's limitation as this is really hard to "get right" in all situations and there will be edge cases that work for one document and not another (because maybe they didn't use Microsoft Word to save the .docx, maybe they used OpenOffice and OpenOffice interpreted the standard slightly differently...)
If you cannot use COM/Interop technologies in your solution, you can take a look at the specialized 3rd party options. I see that you prefer not to use them, however, there are no existing built-in solutions in the .NET Framework. Check out my answer in a similar thread that describes how to accomplish exactly the same task using 3rd party libraries (for example, DevExpress, since I have experience with it). In addition, take a look at the Documents demo, where you can see how to create images/thumbnails from different types of MS Office documents.
I believe what you need is an intermediate representation of the documents which can be converted into an image for the viewer to display.
Lets me try to explain with the below diagram:
You can use tools like smallpdf or OfficeToPDF to do that. Just integrate them into your application.
Small PDF(https://smallpdf.com/library-detail)
officetopdf (https://officetopdf.codeplex.com/)
I am trying to find a way to open or convert a webarchive file to any other format in C#. The goal is an automated import system with as few restrictions on file type as possible. I cannot seem to find any way of converting the file other than using safari to open it.
Unfortunately what you are looking for cannot really be done. A webarchive is a proprietary file type made by Apple to display offline webpages in a Safari. This is a combination of xml, html, and binary data, but there are examples in Objective-C to convert the webarchive to a zip archive that contains the html and embedded images/media that was originally displayed on the website that was saved into the webarchive file.
Here is an Objective-C example from GitHub - WebArchiveExtractor
As for converting to PDF...not sure that can be done, you would be better off printing the webpage to PDF in the first place and then uploading that to your document management system.
Apparently though the webarchive filetype contains XML with binary encoded images/media similar to an MHTML file, so you may be able to figure out the format by viewing them in text editors and then writing a conversion utility, but there is very limited information on the web regarding the internal schema of the webarchive file format, so this may be a daunting task. However, since WebKit is open source you can see their code for created an archive and try to reverse it to build your converter. Here's the source code (in C++) for the archiving features in Safari, which actually looks like they are using mhtml, but I haven't explored deep enough to tell if it's exactly the same format: http://trac.webkit.org/browser/trunk/Source/WebCore/loader/archive
Good Luck!
I am developing a ASP .NET MVC application where users are able to upload files to a repository. Those files could be pdf, doc, any type of image and so on.
When the user select a file to be imported I would like to display this file in the browser so they can review its contents before the upload.
I know I could use some sort of IFrame to display pdf but I am looking for some specific class or .net libraries to implement this feature.
I just need a north.
This is an extremely difficult problem. There are some libraries that can help. For instance PDF files might be rendered to images with ghostscript. Word and Excel files might be converted to PDF or image with a number of libraries. None of them, AFAIK, are very good at it so I can not recommend one.
You could automate MSO to perform the conversion to PDF, but that is decidedly not safe for server code. Another possibility is convert source documents to SWF files (like flexpaper) and display in flash. There are some great libraries out there, but it will limit your supported clients. Sharepoint has support for providing some of this capability as well. Others have used OpenOffice to convert MSO documents but also at a loss of quality.
I can't really advise any specific direction as it is highly dependent on what you/your company is willing to spend and the desired results. Good luck.
You could try to rely on Windows and the explorer thumbnails for it, like here, but then you'd have to make sure that:
You can abuse the server in the most elaborate way (install stuff, talk to the shell from ASP.NET)
You have a thumbnail provider installed on the server for every type that you want to preview. I guess from the moment you can see the thumbnail in explorer, you're set. So for pdf, you might need to install PDF Reader from Adobe.
Docx files should be saved with thumbnail checked (see link). There seems to be no other easy, free way to convert a docx to a thumbnail. The "best" solution I came across, was saving it automatically again somehow, and making sure the thumbnail option is checked.
I don't want to say that's impossible, but it can't be done with finite effort.
What you are asking for is a browser-based solution, because you want the user to be able to "review" the document before uploading.
Therefore you cannot use a server side solution, which is essentially what is being asked by referring to a ".Net library".
.Net libraries are dependent on an installed version of .Net, which does not exist in all versions for all operating systems for which graphical browsers exist.
Next, recent changes in browser security do not allow to read the full client-side file name of the selected file in the input field.
You'd have to rely on HTML5 and its FileReader to access the file's byte stream, but even then you can only retrieve image from image files. (see sample)
Excluding browser-based solutions in Flash, ActiveX, Java, due to browser and platform support, this leaves JavaScript as the only "reasonable" solution: you'd need a library for each supported format to either convert a file into an image in an image format supported by browsers, or extract the text(+image) representation of a file.
Great awnsers... Just want to share the result of my research and I found a nice client-based solution supported by Mozilla Labs. This is a framework based on HTML5 and Javascript with no native code needed.
Here the project website:
https://github.com/mozilla/pdf.js
This is what you are capable of:
http://mozilla.github.com/pdf.js/web/viewer.html
And for the last a great video explaning how everthing works
http://www.youtube.com/watch?v=Iv15UY-4Fg8&noredirect=1
Reguarding my question we are going to converter every possible file to PDF on the server and then render this PDF using this framework.
I need to create a C# or C++ (MFC) application that converts pdf files to txt. I need not only to convert, but remove headers, footers, some garbage characters on the left margin etc. Thus the application shold allow the user to set page margins to cut off what is not needed. I actually have already created such an application using xpdf, but it gives me some problems when I am trying to insert custom tags into the extracted text to preserve italics and bold. Maybe somebody could suggest something useful?
Thanks.
There are shareware and freeware utilities out there. Try fetching their source code, or perhaps use them the way they are.
A public version of the PDF specification can be found here: Adobe PDF Specification
PDF Shareware readers can be found: PDF Reader source code # SourceForge
Please look at Podofo. It's a LGPL-licensed library that has many powerful editing features. One of it's examples, txt2pdf IIRC, is a good start: it shows basic text-extraction; From there you can check if pre (in pdf engine) or post (in text) filtering suffices to your goals. I didn't get to use Pdf Hummus, but it's supposed to have these capabilities too, although it's less straightforward.
I want to parse a PDF file from my C# app and create an audio file off it.
How would I do that ?
I'm particularly looking for a good pdf to text library or a way to strip a pdf file off its text.
You preferably have a tagged PDF document as your input document. This means that the document contains tags to mark up the logical structure of the document (typically a PDF document will only contain visual information).
This PDF could then be converted into DAISY format, which is a standard for digital talking books, i.e. an intermediate XML format storing the text of books along with the logical structure and navigation features.
This Daisy XML format can be either converted to an audio format, or you could be using a Daisy reader, a physical device like an MP3 player to listen to the book.
There is a presentation available at the Daisy web site explaining the principles of this toolchain:
Accessible PDF to DAISY/NIMAS Conversion
Use Festival for the text to speech. Various pdf to text api's exist...
You need the Speech SDK from Microsoft. Read an instruction here
As the other posters outlined, first you have to extract the text from the .pdf file. pdf files are an open format now, so you can probably find a parser through Google.
Then you have to extract the text you want to convert to speech from the file, ignoring things like figure titles, page headers, table of contents etc.
Once you've got the text, you need to convert it to speech. This is probably the hardest part.
A while ago I was fiddling around with generating voice files for a gaming mod, since I'm a rotten voice actor.
Cepstral had the best TTS converters I could find. (The free ones had an annoying tendency to insert Cepstral advertisements in the speech, but I could manually edit this out for what I was doing.)
It turns out that there's a speech synthesis markup language which can be used to provide clues to the TTS converter about which syllable to place accents, etc. Here's a linky:
http://www.w3.org/TR/speech-synthesis/
How you go about automatically adding the SSML to the text is a bit beyond me.
Anyway, the TTS converter will produce an audio file, and the final step would be to compress the audio at the desired bit rate in mp3 format.
If your sole task is to listen to speech synthesized text from a PDF, how about the Acrobat "Read out loud" function at the bottom of the "View" menu?
I guess it's a hard thing to do. Firstly you need to read the text in that pdf, and then use some mechanism of synthetic voice generation to create the audio content. Then you have to store it as an mp3.
On Mac OS X, you can extract the text of the pdf and then pipe it in "say". You should find equivalent synthetisers on other OS.
It's not all that complicated to do, provided that you don't re-invent the wheel, but instead simply reuse existing technology (i.e. text to speech engines like festival), as well as OCR engines to process the PDF files.
The most complicated thing probably is to work with different PDF layouts (columns, rows, embedded graphics,foot notes, URLs etc), which may obfuscate the text recognition process.
However, in general (if this is not supposed to be a learning experience), it is certainly easier to just resort to using existing software solutions:
Visual Text to Speech
Text to speech tools
zero 2000