Split a large html document by page - c#

I have a very long article with over 50 pages in length in a single HTML document. I would like to know if there's an algorithm that can split the HTML document by page (A4 sized page), kind of like the print preview function, into smaller files for each page while maintaining proper formatting.
I use .NET 4.0 C#

This doesn't always work, but sometimes you can print the document to PDF, split the PDF into one PDF per page, and convert those PDFs to HTML files. The result isn't always pretty, though.

You can use HTMLDOC to split HTML into pages:
Here's a blogpost explaining the process

Have you tried using a virtual printer such as CutePDF and print the document to a PDF? Also, according to the website I linked, CutePDF has it's own SDK.

It's not clear why do you want to do so, but, try simply opening your page in Microsoft Word. Microsoft Word has the "Print Layout" where you will see your document by page. Eventually you will be able to edit it to make it fit.

Related

Convert asp.net page to Word document

I have built several PDF documents dynamically from ASP.NET pages (HTML/CSS) using plugins like Winnovative HtmlToPDFconverter. It has always been a successful outcome using the built-in functionality for those plugins, like merging existing PDF documents with dynamic content and adding pre-defined headers and footers, adding page margins, page numbers and so forth. The HTML content has overall been rendered as expected in the final PDF document(s).
Is there any way/any advice for a similar .NET plugin that can render HTML/CSHTML to a Microsoft Word document (.docx) in the same way – or is it too difficult to render native HTML5 and CSS into a desirable layout for a Microsoft Word Document?
I have googled around and found some suggestions, but I'm looking for recommendations for maybe a specific plugin – or a warning if it is a no-go
and too difficult to get the desired layout 1:1 from HTML to a Word document because of incompatibility between markups?
Devexpress HTML editor can export its HTML content to different formats including docx and rtf. Not sure about its limitations (e.g. script and canvas export, etc.), but in the common case it works well.

C# PDFSharp: Examples of how to strip text from PDF?

I have a fairly simple task: I need to read a PDF file and write out its image contents while ignoring its text contents. So essentially I need to do the complement of "save as text".
Ideally, I would prefer to avoid any sort of re-compression of the image contents but if it's not possible, it's ok too.
Are the examples of how to do it?
Thanks!
Extracting text from a PDF file with PDFsharp is not a simple task.
It was discussed recently in this thread:
https://stackoverflow.com/a/9161732/162529
Extracting text from a PDF with PdfSharp can actually be very easy, depending on the document type and what you intend to do with it. If the text is in the document as text, and not an image, and you don't care about the position or format, then it's quite simple. This code gets all of the text of the first page in the PDFs I'm working with:
var doc = PdfReader.Open(docPath);
string pageText = doc.Pages[0].Contents.Elements.GetDictionary(0).Stream.ToString();
doc.Pages.Count gives you the total number of pages, and you access each one through the doc.Pages array with the index. I don't recommend using foreach and Linq here, as the interfaces aren't implemented well. The index passed into GetDictionary is for which PDF document element - this may vary based on how the documents are produced. If you don't get the text you're looking for, try looping through all of the elements.
The text that this produces will be full of various PDF formatting codes. If all you need to do is extract strings, though, you can find the ones you want using Regex or any other appropriate string searching code. If you need to do anything with the formatting or positioning, then good luck - from what I can tell, you'll need it.
Example of PDFSharp libraries extracting images from .pdf file:
link
library
EDIT:
Then if you want to extract text from image you have to use OCR libraries.
There are two good OCRs tessnet and MODI
Link to thread on stack
But I fully can recommend MODI which I am using now. Some sample # codeproject.
EDIT 2 :
If you don't want to read text from extracted images, you should write new PDF document and put all of them into it. For writing PDFs I use MigraDoc. It is not difficult to use that library.

How to create a word document using html written in C#

I creating a C# application that has to create a word document.
I'm using the Microsoft.Office.Interop.Word to do this and I've successfully managed to output some word documents, but creating the content trough the code is a very time consuming work.
I noted that word is able to open html pages and show it as a normal content so I created a simple test table in html and inserted it into the word document. But when I outputted the document the obvious happened: The tags where still there! Word did not format the tags as html. It just outputted exactly what I put in there.
How can I tell word to reformat the text as html?
edit: (trough the C# code of course)
edit 2: Please note that I'm parsing trough some data to make this, so I will end up with about 4 pages of the same table/html, so I will need to be able to tell word to start at the next page each time I've finished a loop. So a html-only method will probably not work.
If you're only wanting to output simple HTML content as a Word document, you could always cheat and write out the HTML content with a .doc extension.
Word will open that just fine.
If you need to add a page break, you can use a CSS page-break-before, like so:
<br style="page-break-before: always;"/>
If you're set on using Interop, having read up a little bit, this post states that you need a converter to insert HTML, and the converters are only accessible when:
you paste HTML from the Clipboard
open/insert HTML from a file
So, this answer looks like it provides a clipboard-based solution : Adding html text to Word using Interop
However, if there's any money to spend on the project, I can heartily recommend Aspose.Words which will do all of this for you.
As requested by the OP, and to make easier for others to find this solution, here it goes the answer I posted as a comment (plus extra results from testing):
When opening an HTML file, MS Word honors the CSS properties page-break-before and page-break-after. There is a caveat, however:
On "Web design" view, page-breaks are never shown (this doesn't mean that they aren't there), just like browsers don't "show" them. And Word opens html files on Web design view by default (which quite makes sense). You need to print the document or switch to some other view (typicall "Print design") to see your breaks in all their glory.
So, saving an HTML file with a .doc extension is a viable solution (also tested: Word opens it properly despite of the extension).
Note: all the testing was done on MS Word 2003 using this snippet: <html>asdf<br style="page-break-before: always;">new page!</html>
Don't build the document in code, create it in Word as template or mail merge template and the use code to merge or replace the fields data.
See this answer here
MS Word Office Automation - Filling Text Form Fields And Check Box Form Fields And Mail Merge
And See this from the mothership:
http://msdn.microsoft.com/en-us/library/ff433638.aspx
If you don't want to use an external lib, Interop is too slow for you and neither pure HTML nor mail merge template are flexible enough, you could write your content as text or HTML into one or more files (using C#), create a VBA macro in a Word document which by itself creates a second Word document, reads the content files and does any formatting you want afterwards.
You can run this macro programmatically by starting Word using the command line switch /m.
Another possible approach, if your html is xhtml (i.e. XML compliant), you could use XSLT to convert it to a Word XML format. But this would take a LOOOOOOOOOOONG time to code.
If you don't have to use HTML as the starting point you could simply build the Word XML document yourself rather than using XSLT, which would be easier. Time consuming but possible - it's something I do quite a lot in my work.
If a third party component is an option I would recommend the stuff from Aspose.
I have been pretty happy with their tools so far. The API is a little messy but everything works as one would expect.

Finding hyperlinks inside a PDF document?

I'm currently using Aspose PDF Kit to split a 'master PDF' up into individual documents + thumbnails. This works well at the moment, but the device I'll be rendering the PDF on won't know about the annotations/links within the PDF.
I understand there is a way to parse the PDF document to detect the X/Y position of a hyperlink etc, is there an simple way to extract/iterate across the document data so I can write it to an external XML file?
You may want to try Docotic.Pdf library for this (disclaimer: I work for Bit Miracle).
The library can be used to retrieve all hyperlinks in a document. You may retrieve bounding box, text and other properties of a link, too.
Please take a look at "Extract text from link target" sample. It may help you to get started.

For eBook which control i should use and how to start

i want to develop such an application through which i can read book ,currently i am using the Richtextbox in flow document,i dont want to use the scroll ,i prefer the navigation style i.e prev page next page start and end ,book may contains images tables so and so ,and how do i import books in my application
How can i achieve?
Regards,
Aamir
You want to look at the FlowDocument, which is meant for documents with pages.
You'll need to write code to extract the text from the word or pdf document.
You can get out the text from a Word document using Word automation.
For pdf files you can possibly use the iTextSharp library:
http://itextsharp.sourceforge.net/
For other formats you might be able to use the source of FBReader as a sample:
http://www.fbreader.org/downloads.php

Categories