I have the below as part of a web application(asp.net) is there any way to convert it to excel? The problem is printing it. Landscape is the desired format and persons in the organization are very novice so to improve usability i want to allow it to be in landscape. I tried activex and sendkeys commands. This works but not what i desire..
Please help....
Two ideas:
(1) Copy + paste. If there is too much noise then try (2).
(2) Copy the HTML into a text file, read the text file into a program, have the program parse the HTML in the way that you need it. From there you should be able to format it as needed or just flat out copy/paste it into Excel.
Are your pages table-driven (<table><tr>...) and not laid out with CSS? If so, this has been asked and answered before. Excel can consume tabular HTML like a champ.
Related
I use ITextSharp to create a PDF with form data based on another PDF.
The problem is the file generated is not editable (the form on it).
If I use ITextSharp in append mode, I get the form editable but most of the form data is not preserved. I want the user to see the resulted PDF with the PDF Form data preserved.
I understand there is NOTHING I can do. The only way for the user to edit the resulted PDF is to use a paid Acrobat version on it. This is because I CHANGE the PDF file by entering form data and setting fonts on it.
Is there something I can do?
Paul
Your question isn't very clear, but here are some answers to similar questions that have been asked before:
End users can't edit a form locally unless the form is "reader-enabled". Making a form reader-enabled is only possible when you use Adobe software: "Adding Enable for commenting Adobe Reader" using Acrobat
You need to fill out reader-enabled forms in append mode if you don't want to break the reader-enabling: Pdf with Acroform editing using iText
This doesn't mean you can't ask people to fill out a PDF form to gather data. See
Edit pdf embedded in the browser and save the pdf directly to server
You can capture that data, and fill out the form without flattening if you want to serve this form (including the data) to the end user: How to fill out a pdf file programmatically?
I'm pretty sure one of these question is a duplicate of what you're asking, but since your question isn't clear, it's hard to mark your question as an exact duplicate of one of them.
Short answer: No
Pdf file are likely to be secure (read only) and this is why everyone is using it. Most of the time, we convert a file into a pdf so maybe if you can get the 'file' and not the pdf will be a good move there.
From my experience in the past, I can confirm with you that iTextSharp may not convert all your data properly and this can make your generated file unusable. If not, you might have some weird line or some changes in the document behavior (ex. fields are not editable anymore).
If you really want to work with pdf file as input and do your stuff with it, you will need to understand the inner structure of it:
[PDF file format]
http://resources.infosecinstitute.com/pdf-file-format-basic-structure/
This can be a hell of a ride. You might need to re-consider the use of a pdf as input. If you can't change that, you might need to use some sort of adobe pluging to do so. Alot of third party pdf library is doing that.
Good luck
I have PDF document data with table structure format and I would like to convert that PDF file into a text file with the same structure with margin and spaces between text in pdf
You need to write your own PDF tool then. Which is not exactly an easy task. Honestly, 3rd party tools make your job much easier, why don't you want to use one?
If you change your mind, I can suggest iTextSharp. I've used it in the past with great success. Here are some example to get you going:
http://www.codeproject.com/Articles/12445/Converting-PDF-to-Text-in-C
ps. there are 3 tools used in there.
I'm currently using Aspose PDF Kit to split a 'master PDF' up into individual documents + thumbnails. This works well at the moment, but the device I'll be rendering the PDF on won't know about the annotations/links within the PDF.
I understand there is a way to parse the PDF document to detect the X/Y position of a hyperlink etc, is there an simple way to extract/iterate across the document data so I can write it to an external XML file?
You may want to try Docotic.Pdf library for this (disclaimer: I work for Bit Miracle).
The library can be used to retrieve all hyperlinks in a document. You may retrieve bounding box, text and other properties of a link, too.
Please take a look at "Extract text from link target" sample. It may help you to get started.
I have a few pdf files that were created from word or excel files.
I need to get the information thats in the tables.
The text in the document is not an image so I'm able to extract the text using tools such as pdfbox.
When I have the text I have no way of knowing what cells in the table it belongs to because I don't know where the table borders are.
Iv'e tried a few desktop tools such as abby or solid pdf converter and they are able to convert the files into nice word documents but this doesn't suit my needs as I want to be able to do this programatticly in C#.
Some of the tables have nested tables wich I think makes this a little bit more diffucult.
I appreciate your help
The difficulty here is caused by the fact that the text in the PDF is not contained within any table. It might look like it is, but underneath the surface, it is not.
So there are a couple of options that I can think of. But none of them are going to be quite as satisfying as you'd probably like.
There are some companies that offer SDKs for PDF to Excel/Word conversion. Investintech and Iceni are a couple of examples. But these solutions are not free.
If you know the exact layout of the PDF files that you need to extract the table data from, then you can use any SDK that lets you extract text from a PDF and also tells you the exact co-ordinates of the extracted text. Using this method you need to know in advance where the text is going to be, so that you can extract text from a specific area on the page. It obviously won't work if you need to process any random document.
It's a difficult task, but hopefully this will give you a starting point.
I'm using itextsharp to generate the PDFs, but I need to change some text dynamically.
I know that it's possible to change if there's any AcroField, but my PDF doen's have any of it. It just has some pure texts and I need to change some of them.
Does anyone know how to do it?
Actually, I have a blog post on how to do it! But like IanGilham said, it depends on whether you have control over the original PDF. The basic idea is you setup a form on the page and replace the form fields with the text you want. (You can style the form so it doesn't look like a form)
If you don't have control over the PDF, let me know how to do it!
Here is a link to the full post:
Using a template to programmatically create PDFs with C# and iTextSharp
I haven't used itextsharp, but I have been using PDFNet SDK to explore the content of a large pile of PDFs for localisation over the last few weeks.
I would say that what you require is absolutely achievable, but how difficult it is will depend entirely on how much control you have over the quality of the files. In my case, the files can be constructed from any combination of images, text in any random order, tables, forms, paths, single pixel graphics and scanned pages, some of which are composed from hundreds of smaller images. Let's just say we're having fun with it.
In the PDFTron way of doing things, you would have to implement a viewer (sample available), and add some code over a text selection. Given the complexities of the format, it may be necessary to implement a simple editor in a secondary dialog with the ability to expand the selection to the next line (or whatever other fundamental object is used to make up text). The string could then be edited and applied by copying the entire page of the document into a new page, replacing the selected elements with your new string. You would probably have to do some mathematics to get this to work well though, as just about everything in PDF is located on the page by means of an affine transform.
Good luck. I'm sure there are people on here with some experience of itextsharp and PDF in general.
This question comes up from time to time on the mailing list. The same answer is given time and time again - NO. See this thread for the official answer from the person who created iText.
This question should be a FAQ on the itextsharp tag wiki.