I have been searching for help with this for a long time. If it's been answered here already, I cannot find it. I'm using C# on a Windows Form.
I'm trying to create a simple program that allows me to open a PDF, flatten any layers within it, and then, for each click of the mouse, draw a circle.
Centered within each circle I need to have a number, beginning with "1", and chronologically increasing to infinity (could be 1, could be 15000).
Finally, I need to be able to save, and print the final result.
There are other things I need to add, but if someone can get me started with this, I should be able to figure out the rest on my own.
I've been able to import the .pdf. However, any tut I've found for creating a transparent layer on which to draw, never allows me to see the pdf behind. Do I even need this transparent layer, or can I draw directly on the pdf? My second biggest issue figuring out is how to create the circle, with the chronologically increasing number anywhere I choose to click my mouse.
Thanks in advance for any help.
Please see the image below for what it should look like.
You can do that with Atalasoft's DotImage/DotPdf/DotAnnotate packages (disclaimer: I used to work on these products up until 3 years ago). There are a number of ways to do this. If you don't care the markup being an annotation, you can make an annotation with a custom appearance and add each one to the document.
If you care that the numbers get added to the document, you can use DotPdf directly to append content to the content stream of any page.
How to do this on your own. Good luck (seriously - this is not an easy problem to solve).
Here's what you need to be able to do first (at a minimum) for putting new content into an existing page:
First, let's talk about the PDF rendering model:
PDF uses a little non-Turing complete RPN language to place content on the page. A given page has one or more streams of code which gets executed in order to render a page. If you want add content on top, you need to render it last (there are other more complicated ways to do this, but this is good enough). That means you either append to the existing content stream (wouldn't recommend), or you take advantage of the fact that a PDF page can have any number of content streams on it. You make a new stream and add in the code to render the content that you want.
I'll warn you ahead of time that rendering text is non-trivial, especially if you have to embed fonts or use unicode encoding. I'll also warn you that there is no "circle" primitive in PDF. You have to approximate it with Bezier curves.
I'm studious in my laziness, so I created abstractions to make it easier to correctly create content streams. For example, I make a class called a drawing surface and I could tell it "set the drawing style to this" "place a 'shape' here" "draw text here" and so on. When it was told to render, it would generate the PDF rendering program that matched. On top of this, I had another abstraction that consisted of a drawing list of higher level objects and the drawing list, when rendered would write to the drawing surface which would in turn write PDF.
Append changes (generations) to a PDF
Create Content streams
Append a replacement page object for an existing page with the Contents changed from a Stream to an array (or if it's an array already, append to it) and with new resources added to the Resource dictionary
Here's what you need to be able to do first (at a minimum) for putting annotations on an existing page:
Append changes (generations) to a PDF
Append a replacement page object for an existing page with the Annots added with new annotations (or modified by inserting/appending new annotations)
Create appearance streams
Create annotations with using the custom appearance streams
As far as how to do the UI, that's oddly straight forward as long as you have a PDF renderer. Render a page into an image and make a control that gives you a mouse click onto the page. Then build a transformation matrix that goes from image coordinates to PDF page coordinates and push the mouse coordinates through that matrix. The result will be the origin of your mark up on the page (be aware that some pages are rotated and you will need to adjust your transformation matrix to match).
Now, to be clear, when I wrote this library for Atalasoft, I already had several years of PDF experience (I worked on Acrobat v 1 - 4). While I wasn't working on the library full time, it was written over the span of 10 years. The code to append to an existing PDF took several months of time to get right. Eventually, I shed that code because of complications in appending anything but simple changes (like annotations) and wrote code that could rewrite an entire PDF with updates to existing content (page reordering, annotations, bookmarks, new content on a page, edited images, etc), while simultaneously shedding anything that is no longer needed. This is akin to adding in and clipping out sections of a directed graph with cycles and being able to ensure that you have a correct graph on the other end.
The hard part wasn't working within the specification - that's fairly straightforward for me. The problem was dealing with cockamamie PDF generated by other tools that had all kinds of bizarro spec violations and handling that correctly.
Now, I'm not saying don't do this. I'm all for people learning new things and learning about PDF. There's a lot there to learn and a lot of interesting ideas, but you need to be aware that simple sounding problems in PDF space aren't trivial unless you have a great deal of infrastructure in place. For example, "how many pages are in a PDF?" requires a PDF scanner and code to execute or parse PDF content so you can read in the cross reference table (which may be a compressed cross reference stream), the document dictionary (which may include encryption), and finally the page tree: all which can be easily derailed by non-compliant PDF.
If you're trying to balance time and cost, remember that your time is far from free and maybe a library to do the heavy lifting is not a bad thing. iTextPdf is open source and can do all of the things you need to do, but it will cost you in time to learn the library, but that's a huge savings over having to write PDF tools on your own. Atalasoft's code is not free, but was written to have a much shallower learning curve than most libraries.
Related
I was wondering whether it would at all be possible to have our creative department design a nice-looking PDF template for our client, e.g. a fancy letterhead, then supply it to me so I could inject various types of content into the body using PDFSharp or MigraDoc.
Currently we generate the header and footer content as part of the rendering process, and it works very well, but as you can imagine, any non-trivial layout and styling is pretty complicated to pull off in what is essentially a 2D graphics environment.
So the thought arose as to whether one of these tools would be able to take a pre-existing PDF, give me access to various objects, and allow me to e.g. replace certain text placeholders or manipulate the PDF "DOM" in a more intelligent fashion.
Something similar to working with Spreadsheets (binary and XML versions) or OpenXML, etc.
What we do: take an existing PDF page, draw it at the bottom (Z axis) of a new PDF page, and then use MigraDoc to add other contents to the page.
PDFsharp can also be used to draw on top.
The template PDF pages are used like letter heads with the corporate design of a customer and the final document will have as many pages as needed.
Hello I have a chart that I need to have the system review and give results...
Chart image located here....
example chart .pdf http://imageshack.us/photo/my-images/651/scorecardchartexample.gif/
http://imageshack.us/photo/my-images/651/scorecardchartexample.gif/
--Assume the chart is in .PDF and the text is renderable I.E. "highlight-able".
--Assume the chart is placed on the page exactly the same way and same position every time
--Assume the chart can change - that is to say, I need to be able to upload a 1000 of these charts all following the exact same format but with some alternate info from chart to chart.
--Assume VAST expertise in .NET - and little expertise in actual text interpretation.
--Assume expertise in interpreting .PDF that have editable fields...I am already doing this, this is limited to .PDF's I created and was able to place values on each field etc.
--Assume this chart is only deliverable in a single text renderable .PDF - that is to say - we interact with a website that creates this chart - this website has no API to interact with, we must print to PDF this chart from the webpage and that is all we can do...(government website)
Using a .NET system, I need to create a program...or incorporate an existing application into my .NET system, that will review this chart and will be able to tell what each "X" represents...that is to say an "X" one inch to the left or in the next row is an indicator of a different result (refer to chart)
I need the program to perform its search and return results based on the trigger of the .PDF document hitting a folder or whatever. This part we can handle assuming we creating the program from scratch...otherwise we will be limited to interacting with an existing app as needed.
We are open to a variety of strategies. Assuming such a class or object exists, we were thinking of reading text based on location in the document, like an X,Y sort of thing. Another desireable route would be some sort of stringBuffer (assume C#) but will need to be able to navigate the chart gridlines and will need to count white spaces to accurately interpret the position of the "X"'s and what the "X" means based on its placement. 3rd option, something we are unware of.
If something exists and is tried and true, well that of course woould be best. Then any tips on interfacing with it using .NET and C#.
Thank you all very much in advance Code Gawds!
Reel
OK We found some software called ClearImage - it wasn't cheap but it is pretty neat. It will analyze any image in the same fashion Adobe PDF analyzes a document to find form fields. After clear image does that it gives you a list of "blobs" you then get to dictate what each blob means and give it a unique identifer. This allows for auto value declaration based on "blob" placement in the image.
It also allows to sort of "finger print" an image so if the same image were to show up it could recognize it...in my case we have 3 different templates for the chart, and indeed each one will be different due to different charting, but ultimately each template has the same layout from multiples of the chart...this has helped in allowing our system to identfy what chart has been entered then after that first check, move on to anyalizing each blob.
Anyway, worth a look if anyone else should come across this question and is in need this type of function. I didn't want to leave it unanswered. I may update this as we learn more about it. I know this isn't exactly a coding question but this type of task is coding intensive and if anyone was looking to perform the same task they may find their way here. I will endeavor to update in the spirit of stackoverflow with comments relating to integration and objects etc. etc.
should anyone have more questions about this software in relation to coding you can ask here or post a new question, we will be happy to post our code (methods, classes objects etc.) we used (in C#) in terms of integrating it into our/your programs.
I have a few pdf files that were created from word or excel files.
I need to get the information thats in the tables.
The text in the document is not an image so I'm able to extract the text using tools such as pdfbox.
When I have the text I have no way of knowing what cells in the table it belongs to because I don't know where the table borders are.
Iv'e tried a few desktop tools such as abby or solid pdf converter and they are able to convert the files into nice word documents but this doesn't suit my needs as I want to be able to do this programatticly in C#.
Some of the tables have nested tables wich I think makes this a little bit more diffucult.
I appreciate your help
The difficulty here is caused by the fact that the text in the PDF is not contained within any table. It might look like it is, but underneath the surface, it is not.
So there are a couple of options that I can think of. But none of them are going to be quite as satisfying as you'd probably like.
There are some companies that offer SDKs for PDF to Excel/Word conversion. Investintech and Iceni are a couple of examples. But these solutions are not free.
If you know the exact layout of the PDF files that you need to extract the table data from, then you can use any SDK that lets you extract text from a PDF and also tells you the exact co-ordinates of the extracted text. Using this method you need to know in advance where the text is going to be, so that you can extract text from a specific area on the page. It obviously won't work if you need to process any random document.
It's a difficult task, but hopefully this will give you a starting point.
I am trying to implement a feature where i open (suppose in iframe) a PDF file (multiple pages), Highlight a section of the document a get the page number (the one that is displayed in the PDF tool bar).
Eg: if the toolbar display 2/7 which means i am right now in page 2, i need to capture the page number information. Sounds simple but i am not able to get a .dll/function that exposes this property.
Any help would be grateful.Thanks.
I wouldn't think this would be possible, there's no way to control PDFs with JavaScript in the browser, which is what you'd need to do.
This article suggests the same: http://codingforums.com/showthread.php?t=43436.
Content of link:
in short, no, you can't do that.
really don't think JS can read properties of PDFs, since PDFs are viewed in the browser thru a plugin, ie a viewport for another application (for want of a better explanation).
You may be better trying a different route, such as generating the pages as images and implementing your own paging. Depends on your content and requirements, of course. ABCPDF from http://www.websupergoo.com/ is free (with a link-back), not sure if that's any help for you.
I'm using itextsharp to generate the PDFs, but I need to change some text dynamically.
I know that it's possible to change if there's any AcroField, but my PDF doen's have any of it. It just has some pure texts and I need to change some of them.
Does anyone know how to do it?
Actually, I have a blog post on how to do it! But like IanGilham said, it depends on whether you have control over the original PDF. The basic idea is you setup a form on the page and replace the form fields with the text you want. (You can style the form so it doesn't look like a form)
If you don't have control over the PDF, let me know how to do it!
Here is a link to the full post:
Using a template to programmatically create PDFs with C# and iTextSharp
I haven't used itextsharp, but I have been using PDFNet SDK to explore the content of a large pile of PDFs for localisation over the last few weeks.
I would say that what you require is absolutely achievable, but how difficult it is will depend entirely on how much control you have over the quality of the files. In my case, the files can be constructed from any combination of images, text in any random order, tables, forms, paths, single pixel graphics and scanned pages, some of which are composed from hundreds of smaller images. Let's just say we're having fun with it.
In the PDFTron way of doing things, you would have to implement a viewer (sample available), and add some code over a text selection. Given the complexities of the format, it may be necessary to implement a simple editor in a secondary dialog with the ability to expand the selection to the next line (or whatever other fundamental object is used to make up text). The string could then be edited and applied by copying the entire page of the document into a new page, replacing the selected elements with your new string. You would probably have to do some mathematics to get this to work well though, as just about everything in PDF is located on the page by means of an affine transform.
Good luck. I'm sure there are people on here with some experience of itextsharp and PDF in general.
This question comes up from time to time on the mailing list. The same answer is given time and time again - NO. See this thread for the official answer from the person who created iText.
This question should be a FAQ on the itextsharp tag wiki.