I am implementing a web service that can modify the content of PDF files and I am having trouble locating specific PDF elements (e.g. a text, an image etc.). Right now I am able to get the location of any elements (using coordinates like the left and top values indicating the location). So if I want to change the content of the text I can put a white box on top of the original text and add the new text on top of the box, which seems to be a stupid way to do that.
I checked this post (https://developers.itextpdf.com/examples/stamping-content-existing-pdfs-itext5/replacing-pdf-objects) which stores texts as keys in a dictionary. The problem with this is that if there are several texts with same content, all of them will be changed.
Also there is another post (https://developers.itextpdf.com/question/how-use-text-extraction-strategy-after-applying-location-extraction-strategy) that can extract content based on locations. That is something close to what I want. My question is, given a location of a text, am I able to locate it and change the content of that text object?
Related
I'm looking for a way to get the result of an overlay between two pdf documents.
We have a document with a single page and only a header and another document with multiple pages and full content (header and body). We're looking for way to generate an overlay pdf between those documents, so that the resulting document with the content gets its header overwritten in each page with the single page document header. Basically like this:
Is there a opensource c# library, which can handle this and not convert the text to a picture.
I looked at PdfSharp and docnet, but couldn't figure it out with either of them.
So far we are using pdfbox, but we'd like to get rid of the java dependency.
Simple solution with PDFsharp: draw a white rectangle that hides the original header, then draw the new header on top of the rectangle.
Drawback: The old header is still contained in the document.
I'm trying to split and SVG into many different SVG files each one containing one element original file.
My main problem is not the slipt in itself since for each element I have to split, I get the svg element, remove everything else from the original svg, add back the element into the main layer and then save the file with the name I want. This works fine.
The real problem is that each file I create has the view centered and dimensioned as in the original svg file. So usually it's misplaced and wide (since in the original file it would contain all elements that now are split into different files).
So I need to resize the canvas to the element that remains in the file.
This very function is done by inkscape with the command
inkscape --verb=FitCanvasToDrawing --verb=FileSave --verb=FileClose
But unfortunally this verb doesn't work in --without-gui mode, so if I call it in the code I see thousands of inkscape instances opening, fitting the canvas and then save and close the windows. It works but it's not good for a batch application (they have this "bug" since quite some years).
So I resorted to use SVG engine (https://github.com/vvvv/SVG) but it has a bug that prevents the right calculation of the element bounds (https://github.com/vvvv/SVG/issues/331), so I cannot change the viewbox or the svg element to the right values.
Any suggestion on how to calculate the right position and size? Or any other library (any language that can run in a batch) that works for that?
I have created a Word document where I have inserted some images, added hyperlinks (to these images), and converted the document to pdf. Is there any way to find the position of the image that has a specific hyperlink using iTextSharp library? I have found solutions that can return the image or the hyperlink text but it's not exactly what I need.
My end goal is to find the image with a specific URL and delete it (along with the associated URL) while saving its location (have to save x, y, height, and width before deletion).
Thank you.
You have found solutions that can return:
the image and its position,
the hyperlink and its position.
And that's exactly what you need. Now compare the positions of the image with the positions of the hyperlinks and you'll know which image corresponds with which link.
You are asking to find images with a specific URL, but there is no such thing in PDF. In a PDF, each page is described using a page dictionary. In this page dictionary, there is:
an entry named /Contents (required): this refers to the content stream(s) of the page and the content stream(s) contain references to images (stored as /XObject in the /Resources entry of the page dictionary).
an entry named /Annots (optional): this refers to all the annotations that are added on top of the content. Hyperlinks are stored in link annotations.
Links are not aware of the content they cover. Content is not aware of the annotations that cover them. That's why you didn't find an answer to your question. You've been making the wrong assumptions about clickable images.
I have a client that is asking me to add a fixed width (510 character) header record to a PDF file. They have asked that I create a new page (not problem) in which I write this fixed width header record on.
I can do this, and see the header record as page 1, followed by the original PDF. The problem is white space. The 510 character fixed width header is about 60% white space and all the ways I've tried generating the PDF cause this to be truncated. There are also line breaks where the text wraps. The client want to be able to use some OCR software they have purchased in order to read this header file from page 1.
I know very little about PDF file format. I've tried using ABCpdf, PDFsharp, and also created an RDLC and bound it to this header string and then generated a PDF from that. All 3 resulted in the same outcome.
Let me say I know how crazy this sounds, but it's what a client is requesting. I proposed several other ways in which we could solve their problem, but this (right now) is the only one they are comfortable with. They are not comfortable with me just appending the 510 characters onto the byte array, and having them separate it out programatically.
Are you looking to have a page displaying the long header? You can create a PDF page of any size (Print to PDF with a custom pages size of 20" wide by 6" tall. Weird but possible.)
Once that page is created, it can be inserted into another document of regular letter size pages.
Are you looking for consecutive pages displaying chunks of the header?
Using an OCR to read content that you put in is an overkill. Instead of rendering the 500-character header as text. Render it as single-character form fields. This way it will be easy to access those form-fields by name and retrieve the values using the same PDF library which you created the PDFs.
I'm trying to replace a section of a PDF with different text. From research on all major PDF libraries for .NET, it seems this is complicated and not a trivial task. I think it may be easier to convert the PDF to an image, replace the text (always in the same place), then convert it back to a PDF (or leave it as an image if converting back isn't possible). Is it possible to extract an image from a PDF page with .NET?
If your text is in a known location, you can simply cover it with a rectangle filled with the background color, and then draw your text over top.
Note that the text will still be there, it simply won't be visible. Someone selecting text will still pick up the old stuff. If that's acceptable, it's quite trivial.
If the PDF was created from image, you can import it into Photoshop to edit it as an graphic. Or you can use screenshot program like "Snagit" to capture pdf page as image and use snagit's editor to erase old text and replace new one.
But this method may bring you problem is that the new added text may not the same font as text around it. Personally, I use pdf editor to replace text in pdf since the added text will be automatically fit with the original font and size.