c#, check area of pdf

c#, check area of pdf - c#

I need to enter a text to existing pdf (in top or bottom of the page) in c#.
I need to make sure that I dont overwrite any visible text or image.
Is there any way I could check an area in pdf if it contains text, image, control etc? I understand it will not be 100% accurate

You're going to need a full PDF consumer at the very least, because the only way to find out where the marks are on the page is to parse (and possibly render) the PDF.
There are complications which you haven't covered (possibly they have not occurred to you); what do you consider to be the area of the PDF file ? The MediaBox ? CropBox, TrimBox, ArtBox, BleedBox ? What if the PDF file contains, for example, a rectangular fill with white which covers the page ? What about a /Separation space called /White ? is that white (it generally gets rendered that way on the output) or not ? And yes, this is a widely used ink in the T-shirt printing industry amongst others.
To me the simplest solution would seem to be to use a tool which will give you the BoundingBox of marks on the page. I know the Ghostscript bbox device can do this, I imagine there are other tools which can do so. But note (for Ghostscript at least); if there are any marks in white (whatever the colour space), these are considered as marking the page and will be counted into the bbox.
The same tool ought to be able to give the size of the various Boxes in the PDF file (you'd need the pdf_info.ps program for Ghostscript to get this, currently). You can then quickly calculate which areas are unmarked.
But 'unmarked' isn't the same things as 'white'. If you want to not count areas which are painted in 'white' then the problem becomes greater. You really need to render the content and then look at each image sample in the output to see if its white or not, recording the maxima and minima of the x and y co-ordinates to determine the 'non-white' area of the page.
This is because there are complications like transfer functions, transparency blending, colour management, and image masking, any or all of which might cause an area which is marked with a non-white colour to be rendered white (a transparency SMask for example) or an area marked with white to be rendered non-white (eg a transfer function).
Your question is unclear because you haven't defined whether any of these issues are important to you, and how you want to treat them.

Related

Creating an image in C# then laying out on page

I need to take a spreadsheet and convert it into a price tag. I've done that part but I'm not to sure how to go about making an image that contains both the price an item name (This is all stored in a list.) Then lay it out on a 8 1/2 x 11 piece of paper.
I read this question here but its using the size of the text, which may vary depending on the name of the item. The TextBox (Or whatever is holding the text) needs to be the same size but have the text scale based on its size.

Take a look at these docs, in particular their example pd_PrintPage function. This takes a PrintPageEventArgs which contains a Graphics object that you can use to actually render your tag.
In particular, to leverage your linked question, there is a DrawImage(Image, Int32, Int32) method that renders the given image at a co-ordinate.
To handle scaling your text, you just need to compare how big your text would be with one font to how big you want it - work out the ratio of width/height, then scale the font you use to render so that it uses the smallest of those ratios. There's a good answer here which shows how to do that.
So:
Handle a print event
Find the right font size
Create your image
Print with your graphics object
I would do a mockup of the resulting code, but I don't have access to a C# IDE at the moment.

What type of image filtering/processing do mobile PDF scanners use to convert a captured image into a monochrome/black and white image?

I am trying to implement my own monochrome/black and white filter in C# to scan text documents. My approach is to apply a threshold filter on the captured image. However, I often run into the problem that the varying brightness on the image causes a ''shadowing effect'' on the processed image. Refer to the link below (it is pretty blurry but it should suffice). The image to the far left is the original image. When I apply my threshold filter, I get the same result as the image in the middle; some of the text becomes unreadable because the brightness of the image varies, so some portions become really black or really white. However, with the right filter, you can obtain the processed image to the right where everything looks crystal clear.
https://www.google.dk/search?q=monochrome+image+processing&espv=2&biw=1706&bih=859&source=lnms&tbm=isch&sa=X&ved=0ahUKEwir8vXlhIzPAhUFiywKHeSBC1wQ_AUIBigB#imgrc=4UTzoIpyqTkwrM%3A
I would like to know what the process is to obtain the image to the far right. Another example can be seen in the image below. It shows a sample mobile PDF scanner in use. Scanning the image results in a very nice black and white image, where the text can be easily read and no ''shadowing'' occurs on the image. Does anyone know what this process is or what it is called? It is very often used in mobile PDF scanning applications. Thank you in advance.
EDIT: The filter is called ''Adaptive Thresholding''. You can use the BradleyLocalThresholding class to implement the filter, or you can write it yourself (which is what I did). Please refer to my response to the comment by Yves Daoust down below.

You need two ingredients.
One is "background reconstruction", i.e. retrieving the intensity of the white sheet "under the characters", for instance by morphological opening.
The other is "shading correction", i.e. compensating the unevenness of the background illumination by comparing to the reconstructed background, for instance by subtraction.
This will "flatten" the image, making it perfectly amenable to global thresholding.

A simple method is to convert the image to grayscale and then convert it to B/W using an error diffusion algorithm such as Floyd–Steinberg dithering.

Measure the height of DrawHTMLText before drawing on document with Debenu Quick PDF Lib

I am using latest version of Debenu Quick PDF Library.
Is it possible to calculate the height of DrawHTMLText before drawing it on document?
I need it because, I want my application to decide where (x,y coordinates) to draw DrawHTMLText according to its dimensions.
For example if it exceeds document border from the bottom side I want it to pull it up to make it fit.
Thank you.

user3253797,
You can use the GetHTMLTextHeight function to determine the height before calling DrawHTMLText.
http://www.debenu.com/docs/pdf_library_reference/GetHTMLTextHeight.php
Note : DrawHTMLText will return any overflow text as a string that will not fit into the specified area. GetHTMLTextHeight should in theory return the text height for the text that can fit inside the box. If the text is too long to fit inside the one box then it sounds like you will need to modify the x,y positions and possibly the HTML text itself to make it all fit on one page.
Good luck.
Andrew.

printing solution for .NET front end, MYSQL backend

i will need to print an "x" according to the coordinates given to me from one of the tables in my database. im probably gonig to use c# to connect to mysql.
i will probably have a winform that is 8.5 x 11 inches (The size of a regular sheet of paper) and i will populate the entire thing with labels of "x" and they will be invisible.
each individual table record will have the coordinates of those labels which should NOT be invisible
the form for every record will show and will print. the printing will be on top of a paper that is actually a physical application itself.
the problem:
how to fill out a physical application using data from a mysql database. (dont tell me that i should be printing the entire app from scratch, the reason this is not possible is because the form is actually TRIPLE paper width (white, yellow, and pink copy), so i cannot print the entire app from scratch, i have to print on top of it.
the question: how do i print "x" at specified regions? is my solution the best way to go or is there a smarter approach?
in case you have no idea what i am talking about, here are some related questions:
ms-access: designing a report: printing text on specific x,y coordinates
Conditional formatting in Access

While labels would offer you the ability to make an X show up I don't feel that having a bunch of hidden labels is the best way.
Does the "application" represent some kind of form? Are you looking to "check-off" boxes using x's and then print this?
I may suggest using GDI+ (drawing) vs using labels.
Consider the following:
Locate the coordinates for your boxes. Then use the drawstring method within an overridden onPaint event-handler for your form or for the panel which may represent your form's canvas.
This article talks about GDI+ and how to draw text as graphics.
http://www.functionx.com/vb/gdi+/objects/fonts.htm

How can I get the number of color pages in a PDF file using C#?

Given a PDF file with color and black & white pages, is there any way with C# to find out among the given pages which are color and which are black & white?

My recommendation is to render each page to an image and then check each pixel for RGB values not equal to each other. If R=G=B for each pixel then it's a grayscale image.
You could then perform actions (such as extracting a page to another document or printing the page) on the pages based on whether they are color pages or black and white pages, etc.
This can be achieved by using my companies PDF developer library, Quick PDF, or potentially by one of the open source PDF libraries that Kenneth suggested.

Short of parsing all the postscript content, probably not. There's no flag on a PDF page that says it is or is not b&w or color. So you'd have to check the color of every element placed on the page to figure out if it was color or not. I'm not sure what libraries exist for reading PDFs on C# but you would need one that will read all the elements.
Similarly, any images you have on the page would need to be checked for color and that is not simple. Color image formats can hold b&w images.

Check out:
PDF-Analyser
I use his tools for text extraction and pdf analysis. Very inexpensive, royalty free, and work well. I think GetPDFColourStyle as part of the PDFLayoutPlus library should do the trick.

Convert each page into bitmap image and then look through each pixel in the image you would be able to catch colours and then differentiate color pages.
refer this Post for more details.
Note: If your are detecting this colors for printing sake, then you have to detect CMYK colors not RGB, CMYK is the printer standard color mode, and RGB is a display color mode.

There is a solution.
You can parse each page content bytes and look for color operators like 'rg, RG, k, K, sc, SC, scn, SCN' and read out all the color values and color spaces defined in each page.
Take a look at this example: http://habjan.blogspot.com/2013/09/proof-of-concept-converting-pdf-files.html
It implements / parses all color operators and I think it will be a good start point and reference to help you code what you need.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.