How to read text image from pdf using pdfbox

How to read text image from pdf using pdfbox - c#

I have one PDF document with image. That image contains text.
Now I want to read the text from the image using pdfbox.
I have tried the PDTTextStripper but it not working for image text.
Can you please give some idea about it.
PDDocument pDDocument = PDDocument.load(new java.io.File(fileName));
PDFTextStripper textStripper = new PDFTextStripper();
string text = textStripper.getText(pDDocument);
Console.WriteLine(text);
I want to read the text inside the image from the pdf using pdfbox c# .net.

Related

Read word file content (image and text) with formatting information

I want to convert the complex word document (which includes both text and images) into RTF text preserving formatting information. Sample image of document content is as shown below
How I can read this information as RTF text so that I can perfectly preserve the formatting and image location information. I tried to read word file using RichTextbox's LoadFile function but it doesn't preserve the formatting & image location information.
Looking forward to receive some suggestions to maintain the formatting and image location information.
Thanks

You could simply save the document in the RTF format...

Convert Image file to PDF document letter size using PDFTron

I am trying to convert an image file to a PDF document with a defined page size (letter size).
Currenlty I am able to convert an image to a PDF document without defining any page dimensions (default dimensions of the PDF is the image size). I would like to define the page dimensions on the creation of the document, and place the image on that page (possibly with margins).
The following code snippet shows how I am currently converting an image file to a PDF document without setting any dimensions for the page:
async static Task<bool> ConvertImageToPDF(TestFile file)
{
pdftron.PDF.PDFDoc pdfdoc = new PDFDoc(); //Initialize a new PDF document
pdftron.PDF.Convert.ToPdf(pdfdoc, file.InputFile); //Use the Convert.ToPdf to generate the file from the image file
await pdfdoc.SaveAsync(file.OutputFile, SDFDocSaveOptions.e_linearized); //Save the PDF document generated from the conversion
pdfdoc.Destroy();
pdfdoc = null;
}
Any help or direction for assigning dimensions (letter size) to a PDF page and inserting the image file in that page would be more than welcome.

If ToPDF is given an image then PDFNet will query the DPI information of the image metadata and make page dimensions to match the DPI and resolution of the source.
If you like, you can always post-process the PDF generated by ToPDF.
Or, you can follow the AddImage sample code to do everything yourself.
https://www.pdftron.com/pdfnet/samplecode.html#AddImage

Whitespace issue in text to image conversion

I have a C# WinForms application that parses a text file and converts it into an image. The application works fine for normal text files. The problem I am facing is with the whitespaces in the text.
The code is:-
string text = File.ReadAllText(file);
Image img = DrawText(text);
img.Save("c:\\LoRa Demo\\pic.jpg", ImageFormat.Jpeg);
I am using the Graphics.DrawString() in the DrawText() to convert the text into image.
When parsing the following text, the white spaces don't occupy the same width in the string buffer as in the text file.
Text File content:-
***************************
****** **********
****** **********
***************************
***************************
Debugging shows the following image in the string buffer:-
The output image is same as the one in the buffer:-
How to parse the text file properly and convert it to image as in the text file ?

I think you should use monospaced fonts.
See: A list of monospaced fonts

How to convert Text containing Images into pdf using iText#

I have a richtextbox and by far, i have been successful in converting plain text into pdf using iTextsharp. Now, the situation is, when i copy some text from any source let's say a website and along text, it contains images. Now when i try to convert the content (text + images) into pdf, resulting pdf doesn't show images. I know that there would be some property to be set for the richtextbox or using itextsharp so to have images as well. I also know that we can insert images by giving path to that image but this is not what i want. I want to have plain text along with images while direct conversion from richtextbox to pdf. Forexample i have,![Im resulting pdf i want the same as in richtextbox][1]: http://i.stack.imgur.com/ts0ec.jpg
How can i have the same as in richtextbox? How can i specify orientation for text and image to be justified?

How to copy/crop text using pdfsharp?

I want to copy some text or area with (x,y) from existing pdf and paste it to a new pdf.
i am using pdfsharp.how to do this?
can anybody help?

If you want to extract text, look at these threads:
C# PDFSharp: Examples of how to strip text from PDF?
C# Extract text from PDF using PdfSharp
If you want to extract an image of the page (or a part of it), look at this thread:
Export PDF to JPG(s) in C#

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to read text image from pdf using pdfbox - c#

Related

Read word file content (image and text) with formatting information

Convert Image file to PDF document letter size using PDFTron

Whitespace issue in text to image conversion

How to convert Text containing Images into pdf using iText#

How to copy/crop text using pdfsharp?

Categories

Resources