Rewriting simple iTextSharp read page function using iText7 library

Rewriting simple iTextSharp read page function using iText7 library - c#

I've been using the iTextSharp library in my C# .NET program to read PDF files. A simplified version of my program uses code like the following procedure to get the text for a specified page number in a PDF File.
using iTextSharp.text.pdf.parser;
using System.Text;
namespace PDFReaderITextSharp
{
public static class PdfHelper
{
public static string GetText(string fileName, int pageNumber)
{
PdfReader pdfReader = new PdfReader(fileName);
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, pageNumber, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
return currentText;
}
}
}
Now, I'd like to construct a similar type of function that does that same thing but using the iText7 library in a .NET Core 3 program, namely, return the text for a specified PDF page. However, when I look at the text that is returned and compare it to the text returned by the above function as well as visually looking at the PDF file using an Adobe Reader, I see document text being represented multiple times.
For example, when I look at the PDF file in Adobe, there are several fields displayed in the form of "Caption: Value". For example, Invoice #: 1234 Invoice Date: 12/32/2019. But the text returned using the iText7 library returns "Invoice#Invoice #: 1234 Invoice DateInvoice Date: 12/32/2019" (the labels are duplicated 2 or more times.)
I wish I could upload the PDF document to help.
Is there something wrong with the iText7 function? What might the iText7 library be doing?
using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Canvas.Parser;
using iText.Kernel.Pdf.Canvas.Parser.Listener;
namespace PDFReader.Helpers
{
public static class PdfHelperIText7
{
public static string GetPdfPageText(string pdfFilePath, int pageNumber)
{
using (PdfDocument pdfDocument = new PdfDocument(new PdfReader(pdfFilePath)))
{
FilteredEventListener listener = new FilteredEventListener();
LocationTextExtractionStrategy extractionStrategy = listener.AttachEventListener(new LocationTextExtractionStrategy());
PdfCanvasProcessor pdfCanvasProcessor = new PdfCanvasProcessor(listener);
listener.AttachEventListener(extractionStrategy);
pdfCanvasProcessor.ProcessPageContent(pdfDocument.GetPage(pageNumber));
string actualText = extractionStrategy.GetResultantText();
pdfDocument.Close();
return actualText;
}
}
}
}

Related

ITextSharp 4.1.6 extract PDF content as text

The company would like to use the Itextsharp 4.1.6 version specifically and don't want to buy the license (version 5/7).
So, we had already implemented the TextExtract from pdf using the itextsharp 5 version. As we downgraded, this method doesn't support in the 4.16 LGPL version.
So, I looked into many StackOverflow and other sites for the answer. Looks like no custom implementation found other than the below code which exists in AGPL version.
PdfTextExtractor.GetTextFromPage(reader, i, new SimpleTextExtractionStrategy())
And byte[] pageContent = reader.GetPageContent(i); gives the byte content, when converted to string it won't give us the exact file text.
As, we do not wish to buy the AGPL version and need to implement the textextractor of pdf, any idea if any other tool supports this/ anybody has the implementation of textextractor.
Any suggestions would be greatly appreciated.
Edit: Refernce for the #jgoday's answer:

With iText 4.1 you can use PdfContentParser (https://github.com/schourode/iTextSharp-LGPL/blob/f75cdad88236d502af42458a420d48be2a47008f/src/core/iTextSharp/text/pdf/PdfContentParser.cs), to parse contents of every page.
using System;
using System.Text;
using iTextSharp.text.pdf;
namespace PdfExtractor
{
class Program
{
static void Main(string[] args)
{
var reader = new PdfReader(#"D:\Tmp\sample.pdf");
try
{
var parser = new PdfContentParser(new PRTokeniser(reader.GetPageContent(2)));
var sb = new StringBuilder();
while (parser.Tokeniser.NextToken())
{
if (parser.Tokeniser.TokenType == PRTokeniser.TK_STRING)
{
string str = parser.Tokeniser.StringValue;
sb.Append(str);
}
}
Console.WriteLine(sb.ToString());
}
finally {
reader.Close();
}
}
}
}

Edit PDF text using C#

How can I find and then hide (or delete) specific text phrase?
For example, I have created a PDF file containing all sorts of data such as images, tables, text etc.
Now, I want to find a specific phrase like "Hello World" wherever it is mentioned in the file and somehow hide it, or -better even- delete it from the PDF.
And finally get the PDF after deleting this phrase.
I have tried iTextSharp and Spire, but couldn't find anything that worked.

Try the following code snippets to hide the specifc text phrase on PDF using Spire.PDF.
using Spire.Pdf;
using Spire.Pdf.General.Find;
using System.Drawing;
namespace HideText
{
class Program
{
static void Main(string[] args)
{
//load PDF file
PdfDocument doc = new PdfDocument();
doc.LoadFromFile(#"C:\Users\Administrator\Desktop\Example.pdf");
//find all results where "Hello World" appears
PdfTextFind[] finds = null;
foreach (PdfPageBase page in doc.Pages)
{
finds = page.FindText("Hello World").Finds;
}
//cover the specific result with white background color
finds[0].ApplyRecoverString("", Color.White, false);
//save to file
doc.SaveToFile("output.pdf");
}
}
}
Result

The following snippet from here let you find and black-out the text in pdf document:
PdfDocument pdf = new PdfDocument(new PdfReader(SRC), new PdfWriter(DEST));
ICleanupStrategy cleanupStrategy = new RegexBasedCleanupStrategy(new Regex(#"Alice", RegexOptions.IgnoreCase)).SetRedactionColor(ColorConstants.PINK);
PdfAutoSweep autoSweep = new PdfAutoSweep(cleanupStrategy);
autoSweep.CleanUp(pdf);
pdf.Close();
Pay attention to the license. It is AGPL, if you don't buy license.

C# Pdf to Text with values in multiple line

Hi I have a pdf with content as following : -
Property Address: 123 Door Form Type: Miscellaneous
ABC City
Pin - XXX
So when I use itextSharp to get the content, it is obtained as follows -
Property Address: 123 Door Form Type: Miscellaneous ABC City Pin - XXX
The data is mixed since it is in next line. Please suggest a possible way to get the content as required. Thanks
Property Address: 123 Door ABC City Pin - XXX Form Type: Miscellaneous

The following code using iTextSharp helped in formatting the pdf -
PdfReader reader = new PdfReader(path);
int pagenumber = reader.NumberOfPages;
for (int page = 1; page <= pagenumber; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string tt = PdfTextExtractor.GetTextFromPage(reader, page , strategy);
tt = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(tt)));
File.AppendAllLines(outfile, tt, Encoding.UTF8);
}

I'm Using Below helper class to convert PDF to Text file. this one is working clam for me.
If any one need full working desktop application please refer this github repo
https://github.com/Kithuldeniya/PDFReader
using iText.Kernel.Geom;
using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Canvas.Parser;
using iText.Kernel.Pdf.Canvas.Parser.Listener;
using System;
namespace PDFReader.Helpers
{
public static class PdfHelper
{
public static string ManipulatePdf(string filePath)
{
PdfDocument pdfDoc = new PdfDocument(new PdfReader(filePath));
//CustomFontFilter fontFilter = new CustomFontFilter(rect);
FilteredEventListener listener = new FilteredEventListener();
// Create a text extraction renderer
LocationTextExtractionStrategy extractionStrategy = listener
.AttachEventListener(new LocationTextExtractionStrategy());
// Note: If you want to re-use the PdfCanvasProcessor, you must call PdfCanvasProcessor.reset()
new PdfCanvasProcessor(listener).ProcessPageContent(pdfDoc.GetFirstPage());
// Get the resultant text after applying the custom filter
String actualText = extractionStrategy.GetResultantText();
pdfDoc.Close();
return actualText;
}
}
}

ASP.NET + C#. Creating word document from template

I have a word document which contains only one page filled with text and graphic. The page also contains some placeholders like [Field1],[Field2],..., etc.
I get data from database and I want to open this document and fill placeholders with some data. For each data row I want to open this document, fill placeholders with row's data and then concatenate all created documents into one document.
What is the best and simpliest way to do this?

Instead of some third party i will suggest you openXML
add following namespaces System.Text.RegularExpressions;
DocumentFormat.OpenXml.Packaging; and DocumentFormat.OpenXml.Wordprocessing;
public static void SearchAndReplace(string document)
{
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(document, true))
{
string docText = null;
using (StreamReader sr = new StreamReader(wordDoc.MainDocumentPart.GetStream()))
{
docText = sr.ReadToEnd();
}
Regex regexText = new Regex("Hello world!");
docText = regexText.Replace(docText, "Hi Everyone!");
using (StreamWriter sw = new StreamWriter(wordDoc.MainDocumentPart.GetStream(FileMode.Create)))
{
sw.Write(docText);
}
}
}

You'll probably need to use a third party library.
You might want to check out http://www.codeproject.com/Articles/660478/Csharp-Create-and-Manipulate-Word-Documents-Progra
The below section specifically discusses replacing values in a Word document.
http://www.typecastexception.com/post/2013/09/28/C-Create-and-Manipulate-Word-Documents-Programmatically-Using-DocX.aspx#Find-and-Replace-Text-Using-DocX---Merge-Templating--Anyone-

Need to create a PDF file from C# with another PDF file as the background watermark

I am looking for a solution that will allow me to create a PDF outfile from C# that also merges in a seperate, static PDF file as the background watermark.
I'm working on a system that will allow users to create a PDF version of their invoice. Instead of trying to recreate all of the invoice features within C# I think the easiest solution would be to use the PDF version fo the blank invoice (created from Adobe Illustrator) as a background watermark and simply overlay the dynamic invoice details on top.
I was looking at Active Reports from Data Dynamics, but it dow not appear they they have the capibility to overlay, or merge, a report onto an existing PDF file.
Is there any other .NET PDF report product that has this capibilty?

Thank you bhavinp. iText seems to do the trick and work exactly as I was hoping for.
For anyone else trying to merge to PDF files and overlay them the following example code based on using the the iTextPDF library might help.
The Result file is a combination of Original and Background
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using iTextSharp.text;
using iTextSharp.text.pdf;
namespace iTextTest
{
class Program
{
/** The original PDF file. */
const String Original = #"C:\Jobs\InvoiceSource.pdf";
const String Background = #"C:\Jobs\InvoiceTemplate.pdf";
const String Result = #"C:\Jobs\InvoiceOutput.pdf";
static void Main(string[] args)
{
ManipulatePdf(Original, Background, Result);
}
static void ManipulatePdf(String src, String stationery, String dest)
{
// Create readers
PdfReader reader = new PdfReader(src);
PdfReader sReader = new PdfReader(stationery);
// Create the stamper
PdfStamper stamper = new PdfStamper(reader, new FileStream(dest, FileMode.Create));
// Add the stationery to each page
PdfImportedPage page = stamper.GetImportedPage(sReader, 1);
int n = reader.NumberOfPages;
PdfContentByte background;
for (int i = 1; i <= n; i++)
{
background = stamper.GetUnderContent(i);
background.AddTemplate(page, 0, 0);
}
// CLose the stamper
stamper.Close();
}
}
}

I came across this question and couldn't use the iTextSharp library due to the license on the free version
The iText AGPL license is for developers who wish to share their entire application source code with the open-source community as free software under the AGPL “copyleft” terms.
However I found PDFSharp to work using the below code.
using System;
using System.Collections.Generic;
using System.Drawing;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using PdfSharp.Drawing;
using PdfSharp.Pdf;
using PdfSharp.Pdf.IO;
namespace PDFTest
{
class Program
{
static Stream Main(string[] args)
{
using (PdfDocument originalDocument= PdfReader.Open("C:\\MainDocument.pdf", PdfDocumentOpenMode.Import))
using (PdfDocument outputPdf = new PdfDocument())
{
foreach (PdfPage page in originalDocument.Pages)
{
outputPdf.AddPage(page);
}
var background = XImage.FromFile("C:\\Watermark.pdf");
foreach (PdfPage page in outputPdf.Pages)
{
XGraphics graphics = XGraphics.FromPdfPage(page);
graphics.DrawImage(background, 1, 1);
}
MemoryStream stream = new MemoryStream();
outputPdf.Save("C:\\OutputFile.pdf");
}
}
}
}

Use this.
http://itextpdf.com/
It works with both Java and .NET

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Rewriting simple iTextSharp read page function using iText7 library - c#

Related

ITextSharp 4.1.6 extract PDF content as text

Edit PDF text using C#

C# Pdf to Text with values in multiple line

ASP.NET + C#. Creating word document from template

Need to create a PDF file from C# with another PDF file as the background watermark

Categories

Resources