C# Pdf to Text with values in multiple line - c#

Hi I have a pdf with content as following : -
Property Address: 123 Door Form Type: Miscellaneous
ABC City
Pin - XXX
So when I use itextSharp to get the content, it is obtained as follows -
Property Address: 123 Door Form Type: Miscellaneous ABC City Pin - XXX
The data is mixed since it is in next line. Please suggest a possible way to get the content as required. Thanks
Property Address: 123 Door ABC City Pin - XXX Form Type: Miscellaneous

The following code using iTextSharp helped in formatting the pdf -
PdfReader reader = new PdfReader(path);
int pagenumber = reader.NumberOfPages;
for (int page = 1; page <= pagenumber; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string tt = PdfTextExtractor.GetTextFromPage(reader, page , strategy);
tt = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(tt)));
File.AppendAllLines(outfile, tt, Encoding.UTF8);
}

I'm Using Below helper class to convert PDF to Text file. this one is working clam for me.
If any one need full working desktop application please refer this github repo
https://github.com/Kithuldeniya/PDFReader
using iText.Kernel.Geom;
using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Canvas.Parser;
using iText.Kernel.Pdf.Canvas.Parser.Listener;
using System;
namespace PDFReader.Helpers
{
public static class PdfHelper
{
public static string ManipulatePdf(string filePath)
{
PdfDocument pdfDoc = new PdfDocument(new PdfReader(filePath));
//CustomFontFilter fontFilter = new CustomFontFilter(rect);
FilteredEventListener listener = new FilteredEventListener();
// Create a text extraction renderer
LocationTextExtractionStrategy extractionStrategy = listener
.AttachEventListener(new LocationTextExtractionStrategy());
// Note: If you want to re-use the PdfCanvasProcessor, you must call PdfCanvasProcessor.reset()
new PdfCanvasProcessor(listener).ProcessPageContent(pdfDoc.GetFirstPage());
// Get the resultant text after applying the custom filter
String actualText = extractionStrategy.GetResultantText();
pdfDoc.Close();
return actualText;
}
}
}

Related

Rewriting simple iTextSharp read page function using iText7 library

I've been using the iTextSharp library in my C# .NET program to read PDF files. A simplified version of my program uses code like the following procedure to get the text for a specified page number in a PDF File.
using iTextSharp.text.pdf.parser;
using System.Text;
namespace PDFReaderITextSharp
{
public static class PdfHelper
{
public static string GetText(string fileName, int pageNumber)
{
PdfReader pdfReader = new PdfReader(fileName);
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, pageNumber, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
return currentText;
}
}
}
Now, I'd like to construct a similar type of function that does that same thing but using the iText7 library in a .NET Core 3 program, namely, return the text for a specified PDF page. However, when I look at the text that is returned and compare it to the text returned by the above function as well as visually looking at the PDF file using an Adobe Reader, I see document text being represented multiple times.
For example, when I look at the PDF file in Adobe, there are several fields displayed in the form of "Caption: Value". For example, Invoice #: 1234 Invoice Date: 12/32/2019. But the text returned using the iText7 library returns "Invoice#Invoice #: 1234 Invoice DateInvoice Date: 12/32/2019" (the labels are duplicated 2 or more times.)
I wish I could upload the PDF document to help.
Is there something wrong with the iText7 function? What might the iText7 library be doing?
using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Canvas.Parser;
using iText.Kernel.Pdf.Canvas.Parser.Listener;
namespace PDFReader.Helpers
{
public static class PdfHelperIText7
{
public static string GetPdfPageText(string pdfFilePath, int pageNumber)
{
using (PdfDocument pdfDocument = new PdfDocument(new PdfReader(pdfFilePath)))
{
FilteredEventListener listener = new FilteredEventListener();
LocationTextExtractionStrategy extractionStrategy = listener.AttachEventListener(new LocationTextExtractionStrategy());
PdfCanvasProcessor pdfCanvasProcessor = new PdfCanvasProcessor(listener);
listener.AttachEventListener(extractionStrategy);
pdfCanvasProcessor.ProcessPageContent(pdfDocument.GetPage(pageNumber));
string actualText = extractionStrategy.GetResultantText();
pdfDocument.Close();
return actualText;
}
}
}
}

Reading pdf content with itextsharp and C#

I'm looking for a way to select text n characters on either side of a keyword search using itextsharp (v5.5.8). I've gotten to the point where I can use the SimpleTextExtractionStrategy() and return a list of pages where the searched text is found (supposedly). When I do a manual search using PDF Viewer Search, sometimes it's there and sometimes it can't find it on the page itextsharp says it's on. Sometimes, not at all.
The idea is to be able to return 40 characters on either side of the found keyword to allow the user to be able to find the reference easier when they look at the actual document. In another question, I saw a reference to additional text retrieval functions (LocationTextExtractionStrategy, PdfTextExtractor.GetTextFromPage(myReader, pageNum) in combination with some Contains(word)).
Where can I find examples of how to use these functions? And how to create a better strategy?
My current code:
public List<int> ReadPdfFile(string fileName, String searthText)
{
string rootPath = HttpContext.Current.Server.MapPath("~");
string dirPath = rootPath + #"content\publications\";
List<int> pages = new List<int>();
string fullFile = dirPath + fileName;
if (File.Exists(fullFile))
{
PdfReader pdfReader = new PdfReader(fullFile);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentPageText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
if (currentPageText.Contains(searthText))
{
pages.Add(page);
}
}
pdfReader.Close();
}
return pages;
}
And an example of the output using a simple response.write command...
Document 1.pdf 1 3
Document 2.pdf 1 2 3 4
The numbers after the file name are page numbers where the searched for keyword is found. However, in Document 1, the keyword is also found on the very top of page 4 in the "References" section that began on page 3. It should be found twice in the References.
Thanks,
Bob
P.S. apparently 5.5.8 doesn't have the iTextSharp.text.pdf.parser.TextExtractionStrategy method...

ASP.NET + C#. Creating word document from template

I have a word document which contains only one page filled with text and graphic. The page also contains some placeholders like [Field1],[Field2],..., etc.
I get data from database and I want to open this document and fill placeholders with some data. For each data row I want to open this document, fill placeholders with row's data and then concatenate all created documents into one document.
What is the best and simpliest way to do this?
Instead of some third party i will suggest you openXML
add following namespaces System.Text.RegularExpressions;
DocumentFormat.OpenXml.Packaging; and DocumentFormat.OpenXml.Wordprocessing;
public static void SearchAndReplace(string document)
{
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(document, true))
{
string docText = null;
using (StreamReader sr = new StreamReader(wordDoc.MainDocumentPart.GetStream()))
{
docText = sr.ReadToEnd();
}
Regex regexText = new Regex("Hello world!");
docText = regexText.Replace(docText, "Hi Everyone!");
using (StreamWriter sw = new StreamWriter(wordDoc.MainDocumentPart.GetStream(FileMode.Create)))
{
sw.Write(docText);
}
}
}
You'll probably need to use a third party library.
You might want to check out http://www.codeproject.com/Articles/660478/Csharp-Create-and-Manipulate-Word-Documents-Progra
The below section specifically discusses replacing values in a Word document.
http://www.typecastexception.com/post/2013/09/28/C-Create-and-Manipulate-Word-Documents-Programmatically-Using-DocX.aspx#Find-and-Replace-Text-Using-DocX---Merge-Templating--Anyone-

Using iTextSharp to get y co-ordinates of text box?

I'm using iTextSharp to return the text from a page in a PDF document,
using this :
var locationTextExtractionStrategy = new LocationTextExtractionStrategy();
string textFromPage = PdfTextExtractor.GetTextFromPage(pdfReader, i + 1, locationTextExtractionStrategy);
I understand from previous questions here that I need to access
renderInfo.GetBaseline().GetStartPoint();
But I don't understand how to call that method from LocationTextExtractionStrategy()

Get Meta Data from PDF using PDFsharp

How do I get meta data from PDF using PDFsharp. Refer to the image.
I want to extract 'Document Restrictions Summary'
private static void Method1(string strPDFAddress)
{
PdfDocument pdfDoc = new PdfDocument(strPDFAddress);
Console.WriteLine("--------------------------------------------------------------");
Console.WriteLine("File: {0}", strPDFAddress);
Console.WriteLine("Author: {0}", pdfDoc.Info.Author);
Console.WriteLine("CreationDate: {0}", pdfDoc.Info.CreationDate);
Console.WriteLine("Creator: {0}", pdfDoc.Info.Creator);
Console.WriteLine("Keywords: {0}", pdfDoc.Info.Keywords);
PdfDocumentSettings pdfDocSettings = pdfDoc.Settings;
Console.WriteLine(pdfDocSettings.ToString());
PdfSecuritySettings pdfSecuritySettings = pdfDoc.SecuritySettings;
Console.WriteLine(pdfSecuritySettings.PermitExtractContent);
//PdfSharp.Pdf.Advanced.PdfFormXObject xObj =
PdfDictionary.DictionaryElements pdfDictionaryElements = pdfDoc.Info.Elements;
Console.WriteLine(pdfDictionaryElements.ToString());
}
Try this
Hope it works.
PdfReader reader = new PdfReader("HelloWorldNoMetadata.pdf");
string s = reader.Info["Author"];
You can set these document restrictions with the PdfSecuritySettings class.
See this sample:
http://www.pdfsharp.net/wiki/ProtectDocument-sample.ashx
I'm not sure but I would expect that this structure will also be filled when opening a PDF document.

Categories