I'm using iTextSharp to return the text from a page in a PDF document,
using this :
var locationTextExtractionStrategy = new LocationTextExtractionStrategy();
string textFromPage = PdfTextExtractor.GetTextFromPage(pdfReader, i + 1, locationTextExtractionStrategy);
I understand from previous questions here that I need to access
renderInfo.GetBaseline().GetStartPoint();
But I don't understand how to call that method from LocationTextExtractionStrategy()
Related
How does one add a PDF Form element to a PDFsharp PdfPage object?
I understand that AcroForm is the best format for form-fillable PDF elements, but the PDFsharp library doesn't seem to allow you to create instances of the AcroForm objects.
I have been able to use PDFsharp to generate simple documents, as here:
static void Main(string[] args) {
PdfDocument document = new PdfDocument();
document.Info.Title = "Created with PDFsharp";
// Create an empty page
PdfPage page = document.AddPage();
// Draw Text
XGraphics gfx = XGraphics.FromPdfPage(page);
XFont font = new XFont("Verdana", 20, XFontStyle.BoldItalic);
gfx.DrawString("Hello, World!", font, XBrushes.Black,
new XRect(0, 0, page.Width, page.Height), XStringFormats.Center);
// Save document
const string filename = "HelloWorld.pdf";
document.Save(filename);
}
But I cannot work out how to add a fillable form element. I gather it would likely use the page.Elements.Add(string key, PdfItem item) method, but how do you make an AcroForm PdfItem? (As classes like PdfTextField do not seem to have a public constructor)
The PDFsharp forums and documentation have not helped with this, and the closest answer I found on Stack Overflow was this one, which is answering with the wrong library.
So, in short: How would I convert the "Hello World" text above into a text field?
Is it possible to do this in PDFsharp, or should I be using a different C# PDF library? (I would very much like to stick with free - and preferably open-source - libraries)
Most of the classes constructors in PdfSharp are sealed which makes it kind of difficult to create new pdf objects. However, you can create objects using it's classes to add low-level pdf elements.
Below is an example of creating a text field.
Please refer to the pdf tech specs starting on page 432 on definition of key elements
https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf
public static void AddTextBox()
{
using (PdfDocument pdf = new PdfDocument())
{
PdfPage page1 = pdf.AddPage();
double left = 50;
double right = 200;
double bottom = 750;
double top = 725;
PdfArray rect = new PdfArray(pdf);
rect.Elements.Add(new PdfReal(left));
rect.Elements.Add(new PdfReal(bottom));
rect.Elements.Add(new PdfReal(right));
rect.Elements.Add(new PdfReal(top));
pdf.Internals.AddObject(rect);
PdfDictionary form = new PdfDictionary(pdf);
form.Elements.Add("/Filter", new PdfName("/FlateDecode"));
form.Elements.Add("/Length", new PdfInteger(20));
form.Elements.Add("/Subtype", new PdfName("/Form"));
form.Elements.Add("/Type", new PdfName("/XObject"));
pdf.Internals.AddObject(form);
PdfDictionary appearanceStream = new PdfDictionary(pdf);
appearanceStream.Elements.Add("/N", form);
pdf.Internals.AddObject(appearanceStream);
PdfDictionary textfield = new PdfDictionary(pdf);
textfield.Elements.Add("/FT", new PdfName("/Tx"));
textfield.Elements.Add("/Subtype", new PdfName("/Widget"));
textfield.Elements.Add("/T", new PdfString("fldHelloWorld"));
textfield.Elements.Add("/V", new PdfString("Hello World!"));
textfield.Elements.Add("/Type", new PdfName("/Annot"));
textfield.Elements.Add("/AP", appearanceStream);
textfield.Elements.Add("/Rect", rect);
textfield.Elements.Add("/P", page1);
pdf.Internals.AddObject(textfield);
PdfArray annotsArray = new PdfArray(pdf);
annotsArray.Elements.Add(textfield);
pdf.Internals.AddObject(annotsArray);
page1.Elements.Add("/Annots", annotsArray);
// draw rectangle around text field
//XGraphics gfx = XGraphics.FromPdfPage(page1);
//gfx.DrawRectangle(new XPen(XColors.DarkOrange, 2), left, 40, right, bottom - top);
// Save document
const string filename = #"C:\Downloads\HelloWorld.pdf";
pdf.Save(filename);
pdf.Close();
Process.Start(filename);
}
}
I have a simple questions. How can you show a PDf file by using PagePreview?
I have a full pathname document.FileName = "c:\scans\Insurance_34345.pdf";
pagePreview.Preview(document.FileName); or something...
If there another way for showing a pdf. It's okay. I want to show it on a WinForms Form.
I tried this. I don't know what I have to do...
in the Designer
private MigraDoc.Rendering.Forms.DocumentPreview dpvScannedDoc;
Part of the code
string fullPadnaam = Path.Combine(defaultPath, document.FileName);
//PdfDocument pdfDocument = new PdfDocument(fullPadnaam);
//PdfPage page = new PdfPage(pdfDocument);
//XGraphics gfx = XGraphics.FromPdfPage(page);
MigraDoc.DocumentObjectModel.Document pdfDocument = new MigraDoc.DocumentObjectModel.Document();
pdfDocument.ImagePath = fullPadnaam;
var docRenderer = new DocumentRenderer(pdfDocument);
docRenderer.PrepareDocument();
var inPdfDoc = PdfReader.Open(fullPadnaam, PdfDocumentOpenMode.ReadOnly);
for (var i = 0; i < inPdfDoc.PageCount; i++)
{
pdfDocument.AddSection();
docRenderer.PrepareDocument();
var page = inPdfDoc.Pages[i];
var gfx = XGraphics.FromPdfPage(page);
docRenderer.RenderPage(gfx, i + 1);
}
var renderer = new PdfDocumentRenderer();
renderer.Document = pdfDocument;
renderer.RenderDocument();
// MigraDoc.DocumentObjectModel.IO.DdlWriter dw = new MigraDoc.DocumentObjectModel.IO.DdlWriter("HelloWorld.mdddl");
// dw.WriteDocument(pdfDocument);
// dw.Close();
//renderer.PdfDocument.rea(outFilePath);
//string ddl = MigraDoc.DocumentObjectModel.IO.DdlWriter.WriteToString(document1);
dpvScannedDoc.Show( pdfDocument);
PDFsharp does not render PDF files. You cannot show PDF files using the PagePreview.
If you use the XGraphics class for drawing then you can use shared code that draws on the PagePreview and on PDF pages.
The PagePreview sample can be found in the sample package and here:
http://www.pdfsharp.net/wiki/Preview-sample.ashx
If you have code that creates a new PDF file using PDFsharp then you can use the PagePreview to show on screen what you would otherwise draw on PDF pages. You cannot draw existing PDF pages using the PagePreview because PDF does not render PDF.
The MigraDoc DocumentPreview can display MDDDL files (your sample code creates a file "HelloWorld.mdddl"), but it cannot display PDF files.
If the MDDDL uses PDF files as images, they will not show up in the preview. They will show when creating a PDF from the MDDDL.
Hi I have a pdf with content as following : -
Property Address: 123 Door Form Type: Miscellaneous
ABC City
Pin - XXX
So when I use itextSharp to get the content, it is obtained as follows -
Property Address: 123 Door Form Type: Miscellaneous ABC City Pin - XXX
The data is mixed since it is in next line. Please suggest a possible way to get the content as required. Thanks
Property Address: 123 Door ABC City Pin - XXX Form Type: Miscellaneous
The following code using iTextSharp helped in formatting the pdf -
PdfReader reader = new PdfReader(path);
int pagenumber = reader.NumberOfPages;
for (int page = 1; page <= pagenumber; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string tt = PdfTextExtractor.GetTextFromPage(reader, page , strategy);
tt = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(tt)));
File.AppendAllLines(outfile, tt, Encoding.UTF8);
}
I'm Using Below helper class to convert PDF to Text file. this one is working clam for me.
If any one need full working desktop application please refer this github repo
https://github.com/Kithuldeniya/PDFReader
using iText.Kernel.Geom;
using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Canvas.Parser;
using iText.Kernel.Pdf.Canvas.Parser.Listener;
using System;
namespace PDFReader.Helpers
{
public static class PdfHelper
{
public static string ManipulatePdf(string filePath)
{
PdfDocument pdfDoc = new PdfDocument(new PdfReader(filePath));
//CustomFontFilter fontFilter = new CustomFontFilter(rect);
FilteredEventListener listener = new FilteredEventListener();
// Create a text extraction renderer
LocationTextExtractionStrategy extractionStrategy = listener
.AttachEventListener(new LocationTextExtractionStrategy());
// Note: If you want to re-use the PdfCanvasProcessor, you must call PdfCanvasProcessor.reset()
new PdfCanvasProcessor(listener).ProcessPageContent(pdfDoc.GetFirstPage());
// Get the resultant text after applying the custom filter
String actualText = extractionStrategy.GetResultantText();
pdfDoc.Close();
return actualText;
}
}
}
I am working on convert PDF to text. I can get text from PDF correctly but it is being complicated in table structure. I know PDF doesn't support table structure but I think there is a way get cells correctly. Well, for example:
I want to convert to text like this:
> This is first example.
> This is second example.
But, when I convert PDF to text, theese datas looking like this:
> This is This is
> first example. second example.
How can I get values correctly?
--EDIT:
Here is how did I convert PDF to Text:
OpenFileDialog ofd = new OpenFileDialog();
string filepath;
ofd.Filter = "PDF Files(*.PDF)|*.PDF|All Files(*.*)|*.*";
if (ofd.ShowDialog() == DialogResult.OK)
{
filepath = ofd.FileName.ToString();
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(filepath);
for (int page = 1; page < reader.NumberOfPages; page++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
string s = PdfTextExtractor.GetTextFromPage(reader, page, its);
s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
strText += s;
}
reader.Close();
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}
}
To make my comment an actual answer...
You use the LocationTextExtractionStrategy for text extraction:
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
string s = PdfTextExtractor.GetTextFromPage(reader, page, its);
This strategy arranges all text it finds in left-to-right lines from top to bottom (actually also taking the text line angle into account). Thus, it clearly is not what you need to extract text from tables with cells with multi-line content.
Depending on the document in question there are different approaches one can take:
Use the iText SimpleTextExtractionStrategy if the text drawing operations in the document in question already are in the order one wants for text extraction.
Use a custom text extraction strategy which makes use of tagging information if the document tables are properly tagged.
Use a complex custom text extraction strategy which tries to get hints from text arrangements, line paths, or background colors to guess the table cell structure and extract text cell by cell.
In this case, the OP commented that he changed LocationTextExtractionStrategy with SimpleTextExtractionStrategy, then it worked.
I'm looking for a way to select text n characters on either side of a keyword search using itextsharp (v5.5.8). I've gotten to the point where I can use the SimpleTextExtractionStrategy() and return a list of pages where the searched text is found (supposedly). When I do a manual search using PDF Viewer Search, sometimes it's there and sometimes it can't find it on the page itextsharp says it's on. Sometimes, not at all.
The idea is to be able to return 40 characters on either side of the found keyword to allow the user to be able to find the reference easier when they look at the actual document. In another question, I saw a reference to additional text retrieval functions (LocationTextExtractionStrategy, PdfTextExtractor.GetTextFromPage(myReader, pageNum) in combination with some Contains(word)).
Where can I find examples of how to use these functions? And how to create a better strategy?
My current code:
public List<int> ReadPdfFile(string fileName, String searthText)
{
string rootPath = HttpContext.Current.Server.MapPath("~");
string dirPath = rootPath + #"content\publications\";
List<int> pages = new List<int>();
string fullFile = dirPath + fileName;
if (File.Exists(fullFile))
{
PdfReader pdfReader = new PdfReader(fullFile);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentPageText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
if (currentPageText.Contains(searthText))
{
pages.Add(page);
}
}
pdfReader.Close();
}
return pages;
}
And an example of the output using a simple response.write command...
Document 1.pdf 1 3
Document 2.pdf 1 2 3 4
The numbers after the file name are page numbers where the searched for keyword is found. However, in Document 1, the keyword is also found on the very top of page 4 in the "References" section that began on page 3. It should be found twice in the References.
Thanks,
Bob
P.S. apparently 5.5.8 doesn't have the iTextSharp.text.pdf.parser.TextExtractionStrategy method...