How to read manually added text within a pdf file with c#?

How to read manually added text within a pdf file with c#? - c#

i'm using iTextSharp with this C# code:
string parsedText = string.Empty;
PdfReader reader = new PdfReader(pdfPath);
ITextExtractionStrategy its = new LocationTextExtractionStrategy();
parsedText = PdfTextExtractor.GetTextFromPage(reader, 1, its);
parsedText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(parsedText)));
It parses the pdf as expected, but it does not parse text, that is manually added with tools like FoxItReader oder NuancePDF.
Our accounting is manually adding an internal invoicenumber on each pdf and I need to parse that number. For some reason i can't find it.
It looks like it is on another "layer" of something that is not parsed.
Any ideas how to read those numbers?
Thanks

It is possible that the internal invoice number is being added as an annotation, rather than as actual text on the page.
Have you tried iText's facilities for extracting annotations to see if there are any on the page?

Related

How to use LocationExtractionStrategy in iText to extract from standardized pdf?

I have a standardized pdf invoice. I would like to use iText7 to extract certain data elements from the pdf. After an extensive search on the net, I was not able to find any resources or walkthroughs to show how iText7 can be used to extract text from pdf. I can extract fine using SimpleTextExtractionStrategy, But i am looking for examples on how to use LocationTextExtractionStrategy to extract text from given boxes in a pdf.
I read that you have to locate the "boxes" in your document first, but really coudlnt find any code or information on that. Thanks
PdfReader pdfReader = new PdfReader(path);
PdfDocument pdfDoc = new PdfDocument(pdfReader);
for (int page = 1; page <= pdfDoc.GetNumberOfPages(); page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string pageContent = PdfTextExtractor.GetTextFromPage(pdfDoc.GetPage(page), strategy);
}

iText7 for .NET SetAuthor() is Doubling the Author Value

When using the iText7 library to set a PDF document's properties the value for the builtin property author is getting doubled like '"Lastname, Firstname"; Lastname; Firstname'. It should be 'Lastname, Firstname'. It is getting the double quotes added, the name value twice and a comma changed to a semicolon. This has happened in two versions, 7.1.17 and 7.2.1.
The steps in creating the PDF are:
Use Microsoft.Interop.Word Document.ExportAsFixedFormat() to create the first PDF used by readerPDF. This does not get the custom document properties to populate. I need to set four custom properties used by a later step in this process.
Use iText7 to read the above PDF, add the custom document properties and also reset the built in properties and write that out to a second file accessed trough writerPDF. iText7 only modifies a PDF by reading from file and writing to a second.
In step 2 the code calls the command to set the author property, PdfDocument.PdfDocumentInfo.SetAuthor(authorvalue);
The problem seems to only happen with commas in the author value, and I need the commas to do Lastname, Firstname. That is a requirement. If I do not reset the property Author is has double quotes around it, that is not useful for our project. All other properties, builtin and custom are working as expected.
The code looks like this:
iText.Kernel.Pdf.PdfReader readerPDF;
iText.Kernel.Pdf.PdfWriter writerPDF;
string authorValue = "Lastname, Firstname";
readerPDF = new PdfReader(saveAsPathAndNameTemp);
writerPDF = new PdfWriter(pSavedPathAndPDFName);
PdfDocument pdfdocument = new PdfDocument(readerPDF, writerPDF);
PdfDocumentInfo info = pdfdocument.GetDocumentInfo();
info.SetAuthor(string.Empty);
info.SetAuthor(authorValue);
pdfdocument.Close();
readerPDF.Close();
writerPDF.Close();

There is issue in Adobe Acrobat reader. I used the next code to reproduce your issue on Java:
String filename = DESTINATION_FOLDER + "openSimpleDoc.pdf";
String author = "Test, Author";
String title = "Test, Title";
String subject = "Test, Subject";
PdfDocument pdfDoc = new PdfDocument(new PdfWriter(filename));
pdfDoc.getDocumentInfo().setAuthor(author).setTitle(title).setSubject(subject);
pdfDoc.addNewPage();
pdfDoc.close();
PdfReader reader = new PdfReader(filename);
pdfDoc = new PdfDocument(reader);
Assert.assertEquals(author, pdfDoc.getDocumentInfo().getAuthor());
Assert.assertEquals(title, pdfDoc.getDocumentInfo().getTitle());
Assert.assertEquals(subject, pdfDoc.getDocumentInfo().getSubject());
pdfDoc.close();
As you can see, I didn't set the author twice and when I open the resulting PDF in Adobe Acrobat, I see that the author's name is enclosed in two quotes:
But in fact there are no two quotes. You can see it in PDF Studio, RUPS and Notepad++:
PDF Studio
RUPS
Notepad++

I resolved the main issue, the duplicating of the Author value if a comma was in the value. I updated the reference BouncyCastle.Crypto to 1.9.0.0. That resolved the main issue. the secondary issue of the double quotes in properties dialog box is address below by Nikita Kovaliov. I thank this poster for there input.

excluding header, footer and watermark from last page with openxml

I'm using Open XML (DocumentFormat.OpenXml nuget package) to generating a docx file. Here is my approach:
I have a file, named template.docx. In this file I have a Cover Page and a blank page which has header, footer, and a background image. Anyway, I first open the document, then append some text to the document, then close it.
In the other hand, I have a file named template-back.docx which I want to append that at the end of modified document (template.docx) above.
I'm able to do that, by using this snippet:
public static void MergeDocumentWithPagebreak(string sourceFile, string destinationFile, string altChunkID) {
using (var myDoc = WordprocessingDocument.Open(sourceFile, true)) {
var mainPart = myDoc.MainDocumentPart;
//Append page break
var para = new Paragraph(new Run((new Break() { Type = BreakValues.Page })));
mainPart.Document.Body.InsertAfter(para, mainPart.Document.Body.LastChild);
//Append file
var chunk = mainPart.AddAlternativeFormatImportPart(
AlternativeFormatImportPartType.WordprocessingML, altChunkID);
using (var fileStream = File.Open(destinationFile, FileMode.Open))
chunk.FeedData(fileStream);
var altChunk = new AltChunk{
Id = altChunkID
};
mainPart.Document
.Body
.InsertAfter(altChunk, mainPart.Document.Body.Elements<Paragraph>().Last());
mainPart.Document.Save();
}
}
But, when I do that, the header, footer, and background image, are applied to the last page. I want to be able to exclude last page from getting those designs. I want it to be clean, simple and white. But googling the issue, had nothing to help. Do you have any idea please? Thanks in advance.
P.S.
The original article about merging documents here:

It's a little bit tricky, but not so complicated.
First you have to understand how word works:
By default, a word document is one section, and this one section share header and footer. If you want differents header / footer, you have to create a break at the end of a page to indicate "the next page is a new section".
Once a new section is create, you must indicate "the new section don't share the same header / footer"
Some documentation on "how to create different header in word". http://www.techrepublic.com/blog/microsoft-office/accommodate-different-headers-and-footers-in-a-word-document/
If we translate to your code, before inserting your document at the end of the other, you have to:
Create a section break
Inserting a new Header / footer in this section (an empty one)
Insert your new document in the new section
To create the new header, some other documentation: https://msdn.microsoft.com/en-us/library/office/cc546917.aspx
Trick: if the document you insert don't contain header / footer, create empty ones and recopy them
Information: I tried to delete the <w:headerReference r:id="rIdX" w:type="default"/> or to set the r:id to 0 but it don't work. Create an empty header is the fastest way

Replace your Page break with the following code
Paragraph PageBreakParagraph = new Paragraph(new DocumentFormat.OpenXml.Wordprocessing.Run(new DocumentFormat.OpenXml.Wordprocessing.Break() { Type = BreakValues.Page }));
I also saw that you are inserting after the last child which is not essential same as Appending but works well for you! Use this instead.
wordprocessingDocument.MainDocumentPart.Document.Body.Append(PageBreakParagraph)
You need to add the section break to the section properties. You then need to append the section properties to the paragraph properties. Followed by appending the paragraph properties to a paragraph.
Paragraph paragraph232 = new Paragraph();
ParagraphProperties paragraphProperties220 = new ParagraphProperties();
SectionProperties sectionProperties1 = new SectionProperties();
SectionType sectionType1 = new SectionType(){ Val = SectionMarkValues.NextPage };
sectionProperties1.Append(sectionType1);
paragraphProperties220.Append(sectionProperties1);
paragraph232.Append(paragraphProperties220);
//Replace your last but one line with this one.
mainPart.Document
.Body
.Append(altChunk);
The resulting Open XML is:
<w:p>
<w:pPr>
<w:sectPr>
<w:type w:val="nextPage" />
</w:sectPr>
</w:pPr>
</w:p>
The Easiest way to do it is to actually create the document in word and then open in it in the Open XML Productivity Tool, you can reflect the code and see what C# code would generate the various Open XML elements you are trying to achieve. Hope this helps!

Trying to insert an image into a pdf‏ in c#

I need to insert an image based on a generated barcode file.
The problem I'm having is when using the iTextSharp library I can normally fill in text such as
PdfReader pdfReader = new PdfReader(oldFile);
PdfStamper pdfStamper = new PdfStamper(pdfReader, outFile);
AcroFields fields = pdfStamper.AcroFields;
fields.SetField("topmostSubform[0].Page1[0].BARCODE[0]", "X974005-1");
though there's one field where in pdf if I click onto it it prompts me for an image to insert into field, but I can't seem to programmatically accomplish this. Based on some google searches and stumbling upon a stackoverflow page, I inserted the following code expecting it to work as desired:
string fieldName = "topmostSubform[0].Page1[0].BARCODE[0]";
string imageFile = "test-barcode.jpg";
AcroFields.FieldPosition fieldPosition = pdfStamper.AcroFields.GetFieldPositions(fieldName)[0];
PushbuttonField imageField = new PushbuttonField(pdfStamper.Writer, fieldPosition.position, fieldName);
imageField.Layout = PushbuttonField.LAYOUT_ICON_ONLY;
imageField.Image = iTextSharp.text.Image.GetInstance(imageFile);
imageField.ScaleIcon = PushbuttonField.SCALE_ICON_ALWAYS;
imageField.ProportionalIcon = false;
imageField.Options = BaseField.READ_ONLY;
pdfStamper.AcroFields.RemoveField(fieldName);
pdfStamper.AddAnnotation(imageField.Field, fieldPosition.page);
The problem I am having is while it removes the existing field as intended, when I open the newly created PDF file I don't see this new push button field with the intended image file but rather as a blank but when I perform this through debug mode I can see that it's at least picking up the correct dimensions of the image file, so I don't know what I'm doing wrong here.
Please advise, thanks.

If you read the official documentation (that is: my book), you'll find this example: ReplaceIcon.cs
You're removing the field using pdfStamper.AcroFields.RemoveField(fieldName); and subsequently you try adding the new field using pdfStamper.AddAnnotation(imageField.Field, fieldPosition.page);
That's wrong. You should replace the field using pdfStamper.AcroFields.ReplacePushbuttonField(fieldname, imageField.Field);
The ReplacePushbuttonField() method copies plenty of settings behind the scenes.

Extract text from pdf to c#

I'm looking for a way to extract text from a pdf and use it i a program. I've done some research on the net and got a few libraries working. These were not freeware; however, en bumbed on there limits.
So i'm looking for a free library. I thought of ITextSharp but i have no idea to get started.
Can you guys help me out here?

Something like should work for you. You have to watch it - they change function names all the time with iTextSharp releases, which is a bit annoying - Lol
public static string GetPDFText(String pdfPath)
{
PdfReader reader = new PdfReader(pdfPath);
StringWriter output = new StringWriter();
for (int i = 1; i <= reader.NumberOfPages; i++)
output.WriteLine(PdfTextExtractor.GetTextFromPage(reader, i, new SimpleTextExtractionStrategy()));
return output.ToString();
}

iTextSharp is open source but the licensing model changed after version 4.1.6. The old license was much less strict while the new one requires payment if you use it commercially and don't want to release your source code. This may or may not affect you.
Here's the most basic version of text extraction using the 5.1.2.0 version:
//Full path to the file to read
string fileToRead = System.IO.Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), #"file1.pdf");
//Bind a PdfReader to our file
iTextSharp.text.pdf.PdfReader reader = new iTextSharp.text.pdf.PdfReader(fileToRead);
//Extract all of the text from the first page
string allPage1Text = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader, 1);
//That's it!
Console.Write(allPage1Text);

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to read manually added text within a pdf file with c#? - c#

It is possible that the internal invoice number is being added as an annotation, rather than as actual text on the page. Have you tried iText's facilities for extracting annotations to see if there are any on the page?

Related

How to use LocationExtractionStrategy in iText to extract from standardized pdf?

iText7 for .NET SetAuthor() is Doubling the Author Value

excluding header, footer and watermark from last page with openxml

Trying to insert an image into a pdf‏ in c#

Extract text from pdf to c#

Categories

Resources