Itextsharp text extraction - c#

I'm using itextsharp on vb.net to get the text content from a pdf file. The solution works fine for some files but not for other even quite simple ones. The problem is that the token stringvalue is set to null (a set of empty square boxes)
token = New iTextSharp.text.pdf.PRTokeniser(pageBytes)
While token.NextToken()
tknType = token.TokenType()
tknValue = token.StringValue
I can meassure the length of the content but I cannot get the actual string content.
I realized that this happens depending on the font of the pdf. If I create a pdf using either Acrobat or PdfCreator with Courier (that by the way is the default font in my visual studio editor) I can get all the text content. If the same pdf is built using a different font I got the empty square boxes.
Now the question is, How can I extract text regardless of the font setting?
Thanks

complementary for Mark's answer that helps me a lot .iTextSharp implementation namespaces and classes are a bit different from java version
public static string GetTextFromAllPages(String pdfPath)
{
PdfReader reader = new PdfReader(pdfPath);
StringWriter output = new StringWriter();
for (int i = 1; i <= reader.NumberOfPages; i++)
output.WriteLine(PdfTextExtractor.GetTextFromPage(reader, i, new SimpleTextExtractionStrategy()));
return output.ToString();
}

Check out PdfTextExtractor.
String pageText =
PdfTextExtractor.getTextFromPage(myReader, pageNum);
or
String pageText =
PdfTextExtractor.getTextFromPage(myReader, pageNum, new LocationTextExtractionStrategy());
Both require fairly recent versions of iText[Sharp]. Actually parsing the content stream yourself is just reinventing the wheel at this point. Spare yourself some pain and let iText do it for you.
PdfTextExtractor will handle all the different font/encoding issues for you... all the ones that can be handled anyway. If you can't copy/paste from Reader accurately, then there's not enough information present in the PDF to get character information from the content stream.

Here is a variant with iTextSharp.text.pdf.PdfName.ANNOTS and iTextSharp.text.pdf.PdfName.CONTENT if some one need it.
string strFile = #"C:\my\path\tothefile.pdf";
iTextSharp.text.pdf.PdfReader pdfRida = new iTextSharp.text.pdf.PdfReader(strFile);
iTextSharp.text.pdf.PRTokeniser prtTokeneiser;
int pageFrom = 1;
int pageTo = pdfRida.NumberOfPages;
iTextSharp.text.pdf.PRTokeniser.TokType tkntype ;
string tknValue;
for (int i = pageFrom; i <= pageTo; i++)
{
iTextSharp.text.pdf.PdfDictionary cpage = pdfRida.GetPageN(i);
iTextSharp.text.pdf.PdfArray cannots = cpage.GetAsArray(iTextSharp.text.pdf.PdfName.ANNOTS);
if(cannots!=null)
foreach (iTextSharp.text.pdf.PdfObject oAnnot in cannots.ArrayList)
{
iTextSharp.text.pdf.PdfDictionary cAnnotationDictironary = (iTextSharp.text.pdf.PdfDictionary)pdfRida.GetPdfObject(((iTextSharp.text.pdf.PRIndirectReference)oAnnot).Number);
iTextSharp.text.pdf.PdfObject moreshit = cAnnotationDictironary.Get(iTextSharp.text.pdf.PdfName.CONTENTS);
if (moreshit != null && moreshit.GetType() == typeof(iTextSharp.text.pdf.PdfString))
{
string cStringVal = ((iTextSharp.text.pdf.PdfString)moreshit).ToString();
if (cStringVal.ToUpper().Contains("LOS 8"))
{ // DO SOMETHING FUN
}
}
}
}
pdfRida.Close();

Related

Using ITextSharp TextRenderInfo.GetTextRenderMode ignoring hidden text in pdf

I have a series of PDF files I need to search for keywords, but many of them contain a huge amount of hidden text. What I mean is when you try to CTRL+F to see how many key words are named "CJP" there are about 35 results, but in reality there are only about 9 that are actually visible, the rest just seem to be randomly hidden all over the page. I have tried out several APIs with them all reading 35 and not 9, so I wanted to try out this class named TextRenderInfo in ITextSharp because the method GetTextRenderMode is suppose to return 3 if the text is hidden, meaning I can use that to ignore strings that are invisable.
Here is my current code:
static void Main(string[] args)
{
Gerdau.ITextSharpCount(#"Source.pdf", "CJP");
}
public static int ITextSharpCount(string filePath, string searchString)
{
StringBuilder sb = new StringBuilder();
string file = filePath;
using (PdfReader reader = new PdfReader(file))
{
for (int pageNumber = 1; pageNumber <= reader.NumberOfPages; pageNumber++)
{
textRenderInfo.GetTextRenderMode();
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string text = PdfTextExtractor.GetTextFromPage(reader, pageNumber, strategy);
text = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text)));
sb.Append(text);
}
}
int numberOfMatches = Regex.Matches(sb.ToString(), searchString).Count;
return numberOfMatches;
}
The issue is I don't know how to set up the TextRenderInfo class to check for the hidden text. If anyone knows how to do it, it would be a huge help and more code the merrier :).

How do I read Japanese characters from a PDF?

I'm parsing a PDF file using IText7 in C# that contains Japanese characters like so:
public static string ExtractTextFromPDF(string filePath)
{
var pdfReader = new PdfReader(filePath);
var pdfDoc = new PdfDocument(pdfReader);
var sb = new StringBuilder();
for (int page = 1; page <= pdfDoc.GetNumberOfPages(); page++)
{
var strategy = new SimpleTextExtractionStrategy();
sb.Append(PdfTextExtractor.GetTextFromPage(pdfDoc.GetPage(page), strategy));
}
pdfDoc.Close();
pdfReader.Close();
return sb.ToString();
}
But I run into the exception:
iText.IO.IOException: 'The CMap iText.IO.Font.Cmap.UniJIS-UTF16-H was
not found.'
I've searched around for a solution on how to add this but I haven't come up with anything that works for the Japanese characters. If there is any other library more suited that would also be ok. Any help?
Thanks
Encoding CMaps in particular for CJK scripts are in a separate package.
For .Net use itext7.font-asian via nuget.
For Java use com.itextpdf:font-asian via maven.
The existence of this package is more visible for the Java version than for the .Net version.

Create PDF by copying it from template with PdfCopy (lost of data)

I'm trying to create a new pdf file based on another one using PdfCopy.
Everything work fine during generation and the generated file can be opened without any problem on my desktop, but the file seems to be corrupted and isn't accepted by the service that I must use :
SignService error when calling 'sign', probably caused by a bad file format.
I noticed that the generated pdf is always ligther than the original template, so i compared the template version with the generated one. There are some big parts of missing data, especially a whole bunch of xml. I guess PdfCopy does not copying every of my original pdf but i cannot figured out what am i missing.
here is my method :
byte[] completedDocument = null;
string originalUri = Path.Combine(this.PdfPath, pdfName);
string generatedUri = Path.Combine(this.PdfGeneratedPath, generatedPdfName);
using(MemoryStream streamCompleted = new MemoryStream())
{
using(Document doc = new Document())
{
PdfCopy copy = new PdfCopy(doc, streamCompleted);
copy.PdfVersion = PdfWriter.VERSION_1_6;
doc.Open();
copy.Open();
byte[] mergedDocument = null;
PdfReader pdfReader = new PdfReader(originalUri);
int pdfPageNumber = pdfReader.NumberOfPages;
using(MemoryStream streamTemplate = new MemoryStream())
{
using (PdfStamper pdfStamper = new PdfStamper(pdfReader, streamTemplate))
{
AcroFields acrofields = pdfStamper.AcroFields;
foreach (KeyValuePair<string, AcroFields.Item> field in acrofields.Fields)
{
string data;
if (pdfFieldsValues.TryGetValue(field.Key, out data))
{
if (data == null)
{
data = string.Empty;
}
acrofields.SetField(field.Key, data);
}
}
pdfStamper.FormFlattening = true;
pdfStamper.Writer.CloseStream = false;
}
mergedDocument = streamTemplate.ToArray();
}
pdfReader = new PdfReader(mergedDocument);
for (int page = 1; page <= pdfPageNumber; page++)
{
if (!excludedPages.Any(s => s == page))
{
copy.AddPage(copy.GetImportedPage(pdfReader, page));
}
}
doc.Close();
copy.Close();
}
completedDocument = streamCompleted.ToArray();
}
File.WriteAllBytes(generatedUri, completedDocument);
I tried to upload the "mergedDocument" rather than the "completedDocument" and my service accepting it, so i'm pretty sure it has something to do with this part :
for (int page = 1; page <= pdfPageNumber; page++)
{
if (!excludedPages.Any(s => s == page))
{
copy.AddPage(copy.GetImportedPage(pdfReader, page));
}
}
Or pdfCopy init
You start with a form. You fill out the form and you flatten it. By flattening it, you deliberately throw away all interactivity. I'm surprised that you're surprised that the file is getting smaller: you're throwing away the form infrastructure!
You then upload the flattened file to some service unknown to us. This service complains:
SignService error when calling 'sign', probably caused by a bad file format.
As we don't know which service you are talking about, we can only guess. An educated guess would be that the original form contains a signature field that needs to be signed by a signing service.
Obviously that field is gone: you flattened the form! I may be wrong, but I assume that the service also tries to read the fields you filled out, but that won't be possible either as you throw away all interactivity. Please remove the following line:
pdfStamper.FormFlattening = true;
Then there's Chris' comment: it seems that you're using PdfCopy. If you're using an old version of iTextSharp (before iText 5.5.1), you shouldn't expect the form to be preserved. If you are using a recent version, you should instruct PdfCopy to preserve the form (but that line is missing). You don't need to ask 'how do I preserve the form?' because you shouldn't be using PdfCopy anyway.
You only need PdfStamper. You already use PdfStamper to fill out the fields, now you can also use the selectPages() method to select the pages you want to keep (or to exclude the ones you want to remove).
Finally, it is unclear what you mean when you write:
There are some big parts of missing data, especially a whole bunch of xml.
Are you saying that the form isn't a pure AcroForm, but that it also contains an XFA stream? If so, then you most definitely can't use PdfCopy.

Using iTextSharp, trying to extract text from a PDF gives non-readable data

Okay, I'm trying to extract text from a PDF file using iTextSharp... that's all I want. However, when I extract the text, it's giving me garbage instead of text.
Here's the code I'm using...
List<String> pdfText = new List<string>();
for (int page = 1; page <= reader.NumberOfPages; page++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
String strPage = PdfTextExtractor.GetTextFromPage(reader, page, its);
strPage = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default,
Encoding.UTF8, Encoding.Default.GetBytes(strPage)));
pdfText.Add(strPage);
}
I then save that data to a text file, but instead of readable text, I get text that looks like binary data... non-printable characters all over the place. I'd post an image of what I see, but it won't let me. Sorry about that.
I have tried without the encoding attempt, and it didn't work any better... still binary-looking data (viewed in Notepad), though I'm not certain it's identical to that produced with the encoding attempt.
Any idea what is happening and how to fix it?
Please open the document in Adobe Reader, then try to copy/paste part of the text.
If you do this with the first page, you'll get:
The following policy (L30304) has been archived by Alpha II. Many policies are part of a larger
jurisdiction, than is indicated by the policy. This policy covers the following states:
• INDIANA
• MICHIGAN
However, if you do this with the second page, you'll get:
In other words: copy/pasting from Adobe Reader gives you garbage.
And if copy/pasting from Adobe Reader gives you garbage, any text extraction tool will give you garbage. You'll need to OCR the document to solve this problem.
Regarding your additional question in the comments: if the PDf employs a custom encoding method, how can Adobe display it properly unless the PDF file contains the information needed to handle it?
This question is answered in a 14-minute movie: https://www.youtube.com/watch?v=wxGEEv7ibHE
try this code:
List<String> pdfText = new List<string>();
for (int page = 1; page <= reader.NumberOfPages; page++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
PdfTextExtractor.GetTextFromPage(reader, page, its);
strPage = its.GetResultantText();
pdfText.Add(strPage);
}
Try this code, Worked for me
using (PdfReader reader = new PdfReader(path))
{
StringBuilder text = new StringBuilder();
for (int i = 1; i <= reader.NumberOfPages; i++)
{
text.Append(PdfTextExtractor.GetTextFromPage(reader, i));
}
return text.ToString();
}

Combine PDFs c#

How can I combine multiple PDFs into one PDF without a 3rd party component?
I don't think you can.
Opensource component PDFSharp has that functionality, and a nice source code sample on file combining
The .NET Framework does not contain the ability to modify/create PDFs. You need a 3rd party component to accomplish what you are looking for.
As others have said, there is nothing built in to do that task. Use iTextSharp with this example code.
AFAIK C# has no built-in support for handling PDF so what you are asking can not be done without using a 3rd party component or a COTS library.
Regarding libraries there is a myriad of possibilities. Just to point a few:
http://csharp-source.net/open-source/pdf-libraries
http://www.codeproject.com/KB/graphics/giospdfnetlibrary.aspx
http://www.pdftron.com/net/index.html
I don't think .NET Framework contains such like libraries. I used iTextsharp with c# to combine pdf files. I think iTextsharp is easyest way to do this. Here is the code I used.
string[] lstFiles=new string[3];
lstFiles[0]=#"C:/pdf/1.pdf";
lstFiles[1]=#"C:/pdf/2.pdf";
lstFiles[2]=#"C:/pdf/3.pdf";
PdfReader reader = null;
Document sourceDocument = null;
PdfCopy pdfCopyProvider = null;
PdfImportedPage importedPage;
string outputPdfPath=#"C:/pdf/new.pdf";
sourceDocument = new Document();
pdfCopyProvider = new PdfCopy(sourceDocument, new System.IO.FileStream(outputPdfPath, System.IO.FileMode.Create));
//Open the output file
sourceDocument.Open();
try
{
//Loop through the files list
for (int f = 0; f < lstFiles.Length-1; f++)
{
int pages =get_pageCcount(lstFiles[f]);
reader = new PdfReader(lstFiles[f]);
//Add pages of current file
for (int i = 1; i <= pages; i++)
{
importedPage = pdfCopyProvider.GetImportedPage(reader, i);
pdfCopyProvider.AddPage(importedPage);
}
reader.Close();
}
//At the end save the output file
sourceDocument.Close();
}
catch (Exception ex)
{
throw ex;
}
private int get_pageCcount(string file)
{
using (StreamReader sr = new StreamReader(File.OpenRead(file)))
{
Regex regex = new Regex(#"/Type\s*/Page[^s]");
MatchCollection matches = regex.Matches(sr.ReadToEnd());
return matches.Count;
}
}
ITextSharp is the way to go
Although it has already been said, you can't manipulate PDFs with the built-in libraries of the .NET Framework. I can however recommend iTextSharp, which is a .NET port of the Java iText. I have played around with it, and found it to be a very easy tool to use.

Categories