I am converting a PDF to Text using 'iText.PdfTextExtractor' and I am receiving this error ONLY on some of the pdf pages I am trying to convert:
'BuiltIn' is not a supported encoding name. For information on defining a custom encoding, see the documentation for the Encoding.RegisterProvider method. (Parameter 'name')
I've tried adding the following code before opening the file stream, but I am still receiving the error:
System.Text.EncodingProvider provider = System.Text.CodePagesEncodingProvider.Instance;
Encoding.RegisterProvider(provider);
Here is my code:
public void ExtractFromPdf(string pdfFile, ClaimInfo claimInfo, string memberId)
{
System.Text.EncodingProvider provider = System.Text.CodePagesEncodingProvider.Instance;
Encoding.RegisterProvider(provider);
PdfReader pdfRead = new PdfReader(pdfFile);
PdfDocument pdfDoc = new PdfDocument(pdfRead);
for (int page = 1; page < pdfDoc.GetNumberOfPages(); page++)
{
string convertToText = PdfToText(pdfDoc, page);
}
}
private string PdfToText(PdfDocument pdfDoc, int pageNo)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
return PdfTextExtractor.GetTextFromPage(pdfDoc.GetPage(pageNo), strategy);
}
The error occurs at return PdfTextExtractor.GetTextFromPage(pdfDoc.GetPage(pageNo), strategy);
I've tried looking everywhere and it seems that 'BuiltIn' is a built in encoding name that I don't know how to find. Any ideas?
Related
I have a series of PDF files I need to search for keywords, but many of them contain a huge amount of hidden text. What I mean is when you try to CTRL+F to see how many key words are named "CJP" there are about 35 results, but in reality there are only about 9 that are actually visible, the rest just seem to be randomly hidden all over the page. I have tried out several APIs with them all reading 35 and not 9, so I wanted to try out this class named TextRenderInfo in ITextSharp because the method GetTextRenderMode is suppose to return 3 if the text is hidden, meaning I can use that to ignore strings that are invisable.
Here is my current code:
static void Main(string[] args)
{
Gerdau.ITextSharpCount(#"Source.pdf", "CJP");
}
public static int ITextSharpCount(string filePath, string searchString)
{
StringBuilder sb = new StringBuilder();
string file = filePath;
using (PdfReader reader = new PdfReader(file))
{
for (int pageNumber = 1; pageNumber <= reader.NumberOfPages; pageNumber++)
{
textRenderInfo.GetTextRenderMode();
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string text = PdfTextExtractor.GetTextFromPage(reader, pageNumber, strategy);
text = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text)));
sb.Append(text);
}
}
int numberOfMatches = Regex.Matches(sb.ToString(), searchString).Count;
return numberOfMatches;
}
The issue is I don't know how to set up the TextRenderInfo class to check for the hidden text. If anyone knows how to do it, it would be a huge help and more code the merrier :).
I'm parsing a PDF file using IText7 in C# that contains Japanese characters like so:
public static string ExtractTextFromPDF(string filePath)
{
var pdfReader = new PdfReader(filePath);
var pdfDoc = new PdfDocument(pdfReader);
var sb = new StringBuilder();
for (int page = 1; page <= pdfDoc.GetNumberOfPages(); page++)
{
var strategy = new SimpleTextExtractionStrategy();
sb.Append(PdfTextExtractor.GetTextFromPage(pdfDoc.GetPage(page), strategy));
}
pdfDoc.Close();
pdfReader.Close();
return sb.ToString();
}
But I run into the exception:
iText.IO.IOException: 'The CMap iText.IO.Font.Cmap.UniJIS-UTF16-H was
not found.'
I've searched around for a solution on how to add this but I haven't come up with anything that works for the Japanese characters. If there is any other library more suited that would also be ok. Any help?
Thanks
Encoding CMaps in particular for CJK scripts are in a separate package.
For .Net use itext7.font-asian via nuget.
For Java use com.itextpdf:font-asian via maven.
The existence of this package is more visible for the Java version than for the .Net version.
Hi I have a pdf with content as following : -
Property Address: 123 Door Form Type: Miscellaneous
ABC City
Pin - XXX
So when I use itextSharp to get the content, it is obtained as follows -
Property Address: 123 Door Form Type: Miscellaneous ABC City Pin - XXX
The data is mixed since it is in next line. Please suggest a possible way to get the content as required. Thanks
Property Address: 123 Door ABC City Pin - XXX Form Type: Miscellaneous
The following code using iTextSharp helped in formatting the pdf -
PdfReader reader = new PdfReader(path);
int pagenumber = reader.NumberOfPages;
for (int page = 1; page <= pagenumber; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string tt = PdfTextExtractor.GetTextFromPage(reader, page , strategy);
tt = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(tt)));
File.AppendAllLines(outfile, tt, Encoding.UTF8);
}
I'm Using Below helper class to convert PDF to Text file. this one is working clam for me.
If any one need full working desktop application please refer this github repo
https://github.com/Kithuldeniya/PDFReader
using iText.Kernel.Geom;
using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Canvas.Parser;
using iText.Kernel.Pdf.Canvas.Parser.Listener;
using System;
namespace PDFReader.Helpers
{
public static class PdfHelper
{
public static string ManipulatePdf(string filePath)
{
PdfDocument pdfDoc = new PdfDocument(new PdfReader(filePath));
//CustomFontFilter fontFilter = new CustomFontFilter(rect);
FilteredEventListener listener = new FilteredEventListener();
// Create a text extraction renderer
LocationTextExtractionStrategy extractionStrategy = listener
.AttachEventListener(new LocationTextExtractionStrategy());
// Note: If you want to re-use the PdfCanvasProcessor, you must call PdfCanvasProcessor.reset()
new PdfCanvasProcessor(listener).ProcessPageContent(pdfDoc.GetFirstPage());
// Get the resultant text after applying the custom filter
String actualText = extractionStrategy.GetResultantText();
pdfDoc.Close();
return actualText;
}
}
}
This issue is expected as True Type Fonts are an image really, not a font. You would have to use image recognition techniques top accomplish reading it in.
This issue has come up multiple times, and so am placing an answer for it out to the public.
Q: How to parse a PDF when cannot read the font of a PDF for location purposes. EX: An Account number to know page 1, or a page number as "Printed for Duplex for example, not as the document counts it".
I had this issue when managing statements. I needed to know what page I was on, where I was, and what was on it. I began to realize that different print software output different file needs, but you can normally find them in the comments of the PDF output file, you are reading in. For example, I am using "Tray Call ID's" I find in the PDF's I am reading with iTextSharp. An example of this is demonstrated below:
I first use a simple method like so to test what font type the document is
public void SetFontType()
{
this.PdfReaderContentParser = new PdfReaderContentParser(this.PdfReaderMain);
//Here we see if we can read the text from the extraction. If not, we know it is a TT font.
ITextExtractionStrategy iTextExtractionStrategy = this.PdfReaderContentParser.ProcessContent(1, new SimpleTextExtractionStrategy());
String pdfText = iTextExtractionStrategy.GetResultantText();
this.TextType = String.IsNullOrEmpty(pdfText) ? TextType.TrueTypeFont : TextType.Default;
}
When I have established it is not readable, and have encountered a case of the True Type Font, I then do the following to read in the PDF [excluding non-necessary code]
The following code cycles through the annotations to find anything special to search on. In this case I am looking for MT3 type searches, and or a list of items I then use in an override. Every case will be unique, but it sums up the basic concept of stripping out the annotations of the document. This is also briefly explained in iText's documentation.
public static Boolean CycleAnnotations(PdfReader reader, int pageIndex, PdfJob job)
{
List<string> keys = job.ConfigurationSettings.Where(cfs => cfs.Condition != null).Select(cs => cs.Condition).ToList();
bool found = CycleAnnotations(reader, pageIndex, keys);
if (found)
{
return found;
}
else
{
return CycleAnnotations(reader, pageIndex, "MT(TR3)"); //default key
}
}
public static Boolean CycleAnnotations(PdfReader reader, int pageIndex, string key)
{
PdfDictionary pdfDictionary = reader.GetPageN(pageIndex);
PdfArray annots = pdfDictionary.GetAsArray(PdfName.ANNOTS);
if (annots != null)
{
foreach (var iter in annots)
{
PdfDictionary annot = (PdfDictionary)PdfReader.GetPdfObject(iter);
PdfString content = (PdfString)PdfReader.GetPdfObject(annot.Get(PdfName.CONTENTS));
if (content != null)
{
if (Utilities.IsAnnotationFound(content, key))
{
return true;
}
}
}
}
return false;
}
public static Boolean CycleAnnotations(PdfReader reader, int pageIndex, List<string> keys)
{
PdfDictionary pdfDictionary = reader.GetPageN(pageIndex);
PdfArray annots = pdfDictionary.GetAsArray(PdfName.ANNOTS);
foreach (string keyItem in keys)
{
if (annots != null)
{
foreach (var iter in annots)
{
PdfDictionary annot = (PdfDictionary)PdfReader.GetPdfObject(iter);
PdfString content = (PdfString)PdfReader.GetPdfObject(annot.Get(PdfName.CONTENTS));
if (content != null)
{
if (Utilities.IsAnnotationFound(content, keyItem))
{
return true;
}
}
}
}
}
Hopefully this helps someone, and have a great day!
Okay, I'm trying to extract text from a PDF file using iTextSharp... that's all I want. However, when I extract the text, it's giving me garbage instead of text.
Here's the code I'm using...
List<String> pdfText = new List<string>();
for (int page = 1; page <= reader.NumberOfPages; page++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
String strPage = PdfTextExtractor.GetTextFromPage(reader, page, its);
strPage = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default,
Encoding.UTF8, Encoding.Default.GetBytes(strPage)));
pdfText.Add(strPage);
}
I then save that data to a text file, but instead of readable text, I get text that looks like binary data... non-printable characters all over the place. I'd post an image of what I see, but it won't let me. Sorry about that.
I have tried without the encoding attempt, and it didn't work any better... still binary-looking data (viewed in Notepad), though I'm not certain it's identical to that produced with the encoding attempt.
Any idea what is happening and how to fix it?
Please open the document in Adobe Reader, then try to copy/paste part of the text.
If you do this with the first page, you'll get:
The following policy (L30304) has been archived by Alpha II. Many policies are part of a larger
jurisdiction, than is indicated by the policy. This policy covers the following states:
• INDIANA
• MICHIGAN
However, if you do this with the second page, you'll get:
In other words: copy/pasting from Adobe Reader gives you garbage.
And if copy/pasting from Adobe Reader gives you garbage, any text extraction tool will give you garbage. You'll need to OCR the document to solve this problem.
Regarding your additional question in the comments: if the PDf employs a custom encoding method, how can Adobe display it properly unless the PDF file contains the information needed to handle it?
This question is answered in a 14-minute movie: https://www.youtube.com/watch?v=wxGEEv7ibHE
try this code:
List<String> pdfText = new List<string>();
for (int page = 1; page <= reader.NumberOfPages; page++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
PdfTextExtractor.GetTextFromPage(reader, page, its);
strPage = its.GetResultantText();
pdfText.Add(strPage);
}
Try this code, Worked for me
using (PdfReader reader = new PdfReader(path))
{
StringBuilder text = new StringBuilder();
for (int i = 1; i <= reader.NumberOfPages; i++)
{
text.Append(PdfTextExtractor.GetTextFromPage(reader, i));
}
return text.ToString();
}