Reading text from pdf with iText7 + C#, text not recognized - c#

i want to read data from pdf document. I use iText7:
var src = "<file location>";
var pdfDocument = new PdfDocument(new PdfReader(src));
var strategy = new LocationTextExtractionStrategy();
for (int i = 1; i <= pdfDocument.GetNumberOfPages(); ++i)
{
var page = pdfDocument.GetPage(i);
string text = PdfTextExtractor.GetTextFromPage(page, strategy);
string processed = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text)));
}
pdfDocument.Close();
It works, but doesn't recognize letters. All text looks like
"����������\n�������������������������\n�����������������������������������\n
It is in English, so I don't expect any problems with encoding. What is the cause of this issue and how can I fix it?

You don't need the conversion you're doing. Change the code to:
StringBuilder processed = new StringBuilder();
for (int i = 1; i <= pdfDocument.GetNumberOfPages(); ++i)
{
var page = pdfDocument.GetPage(i);
string text = PdfTextExtractor.GetTextFromPage(page, strategy);
processed.Append(text);
}

Related

HTML to Word or PDF with GEMBOX

I am using Gembox.Documents to insert an HTML file into a Word or PDF document.
Unfortunately, in the resulting Word (or pdf), the height of the contents of the rows (cells) in the table is too high and does not correspond to the original one in the HTML file and I cannot change this with the help of CSS or HTML.
Can you, please, suggest solutions to the problem?
string fileName="zzzz";
var destinationDocument = new DocumentModel();
var section = new Section(destinationDocument);
destinationDocument.Sections.Add(section);
var srcDocument = DocumentModel.Load(TempPath + fileName + ".html");
var pageSetup = srcDocument.Sections[0].PageSetup;
var destpagesPageSetup = destinationDocument.Sections[0].PageSetup;
destpagesPageSetup.Orientation = Orientation.Landscape;
destpagesPageSetup.PageWidth = 1000;
destpagesPageSetup.PageHeight = 1000;
destpagesPageSetup.RightToLeft = true;
destpagesPageSetup.PageMargins.Left = 20;
destpagesPageSetup.PageMargins.Right = 0;
destpagesPageSetup.PageMargins.Bottom = 0;
destpagesPageSetup.PageMargins.Top = 0;
destpagesPageSetup.PageMargins.Gutter = 0;
destpagesPageSetup.PageMargins.Footer = 0;
var mapping = new ImportMapping(srcDocument, destinationDocument, false);
var blocks = srcDocument.Sections[0].Blocks;
foreach (Block b in blocks)
{
//b.ParentCollection.TableFormat.DefaultCellSpacing = 1;
Block b1 = destinationDocument.Import(b, true, mapping);
section.Blocks.Add(b1);
}
var pageSetup1 = section.PageSetup;
destinationDocument.Save(TempPath + fileName + ".pdf");
thanks
This issue occurred because of the cell margins appearing from the HTML content.
After investigating that HTML, the issue was resolved, the fix is available in the current latest bugfix version:
https://www.gemboxsoftware.com/document/downloads/bugfixes.html
Or in the current latest NuGet package:
https://www.nuget.org/packages/GemBox.Document/

How to convert special characters from .PDF to .TXT in C# using iTextSharp?

I got the CODE-SUN.pdf document.
I extracted the text from this document using the below source code.
The extracted text is different from the CODE-SUN.pdf document.
For example, it was extracted "commi\0ed" instead of "committed".
Please see the attached CODE-SUN.pdf document: https://www.dropbox.com/s/p88695vel6zaa86/CODE-SUN.pdf?dl=0
I've tried the below source code:
public static string ConvertToTxtWithITextSharp(byte[] content)
{
var pdfReader = new PdfReader(content);
var s = new StringBuilder();
foreach (var pageNo in Enumerable.Range(1, pdfReader.NumberOfPages))
{
s.AppendLine($"** Page {pageNo} **");
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
var currentText = "";
currentText = PdfTextExtractor.GetTextFromPage(pdfReader, pageNo, strategy);
currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.ASCII, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
s.AppendLine(ConcatLines(currentText));
}
return s.ToString();
}
The desired result is to extract the correct text from any .pdf document.

Using iTextSharp for C# and pdfreader returns only footer info

I have 2 pdf libraries which I am reading all docs and parsing specific information from. One library processes without issues. THe other library only returns the footer of all the pages as follows: Page 1 of 6Page 2 of 6Page 3 of 6Page4 of 6.....
The library which is working has one document with multiple pages.
The following is the pdfreader I am using. Has anyone experienced this behavior before and what is different between the documents and how should I handle the case where footer only is returned.
static string ReadPdfFile(string fileName)
{
string curFile = #fileName;
// Console.WriteLine(curFile);
// Console.WriteLine(File.Exists(curFile) ? "File exists." : "File does not exist.");
StringBuilder text = new StringBuilder();
if (File.Exists(curFile))
{
Console.Error.WriteLine("in: " + fileName);
PdfReader pdfReader = new PdfReader(fileName);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText =
Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8,
Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
}
pdfReader.Close();
}
return text.ToString();
}

Extract text by line from PDF using iTextSharp c#

I need to run some analysis my extracting data from a PDF document.
Using iTextSharp, I used the PdfTextExtractor.GetTextFromPage method to extract contents from a PDF document and it returned me in a single long line.
Is there a way to get the text by line so that i can store them in an array? So that i can analyze the data by line which will be more flexible.
Below is the code I used:
string urlFileName1 = "pdf_link";
PdfReader reader = new PdfReader(urlFileName1);
string text = string.Empty;
for (int page = 1; page <= reader.NumberOfPages; page++)
{
text += PdfTextExtractor.GetTextFromPage(reader, page);
}
reader.Close();
candidate3.Text = text.ToString();
public void ExtractTextFromPdf(string path)
{
using (PdfReader reader = new PdfReader(path))
{
StringBuilder text = new StringBuilder();
ITextExtractionStrategy Strategy = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
for (int i = 1; i <= reader.NumberOfPages; i++)
{
string page = "";
page = PdfTextExtractor.GetTextFromPage(reader, i,Strategy);
string[] lines = page.Split('\n');
foreach (string line in lines)
{
MessageBox.Show(line);
}
}
}
}
I know this is posting on an older post, but I spent a lot of time trying to figure this out so I'm going to share this for the future people trying to google this:
using System;
using System.Text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
namespace PDFApp2
{
class Program
{
static void Main(string[] args)
{
string filePath = #"Your said path\the file name.pdf";
string outPath = #"the output said path\the text file name.txt";
int pagesToScan = 2;
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(filePath);
for (int page = 1; page <= pagesToScan; page ++) //(int page = 1; page <= reader.NumberOfPages; page++) <- for scanning all the pages in A PDF
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
strText = PdfTextExtractor.GetTextFromPage(reader, page, its);
strText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(strText)));
//creating the string array and storing the PDF line by line
string[] lines = strText.Split('\n');
foreach (string line in lines)
{
//Creating and appending to a text file
using (System.IO.StreamWriter file = new System.IO.StreamWriter(outPath, true))
{
file.WriteLine(line);
}
}
}
reader.Close();
}
catch (Exception ex)
{
Console.Write(ex);
}
}
}
}
I had the program read in a PDF, from a set path, and just output to a text file, but you can manipulate that to anything. This was building off of Snziv Gupta's response.
All the other code samples here didn't work for me, probably due to changes to the itext7 API.
This minimal example here works ok:
var pdfReader = new iText.Kernel.Pdf.PdfReader(fileName);
var pdfDocument = new iText.Kernel.Pdf.PdfDocument(pdfReader);
var contents = iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(pdfDocument.GetFirstPage());
LocationTextExtractionStrategy will automatically insert '\n' in the output text. However, sometimes it will insert '\n' where it shouldn't.
In that case you need to build a custom TextExtractionStrategy or RenderListener. Bascially the code that detects newline is the method
public virtual bool SameLine(ITextChunkLocation other) {
return OrientationMagnitude == other.OrientationMagnitude &&
DistPerpendicular == other.DistPerpendicular;
}
In some cases '\n' shouldn't be inserted if there is only small difference between DistPerpendicular and other.DistPerpendicular, so you need to change it to something like Math.Abs(DistPerpendicular - other.DistPerpendicular) < 10
Or you can put that piece of code in the RenderText method of your custom TextExtractionStrategy/RenderListener class
Use LocationTextExtractionStrategy in lieu of SimpleTextExtractionStrategy. LocationTextExtractionStrategy extracted text contains the new line character at the end of line.
ITextExtractionStrategy Strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), renderFilter);
string pdftext = PdfTextExtractor.GetTextFromPage(reader,pageno, Strategy);
string[] words = pdftext.Split('\n');
return words;
Try
String page = PdfTextExtractor.getTextFromPage(reader, 2);
String s1[]=page.split("\n");

How to use the ABCPdf.NET to extract texts from all pages of a PDF file?

How to use the ABCPdf.NET tool to extract the content texts from a PDF file?
I tried the GetText method but doesn't extract the contents:
var doc = new Doc();
var url = #".../FileName.pdf";
doc.Read(url);
string xmlContents = doc.GetText("Text");
Response.Write(xmlContents);
doc.Clear();
doc.Dispose();
My pdf has almost 1000 words but the GetText only returns 4-5 words. I realized it returns only the texts of the first page.
So the question should be "how to extract the text from all pages of a pdf file?" -(changed the Title to make it clearer).
Thanks,
For your benefit, yes you!
public string ExtractTextsFromAllPages(string pdfFileName)
{
var sb = new StringBuilder();
using (var doc = new Doc())
{
doc.Read(pdfFileName);
for (var currentPageNumber = 1; currentPageNumber <= doc.PageCount; currentPageNumber++)
{
doc.PageNumber = currentPageNumber;
sb.Append(doc.GetText("Text"));
}
}
return sb.ToString();
}
if you don't have the url but have the bytes, then:
public string ExtractTextsFromAllPages(Byte[] pdfBytes)
{
var sb = new StringBuilder();
using (var doc = new Doc())
{
doc.Read(pdfBytes);
for (var currentPageNumber = 1; currentPageNumber <= doc.PageCount; currentPageNumber++)
{
doc.PageNumber = currentPageNumber;
sb.Append(doc.GetText("Text"));
}
}
return sb.ToString();
}
Have you tried the GetText method?
doc.Read(.......);
var textOperation = new TextOperation(doc);
textOperation.PageContents.AddPages();
string allText = textOperation.GetText();

Categories