itextsharp and pdf - c#

i want to read a pdf file line per line but i want to maintain his original format
¿can i do this with itextsharp?
i use the next code :
private void button1_Click(object sender, EventArgs e)
{
string text = string.Empty;
string path = string.Empty;
path = "C:\\Documents and Settings\\Rafael\\Desktop\\Imprimiendo\\Print1.pdf";
PdfReader reader = new PdfReader(path);
for (int page = 1; page <= reader.NumberOfPages; page++)
{
text = PdfTextExtractor.GetTextFromPage(reader, page);
richTextBox1.Text = text;
}
reader.Close();
return;
}
thanks, i really need your help

If you want to read PDF file with small data in it, iTextsharp would be the best choice, you may find answer here:
Reading PDF content with itextsharp dll in VB.NET or C#
However, if you have huge data in your PDF file, iTextsharp will have problems in realizing this task. in such a case, you may need a third party library. This article may help you much:
Read PDF file in C#

Related

Searching PDF file for specific word then saving that specific PDF page C#

I am trying to create an app which will search through a bulk PDF file for a letterID (eg 1234567) which is inputted via a textbox. If it locates it it will then save that specific page to a new document. I'm currently using PDFSharp but I'm struggling to find anything online that resembles what I am trying to achieve.
UPDATE: I have solved my issue and managed to get a working result from it! Will update thread tomorrow with code as it may help others.
private void btnSearch_Click(object sender, EventArgs e)
{
string letterID = txtLetterID.Text;
// Open the input file in Import Mode
PdfDocument inputPDFFile = PdfReader.Open(path, PdfDocumentOpenMode.Import);
PdfDictionary dictionary = new PdfDictionary(inputPDFFile);
string id = dictionary.Elements.GetString(letterID);
//Get the total pages in the PDF
var totalPagesInInputPDFFile = inputPDFFile.PageCount;
if (id.Equals(letterID))
{
//Create an instance of the PDF document in memory
PdfDocument outputPDFDocument = new PdfDocument();
// Add a specific page to the PdfDocument instance
outputPDFDocument.AddPage(inputPDFFile.Pages[totalPagesInInputPDFFile - 1]);
//save the PDF document
SaveOutputPDF(outputPDFDocument, totalPagesInInputPDFFile);
}
else
{
lblMessage.Text = "Letter ID not found in this set of letters";
}
}

Convert bytes to PDF File

I am able to create a word doc using the code below.
Question: how do i create a pdf instead of word doc?
Code
using (StreamWriter outputFile = new StreamWriter(Path.Combine(docPath, tdindb.TDCode + "-test.doc")))
{
string html = string.Format("<html>{0}</html>", sbHtml);
outputFile.WriteLine(html);
}
string FileLocation = docPath + "\\" + tdindb.TDCode + "-test.doc";
byte[] fileBytes = System.IO.File.ReadAllBytes(FileLocation);
string fileName = Path.GetFileName(FileLocation);
return File(fileBytes, System.Net.Mime.MediaTypeNames.Application.Octet, fileName);
Thank you
You need to use one of PDF creating libraries. I tried to use iText , IronPDF, and PDFFlow. All of them create PDF documents from scratch.
But PDFFlow was better for my case because i needed automatic page creation and multi-page spread table)
This is how to create a simple PDF file in C#:
{
var DocumentBuilder.New()
.AddSection()
.AddParagraphToSection("your text goes here!")
.ToSection()
.ToDocument()
.Build("Result.PDF");
}
feel free to ask me if you need more help.
Converting a Word document to HTML has been answered here before. The linked example is the first result of many on this site.
Once you have your HTML, to create a PDF you need to use a PDF creation library. For this example we will use IronPDF which requires just 3 lines of code:
string html = string.Format("<html>{0}</html>", sbHtml);
var renderer = new IronPdf.ChromePdfRenderer();
// Save PDF file to BinaryData
renderer.RenderHtmlAsPdf(html).BinaryData;
// Save file to location
renderer.RenderHtmlAsPdf(html).SaveAs("output.pdf");

iTextSharp How to read Table in PDF file

I am working on convert PDF to text. I can get text from PDF correctly but it is being complicated in table structure. I know PDF doesn't support table structure but I think there is a way get cells correctly. Well, for example:
I want to convert to text like this:
> This is first example.
> This is second example.
But, when I convert PDF to text, theese datas looking like this:
> This is This is
> first example. second example.
How can I get values correctly?
--EDIT:
Here is how did I convert PDF to Text:
OpenFileDialog ofd = new OpenFileDialog();
string filepath;
ofd.Filter = "PDF Files(*.PDF)|*.PDF|All Files(*.*)|*.*";
if (ofd.ShowDialog() == DialogResult.OK)
{
filepath = ofd.FileName.ToString();
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(filepath);
for (int page = 1; page < reader.NumberOfPages; page++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
string s = PdfTextExtractor.GetTextFromPage(reader, page, its);
s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
strText += s;
}
reader.Close();
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}
}
To make my comment an actual answer...
You use the LocationTextExtractionStrategy for text extraction:
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
string s = PdfTextExtractor.GetTextFromPage(reader, page, its);
This strategy arranges all text it finds in left-to-right lines from top to bottom (actually also taking the text line angle into account). Thus, it clearly is not what you need to extract text from tables with cells with multi-line content.
Depending on the document in question there are different approaches one can take:
Use the iText SimpleTextExtractionStrategy if the text drawing operations in the document in question already are in the order one wants for text extraction.
Use a custom text extraction strategy which makes use of tagging information if the document tables are properly tagged.
Use a complex custom text extraction strategy which tries to get hints from text arrangements, line paths, or background colors to guess the table cell structure and extract text cell by cell.
In this case, the OP commented that he changed LocationTextExtractionStrategy with SimpleTextExtractionStrategy, then it worked.

PDF to Text: iTextSharp: Duplicate Pages in Extraction Results

Thanks in advance.
The Background:
I'm working on a console application that extracts data from specific sections in pdf documents. To do this I first need to convert that pdf into a string to work with. To do this I turned to iTextSharp. The pdfs are laid out with two columns per page so I'm using the SimpleTextExtractionStratgey() (I tried iTextSharp.text.pdf.parser.LocationTextExtractionStrategy(); but found it ineffective for the page layout).
Description of content being converted to text:
The pages I seem to be having trouble with have a "header" posted up on the side of the page. Pages with headers are intermittently dispersed through the document.
Image of page layout: http://postimg.org/image/b7i25v0g1/
The Problem:
It seems when it finishes looking through the columns on the page then moves on to that side header. It would then jump to the next page with a side header, convert that to text, then start again from the top of the page where the first header was encountered.
I'd end up with text that looks like:
Page 1 Content
First Header
Second Header
Page 1 Content
Page 2 Content
etc.
Here is the pdf: http://www.filedropper.com/dd35-completeadventurer
I'm not married to iTextSharp I just need a reliable way to convert documents with this format to text. A work around or alternate method would be appreciated.
static public string ToTxt(string #filePath)
{
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(filePath);
for (int page = 1; page <= reader.NumberOfPages; page++)
{
Widgets.ProgressBar(page);
//Convert PDF to Text
ITextExtractionStrategy its = new SimpleTextExtractionStrategy(); //iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
String s = PdfTextExtractor.GetTextFromPage(reader, page, its);
strText = strText + s;
}
reader.Close();
Console.WriteLine("File Extracted");
}
catch (Exception e)
{
Console.WriteLine("Exception: " + e.Message);
}
finally
{
Console.Clear();
}
return strText;
}
As already conjectured in a comment, the duplicate text already is present in the PDF content!
Details
The page contents of pairs of pages facing each other in your document often are identical, each time the contents of the whole spread, and the individual pages merely display only the left or the right half respectively.
E.g. consider the two pages 6 and 7. Their contents are identical:
filling the area of their identical MediaBox. Merely by setting the CropBox (and the ArtBox, BleedBox, and TrimBox) to the left or right half respectively, only the expected content is shown for page 6:
and page 7:
Neither the iText(Sharp) parser framework nor the SimpleTextExtractionStrategy automatically restrict to these boxes, they extract all text drawn anywhere in the content. Thus, the duplicate text.
Preventing duplicate text in the extraction result
Knowing the cause for the text duplication, there are multiple ways to prevent it:
You can try and extract the content only of every other PDF page. Unfortunately the above said is not true for all pages, at least the initial pages (title page, contents, ...) are not created using the scheme explained above, and further into the book there are some artwork pages not following the scheme either. Thus, this option would require quite some management of exceptional pages.
You can extract the contents of each page but keep the contents of the previously processed page in some variable. Now only add the newly extracted content to the result if it does not equal the content of the prior page.
You can use the iText(Sharp) parser filters. If you restrict the text chunks processed by your strategy to only those drawn inside the crop box of the current page, you prevent duplicate text caused by off-page content. You can find an example filtering by region here: ExtractPageContentArea.java / ExtractPageContentArea.cs.
Using mkl's second method (checking each page for repeat) I came up with the following and it works brilliantly; an easy fix:
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(filePath);
string prevPage = "";
for (int page = 1; page <= reader.NumberOfPages; page++)
{
Widgets.ProgressBar(page);
//Convert PDF to Text
ITextExtractionStrategy its = new SimpleTextExtractionStrategy();
String s = PdfTextExtractor.GetTextFromPage(reader, page, its);
if (prevPage != s)
strText += s;
prevPage = s;
}
reader.Close();
Console.WriteLine("File Extracted");
}
catch (Exception e)
{
Console.WriteLine("Exception: " + e.Message);
}
finally
{
Console.Clear();
}
return strText;
}

Using iTextSharp, trying to extract text from a PDF gives non-readable data

Okay, I'm trying to extract text from a PDF file using iTextSharp... that's all I want. However, when I extract the text, it's giving me garbage instead of text.
Here's the code I'm using...
List<String> pdfText = new List<string>();
for (int page = 1; page <= reader.NumberOfPages; page++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
String strPage = PdfTextExtractor.GetTextFromPage(reader, page, its);
strPage = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default,
Encoding.UTF8, Encoding.Default.GetBytes(strPage)));
pdfText.Add(strPage);
}
I then save that data to a text file, but instead of readable text, I get text that looks like binary data... non-printable characters all over the place. I'd post an image of what I see, but it won't let me. Sorry about that.
I have tried without the encoding attempt, and it didn't work any better... still binary-looking data (viewed in Notepad), though I'm not certain it's identical to that produced with the encoding attempt.
Any idea what is happening and how to fix it?
Please open the document in Adobe Reader, then try to copy/paste part of the text.
If you do this with the first page, you'll get:
The following policy (L30304) has been archived by Alpha II. Many policies are part of a larger
jurisdiction, than is indicated by the policy. This policy covers the following states:
• INDIANA
• MICHIGAN
However, if you do this with the second page, you'll get:
In other words: copy/pasting from Adobe Reader gives you garbage.
And if copy/pasting from Adobe Reader gives you garbage, any text extraction tool will give you garbage. You'll need to OCR the document to solve this problem.
Regarding your additional question in the comments: if the PDf employs a custom encoding method, how can Adobe display it properly unless the PDF file contains the information needed to handle it?
This question is answered in a 14-minute movie: https://www.youtube.com/watch?v=wxGEEv7ibHE
try this code:
List<String> pdfText = new List<string>();
for (int page = 1; page <= reader.NumberOfPages; page++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
PdfTextExtractor.GetTextFromPage(reader, page, its);
strPage = its.GetResultantText();
pdfText.Add(strPage);
}
Try this code, Worked for me
using (PdfReader reader = new PdfReader(path))
{
StringBuilder text = new StringBuilder();
for (int i = 1; i <= reader.NumberOfPages; i++)
{
text.Append(PdfTextExtractor.GetTextFromPage(reader, i));
}
return text.ToString();
}

Categories