I am using iTextSharp to extract text from PDF documents, but some text files that are encoding ISO-8859-1 are not displayed correctly.
Below is my code, if anyone can help me I will be grateful.
public string ReadPdfFile(string fileName)
{
StringBuilder text = new StringBuilder();
PdfReader pdfReader = null;
try
{
if (File.Exists(fileName))
{
pdfReader = new PdfReader(fileName);
Encoding encoding = Encoding.GetEncoding("iso8859-2");
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, new LocationTextExtractionStrategy());
currentText = encoding.GetString(ASCIIEncoding.Convert(Encoding.UTF8, encoding, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
}
pdfReader.Close();
}
return text.ToString();
}
catch
{
return string.Empty;
}
finally
{
if (pdfReader != null) pdfReader.Close();
}
}
Related
I am going to convert a PDF to a Text file using iText but the Euro currency symbol is missed in the final result.
public void TextExtraction()
{
StringBuilder allTextBuilder = new StringBuilder();
using (PdfReader pdfReader = new PdfReader(SourceFileName))
using (PdfDocument pdfDocument = new PdfDocument(pdfReader))
{
for (int page = 1; page <= pdfDocument.GetNumberOfPages(); page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentPageText = PdfTextExtractor.GetTextFromPage(pdfDocument.GetPage(page), strategy);
allTextBuilder.AppendFormat(currentPageText);
}
}
File.WriteAllText(DestinationFileName, allTextBuilder.ToString(), Encoding.Unicode);
}
I wonder if someone has any solution for me?
In my project I need to read a PDF document. This pdf contains ukrainian & russian characters. the PDFReader read all characters in this pdf but the cirillic characters missing in output. I'm try to use encoding but it not helped. What can I do with this chars?
public static string GetText(string filePath)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
StringBuilder text = new StringBuilder();
if (File.Exists(filePath)){
PdfReader pdfReader = new PdfReader(filePath);
for (int i = 1; i < pdfReader.NumberOfPages; i++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string thePage = PdfTextExtractor.GetTextFromPage(pdfReader, i, strategy);
text.Append(System.Environment.NewLine);
thePage = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(thePage)));
text.Append(thePage);
} pdfReader.Close();
} return text.ToString();
}
iTextSharp is an outdated product that is no longer supported, probably there are problems with text extraction. Here is a simple example of how the extraction text works in ITEXT 7 (the code is in java, but everything is the same for c#).
String filePath = "test.pdf";
StringBuilder text = new StringBuilder();
PdfReader pdfReader = new PdfReader(filePath);
PdfDocument pdfDocument = new PdfDocument(pdfReader);
for (int i = 1; i <= pdfDocument.getNumberOfPages(); i++) {
PdfPage page = pdfDocument.getPage(i);
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
String thePage = PdfTextExtractor.getTextFromPage(page, strategy);
text.append(thePage);
}
pdfReader.close();
System.out.print(text);
The code is about the same as in your example, but the text extracts
This code Convert only English PDF code in English text, And I want to Convert Any other Language to English, So how can, I Solve this Problem.
Below is my code
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
private string PDFReader(string url)
{
StringBuilder text = new StringBuilder();
PdfReader pdfReader;
try
{
ServicePointManager.Expect100Continue = true;
ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12;
url = "http://www.openprocurement.al/tenders/shpallje/29357.pdf";
pdfReader = new PdfReader(url);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
if (currentText.Contains("Page " + page.ToString()))
{
currentText = currentText.Replace("Page " + page.ToString(), "♥♥");
}
currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText)));
text.Append("\n----------------------------------------------------------------------\n");
text.Append(currentText);
}
pdfReader.Close();
}
catch (Exception ex)
{
}
return text.Replace("‘", "‘").Replace("’", "’").Replace("–", "–").ToString();
}
.NET strings are Unicode, specifically UTF16. They don't need any kind of conversion.
The problems are caused by the attempt to convert Unicode to the local machine's locale then back to Unicode as if it were UTF8 (which it isn't, it's in the local machine's locale). That's what produces the †strings too - the two-byte UTF8 sequences are translated as ASCII (most likely Western European).
This code extracts the text without any conversion issues :
static string GetPdfText(string url)
{
var separator="\n----------------------------------------------------------------------\n";
var text = new StringBuilder();
var strategy = new SimpleTextExtractionStrategy();
using( var pdfReader = new PdfReader(url))
{
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
var currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
text.Append(separator);
text.Append(currentText);
}
}
return text.ToString();
}
Please try this..
Using the WhatsMate PDF-to-Text REST API
I have one Pdf file. I need to extract the particular data in PDF to Excel using c#. Is it possible?
I have written below code for extract pdf
private void ExportPDFToExcel(string fileName)
{
fileName = HttpContext.Current.Server.MapPath("~/Models/10310.pdf");
StringBuilder text = new StringBuilder();
PdfReader pdfReader = new PdfReader(fileName);
//ExportPDFToExcel(fileName);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText)));
text.Append(currentText);
}
pdfReader.Close();
HttpContext.Current.Response.Clear();
HttpContext.Current.Response.Buffer = true;
HttpContext.Current.Response.ContentType = "application/vnd.ms-excel";
HttpContext.Current.Response.AddHeader("content-disposition", "attachment;filename=ReceiptExport.xlsx");
HttpContext.Current.Response.Cache.SetCacheability(HttpCacheability.NoCache);
HttpContext.Current.Response.Write(text);
HttpContext.Current.Response.Flush();
HttpContext.Current.Response.End();
System.Diagnostics.Process.Start("ReceiptExport.xlsx");
}
I am using the iTextSharp library in my project. How can I take PDF line's decoration or style? (something that indicates my text from others.)
public static string ReadPdfFile(string fileName)
{
StringBuilder text = new StringBuilder();
if (File.Exists(fileName))
{
PdfReader pdfReader = new PdfReader(fileName);
pdfReader.GetNamedDestinationFromStrings();
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
}
pdfReader.Close();
}
return text.ToString();
}