Itextsharp get text decoration from pdf - c#

I am using the iTextSharp library in my project. How can I take PDF line's decoration or style? (something that indicates my text from others.)
public static string ReadPdfFile(string fileName)
{
StringBuilder text = new StringBuilder();
if (File.Exists(fileName))
{
PdfReader pdfReader = new PdfReader(fileName);
pdfReader.GetNamedDestinationFromStrings();
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
}
pdfReader.Close();
}
return text.ToString();
}

Related

How can I read Euro Symbol using IText

I am going to convert a PDF to a Text file using iText but the Euro currency symbol is missed in the final result.
public void TextExtraction()
{
StringBuilder allTextBuilder = new StringBuilder();
using (PdfReader pdfReader = new PdfReader(SourceFileName))
using (PdfDocument pdfDocument = new PdfDocument(pdfReader))
{
for (int page = 1; page <= pdfDocument.GetNumberOfPages(); page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentPageText = PdfTextExtractor.GetTextFromPage(pdfDocument.GetPage(page), strategy);
allTextBuilder.AppendFormat(currentPageText);
}
}
File.WriteAllText(DestinationFileName, allTextBuilder.ToString(), Encoding.Unicode);
}
I wonder if someone has any solution for me?

iTextSharp extraction cyrillic characters

In my project I need to read a PDF document. This pdf contains ukrainian & russian characters. the PDFReader read all characters in this pdf but the cirillic characters missing in output. I'm try to use encoding but it not helped. What can I do with this chars?
public static string GetText(string filePath)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
StringBuilder text = new StringBuilder();
if (File.Exists(filePath)){
PdfReader pdfReader = new PdfReader(filePath);
for (int i = 1; i < pdfReader.NumberOfPages; i++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string thePage = PdfTextExtractor.GetTextFromPage(pdfReader, i, strategy);
text.Append(System.Environment.NewLine);
thePage = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(thePage)));
text.Append(thePage);
} pdfReader.Close();
} return text.ToString();
}
iTextSharp is an outdated product that is no longer supported, probably there are problems with text extraction. Here is a simple example of how the extraction text works in ITEXT 7 (the code is in java, but everything is the same for c#).
String filePath = "test.pdf";
StringBuilder text = new StringBuilder();
PdfReader pdfReader = new PdfReader(filePath);
PdfDocument pdfDocument = new PdfDocument(pdfReader);
for (int i = 1; i <= pdfDocument.getNumberOfPages(); i++) {
PdfPage page = pdfDocument.getPage(i);
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
String thePage = PdfTextExtractor.getTextFromPage(page, strategy);
text.append(thePage);
}
pdfReader.close();
System.out.print(text);
The code is about the same as in your example, but the text extracts

I am using iText to extract text from pdf file, I could able to see the text, however it losing structure of the page

I am using iText to extract text from the pdf file, I could able to see all text value, but the structure is broken. Could you help me how to extract the text exactly like the pdf file. I've tried some online tool, it does the extraction correctly, what library they are using.
StringBuilder text = new StringBuilder();
if (File.Exists(ofd.FileName))
{
PdfReader pdfReader = new PdfReader(ofd.FileName);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
//ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
}
pdfReader.Close();
}
rtxtboxInvoice.Text = text.ToString();

Can we extract particular PDF data to Excel

I have one Pdf file. I need to extract the particular data in PDF to Excel using c#. Is it possible?
I have written below code for extract pdf
private void ExportPDFToExcel(string fileName)
{
fileName = HttpContext.Current.Server.MapPath("~/Models/10310.pdf");
StringBuilder text = new StringBuilder();
PdfReader pdfReader = new PdfReader(fileName);
//ExportPDFToExcel(fileName);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText)));
text.Append(currentText);
}
pdfReader.Close();
HttpContext.Current.Response.Clear();
HttpContext.Current.Response.Buffer = true;
HttpContext.Current.Response.ContentType = "application/vnd.ms-excel";
HttpContext.Current.Response.AddHeader("content-disposition", "attachment;filename=ReceiptExport.xlsx");
HttpContext.Current.Response.Cache.SetCacheability(HttpCacheability.NoCache);
HttpContext.Current.Response.Write(text);
HttpContext.Current.Response.Flush();
HttpContext.Current.Response.End();
System.Diagnostics.Process.Start("ReceiptExport.xlsx");
}

Itextsharp - GetTextFromPage does not recognize iso-8859 characters

I am using iTextSharp to extract text from PDF documents, but some text files that are encoding ISO-8859-1 are not displayed correctly.
Below is my code, if anyone can help me I will be grateful.
public string ReadPdfFile(string fileName)
{
StringBuilder text = new StringBuilder();
PdfReader pdfReader = null;
try
{
if (File.Exists(fileName))
{
pdfReader = new PdfReader(fileName);
Encoding encoding = Encoding.GetEncoding("iso8859-2");
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, new LocationTextExtractionStrategy());
currentText = encoding.GetString(ASCIIEncoding.Convert(Encoding.UTF8, encoding, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
}
pdfReader.Close();
}
return text.ToString();
}
catch
{
return string.Empty;
}
finally
{
if (pdfReader != null) pdfReader.Close();
}
}

Categories