Can we extract particular PDF data to Excel - c#

I have one Pdf file. I need to extract the particular data in PDF to Excel using c#. Is it possible?
I have written below code for extract pdf
private void ExportPDFToExcel(string fileName)
{
fileName = HttpContext.Current.Server.MapPath("~/Models/10310.pdf");
StringBuilder text = new StringBuilder();
PdfReader pdfReader = new PdfReader(fileName);
//ExportPDFToExcel(fileName);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText)));
text.Append(currentText);
}
pdfReader.Close();
HttpContext.Current.Response.Clear();
HttpContext.Current.Response.Buffer = true;
HttpContext.Current.Response.ContentType = "application/vnd.ms-excel";
HttpContext.Current.Response.AddHeader("content-disposition", "attachment;filename=ReceiptExport.xlsx");
HttpContext.Current.Response.Cache.SetCacheability(HttpCacheability.NoCache);
HttpContext.Current.Response.Write(text);
HttpContext.Current.Response.Flush();
HttpContext.Current.Response.End();
System.Diagnostics.Process.Start("ReceiptExport.xlsx");
}

Related

iTextSharp extraction cyrillic characters

In my project I need to read a PDF document. This pdf contains ukrainian & russian characters. the PDFReader read all characters in this pdf but the cirillic characters missing in output. I'm try to use encoding but it not helped. What can I do with this chars?
public static string GetText(string filePath)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
StringBuilder text = new StringBuilder();
if (File.Exists(filePath)){
PdfReader pdfReader = new PdfReader(filePath);
for (int i = 1; i < pdfReader.NumberOfPages; i++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string thePage = PdfTextExtractor.GetTextFromPage(pdfReader, i, strategy);
text.Append(System.Environment.NewLine);
thePage = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(thePage)));
text.Append(thePage);
} pdfReader.Close();
} return text.ToString();
}
iTextSharp is an outdated product that is no longer supported, probably there are problems with text extraction. Here is a simple example of how the extraction text works in ITEXT 7 (the code is in java, but everything is the same for c#).
String filePath = "test.pdf";
StringBuilder text = new StringBuilder();
PdfReader pdfReader = new PdfReader(filePath);
PdfDocument pdfDocument = new PdfDocument(pdfReader);
for (int i = 1; i <= pdfDocument.getNumberOfPages(); i++) {
PdfPage page = pdfDocument.getPage(i);
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
String thePage = PdfTextExtractor.getTextFromPage(page, strategy);
text.append(thePage);
}
pdfReader.close();
System.out.print(text);
The code is about the same as in your example, but the text extracts

I am using iText to extract text from pdf file, I could able to see the text, however it losing structure of the page

I am using iText to extract text from the pdf file, I could able to see all text value, but the structure is broken. Could you help me how to extract the text exactly like the pdf file. I've tried some online tool, it does the extraction correctly, what library they are using.
StringBuilder text = new StringBuilder();
if (File.Exists(ofd.FileName))
{
PdfReader pdfReader = new PdfReader(ofd.FileName);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
//ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
}
pdfReader.Close();
}
rtxtboxInvoice.Text = text.ToString();

Itextsharp get text decoration from pdf

I am using the iTextSharp library in my project. How can I take PDF line's decoration or style? (something that indicates my text from others.)
public static string ReadPdfFile(string fileName)
{
StringBuilder text = new StringBuilder();
if (File.Exists(fileName))
{
PdfReader pdfReader = new PdfReader(fileName);
pdfReader.GetNamedDestinationFromStrings();
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
}
pdfReader.Close();
}
return text.ToString();
}

Itextsharp - GetTextFromPage does not recognize iso-8859 characters

I am using iTextSharp to extract text from PDF documents, but some text files that are encoding ISO-8859-1 are not displayed correctly.
Below is my code, if anyone can help me I will be grateful.
public string ReadPdfFile(string fileName)
{
StringBuilder text = new StringBuilder();
PdfReader pdfReader = null;
try
{
if (File.Exists(fileName))
{
pdfReader = new PdfReader(fileName);
Encoding encoding = Encoding.GetEncoding("iso8859-2");
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, new LocationTextExtractionStrategy());
currentText = encoding.GetString(ASCIIEncoding.Convert(Encoding.UTF8, encoding, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
}
pdfReader.Close();
}
return text.ToString();
}
catch
{
return string.Empty;
}
finally
{
if (pdfReader != null) pdfReader.Close();
}
}

How to programmatically find position of character in PDF and place form field on top of it using iTextSharp

How could I find every SOH character, which looks like a box on the PDF, and place a checkbox form field on top of it. This question was close Extract text and text rectangle coordinates from a Pdf file using itextsharp but I can not get this to work. Below is some code of what I am trying to do. It would be best also, if I could not put a form if there is already one there.
StringBuilder text = new StringBuilder();
if (File.Exists(filePath))
{
using (PdfReader pdfReader = new PdfReader(filePath))
using (FileStream fileOut = new FileStream(#"C:\Projects\document.pdf", FileMode.Create, FileAccess.Write))
using (PdfStamper stamp = new PdfStamper(pdfReader, fileOut))
{
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new PdfHelper.LocationTextExtractionStrategyEx();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
int count = 0;
foreach (var strat in ((PdfHelper.LocationTextExtractionStrategyEx)(strategy)).TextLocationInfo)
{
RadioCheckField checkbox = new RadioCheckField(stamp.Writer, new iTextSharp.text.Rectangle(strat.TopLeft[0], strat.BottomRight[1], (strat.TopLeft[0] + 5), (strat.BottomRight[1] - 5)), ("CheckBoxInserted" + count), "On");
checkbox.CheckType = RadioCheckField.TYPE_SQUARE;
stamp.AddAnnotation(checkbox.CheckField, page);
}
RadioCheckField checkField = new RadioCheckField(stamp.Writer, new iTextSharp.text.Rectangle(450, 690, 460, 680), "checkboxname", "On");
checkField.CheckType = RadioCheckField.TYPE_SQUARE;
stamp.AddAnnotation(checkField.CheckField, 1);
}
}
}
return text.ToString();

Categories