I am going to convert a PDF to a Text file using iText but the Euro currency symbol is missed in the final result.
public void TextExtraction()
{
StringBuilder allTextBuilder = new StringBuilder();
using (PdfReader pdfReader = new PdfReader(SourceFileName))
using (PdfDocument pdfDocument = new PdfDocument(pdfReader))
{
for (int page = 1; page <= pdfDocument.GetNumberOfPages(); page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentPageText = PdfTextExtractor.GetTextFromPage(pdfDocument.GetPage(page), strategy);
allTextBuilder.AppendFormat(currentPageText);
}
}
File.WriteAllText(DestinationFileName, allTextBuilder.ToString(), Encoding.Unicode);
}
I wonder if someone has any solution for me?
Related
In my project I need to read a PDF document. This pdf contains ukrainian & russian characters. the PDFReader read all characters in this pdf but the cirillic characters missing in output. I'm try to use encoding but it not helped. What can I do with this chars?
public static string GetText(string filePath)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
StringBuilder text = new StringBuilder();
if (File.Exists(filePath)){
PdfReader pdfReader = new PdfReader(filePath);
for (int i = 1; i < pdfReader.NumberOfPages; i++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string thePage = PdfTextExtractor.GetTextFromPage(pdfReader, i, strategy);
text.Append(System.Environment.NewLine);
thePage = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(thePage)));
text.Append(thePage);
} pdfReader.Close();
} return text.ToString();
}
iTextSharp is an outdated product that is no longer supported, probably there are problems with text extraction. Here is a simple example of how the extraction text works in ITEXT 7 (the code is in java, but everything is the same for c#).
String filePath = "test.pdf";
StringBuilder text = new StringBuilder();
PdfReader pdfReader = new PdfReader(filePath);
PdfDocument pdfDocument = new PdfDocument(pdfReader);
for (int i = 1; i <= pdfDocument.getNumberOfPages(); i++) {
PdfPage page = pdfDocument.getPage(i);
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
String thePage = PdfTextExtractor.getTextFromPage(page, strategy);
text.append(thePage);
}
pdfReader.close();
System.out.print(text);
The code is about the same as in your example, but the text extracts
This code Convert only English PDF code in English text, And I want to Convert Any other Language to English, So how can, I Solve this Problem.
Below is my code
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
private string PDFReader(string url)
{
StringBuilder text = new StringBuilder();
PdfReader pdfReader;
try
{
ServicePointManager.Expect100Continue = true;
ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12;
url = "http://www.openprocurement.al/tenders/shpallje/29357.pdf";
pdfReader = new PdfReader(url);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
if (currentText.Contains("Page " + page.ToString()))
{
currentText = currentText.Replace("Page " + page.ToString(), "♥♥");
}
currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText)));
text.Append("\n----------------------------------------------------------------------\n");
text.Append(currentText);
}
pdfReader.Close();
}
catch (Exception ex)
{
}
return text.Replace("‘", "‘").Replace("’", "’").Replace("–", "–").ToString();
}
.NET strings are Unicode, specifically UTF16. They don't need any kind of conversion.
The problems are caused by the attempt to convert Unicode to the local machine's locale then back to Unicode as if it were UTF8 (which it isn't, it's in the local machine's locale). That's what produces the †strings too - the two-byte UTF8 sequences are translated as ASCII (most likely Western European).
This code extracts the text without any conversion issues :
static string GetPdfText(string url)
{
var separator="\n----------------------------------------------------------------------\n";
var text = new StringBuilder();
var strategy = new SimpleTextExtractionStrategy();
using( var pdfReader = new PdfReader(url))
{
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
var currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
text.Append(separator);
text.Append(currentText);
}
}
return text.ToString();
}
Please try this..
Using the WhatsMate PDF-to-Text REST API
I am using ITextSharp version 5.5.3.0 and I am trying to extract text from a pdf in C#. The pdf is a form, and not an image. This is the code:
var text = new StringBuilder();
// The PdfReader object implements IDisposable.Dispose, so you can
// wrap it in the using keyword to automatically dispose of it
using (var pdfReader = new PdfReader(inFileName))
{
// Loop through each page of the document
for (var page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
var currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
}
}
return text.ToString();
}
The returned text is unusable. The pdf was generated with GhostScript.
Does anyone have a suggestion regarding what the problem cound be? Or any suggestions?
I am using the iTextSharp library in my project. How can I take PDF line's decoration or style? (something that indicates my text from others.)
public static string ReadPdfFile(string fileName)
{
StringBuilder text = new StringBuilder();
if (File.Exists(fileName))
{
PdfReader pdfReader = new PdfReader(fileName);
pdfReader.GetNamedDestinationFromStrings();
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
}
pdfReader.Close();
}
return text.ToString();
}
How could I find every SOH character, which looks like a box on the PDF, and place a checkbox form field on top of it. This question was close Extract text and text rectangle coordinates from a Pdf file using itextsharp but I can not get this to work. Below is some code of what I am trying to do. It would be best also, if I could not put a form if there is already one there.
StringBuilder text = new StringBuilder();
if (File.Exists(filePath))
{
using (PdfReader pdfReader = new PdfReader(filePath))
using (FileStream fileOut = new FileStream(#"C:\Projects\document.pdf", FileMode.Create, FileAccess.Write))
using (PdfStamper stamp = new PdfStamper(pdfReader, fileOut))
{
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new PdfHelper.LocationTextExtractionStrategyEx();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
int count = 0;
foreach (var strat in ((PdfHelper.LocationTextExtractionStrategyEx)(strategy)).TextLocationInfo)
{
RadioCheckField checkbox = new RadioCheckField(stamp.Writer, new iTextSharp.text.Rectangle(strat.TopLeft[0], strat.BottomRight[1], (strat.TopLeft[0] + 5), (strat.BottomRight[1] - 5)), ("CheckBoxInserted" + count), "On");
checkbox.CheckType = RadioCheckField.TYPE_SQUARE;
stamp.AddAnnotation(checkbox.CheckField, page);
}
RadioCheckField checkField = new RadioCheckField(stamp.Writer, new iTextSharp.text.Rectangle(450, 690, 460, 680), "checkboxname", "On");
checkField.CheckType = RadioCheckField.TYPE_SQUARE;
stamp.AddAnnotation(checkField.CheckField, 1);
}
}
}
return text.ToString();