Get Text From specific Layer from PDF - c#

Here is the pdf sample with texts on the layer. If I turn off the layer all the text belong to this layer will be invisible also.
I need to get all the texts from the specific layer. Any body know how to archive this.
Here is my sample PDF file: https://drive.google.com/file/d/1TcRyE8MQRhw-j89BbovV7fFIwZ0yks0N/view?usp=sharing
My code can get all texts. But I don't know how to get texts belong any specific layer only.
public CreateHyperLinkButton(string inPutPDF, string outPutPDF, List<ViewPortInfo> ViewportInfos)
{
using (FileStream pdf = new FileStream(outPutPDF, FileMode.Create))
{
using (PdfReader pdfReader = new iTextSharp.text.pdf.PdfReader(inPutPDF))
{
using (PdfStamper pdfStamper = new iTextSharp.text.pdf.PdfStamper(pdfReader, pdf))
{
//Get Text list on 2D PDF
List<TextRenderInfo> listTextInfor = GetAllTextInfor(inPutPDF, pdfReader);
listTextInfor.ForEach(item =>{
string btnName = item.GetText().Trim();
//Check btnName exist in ViewportInfos
for (var i = 0; i < ViewportInfos.Count; i++)
{
string szRes = GetTextContained(ViewportInfos[i].Hyperlinks.Keys.ToList(), btnName);
if (!string.IsNullOrEmpty(szRes))
{
iTextSharp.text.Rectangle box = GetRectOfText(item);
iTextSharp.text.pdf.PushbuttonField btnField = new iTextSharp.text.pdf.PushbuttonField(pdfStamper.Writer, box, szRes);
iTextSharp.text.pdf.PdfAnnotation pushbutton = btnField.Field;
//Add JS function and button in annotation
string js = "mapView('" + szRes + "');";
pushbutton.SetAdditionalActions(iTextSharp.text.pdf.PdfName.U, iTextSharp.text.pdf.PdfAction.JavaScript(js, pdfStamper.Writer));
pdfStamper.AddAnnotation(pushbutton, 1);
}
}
});
pdfStamper.Close();
}
pdfReader.Close();
}
pdf.Close();
}
}
private static List<TextRenderInfo> GetAllTextInfor(string inPutPDF, PdfReader pdfReader)
{
List<TextRenderInfo> listTextInfor = new List<TextRenderInfo>();
TextExtractionStrategy allTextInfo = new TextExtractionStrategy();
for (int i = 1; i <= pdfReader.NumberOfPages; i++)
{
PdfTextExtractor.GetTextFromPage(pdfReader, i, allTextInfo);
}
listTextInfor = allTextInfo.textList;
return listTextInfor;
}
public class TextExtractionStrategy : ITextExtractionStrategy
{
public List<TextRenderInfo> textList = new List<TextRenderInfo>();
public void BeginTextBlock()
{
}
public void EndTextBlock()
{
}
public string GetResultantText()
{
return "";
}
public void RenderImage(ImageRenderInfo renderInfo)
{
var a = renderInfo;
}
public void RenderText(TextRenderInfo renderInfo)
{
textList.Add(renderInfo);
}
}

You could use ironpdf for this purpose. Parse/open the pdf as per the docs on their site and examine it in debug, then you can develop some code to retrieve text from that layer only.

Related

iText7 PdfReaded loading IFormFile

I am trying to use the iText7 library but for some reason, I cannot split pages into the list of strings.
Instead, I am getting a list of pages like this:
1,1+2,1+2+3,1+2+3+4
public List<string> PdfPages;
private ITextExtractionStrategy _Strategy;
public PdfExtractor(IFormFile pdf, ITextExtractionStrategy? strategy = default)
{
this._Strategy = strategy ?? new SimpleTextExtractionStrategy();
PdfPages = new List<string>();
ExtractTextFromPages(pdf);
}
private void ExtractTextFromPages(IFormFile pdf)
{
using (var stream = pdf.OpenReadStream())
{
using (var reader = new PdfReader(stream))
{
PdfDocument pdfDoc = new PdfDocument(reader);
for (int index = 1; index < pdfDoc.GetNumberOfPages(); index++)
{
string PdfPageToText = PdfTextExtractor.GetTextFromPage(pdfDoc.GetPage(index), _Strategy);
PdfPages.Add(PdfPageToText);
}
}
}
}
Does anyone know how to correct that?
The problem was as #mkl mentioned in the comment below, that I did not create a new ITextExtractionStrategy object for each page, and when I did that everything works like a charm without the need to save files anywhere.
using (var stream = pdf.OpenReadStream())
{
using (var reader = new PdfReader(stream))
{
PdfDocument pdfDoc = new PdfDocument(reader);
for (int index = 1; index < pdfDoc.GetNumberOfPages(); index++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string PdfPageToText = PdfTextExtractor.GetTextFromPage(pdfDoc.GetPage(index), strategy);
PdfPages.Add(PdfPageToText);
}
pdfDoc.Close();
reader.Close();
}
}

SelectPDF.Document.Footer is always null?

Creating a PDF document from the stream of a HTTP request.
public class HomeController : Controller {
public HomeController() {
converter = new HtmlToPdf();
InitializeConverter();
}
public void Index() {
ConvertHtmlToPdf(new Uri("http://localhost:52328/CertificateOfOrigin?noCertificate=2691"));
}
public void ConvertHtmlToPdf(Uri toConvert) {
if(toConvert == null) throw new ArgumentNullException(nameof(toConvert));
using(var stream =new MemoryStream()) {
var doc = converter.ConvertUrl(toConvert.AbsoluteUri);
// The doc.AddTemplate returns a PdfTemplate and should be assigned to doc.Footer
doc.Footer = doc.AddTemplate(doc.Pages[0].ClientRectangle.Width, 100);
var pageNumbering = new PdfTextElement(20, 50, "Page {page_number} of {total_pages}", doc.Fonts[0], Color.Black);
// Once template defined, I add it to the doc Footer. But...
doc.Footer.Add(pageNumbering); // Throws a NullPointerException?
doc.Footer = template;
doc.Save(stream);
doc.Close();
using(var ms = new MemoryStream(stream.ToArray())) {
Response.AddHeader("content-disposition", "filename=certificate-of-origin.pdf");
Response.ContentType = "application/pdf";
ms.CopyTo(Response.OutputStream);
Response.End();
Response.Close();
}
}
}
private void InitializeConverter() {
converter.Options.MarginBottom = 0;
converter.Options.MarginLeft = 0;
converter.Options.MarginRight = 0;
converter.Options.MarginTop = 0;
converter.Options.PdfPageSize = PdfPageSize.Letter;
}
private readonly HtmlToPdf converter;
}
I put a breakpoint and quick watched the return of doc.AddTemplate method call and it returns an actual PdfTemplate no problem!
Other than that, everything works fine. Document is generated no problem, except when I uncomment the page numbering because the doc.Footer remains null despite its assignment.
Could it be a bug? Idk.
You need to either set the header/footer content before the conversion, like here:
https://selectpdf.com/demo-mvc/HtmlToPdfHeadersAndFooters
using System;
using System.Web.Mvc;
namespace SelectPdf.Samples.Controllers
{
public class HtmlToPdfHeadersAndFootersController : Controller
{
// GET: HtmlToPdfHeadersAndFooters
public ActionResult Index()
{
return View();
}
[HttpPost]
public ActionResult SubmitAction(FormCollection collection)
{
// get parameters
string headerUrl = Server.MapPath("~/files/header.html");
string footerUrl = Server.MapPath("~/files/footer.html");
bool showHeaderOnFirstPage = collection["ChkHeaderFirstPage"] == "on";
bool showHeaderOnOddPages = collection["ChkHeaderOddPages"] == "on";
bool showHeaderOnEvenPages = collection["ChkHeaderEvenPages"] == "on";
int headerHeight = 50;
try
{
headerHeight = Convert.ToInt32(collection["TxtHeaderHeight"]);
}
catch { }
bool showFooterOnFirstPage = collection["ChkFooterFirstPage"] == "on";
bool showFooterOnOddPages = collection["ChkFooterOddPages"] == "on";
bool showFooterOnEvenPages = collection["ChkFooterEvenPages"] == "on";
int footerHeight = 50;
try
{
footerHeight = Convert.ToInt32(collection["TxtFooterHeight"]);
}
catch { }
// instantiate a html to pdf converter object
HtmlToPdf converter = new HtmlToPdf();
// header settings
converter.Options.DisplayHeader = showHeaderOnFirstPage ||
showHeaderOnOddPages || showHeaderOnEvenPages;
converter.Header.DisplayOnFirstPage = showHeaderOnFirstPage;
converter.Header.DisplayOnOddPages = showHeaderOnOddPages;
converter.Header.DisplayOnEvenPages = showHeaderOnEvenPages;
converter.Header.Height = headerHeight;
PdfHtmlSection headerHtml = new PdfHtmlSection(headerUrl);
headerHtml.AutoFitHeight = HtmlToPdfPageFitMode.AutoFit;
converter.Header.Add(headerHtml);
// footer settings
converter.Options.DisplayFooter = showFooterOnFirstPage ||
showFooterOnOddPages || showFooterOnEvenPages;
converter.Footer.DisplayOnFirstPage = showFooterOnFirstPage;
converter.Footer.DisplayOnOddPages = showFooterOnOddPages;
converter.Footer.DisplayOnEvenPages = showFooterOnEvenPages;
converter.Footer.Height = footerHeight;
PdfHtmlSection footerHtml = new PdfHtmlSection(footerUrl);
footerHtml.AutoFitHeight = HtmlToPdfPageFitMode.AutoFit;
converter.Footer.Add(footerHtml);
// add page numbering element to the footer
if (collection["ChkPageNumbering"] == "on")
{
// page numbers can be added using a PdfTextSection object
PdfTextSection text = new PdfTextSection(0, 10,
"Page: {page_number} of {total_pages} ",
new System.Drawing.Font("Arial", 8));
text.HorizontalAlign = PdfTextHorizontalAlign.Right;
converter.Footer.Add(text);
}
// create a new pdf document converting an url
PdfDocument doc = converter.ConvertUrl(collection["TxtUrl"]);
// custom header on page 3
if (doc.Pages.Count >= 3)
{
PdfPage page = doc.Pages[2];
PdfTemplate customHeader = doc.AddTemplate(
page.PageSize.Width, headerHeight);
PdfHtmlElement customHtml = new PdfHtmlElement(
"<div><b>This is the custom header that will " +
"appear only on page 3!</b></div>",
string.Empty);
customHeader.Add(customHtml);
page.CustomHeader = customHeader;
}
// save pdf document
byte[] pdf = doc.Save();
// close pdf document
doc.Close();
// return resulted pdf document
FileResult fileResult = new FileContentResult(pdf, "application/pdf");
fileResult.FileDownloadName = "Document.pdf";
return fileResult;
}
}
}
Or use this approach, to add headers/footers to an already generated pdf:
https://selectpdf.com/demo-mvc/ExistingPdfHeadersAndFooters
using System.Web.Mvc;
using System.Drawing;
namespace SelectPdf.Samples.Controllers
{
public class ExistingPdfHeadersAndFootersController : Controller
{
// GET: ExistingPdfHeadersAndFooters
public ActionResult Index()
{
return View();
}
[HttpPost]
public ActionResult SubmitAction(FormCollection collection)
{
// the test file
string filePdf = Server.MapPath("~/files/selectpdf.pdf");
string imgFile = Server.MapPath("~/files/logo.png");
// resize the content
PdfResizeManager resizer = new PdfResizeManager();
resizer.Load(filePdf);
// add extra top and bottom margins
resizer.PageMargins = new PdfMargins(0, 0, 90, 40);
// add the header and footer to the existing (now resized pdf document)
PdfDocument doc = resizer.GetDocument();
// header template (90 points in height) with image element
PdfTemplate header = doc.AddTemplate(doc.Pages[0].ClientRectangle.Width, 90);
PdfImageElement img1 = new PdfImageElement(10, 10, imgFile);
header.Add(img1);
// footer template (40 points in height) with text element
PdfTemplate footer = doc.AddTemplate(new RectangleF(0,
doc.Pages[0].ClientRectangle.Height - 40,
doc.Pages[0].ClientRectangle.Width, 40));
// create a new pdf font
PdfFont font2 = doc.AddFont(PdfStandardFont.Helvetica);
font2.Size = 12;
PdfTextElement text1 = new PdfTextElement(10, 10,
"Generated by SelectPdf. Page number {page_number} of {total_pages}.",
font2);
text1.ForeColor = System.Drawing.Color.Blue;
footer.Add(text1);
// save pdf document
byte[] pdf = doc.Save();
// close pdf document
resizer.Close();
// return resulted pdf document
FileResult fileResult = new FileContentResult(pdf, "application/pdf");
fileResult.FileDownloadName = "Document.pdf";
return fileResult;
}
}
}
The best approach is the first, so try to move your footer setting before the conversion.

C# Adding an array or list into an List

I've got a List of Document
public class Document
{
public string[] fullFilePath;
public bool isPatch;
public string destPath;
public Document() { }
public Document(string[] fullFilePath, bool isPatch, string destPath)
{
this.fullFilePath = fullFilePath;
this.isPatch = isPatch;
this.destPath = destPath;
}
The fullFilepath should a List or an Array of Paths.
For example:
Document 1
---> C:\1.pdf
---> C:\2.pdf
Document 2
---> C:\1.pdf
---> C:\2.pdf
---> C:\3.pdf
etc.
My problem if I am using an array string all Documents got "null" in its fullFilePath.
If I'm using a List for the fullFilePath all Documents got the same entries from the last Document.
Here is how the List is filled:
int docCount = -1;
int i = 0;
List<Document> Documents = new List<Document>();
string[] sourceFiles = new string[1];
foreach (string file in filesCollected)
{
string bc;
string bcValue;
if (Settings.Default.barcodeEngine == "Leadtools")
{
bc = BarcodeReader.ReadBarcodeSymbology(file);
bcValue = "PatchCode";
}
else
{
bc = BarcodeReader.ReadBacrodes(file);
bcValue = "009";
}
if (bc == bcValue)
{
if(Documents.Count > 0)
{
Array.Clear(sourceFiles, 0, sourceFiles.Length);
Array.Resize<string>(ref sourceFiles, 1);
i = 0;
}
sourceFiles[i] = file ;
i++;
Array.Resize<string>(ref sourceFiles, i + 1);
Documents.Add(new Document(sourceFiles, true,""));
docCount++;
}
else
{
if (Documents.Count > 0)
{
sourceFiles[i] = file;
i++;
Array.Resize<string>(ref sourceFiles, i + 1);
Documents[docCount].fullFilePath = sourceFiles;
}
}
}
You are using the same instance of the array for every document. The instance is updated with a new list of files at every inner loop, but an array is a reference to an area of memory (oversimplification, I know but for the purpose of this answer is enough) and if you change the content of that area of memory you are changing it for every document.
You need to create a new instance of the source files for every new document you add to your documents list. Moreover, when you are not certain of the number of elements that you want to be included in the array, it is a lot better to use a generic List and remove all that code that handles the resizing of the array.
First change the class definition
public class Document
{
public List<string> fullFilePath;
public bool isPatch;
public string destPath;
public Document() { }
public Document(List<string> fullFilePath, bool isPatch, string destPath)
{
this.fullFilePath = fullFilePath;
this.isPatch = isPatch;
this.destPath = destPath;
}
}
And now change your inner loop to
foreach (string file in filesCollected)
{
string bc;
string bcValue;
....
if (bc == bcValue)
{
List<string> files = new List<string>();
files.Add(file);
Documents.Add(new Document(files, true, ""));
docCount++;
}
else
Documents[docCount].fullFilePath.Add(file);
}
Notice that when you need to add a new Document I build a new List<string>, add the current file and pass everything at the constructor (In reality this should be moved directly inside the constructor of the Document class). When you want to add just a new file you could add it directly to the public fullFilePath property
Moving the handling of the files inside the Documents class could be rewritten as
public class Document
{
public List<string> fullFilePath;
public bool isPatch;
public string destPath;
public Document()
{
// Every constructory initializes internally the List
fullFilePath = new List<string>();
}
public Document(string aFile, bool isPatch, string destPath)
{
// Every constructory initializes internally the List
fullFilePath = new List<string>();
this.fullFilePath.Add(aFile);
this.isPatch = isPatch;
this.destPath = destPath;
}
public void AddFile(string aFile)
{
this.fullFilePath.Add(aFile);
}
}
Of course, now in you calling code you pass only the new file or call AddFile without the need to check for the list initialization.
The issue should be here:
string[] sourceFiles = new string[1];
If you move this line of code in your foreach you should solve this problem because in your foreach you always use the same variable, so the same reference.
int docCount = -1;
int i = 0;
List<Document> Documents = new List<Document>();
foreach (string file in filesCollected)
{
string[] sourceFiles = new string[1];
string bc;
string bcValue;
if (Settings.Default.barcodeEngine == "Leadtools")
{
bc = BarcodeReader.ReadBarcodeSymbology(file);
bcValue = "PatchCode";
}
else
{
bc = BarcodeReader.ReadBacrodes(file);
bcValue = "009";
}
if (bc == bcValue)
{
if(Documents.Count > 0)
{
Array.Clear(sourceFiles, 0, sourceFiles.Length);
Array.Resize<string>(ref sourceFiles, 1);
i = 0;
}
sourceFiles[i] = file ;
i++;
Array.Resize<string>(ref sourceFiles, i + 1);
Documents.Add(new Document(sourceFiles, true,""));
docCount++;
}
else
{
if (Documents.Count > 0)
{
sourceFiles[i] = file;
i++;
Array.Resize<string>(ref sourceFiles, i + 1);
Documents[docCount].fullFilePath = sourceFiles;
}
}
}

Using word interop create multiple documents from a template for print preview

I have an app where the user selects from a list all the students they want to print an award document for.
I have a template .doc file that contains 3 text boxes that I populate from code. I can populate the file and show it in the print preview for a single student, but based on how many students are selected I want to create a large document with many pages I can see in print preview before printing and print all at once.
The following is my attempt to turn my working code for a single word document being created from the template to show in print preview. Any ideas?
public void AddStudentToDocument(IEnumerable<StudentToPrint> studentsToPrint )
{
_Application oWordApp = new Application();
string folder = Environment.GetFolderPath(Environment.SpecialFolder.ApplicationData);
string specificFolder = Path.Combine(folder, "FoothillsAcademy");
string fileLocation = Path.Combine(specificFolder, "CertTemplate.doc");
if (File.Exists(fileLocation))
{
var oWordDoc = oWordApp.Documents.Open(fileLocation);
oWordDoc.Activate();
oWordApp.Selection.TypeParagraph();
foreach (var studentToPrint in studentsToPrint)
{
_Document oDoc = oWordApp.Documents.Add();
Selection oSelection = oWordApp.Selection;
string docText = oWordDoc.Content.Text;
if (docText != null)
{
int boxNumber = 1;
foreach (Shape shape in oWordApp.ActiveDocument.Shapes)
{
if (shape.Type == Microsoft.Office.Core.MsoShapeType.msoTextBox)
{
if (boxNumber == 1)
{
shape.TextFrame.TextRange.Text = studentToPrint.StudentName;
}
if (boxNumber == 2)
{
shape.TextFrame.TextRange.Text = studentToPrint.Rank;
}
if (boxNumber == 3)
{
shape.TextFrame.TextRange.Text = studentToPrint.DateAcheved;
}
boxNumber++;
}
}
_Document oCurrentDocument = oWordApp.Documents.Add(oWordDoc);
copyPageSetup(oCurrentDocument.PageSetup, oDoc.Sections.Last.PageSetup);
oCurrentDocument.Range().Copy();
oSelection.PasteAndFormat(WdRecoveryType.wdFormatOriginalFormatting);
//if (!Object.ReferenceEquals(oWordDoc.Content, oWordDoc.Last()))
oSelection.InsertBreak(WdBreakType.wdSectionBreakNextPage);
}
oWordApp.Visible = true;
oWordApp.ShowStartupDialog = true;
oWordApp.ActiveDocument.PrintPreview();
}
}
}
private void copyPageSetup(PageSetup source, PageSetup target)
{
target.PaperSize = source.PaperSize;
if (!source.Orientation.Equals(target.Orientation))
target.TogglePortrait();
target.TopMargin = source.TopMargin;
target.BottomMargin = source.BottomMargin;
target.RightMargin = source.RightMargin;
target.LeftMargin = source.LeftMargin;
target.FooterDistance = source.FooterDistance;
target.HeaderDistance = source.HeaderDistance;
target.LayoutMode = source.LayoutMode;
}
public class StudentToPrint
{
public string StudentName { get; set; }
public string Rank { get; set; }
public string DateAcheved { get; set; }
}
Currently I am testing this with a collection of StudentsToPrint added below. Based on the data below I would expect to see 3 certificates personalized for each of the 3 students. Each certificate would be on its own page.
List<StudentToPrint> listOfStudents = new List<StudentToPrint>
{
new StudentToPrint
{
DateAcheved = DateTime.Now.ToShortDateString(),
Rank = "5th Degree",
StudentName = "Scott LaFoy"
},
new StudentToPrint
{
DateAcheved = DateTime.Now.ToShortDateString(),
Rank = "3rd Degree",
StudentName = "John Doe"
},
new StudentToPrint
{
DateAcheved = DateTime.Now.ToShortDateString(),
Rank = "2nd Degree",
StudentName = "Jane Doe"
}
};
The template is a word doc that has 3 text boxes. Using the text box lets me set font and position different for each of them as well as making the background transparent so the template background shows through. I am sure there is another way to do this as well but there is not much on this topic around.

Read File and display contents

I want that when I click the button "List all Customers", the code should read the Customer.csv file and display the information on the form called "List All Customers".
How can I do that?
public static void ReadFile()
{
StreamReader sr = File.OpenText("Customer.csv");
}
public static void LoadCustomers()
{
try
{
if (File.Exists("Customer.csv"))
{
string temp = null;
int count = 0;
using (StreamReader sr = File.OpenText(#"Customer.csv"))
{
while ((temp = sr.ReadLine()) != null)
{
temp = temp.Trim();
string[] lineHolder = temp.Split(',');
Customer tempCust = new Customer();
tempCust.customerName = lineHolder[0];
tempCust.customerAddress = lineHolder[1];
tempCust.customerZip = Convert.ToInt32(lineHolder[2]);
myCustArray[count] = tempCust;
count++;
}//end for loop
}
}
else
{
File.Create("Customer.csv");
}
}
catch (Exception e)
{
System.Windows.Forms.MessageBox.Show("File Loading Error: " + e.Message);
}
}
I'm not sure what kind of control you want to display this data in but your method could just return a list of Customer, then you can add to a ListBox, ListView or DataGrid
public static IEnumerable<Customer> LoadCustomers(string filename)
{
if (File.Exists(filename))
{
foreach (var line in File.ReadAllLines(filename).Where(l => l.Contains(',')))
{
var splitLine = line.Split(',');
if (splitLine.Count() >= 3)
{
yield return new Customer
{
customerName = splitLine[0].Trim(),
customerAddress = splitLine[1].Trim(),
customerZip = Convert.ToInt32(splitLine[2].Trim())
};
}
}
}
}
ListBox
listBox1.DisplayMember = "customerName";
listBox1.Items.AddRange(LoadCustomers(#"G:\Customers.csv").ToArray());
First, take advantage of the list object:
public static void ReadFile()
{
StreamReader sr = File.OpenText("Customer.csv");
}
public static void LoadCustomers()
{
try
{
if (File.Exists("Customer.csv"))
{
string temp = null;
var retList = new List<Customer>();
using (StreamReader sr = File.OpenText(#"Customer.csv"))
{
while ((temp = sr.ReadLine()) != null)
{
temp = temp.Trim();
string[] lineHolder = temp.Split(',');
retlist.add(new Customer(){
customerName = linerHolder[0],
customerAddress = lineHolder[1],
customerZip = Convert.ToInt32(lineHolder[2])
});
}//end for loop
}
}
else
{
File.Create("Customer.csv");
}
}
catch (Exception e)
{
System.Windows.Forms.MessageBox.Show("File Loading Error: " + e.Message);
}
}
just wrap it up in a class, call if from the controller and populate up the results. Depending on how often you will be updating this data, you might look into caching it, so you don't have to run this process every X seconds for each user.

Categories