I have a pdf document, inside are comments lists of 2 types :
1. Rectangle
2. Text Box
I want to get values from Text Boxes with c# and itextsharp.
The text boxes and rectangles you're referring to are called Annotations. Annotations are defined as dictionaries and they are listed per page.
In other words: you need to create a PdfReader instance and get the ANNOTS from each page:
PdfReader reader = new PdfReader("your.pdf");
for (int i = 1; i <= reader.NumberOfPages; i++) {
PdfArray array = reader.GetPageN(i).GetAsArray(PdfName.ANNOTS);
if (array == null) continue;
for (int j = 0; j < array.Size; j++) {
PdfDictionary annot = array.GetAsDict(j);
PdfString text = annot.GetAsString(PdfName.CONTENTS);
...
}
}
In the above code sample, I have a PdfDictionary named annot, from which I can extract the Contents. You may be interested in some other entries too (for instance the name of the annotation, if any). Please inspect all the keys that are available in the annot object in case the Contents entry isn't what you're looking for.
Replace the dots with whatever you want to do with the text. PdfString has different method that will reveal its contents.
DISCLAIMER: I'm the original developer of iText (I always assume that people already know this, but I was once downvoted because I didn't add this disclaimer).
Related
Goal: extract a value from a specific location inside a PDF page. In GemBox.Pdf, I can extract text elements including their bounds and content, but:
Problem: a text element can have a complex structure, with each glyph being positioned using individual settings.
Consider this common example of a page header:
Billing Info Date: 02/02/20222
Company Ltd. Order Number: 0123456789
123 Main Street Name: Smith, John
Let's say, I want to get the order number (0123456789) from the document, knowing its precise position on the page. But in practice, often enough the entire line would be one single text element, with the content SO CompanyOrder Number:0123456789, and all positioning and spacing done via offsets and indices only. I can get the bounds and text of the entire line, but I need the bounds (and value) of each character/glyph, so I can combine them into "words" (= character sequences, separated by whitespace or large offsets).
I know this is definitely possible in other libraries. But this question is specific to GemBox. It seems to me, all the necessary implementations should already there, just not much is exposed in the API.
In itextsharp I can get the bounds for each single glyph, like this:
// itextsharp 5.2.1.0
public GlyphExtractionStrategy : LocationTextExtractionStrategy
{
public override void RenderText(TextRenderInfo renderInfo)
{
var segment = renderInfo.GetBaseline();
var chunk = new TextChunk(
renderInfo.GetText(),
segment.GetStartPoint(),
segment.GetEndPoint(),
renderInfo.GetSingleSpaceWidth(),
renderInfo.GetAscentLine(),
renderInfo.GetDescentLine()
);
// glyph infos
var glyph = chunk.Text;
var left = chunk.StartLocation[0];
var top = chunk.StartLocation[1];
var right = chunk.EndLocation[0];
var bottom = chunk.EndLocation[1];
}
}
var reader = new PdfReader(bytes);
var strategy = new GlyphExtractionStrategy();
PdfTextExtractor.GetTextFromPage(reader, pageNumber: 1, strategy);
reader.Close();
Is this possible in GemBox? If so, that would be helpful, because we already have the code to combinine the glphs into "words".
Currently, I can somewhat work around this using regex, but this is not always possible and also way too technical for end users to configure.
Try using this latest NuGet package, we added PdfTextContent.GetGlyphOffsets method:
Install-Package GemBox.Pdf -Version 17.0.1128-hotfix
Here is how you can use it:
using (var document = PdfDocument.Load("input.pdf"))
{
var page = document.Pages[0];
var enumerator = page.Content.Elements.All(page.Transform).GetEnumerator();
while (enumerator.MoveNext())
{
if (enumerator.Current.ElementType != PdfContentElementType.Text)
continue;
var textElement = (PdfTextContent)enumerator.Current;
var text = textElement.ToString();
int index = text.IndexOf("Number:");
if (index < 0)
continue;
index += "Number:".Length;
for (int i = index; i < text.Length; i++)
{
if (text[i] == ' ')
index++;
else
break;
}
var bounds = textElement.Bounds;
enumerator.Transform.Transform(ref bounds);
string orderNumber = text.Substring(index);
double position = bounds.Left + textElement.GetGlyphOffsets().Skip(index - 1).First();
// TODO ...
}
}
Not sure if this is the best way to do this but I've created a 2D array from values in an excel file that i want to use as variables within my application that control where to look for things like file paths and files.
The array is built and all data from the file is contained within it:
for (int i = 0; i <= bindingSourceConfig.Count - 1; i++) // for each row in the binding source data
{
for (int j = 0; j <= 2 - 1; j++) // for each column, only need one and 2
{
System_Var_Array[i, j] = (bindingSourceConfig.DataSource as DataTable).Rows[i][j].ToString();
}
}
Now i want to be able to look in the array for my variable say "Project_Directory" and have it return "C:\Users\User\Dropbox\default\master\support"
Is this even possible?
EDIT 1:
Purpose of doing it this way is to make an easily customizable/configurable multi project environment where by anyone can simply edit the paths in the excel file and import that file into the application.
I can not in my incompetence see a way of setting these 'variables' at a class level without first extracting the 'variable' and the 'value' from the excel data.
Is there a simple way of looking up the 'variable' in bindingSource?
EDIT 2:
debugger image
var name = "Project_Directory";
string value = null;
for (int i = 0; i<System_Var_Array.GetLength(0); i++) {
if (System_Var_Array[i,0]==name) {
value = System_Var_Array[i,1];
break;
}
}
After the loop stop, value will contain the data you need, assuming it exists (otherwise it will be null. A far simpler approach will be using Dictionary<string,string>, if you generate it with
var systemVars = new Dictionary<string,string>();
var dt = bindingSourceConfig.DataSource as DataTable;
for (int i = 0; i < bindingSourceConfig.Count; i++) // for each row in the binding source data
{
systemVars[dt.Rows[i][0].ToString()] = dt.Rows[i][1].ToString();
}
then to get the value for "Project_Directory", simply call
var value = systemVars["Project_Directory"];
You should use an appsettings file for this instead, it's the dotnet standard solution for this sort of thing. Please see https://learn.microsoft.com/en-us/dotnet/core/extensions/configuration.
I'm working wth .NET 4.7.2, Windowsform.
I have a datagridview and I manage to generate a powerpoint file pptx.
I made a first ppt slide and I'd like to add the datagridview content into the second ppt slide given that I need to have the option to change the data within the PPt slide.
Microsoft.Office.Interop.PowerPoint.Application pptApp = new Microsoft.Office.Interop.PowerPoint.Application();
pptApp.Visible = Microsoft.Office.Core.MsoTriState.msoTrue;
Microsoft.Office.Interop.PowerPoint.Slides slides;
Microsoft.Office.Interop.PowerPoint._Slide slide;
Microsoft.Office.Interop.PowerPoint._Slide slide2;
Microsoft.Office.Interop.PowerPoint.TextRange objText;
// Create File
Presentation pptPresentation = pptApp.Presentations.Add(Microsoft.Office.Core.MsoTriState.msoTrue);
CustomLayout customLayout = pptPresentation.SlideMaster.CustomLayouts[PpSlideLayout.ppLayoutText];
// new Slide
slides = pptPresentation.Slides;
slide = slides.AddSlide(1, customLayout);
slide2 = slides.AddSlide(1, customLayout);
// title
objText = slide.Shapes[1].TextFrame.TextRange;
objText.Text = "Bonds Screner Report";
objText.Font.Name = "Haboro Contrast Ext Light";
objText.Font.Size = 32;
Shape shape1 = slide.Shapes[2];
slide.Shapes.AddPicture("C:\\mylogo.png", Microsoft.Office.Core.MsoTriState.msoFalse, Microsoft.Office.Core.MsoTriState.msoTrue, shape1.Left, shape1.Top, shape1.Width, shape1.Height);
slide.NotesPage.Shapes[2].TextFrame.TextRange.Text = "Disclaimer";
dataGridViewBonds.ClipboardCopyMode = DataGridViewClipboardCopyMode.EnableAlwaysIncludeHeaderText;
dataGridViewBonds.SelectAll();
DataObject obj = dataGridViewBonds.GetClipboardContent();
Clipboard.SetDataObject(obj, true);
Shape shapegrid = slide2.Shapes[2];
I know I'm not so far by now but I miss smething. Any help would be appreciated !
I am familiar with Excel interop and have used it many times and most likely have become numb to the awkward ways in which interop works. Using PowerPoint interop can be very frustrating for numerous reasons, however, the biggest I feel is the lack of documentation and the differences between the different MS versions.
In addition, I looked for a third-party PowerPoint library and “Aspose” looked like the only option, unfortunately it is not a “free” option. I will assume there is a free third-party option and I just did not look in the right place… Or there may be a totally different way to do this possibly with XML. I am confident I am preaching to the choir.
Therefore, what I have been able to put together may work for you. For starters, looking at your current posted code, there is one part missing that you need to get the “copied” grid cells into the slide…
slide.Shapes.Paste();
This will paste the “copied” cells from the grid into an “unformatted” table into the slide. This will copy the “row header” if it is displayed in the grid in addition to the “new row” if the grids AllowUserToAddRows is set to true. If this “unformatted paste” works for you, then you are good to go.
If you prefer to have at least a minimally formatted table and ignore the row headers and last empty row… It may be easier to simply “create” a new Table in the slide with the size we want along with the correct number of rows and columns. Granted, this may be more work, however, using the paste is going require this anyway “IF” you want the table formatted.
The method (below) takes a power point _Slide and a DataGridView. The code “creates” a new Table in the slide based on the number of rows and columns in the given grid. With this approach, the table will be “formatted” using the default “Table Style” in the presentation. So, this may give you the formatting you want by simply “creating” the table as opposed to “pasting” the table.
I have tried to “apply” one of the existing “Table Styles” in power point, however, the examples I saw used something like…
table.ApplyStyle("{5C22544A-7EE6-4342-B048-85BDC9FD1C3A}");
Which uses a GUID id to identify “which” style to use. I am not sure why MS decided on this GUID approach… this is beyond me, and it worked for “some” styles but not all.
Also, more common-sense solutions that showed something like…
table.StylePreset = TableStylePreset.MediumStyle2Accent2;
Unfortunately using my 2019 version of Office PowerPoint, this property does not exist. I have abandoned further research on this as it appears to be version dependent. Very annoying!
Given this, it may be easier if we format the cells individually as we want. We will need to add the cells text from the grid into the individual cells anyway, so we could also format the individual cells at the same time. Again, I am confident there is a better way, however, I could not find one.
Below the InsertTableIntoSlide(_Slide slide, DataGridView dgv) method takes a slide and a grid as parameters. It will add a table to the slide with data from the given grid. A brief code trace is below.
First a check is made to get the number of total rows in the grid (not including the headers) totRows. If the grids AllowUsersToAddRows is true, then the total rows variable is decremented by 1 to ignore this new row. Next the number of columns in the grid is set to the variable totCols. The top left X and Y point is defined topLeftX and topLeftY to position the table in the slide along with the tables width and height.
ADDED NOTE: Using the AllowUserToAddRows property to determine the number of rows … may NOT work as described above and will “miss” the last row… “IF” AllowUserToAddRows is true (default) AND the grid is data bound to a data source that does NOT allow new rows to be added. In that case you do NOT want to decrement the totRows variable.
Next a “Table” “Shape” is added to the slide using the previous variables to define the base table dimensions. Next are two loops. The first loop adds the header cells to the first row in the table. Then a second loop to add the data from the cells in the grid… to the table cells in the slide.
The commented-out code is left as an example such that you want to do some specific formatting for the individual cells. This was not need in my case since the “default” table style was close to the formatting I wanted.
Also, a note that “ForeColor” is the “Back ground” color of the cell/shape. Strange!
I hope this helps and again, sympathize more about having to use PowerPoint interop… I could not.
private void InsertTableIntoSlide(_Slide slide, DataGridView dgv) {
try {
int totRows;
if (dgv.AllowUserToAddRows) {
totRows = dgv.Rows.Count - 1;
}
else {
totRows = dgv.Rows.Count;
}
int totCols = dgv.Columns.Count;
int topLeftX = 10;
int topLeftY = 10;
int width = 400;
int height = 100;
// add extra row for header row
Shape shape = slide.Shapes.AddTable(totRows + 1, totCols, topLeftX, topLeftY, width, height);
Table table = shape.Table;
for (int i = 0; i < dgv.Columns.Count; i++) {
table.Cell(1, i+1).Shape.TextFrame.TextRange.Text = dgv.Columns[i].HeaderText;
//table.Cell(1, i+1).Shape.Fill.ForeColor.RGB = ColorTranslator.ToOle(Color.Blue);
//table.Cell(1, i+1).Shape.TextFrame.TextRange.Font.Bold = Microsoft.Office.Core.MsoTriState.msoTrue;
//table.Cell(1, i+1).Shape.TextFrame.TextRange.Font.Color.RGB = ColorTranslator.ToOle(Color.White);
}
int curRow = 2;
for (int i = 0; i < totRows; i++) {
for (int j = 0; j < totCols; j++) {
if (dgv.Rows[i].Cells[j].Value != null) {
table.Cell(curRow, j + 1).Shape.TextFrame.TextRange.Text = dgv.Rows[i].Cells[j].Value.ToString();
//table.Cell(curRow, j + 1).Shape.Fill.ForeColor.RGB = ColorTranslator.ToOle(Color.LightGreen);
//table.Cell(curRow, j + 1).Shape.TextFrame.TextRange.Font.Bold = Microsoft.Office.Core.MsoTriState.msoTrue;
//table.Cell(curRow, j + 1).Shape.TextFrame.TextRange.Font.Color.RGB = ColorTranslator.ToOle(Color.Black);
}
}
curRow++;
}
}
catch (Exception ex) {
MessageBox.Show("Error: " + ex.Message);
}
}
I'm trying to change the text in some PDF annotations using iTextSharp. Here is my code:
void changeAnnotations(string inputPath, string outputPath)
{
PdfReader pdfReader = new PdfReader(inputPath);
PdfStamper pdfStamper = new PdfStamper(pdfReader, new FileStream(outputPath, FileMode.Create));
//get the PdfDictionary of the 1st page
PdfDictionary pageDict = pdfReader.GetPageN(1);
//get annotation array
PdfArray annotArray = pageDict.GetAsArray(PdfName.ANNOTS);
//iterate through annotation array
int size = annotArray.Size;
for (int i = 0; i < size; i++)
{
//get value of /Contents
PdfDictionary dict = annotArray.GetAsDict(i);
PdfString contents = dict.GetAsString(PdfName.CONTENTS);
//check if /Contents key exists
if (contents != null)
{
//set new value
dict.Put(PdfName.CONTENTS, new PdfString("value has been changed"));
}
}
pdfStamper.Close();
}
When I open the output file in Adobe Reader, none of the text has changed in any of the annotations. How should I be setting the new value in an annotation?
UPDATE: I've found that the value is being changed in the popup box that appears when I click on the annotation. And in some cases, when I modify this value in the popup box, the change is then applied to the annotation.
As the OP clarified in a comment:
This annotation is a FreeText, how do I find and change the text that's displayed in this text box?
Free text annotations allow a number of mechanisms to set the displayed text:
A pre-formatted appearance stream, referenced by the N entry in the AP dictionary
A rich text string with a default style string given in RC and DS respectively
A default appearance string applied to the contents given in DA and Contents respectively
(For details cf. the PDF specification ISO 32000-1 section 12.5.6.6 Free Text Annotations)
If you want to change the text using one of these mechanisms, make sure you remove or adjust the contents of the entries for the other mechanisms; otherwise your change might not be visible or even visible on some viewers but not visible on others.
I can't figure out how to determine if there is an appearance stream. Is that the /AP property? I checked that for one of the annotations and it's a dictionary with a single entry whose value is 28 0 R.
So that one of the annotations indeed comes with an appearance stream. The single entry whose value is 28 0 R presumably has the N name to indicate the normal appearance. 28 0 R is a reference to the indirect object with object number 28 and generation 0.
If you want to change the text content but do not want to deal with the formatting details, you should remove the AP entry.
I create a table of figures programmatically in a Word document.
Well, the ToF style is centered and I would like it to be left indent.
To do so (set paragraph indention) I have to get the paragraph where the ToF is located.
This is the way rto access the ToF:
wordApp.ActiveDocument.TablesOfFigures[1]
Any ideas?
Try the code below. Assuming that TablesOfFigures[1] is exists (otherwise we will get buffer overflow).
// Check in which paragraph TablesOfFigures[1] is found
for (int i=1; i <= wordApp.ActiveDocument.Paragraphs.Count; i++)
{
if (IsInRange(wordApp.ActiveDocument.TablesOfFigures[1].Range, wordApp.ActiveDocument.Paragraphs[i].Range))
{
MessageBox.Show("ToF is in paragraph " + i);
}
}
// Returns true if 'target' is contained in 'source'
private bool IsInRange(Range target, Range source)
{
return target.Start >= source.Start && target.End <= source.End;
}
Provided that you only have one Table of Figures, you can try this:
With wordApp.ActiveDocument.TablesOfFigures(1).Range
'Setting the indent
.ParagraphFormat.LeftIndent = CentimetersToPoints(1)
End With
I have tested it just using Word and it Selects the Table of Figures and then Indents it by 1cm