is possible in itext7 knowing if a table (added to document) occupies 1 or more pages and in which page has placed ?
I've tried in an Handler on END_PAGE:
IRenderer pRenderer = TableData.CreateRendererSubTree().SetParent(doc.GetRenderer());
LayoutResult pLayoutResult = pRenderer.Layout(new LayoutContext(new LayoutArea(0, PageSize.A4)));
float y = pLayoutResult.GetOccupiedArea().GetBBox().GetY();
float x = pLayoutResult.GetOccupiedArea().GetBBox().GetX();
float xBottom = pLayoutResult.GetOccupiedArea().GetBBox().GetBottom();
float xHeight = pLayoutResult.GetOccupiedArea().GetBBox().GetHeight();
int pageNumber= pLayoutResult.GetOccupiedArea().GetPageNumber();
I've tried with table only on first page and with table extended on first and second page.
pageNumber is always = 0.
Thanks in advance.
This is certainly possible. iText 7 allows you to override rendering logic, one of the simplest applications of which is knowing where on the page your elements are going to be placed.
As a helper means to store the information about pages on which our table is placed, we can define a small class:
private static class LayoutInfo {
Collection<Integer> occupiedPages = new ArrayList<>();
public void addPage(int pageNum) {
occupiedPages.add(pageNum);
}
public Collection<Integer> getOccupiedPages() {
return occupiedPages;
}
}
Now, we can define our custom table renderer which is going to store the page numbers into this LayoutInfo object:
private static class CustomTableRenderer extends TableRenderer {
private LayoutInfo layoutInfo;
public CustomTableRenderer(Table modelElement, LayoutInfo info) {
super(modelElement);
this.layoutInfo = info;
}
#Override
public void draw(DrawContext drawContext) {
super.draw(drawContext);
layoutInfo.addPage(occupiedArea.getPageNumber());
}
#Override
public IRenderer getNextRenderer() {
return new CustomTableRenderer((Table) modelElement, layoutInfo);
}
}
You have to set the custom renderer to the table after all the cells have been added into it and the table is ready to be added to the document:
table.setNextRenderer(new CustomTableRenderer(table, info));
Full high level code:
PdfDocument pdfDocument = new PdfDocument(new PdfWriter(outFileName));
Document document = new Document(pdfDocument);
LayoutInfo info = new LayoutInfo();
Table table = new Table(2);
for (int i = 0; i < 200; i++) {
table.addCell("Row number ");
table.addCell(i + "");
}
table.setNextRenderer(new CustomTableRenderer(table, info));
document.add(table);
document.close();
System.out.println("The table is placed on the following pages: " + info.getOccupiedPages().toString());
Output of the code:
The table is placed on the following pages: [1, 2, 3, 4, 5, 6]
Note that the code is in Java but translating into C# is mostly a matter of changing cases of characters in a couple of places.
Related
I'm using ColumnDocumentRenderer to draw content in two columns.
Below are the codes.
public void ManipulatePdf(string dest)
{
PdfDocument pdfDoc = new PdfDocument(new PdfWriter(dest));
Document doc = new Document(pdfDoc, PageSize.A4);
doc.SetMargins(55f, 55f, 45f, 55f);
var interval = 20f;
var columnWidth = (doc.GetPdfDocument().GetDefaultPageSize().GetWidth() - 110 - interval) / 2;
var pageHeight = doc.GetPdfDocument().GetDefaultPageSize().GetHeight() - 110;
var baseText = "We have seen too many reports, too many words, too many good intentions, too many families torn apart, and too many excruciatingly painful deaths to see yet more delays in taking collective action.";
var pTitle = new Paragraph(baseText);
pTitle.SetFontSize(20);
pTitle.SetTextAlignment(iText.Layout.Properties.TextAlignment.JUSTIFIED);
doc.Add(pTitle);
var currentYLine = doc.GetRenderer().GetCurrentArea().GetBBox().GetTop();
Rectangle[] columns = {
new Rectangle(55, 55,
columnWidth,
currentYLine - 55),
new Rectangle(55 + columnWidth + interval, 55,
columnWidth,
currentYLine - 55) };
doc.SetRenderer(new ColumnDocumentRenderer(doc, columns));
for (int i = 1; i <= 12; i++)
{
var text = baseText;
for (int j = 1; j <= i; j++)
{
text += " Additional Text. ";
}
var p = new Paragraph(text);
p.SetTextAlignment(iText.Layout.Properties.TextAlignment.JUSTIFIED);
doc.Add(p);
if (i == 6)
{
var tp = new Paragraph("Introduction");
tp.SetFontSize(20);
tp.SetMarginTop(50);
tp.SetTextAlignment(iText.Layout.Properties.TextAlignment.JUSTIFIED);
doc.Add(tp);
}
}
doc.Close();
}
Below is the created PDF:
Please take a look at the red line in this screenshot, my question is that how to make bottom of two columns alignment in a straight line?
Thanks & Regards.
As discussed in the comments to the original question, algorithm to auto-align the bottom lines does not exist in iText and defining its behavior turned out to be tricky even for the person asking the question.
However, it is possible to calculate the difference between the vertical positions of the bottom lines when the document is being rendered and then you can correct the content and create another document with proper layout.
Here is the example implementation which analyzes the positions of the content when it's being drawn on the document:
class CustomColumnDocumentRenderer : ColumnDocumentRenderer {
public CustomColumnDocumentRenderer(Document document, Rectangle[] columns) : base(document, columns) {
}
public CustomColumnDocumentRenderer(Document document, bool immediateFlush, Rectangle[] columns) : base(document, immediateFlush, columns) {
}
IDictionary<int, float> leftColumnBottom = new Dictionary<int, float>();
IDictionary<int, float> rightColumnBottom = new Dictionary<int, float>();
protected override void FlushSingleRenderer(IRenderer resultRenderer) {
TraverseRecursively(resultRenderer, leftColumnBottom, rightColumnBottom);
base.FlushSingleRenderer(resultRenderer);
}
void TraverseRecursively(IRenderer child, IDictionary<int, float> leftColumnBottom, IDictionary<int, float> rightColumnBottom) {
if (child is LineRenderer) {
int page = child.GetOccupiedArea().GetPageNumber();
if (!leftColumnBottom.ContainsKey(page)) {
leftColumnBottom[page] = 1000;
}
if (!rightColumnBottom.ContainsKey(page)) {
rightColumnBottom[page] = 1000;
}
bool isLeftColumn = !(child.GetOccupiedArea().GetBBox().GetX() > PageSize.A4.GetWidth() / 2);
if (isLeftColumn) {
leftColumnBottom[page] =
Math.Min(leftColumnBottom[page], child.GetOccupiedArea().GetBBox().GetBottom());
} else {
rightColumnBottom[page] = Math.Min(rightColumnBottom[page],
child.GetOccupiedArea().GetBBox().GetBottom());
}
} else {
foreach (IRenderer ownChild in (child is ParagraphRenderer ? (((ParagraphRenderer)child).GetLines().Cast<IRenderer>()) : child.GetChildRenderers())) {
TraverseRecursively(ownChild, leftColumnBottom, rightColumnBottom);
}
}
}
public List<float> getDiffs() {
List<float> ans = new List<float>();
foreach (int pageNum in leftColumnBottom.Keys) {
ans.Add(leftColumnBottom[pageNum] - rightColumnBottom[pageNum]);
}
return ans;
}
}
To use it, make sure to pass the customized document renderer to the Document instance:
CustomColumnDocumentRenderer renderer = new CustomColumnDocumentRenderer(doc, true, columns);
doc.SetRenderer(renderer);
Then you can get page-by-page diffs:
Console.WriteLine(renderer.getDiffs()[0]);
I'm pretty new to TDD and I have a hard time to understand how to test private members of the class (I know! It's private, shouldn't be tested - but please keep reading). We might have a public function which sets private property and other public function that returns "something" based on that private property.
Let me show you a basic example:
public class Cell
{
public int X { get; set; }
public int Y { get; set; }
public string Value { get; set; }
}
public class Table
{
private Cell[,] Cells { get; }
public Table(Cell[,] cells)
{
Cells = cells;
}
public void SetCell(int x, int y, string value)
{
Cells[x, y].Value = value;
}
public void Reset()
{
for (int i = 0; i < Cells.GetLength(0); i++)
{
for (int j = 0; j < Cells.GetLength(1); j++)
{
Cells[i, j].Value = "";
}
}
}
public bool AreNeighborCellsSet(int x, int y)
{
bool areNeighborCellsSet = false;
// checking...
return areNeighborCellsSet;
}
}
In this example Cells are private, because there's no reason to make them public. I don't need to know what's the value of particular Cell outside this class. I just need an information if neighbor cells are empty.
1. How can I test Reset method?
Technically I should create a Table with mocked array of cells. Call Reset and then assert if every cell has empty Value. But I can't actually check if they are empty or not.
2. In this case I would call Assert many times (for every cell) - is it a good practice? I've read that "It's not!", but Reset resets all cells, so I have to somehow check every cell.
EDIT:
Option 2:
public class Table
{
private Cell[,] Cells { get; }
public Table(int height, int width, ICellFactory cellFactory)
{
Cells = new ICell[height, width];
for (int i = 0; i < Cells.GetLength(0); i++)
{
for (int j = 0; j < Cells.GetLength(1); j++)
{
Cells[i, j].Value = cellFactory.Create(i, j);
}
}
}
// Rest is the same...
}
Your class have three public methods
void SetCell
void Reset
bool AreNeighborCellsSet
So all functionality should be tested only through those methods and with possible help of constructor input arguments.
I am afraid you are not doing TDD, because you are trying to test already implemented logic (for loop of internal member). With TDD you should write unit tests by using only public API of class under test.
When you test Reset method you should think how it affect on results of other public methods. Table class has only one method which return some value we can observe - bool AreNeighborCellsSet - so seems like this is the only method against which we can execute our asserts.
For Reset method you need to set cells so that AreNeighborCellsSet returns true. Then execute Reset and assert that now AreNeighborCellsSet returns false.
[Test]
public void AfterResetGivenCellShouldNotHaveNeighbors()
{
// Arrange
var cell = new Cell { X = 1, Y = 1, Value = "central" };
var neighborCell = new new Cell { X = 1, Y = 2, Value = "neighbor" };
var table = new Table(new[] { cell, neighborCell });
// table.AreNeighborCellsSet(cell.X, cell.Y) - should return true at this moment
// Act
table.Reset();
// Assert
table.AreNeighborCellsSet(cell.X, cell.Y).Should().BeFalse();
}
This is a good example of TDD (Test-Driven Development), where problems with testing is good sign that something wrong with design.
Actually, I think, in your case you don't need Reset method at all - just create a new instance of Table every time you need to reset it.
The answer of Ignas my be a workaround for the problem but I feel a need to clarify some design issues here:
Basically there is no need to check if loop iterates through whole collection. That is tested by the framework team in MS.
What you need to do is to check if your new type (in this case Cell) behaves properly.
In my opinion you're violating the SRP. There is really no need for Table class to know how to reset this particular implementation of Cell. If some day you decide to create a cell able to contain a picture let's say, you'll most likely feel a need to clear it in some other way than by setting an empty string to it's Value property.
Start with abstracting Cell to an interface. Then just add method Reset() to the Cell and call it in the loop in Table class for every cell.
That would allow you to create tests for your implementation of Cell and there you can check if after calling Reset() cell's value truly becomes null or empty or whatever you need :-)
There are ways to test private properties with no need for changing your code or adding extra code to your tested class, you can use testing tools that allows you to do so.
for example i used Typemock to change the logic of the Table c'tor to create a populated table and to get the private property Cells after calling the reset method:
public void TestMethod1()
{
var handle = Isolate.Fake.NextInstance<Table>(Members.CallOriginal, context =>
{
var tempcells = context.Parameters[0] as Cell[,];
for (int i = 0; i < tempcells.GetLength(0); i++)
{
for (int j = 0; j < tempcells.GetLength(1); j++)
{
tempcells[i, j] = cellFactory.Create(i, j);
}
}
context.Parameters[0] = tempcells;
//calling the original ctor with tempcells as the parameter
context.WillCallOriginal();
});
// calling the ctor with the custom logic
var testTable = new Table(new Cell[2,2]);
testTable.Reset();
// calling the private property
var resTable = Isolate.Invoke.Method(testTable, "get_Cells") as Cell[,];
// for asserting
var emptyCell = new Cell { Value = string.Empty };
for (int i = 0; i < 2; i++)
{
for(int j=0; j<2; j++)
{
Assert.AreEqual(emptyCell.Value, resTable[i, j].Value);
}
}
}
As Zegar mentioned in comments there could be several code design considerations, and probably writing tests first aka using TDD would help not to even run into such situations, however I think there is a simple workaround as well.
You are passing array as reference into Table class and not overriding it, therefore you can access the array outside of Table class even though you are modifying it inside the class.
You don't need to do many asserts, you just need to arrange an expected array for assertion. Use FluentAssertions and specifically ShouldBeEquivalentTo() which is a very nice solution for arrays comparison. Nuget package.
Sample test below.
[TestMethod]
public void TestMethod1()
{
// Arrange
var expectedCells = new Cell[2, 2];
expectedCells[0, 0] = new Cell { Value = string.Empty };
expectedCells[0, 1] = new Cell { Value = string.Empty };
expectedCells[1, 0] = new Cell { Value = string.Empty };
expectedCells[1, 1] = new Cell { Value = string.Empty };
var cells = new Cell[2,2];
cells[0,0] = new Cell { Value = "00" };
cells[0,1] = new Cell { Value = "01" };
cells[1,0] = new Cell { Value = "10" };
cells[1,1] = new Cell { Value = "11" };
var table = new Table(cells);
// Act
table.Reset();
// Assert
cells.ShouldBeEquivalentTo(expectedCells); // using FluentAssertions
}
To summarize and answer your questions.
Test cells array you pass into the constructor.
Ideally you want to have a single assert per test, if possible.
I am currently using the Map Sample from Microsoft. This sample creates clusters of pins on the map to save space.
I am finding it difficult to be able to select a POI that is on the map and get the properties of the POI.
There are over 1000, POIs so I need clustering, but the sample is not clear on how to select the POI in question.
Code is below:
private async Task LoadPlaceInfoAsync()
{
Uri dataUri = new Uri("ms-appx:///places.txt");
StorageFile file = await StorageFile.GetFileFromApplicationUriAsync(dataUri);
IList<string> lines = await FileIO.ReadLinesAsync(file);
// In the places.txt file, each place is represented by three lines:
// Place name, latitude, and longitude.
for (int i = 0; i < lines.Count; i += 3)
{
PlaceInfo place = new PlaceInfo
{
Name = lines[i],
Location = new PlaceLocation(double.Parse(lines[i + 1]), double.Parse(lines[i + 2]))
};
places.Add(place);
}
}
private void refreshMapIcons()
{
// Erase the old map icons.
myMap.MapElements.Clear();
// Create an icon for each cluster.
foreach (var cluster in GetClustersForZoomLevel(previousZoomLevel))
{
MapIcon mapIcon = new MapIcon
{
Location = new Geopoint(cluster.Location.Geoposition),
CollisionBehaviorDesired = MapElementCollisionBehavior.RemainVisible,
};
if (cluster.Places.Count > 1)
{
// The cluster represents more than one place. Use a custom marker that shows
// how many places are represented by this cluster, and place the marker
// centered at the cluster.
mapIcon.Image = numberIconReferences[Math.Min(cluster.Places.Count, 9) - 2];
mapIcon.NormalizedAnchorPoint = new Point(0.5, 0.5);
}
else
{
// The cluster represents a single place. Label the cluster with the place name.
mapIcon.Title = cluster.Places[0].Name;
}
myMap.MapElements.Add(mapIcon);
}
}
Does anyone know how to use the visio insertListMember method (below) in c#?
https://msdn.microsoft.com/en-us/library/office/ff768115.aspx
I have tried to execute the method with the following commands but it gives a "Run Time Error- 424 object required"
I have also used the dropIntoList method and it works fine but for specific purposes I need to use the insertListMember method. (to determine the height of the list)
static void Main(string[] args)
{
//create the object that will do the drawing
visioDrawing.VisioDrawer Drawer = new visioDrawing.VisioDrawer();
Drawer.setUpVisio();
Visio.Shape testShape;
Visio.Shape testShape1;
testShape = Drawer.DropShape("abc", "lvl1Box");
testShape1 = Drawer.DropShape("ccc", "Capability");
Drawer.insertListMember(testShape, testShape1, 1);
}
public void insertListMember(Visio.Shape outerlist, Visio.Shape innerShape, int position)
{
ActiveDoc.ExecuteLine(outerlist + ".ContainerProperties.InsertListMember" + innerShape + "," + position);
}
To obtain the shape:
public Visio.Shape DropShape(string rectName, string masterShape)
{
//get the shape to drop from the masters collection
Visio.Master shapetodrop = GetMaster(stencilPath, masterShape);
// drop a shape on the page
Visio.Shape DropShape = acPage.Drop(shapetodrop, 1, 1);
//put name in the shape
Visio.Shape selShape = selectShp(DropShape.ID);
selShape.Text = rectName;
return DropShape;
}
private Visio.Master GetMaster(string stencilName, string mastername)
{
// open the page holding the masters collection so we can use it
MasterDoc = MastersDocuments.OpenEx(stencilName, (short)Visio.VisOpenSaveArgs.visOpenDocked);
// now get a masters collection to use
Masters = MasterDoc.Masters;
return Masters.get_ItemU(mastername);
}
From your code, 'Drawer' looks to be some kind of Visio app wrapper, but essentially InsertListMember allows you to add shapes to a list that already exist on the page. Here's an example of the method and an alternative Page.DropIntoList if you just want to drop directly from the stencil:
void Main()
{
// 'GetRunningVisio' as per
// http://visualsignals.typepad.co.uk/vislog/2015/12/getting-started-with-c-in-linqpad-with-visio.html
// but all you need is a reference to the app
var vApp = MyExtensions.GetRunningVisio();
var vDoc = vApp.Documents.Add("wfdgm_m.vstx");
var vPag = vDoc.Pages[1];
var vCtrlsStencil = vApp.Documents["WFCTRL_M.VSSX"];
var vListMst = vCtrlsStencil?.Masters["List box"];
if (vListMst != null)
{
var vListShp = vPag.Drop(vListMst, 2, 6);
var vListItemMst = vCtrlsStencil.Masters["List box item"];
var insertPosition = vListShp.ContainerProperties.GetListMembers().Length - 1;
//Use InsertListMember method
var firstListItem = vPag.Drop(vListItemMst, 4, 6);
vListShp.ContainerProperties.InsertListMember(firstListItem, insertPosition);
firstListItem.CellsU["FillForegnd"].FormulaU = "3"; //Green
//or use DropIntoList method on Page instead
var secondListItem = vPag.DropIntoList(vListItemMst, vListShp, insertPosition);
secondListItem.CellsU["FillForegnd"].FormulaU = "2"; //Red
}
}
This is using the Wireframe Diagram template (in Visio Professional) and should result in the following:
In case people were wondering I ended up fixing the method up. However I believe that this method is not required if you are using visio Interop assemblies v15. (I am using v14)
public void insertListMember(int outerShpID, int innerShpID, int position)
{
acWindow.DeselectAll();
Visio.Page page = acWindow.Page;
acWindow.Select(page.Shapes.get_ItemFromID(innerShpID), (short)Microsoft.Office.Interop.Visio.VisSelectArgs.visSelect);
Debug.WriteLine("Application.ActivePage.Shapes.ItemFromID(" + outerShpID + ").ContainerProperties.InsertListMember ActiveWindow.Selection," + position);
ActiveDoc.ExecuteLine("Application.ActivePage.Shapes.ItemFromID(" + outerShpID + ").ContainerProperties.InsertListMember ActiveWindow.Selection," + position);
}
I have a PDF file that i am reading into string using ITextExtractionStrategy.Now from the string i am taking a substring like My name is XYZ and need to get the rectangular coordinates of substring from the PDF file but not able to do it.On googling i got to know that LocationTextExtractionStrategy but not getting how to use this to get the coordinates.
Here is the code..
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
string getcoordinate="My name is XYZ";
How can i get the rectangular coordinate of this substring using ITEXTSHARP..
Please help.
Here is a very, very simple version of an implementation.
Before implementing it is very important to know that PDFs have zero concept of "words", "paragraphs", "sentences", etc. Also, text within a PDF is not necessarily laid out left to right and top to bottom and this has nothing to do with non-LTR languages. The phrase "Hello World" could be written into the PDF as:
Draw H at (10, 10)
Draw ell at (20, 10)
Draw rld at (90, 10)
Draw o Wo at (50, 20)
It could also be written as
Draw Hello World at (10,10)
The ITextExtractionStrategy interface that you need to implement has a method called RenderText that gets called once for every chunk of text within a PDF. Notice I said "chunk" and not "word". In the first example above the method would be called four times for those two words. In the second example it would be called once for those two words. This is the very important part to understand. PDFs don't have words and because of this, iTextSharp doesn't have words either. The "word" part is 100% up to you to solve.
Also along these lines, as I said above, PDFs don't have paragraphs. The reason to be aware of this is because PDFs cannot wrap text to a new line. Any time that you see something that looks like a paragraph return you are actually seeing a brand new text drawing command that has a different y coordinate as the previous line. See this for further discussion.
The code below is a very simple implementation. For it I'm subclassing LocationTextExtractionStrategy which already implements ITextExtractionStrategy. On each call to RenderText() I find the rectangle of the current chunk (using Mark's code here) and storing it for later. I'm using this simple helper class for storing these chunks and rectangles:
//Helper class that stores our rectangle and text
public class RectAndText {
public iTextSharp.text.Rectangle Rect;
public String Text;
public RectAndText(iTextSharp.text.Rectangle rect, String text) {
this.Rect = rect;
this.Text = text;
}
}
And here's the subclass:
public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy {
//Hold each coordinate
public List<RectAndText> myPoints = new List<RectAndText>();
//Automatically called for each chunk of text in the PDF
public override void RenderText(TextRenderInfo renderInfo) {
base.RenderText(renderInfo);
//Get the bounding box for the chunk of text
var bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
var topRight = renderInfo.GetAscentLine().GetEndPoint();
//Create a rectangle from it
var rect = new iTextSharp.text.Rectangle(
bottomLeft[Vector.I1],
bottomLeft[Vector.I2],
topRight[Vector.I1],
topRight[Vector.I2]
);
//Add this to our main collection
this.myPoints.Add(new RectAndText(rect, renderInfo.GetText()));
}
}
And finally an implementation of the above:
//Our test file
var testFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "test.pdf");
//Create our test file, nothing special
using (var fs = new FileStream(testFile, FileMode.Create, FileAccess.Write, FileShare.None)) {
using (var doc = new Document()) {
using (var writer = PdfWriter.GetInstance(doc, fs)) {
doc.Open();
doc.Add(new Paragraph("This is my sample file"));
doc.Close();
}
}
}
//Create an instance of our strategy
var t = new MyLocationTextExtractionStrategy();
//Parse page 1 of the document above
using (var r = new PdfReader(testFile)) {
var ex = PdfTextExtractor.GetTextFromPage(r, 1, t);
}
//Loop through each chunk found
foreach (var p in t.myPoints) {
Console.WriteLine(string.Format("Found text {0} at {1}x{2}", p.Text, p.Rect.Left, p.Rect.Bottom));
}
I can't stress enough that the above does not take "words" into account, that'll be up to you. The TextRenderInfo object that gets passed into RenderText has a method called GetCharacterRenderInfos() that you might be able to use to get more information. You might also want to use GetBaseline() instead ofGetDescentLine()` if you don't care about descenders in the font.
EDIT
(I had a great lunch so I'm feeling a little more helpful.)
Here's an updated version of MyLocationTextExtractionStrategy that does what my comments below say, namely it takes a string to search for and searches each chunk for that string. For all the reasons listed this will not work in some/many/most/all cases. If the substring exists multiple times in a single chunk it will also only return the first instance. Ligatures and diacritics could also mess with this.
public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy {
//Hold each coordinate
public List<RectAndText> myPoints = new List<RectAndText>();
//The string that we're searching for
public String TextToSearchFor { get; set; }
//How to compare strings
public System.Globalization.CompareOptions CompareOptions { get; set; }
public MyLocationTextExtractionStrategy(String textToSearchFor, System.Globalization.CompareOptions compareOptions = System.Globalization.CompareOptions.None) {
this.TextToSearchFor = textToSearchFor;
this.CompareOptions = compareOptions;
}
//Automatically called for each chunk of text in the PDF
public override void RenderText(TextRenderInfo renderInfo) {
base.RenderText(renderInfo);
//See if the current chunk contains the text
var startPosition = System.Globalization.CultureInfo.CurrentCulture.CompareInfo.IndexOf(renderInfo.GetText(), this.TextToSearchFor, this.CompareOptions);
//If not found bail
if (startPosition < 0) {
return;
}
//Grab the individual characters
var chars = renderInfo.GetCharacterRenderInfos().Skip(startPosition).Take(this.TextToSearchFor.Length).ToList();
//Grab the first and last character
var firstChar = chars.First();
var lastChar = chars.Last();
//Get the bounding box for the chunk of text
var bottomLeft = firstChar.GetDescentLine().GetStartPoint();
var topRight = lastChar.GetAscentLine().GetEndPoint();
//Create a rectangle from it
var rect = new iTextSharp.text.Rectangle(
bottomLeft[Vector.I1],
bottomLeft[Vector.I2],
topRight[Vector.I1],
topRight[Vector.I2]
);
//Add this to our main collection
this.myPoints.Add(new RectAndText(rect, this.TextToSearchFor));
}
You would use this the same as before but now the constructor has a single required parameter:
var t = new MyLocationTextExtractionStrategy("sample");
It's an old question but I leave here my response as I could not find a correct answer in the web.
As Chris Haas has exposed it is not easy dealing with words as iText deals with chunks. The code that Chris post failed in most of my test because a word is normally splited in different chunks (he warns about that in the post).
To solve that problem here it is the strategy I have used:
Split chunks in characters (actually textrenderinfo objects per each char)
Group chars by line. This is not straight forward as you have to deal with chunk alignment.
Search the word you need to find for each line
I leave here the code. I test it with several documents and it works pretty well but it could fail in some scenarios because it's a bit tricky this chunk -> words transformation.
Hope it helps to someone.
class LocationTextExtractionStrategyEx : LocationTextExtractionStrategy
{
private List<LocationTextExtractionStrategyEx.ExtendedTextChunk> m_DocChunks = new List<ExtendedTextChunk>();
private List<LocationTextExtractionStrategyEx.LineInfo> m_LinesTextInfo = new List<LineInfo>();
public List<SearchResult> m_SearchResultsList = new List<SearchResult>();
private String m_SearchText;
public const float PDF_PX_TO_MM = 0.3528f;
public float m_PageSizeY;
public LocationTextExtractionStrategyEx(String sSearchText, float fPageSizeY)
: base()
{
this.m_SearchText = sSearchText;
this.m_PageSizeY = fPageSizeY;
}
private void searchText()
{
foreach (LineInfo aLineInfo in m_LinesTextInfo)
{
int iIndex = aLineInfo.m_Text.IndexOf(m_SearchText);
if (iIndex != -1)
{
TextRenderInfo aFirstLetter = aLineInfo.m_LineCharsList.ElementAt(iIndex);
SearchResult aSearchResult = new SearchResult(aFirstLetter, m_PageSizeY);
this.m_SearchResultsList.Add(aSearchResult);
}
}
}
private void groupChunksbyLine()
{
LocationTextExtractionStrategyEx.ExtendedTextChunk textChunk1 = null;
LocationTextExtractionStrategyEx.LineInfo textInfo = null;
foreach (LocationTextExtractionStrategyEx.ExtendedTextChunk textChunk2 in this.m_DocChunks)
{
if (textChunk1 == null)
{
textInfo = new LocationTextExtractionStrategyEx.LineInfo(textChunk2);
this.m_LinesTextInfo.Add(textInfo);
}
else if (textChunk2.sameLine(textChunk1))
{
textInfo.appendText(textChunk2);
}
else
{
textInfo = new LocationTextExtractionStrategyEx.LineInfo(textChunk2);
this.m_LinesTextInfo.Add(textInfo);
}
textChunk1 = textChunk2;
}
}
public override string GetResultantText()
{
groupChunksbyLine();
searchText();
//In this case the return value is not useful
return "";
}
public override void RenderText(TextRenderInfo renderInfo)
{
LineSegment baseline = renderInfo.GetBaseline();
//Create ExtendedChunk
ExtendedTextChunk aExtendedChunk = new ExtendedTextChunk(renderInfo.GetText(), baseline.GetStartPoint(), baseline.GetEndPoint(), renderInfo.GetSingleSpaceWidth(), renderInfo.GetCharacterRenderInfos().ToList());
this.m_DocChunks.Add(aExtendedChunk);
}
public class ExtendedTextChunk
{
public string m_text;
private Vector m_startLocation;
private Vector m_endLocation;
private Vector m_orientationVector;
private int m_orientationMagnitude;
private int m_distPerpendicular;
private float m_charSpaceWidth;
public List<TextRenderInfo> m_ChunkChars;
public ExtendedTextChunk(string txt, Vector startLoc, Vector endLoc, float charSpaceWidth,List<TextRenderInfo> chunkChars)
{
this.m_text = txt;
this.m_startLocation = startLoc;
this.m_endLocation = endLoc;
this.m_charSpaceWidth = charSpaceWidth;
this.m_orientationVector = this.m_endLocation.Subtract(this.m_startLocation).Normalize();
this.m_orientationMagnitude = (int)(Math.Atan2((double)this.m_orientationVector[1], (double)this.m_orientationVector[0]) * 1000.0);
this.m_distPerpendicular = (int)this.m_startLocation.Subtract(new Vector(0.0f, 0.0f, 1f)).Cross(this.m_orientationVector)[2];
this.m_ChunkChars = chunkChars;
}
public bool sameLine(LocationTextExtractionStrategyEx.ExtendedTextChunk textChunkToCompare)
{
return this.m_orientationMagnitude == textChunkToCompare.m_orientationMagnitude && this.m_distPerpendicular == textChunkToCompare.m_distPerpendicular;
}
}
public class SearchResult
{
public int iPosX;
public int iPosY;
public SearchResult(TextRenderInfo aCharcter, float fPageSizeY)
{
//Get position of upperLeft coordinate
Vector vTopLeft = aCharcter.GetAscentLine().GetStartPoint();
//PosX
float fPosX = vTopLeft[Vector.I1];
//PosY
float fPosY = vTopLeft[Vector.I2];
//Transform to mm and get y from top of page
iPosX = Convert.ToInt32(fPosX * PDF_PX_TO_MM);
iPosY = Convert.ToInt32((fPageSizeY - fPosY) * PDF_PX_TO_MM);
}
}
public class LineInfo
{
public string m_Text;
public List<TextRenderInfo> m_LineCharsList;
public LineInfo(LocationTextExtractionStrategyEx.ExtendedTextChunk initialTextChunk)
{
this.m_Text = initialTextChunk.m_text;
this.m_LineCharsList = initialTextChunk.m_ChunkChars;
}
public void appendText(LocationTextExtractionStrategyEx.ExtendedTextChunk additionalTextChunk)
{
m_LineCharsList.AddRange(additionalTextChunk.m_ChunkChars);
this.m_Text += additionalTextChunk.m_text;
}
}
}
I know this is a really old question, but below is what I ended up doing. Just posting it here hoping that it will be useful for someone else.
The following code will tell you the starting coordinates of the line(s) that contains a search text. It should not be hard to modify it to give positions of words.
Note. I tested this on itextsharp 5.5.11.0 and won't work on some older versions
As mentioned above pdfs have no concept of words/lines or paragraphs. But I found that the LocationTextExtractionStrategy does a very good job of splitting lines and words. So my solution is based on that.
DISCLAIMER:
This solution is based on the https://github.com/itext/itextsharp/blob/develop/src/core/iTextSharp/text/pdf/parser/LocationTextExtractionStrategy.cs and that file has a comment saying that it's a dev preview. So this might not work in future.
Anyway here's the code.
using System.Collections.Generic;
using iTextSharp.text.pdf.parser;
namespace Logic
{
public class LocationTextExtractionStrategyWithPosition : LocationTextExtractionStrategy
{
private readonly List<TextChunk> locationalResult = new List<TextChunk>();
private readonly ITextChunkLocationStrategy tclStrat;
public LocationTextExtractionStrategyWithPosition() : this(new TextChunkLocationStrategyDefaultImp())
{
}
/**
* Creates a new text extraction renderer, with a custom strategy for
* creating new TextChunkLocation objects based on the input of the
* TextRenderInfo.
* #param strat the custom strategy
*/
public LocationTextExtractionStrategyWithPosition(ITextChunkLocationStrategy strat)
{
tclStrat = strat;
}
private bool StartsWithSpace(string str)
{
if (str.Length == 0) return false;
return str[0] == ' ';
}
private bool EndsWithSpace(string str)
{
if (str.Length == 0) return false;
return str[str.Length - 1] == ' ';
}
/**
* Filters the provided list with the provided filter
* #param textChunks a list of all TextChunks that this strategy found during processing
* #param filter the filter to apply. If null, filtering will be skipped.
* #return the filtered list
* #since 5.3.3
*/
private List<TextChunk> filterTextChunks(List<TextChunk> textChunks, ITextChunkFilter filter)
{
if (filter == null)
{
return textChunks;
}
var filtered = new List<TextChunk>();
foreach (var textChunk in textChunks)
{
if (filter.Accept(textChunk))
{
filtered.Add(textChunk);
}
}
return filtered;
}
public override void RenderText(TextRenderInfo renderInfo)
{
LineSegment segment = renderInfo.GetBaseline();
if (renderInfo.GetRise() != 0)
{ // remove the rise from the baseline - we do this because the text from a super/subscript render operations should probably be considered as part of the baseline of the text the super/sub is relative to
Matrix riseOffsetTransform = new Matrix(0, -renderInfo.GetRise());
segment = segment.TransformBy(riseOffsetTransform);
}
TextChunk tc = new TextChunk(renderInfo.GetText(), tclStrat.CreateLocation(renderInfo, segment));
locationalResult.Add(tc);
}
public IList<TextLocation> GetLocations()
{
var filteredTextChunks = filterTextChunks(locationalResult, null);
filteredTextChunks.Sort();
TextChunk lastChunk = null;
var textLocations = new List<TextLocation>();
foreach (var chunk in filteredTextChunks)
{
if (lastChunk == null)
{
//initial
textLocations.Add(new TextLocation
{
Text = chunk.Text,
X = iTextSharp.text.Utilities.PointsToMillimeters(chunk.Location.StartLocation[0]),
Y = iTextSharp.text.Utilities.PointsToMillimeters(chunk.Location.StartLocation[1])
});
}
else
{
if (chunk.SameLine(lastChunk))
{
var text = "";
// we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space
if (IsChunkAtWordBoundary(chunk, lastChunk) && !StartsWithSpace(chunk.Text) && !EndsWithSpace(lastChunk.Text))
text += ' ';
text += chunk.Text;
textLocations[textLocations.Count - 1].Text += text;
}
else
{
textLocations.Add(new TextLocation
{
Text = chunk.Text,
X = iTextSharp.text.Utilities.PointsToMillimeters(chunk.Location.StartLocation[0]),
Y = iTextSharp.text.Utilities.PointsToMillimeters(chunk.Location.StartLocation[1])
});
}
}
lastChunk = chunk;
}
//now find the location(s) with the given texts
return textLocations;
}
}
public class TextLocation
{
public float X { get; set; }
public float Y { get; set; }
public string Text { get; set; }
}
}
How to call the method:
using (var reader = new PdfReader(inputPdf))
{
var parser = new PdfReaderContentParser(reader);
var strategy = parser.ProcessContent(pageNumber, new LocationTextExtractionStrategyWithPosition());
var res = strategy.GetLocations();
reader.Close();
}
var searchResult = res.Where(p => p.Text.Contains(searchText)).OrderBy(p => p.Y).Reverse().ToList();
inputPdf is a byte[] that has the pdf data
pageNumber is the page where you want to search in
Here is how you use LocationTextExtractionStrategy in VB.NET.
Class definition:
Class TextExtractor
Inherits LocationTextExtractionStrategy
Implements iTextSharp.text.pdf.parser.ITextExtractionStrategy
Public oPoints As IList(Of RectAndText) = New List(Of RectAndText)
Public Overrides Sub RenderText(renderInfo As TextRenderInfo) 'Implements IRenderListener.RenderText
MyBase.RenderText(renderInfo)
Dim bottomLeft As Vector = renderInfo.GetDescentLine().GetStartPoint()
Dim topRight As Vector = renderInfo.GetAscentLine().GetEndPoint() 'GetBaseline
Dim rect As Rectangle = New Rectangle(bottomLeft(Vector.I1), bottomLeft(Vector.I2), topRight(Vector.I1), topRight(Vector.I2))
oPoints.Add(New RectAndText(rect, renderInfo.GetText()))
End Sub
Private Function GetLines() As Dictionary(Of Single, ArrayList)
Dim oLines As New Dictionary(Of Single, ArrayList)
For Each p As RectAndText In oPoints
Dim iBottom = p.Rect.Bottom
If oLines.ContainsKey(iBottom) = False Then
oLines(iBottom) = New ArrayList()
End If
oLines(iBottom).Add(p)
Next
Return oLines
End Function
Public Function Find(ByVal sFind As String) As iTextSharp.text.Rectangle
Dim oLines As Dictionary(Of Single, ArrayList) = GetLines()
For Each oEntry As KeyValuePair(Of Single, ArrayList) In oLines
'Dim iBottom As Integer = oEntry.Key
Dim oRectAndTexts As ArrayList = oEntry.Value
Dim sLine As String = ""
For Each p As RectAndText In oRectAndTexts
sLine += p.Text
If sLine.IndexOf(sFind) <> -1 Then
Return p.Rect
End If
Next
Next
Return Nothing
End Function
End Class
Public Class RectAndText
Public Rect As iTextSharp.text.Rectangle
Public Text As String
Public Sub New(ByVal rect As iTextSharp.text.Rectangle, ByVal text As String)
Me.Rect = rect
Me.Text = text
End Sub
End Class
Usage (Insert Signature box right to the found text)
Sub EncryptPdf(ByVal sInFilePath As String, ByVal sOutFilePath As String)
Dim oPdfReader As iTextSharp.text.pdf.PdfReader = New iTextSharp.text.pdf.PdfReader(sInFilePath)
Dim oPdfDoc As New iTextSharp.text.Document()
Dim oPdfWriter As PdfWriter = PdfWriter.GetInstance(oPdfDoc, New FileStream(sOutFilePath, FileMode.Create))
'oPdfWriter.SetEncryption(PdfWriter.STRENGTH40BITS, sPassword, sPassword, PdfWriter.AllowCopy)
oPdfDoc.Open()
oPdfDoc.SetPageSize(iTextSharp.text.PageSize.LEDGER.Rotate())
Dim oDirectContent As iTextSharp.text.pdf.PdfContentByte = oPdfWriter.DirectContent
Dim iNumberOfPages As Integer = oPdfReader.NumberOfPages
Dim iPage As Integer = 0
Dim iBottomMargin As Integer = txtBottomMargin.Text '10
Dim iLeftMargin As Integer = txtLeftMargin.Text '500
Dim iWidth As Integer = txtWidth.Text '120
Dim iHeight As Integer = txtHeight.Text '780
Dim oStrategy As New parser.SimpleTextExtractionStrategy()
Do While (iPage < iNumberOfPages)
iPage += 1
oPdfDoc.SetPageSize(oPdfReader.GetPageSizeWithRotation(iPage))
oPdfDoc.NewPage()
Dim oPdfImportedPage As iTextSharp.text.pdf.PdfImportedPage =
oPdfWriter.GetImportedPage(oPdfReader, iPage)
Dim iRotation As Integer = oPdfReader.GetPageRotation(iPage)
If (iRotation = 90) Or (iRotation = 270) Then
oDirectContent.AddTemplate(oPdfImportedPage, 0, -1.0F, 1.0F,
0, 0, oPdfReader.GetPageSizeWithRotation(iPage).Height)
Else
oDirectContent.AddTemplate(oPdfImportedPage, 1.0F, 0, 0, 1.0F, 0, 0)
End If
'Dim sPageText As String = parser.PdfTextExtractor.GetTextFromPage(oPdfReader, iPage, oStrategy)
'sPageText = System.Text.Encoding.UTF8.GetString(System.Text.ASCIIEncoding.Convert(System.Text.Encoding.Default, System.Text.Encoding.UTF8, System.Text.Encoding.Default.GetBytes(sPageText)))
'If txtFind.Text = "" OrElse sPageText.IndexOf(txtFind.Text) <> -1 Then
Dim oTextExtractor As New TextExtractor()
PdfTextExtractor.GetTextFromPage(oPdfReader, iPage, oTextExtractor) 'Initialize oTextExtractor
Dim oRect As iTextSharp.text.Rectangle = oTextExtractor.Find(txtFind.Text)
If oRect IsNot Nothing Then
Dim iX As Integer = oRect.Left + oRect.Width + iLeftMargin 'Move right
Dim iY As Integer = oRect.Bottom - iBottomMargin 'Move down
Dim field As PdfFormField = PdfFormField.CreateSignature(oPdfWriter)
field.SetWidget(New Rectangle(iX, iY, iX + iWidth, iY + iHeight), PdfAnnotation.HIGHLIGHT_OUTLINE)
field.FieldName = "myEmptySignatureField" & iPage
oPdfWriter.AddAnnotation(field)
End If
Loop
oPdfDoc.Close()
End Sub