I have a PDF file that i am reading into string using ITextExtractionStrategy.Now from the string i am taking a substring like My name is XYZ and need to get the rectangular coordinates of substring from the PDF file but not able to do it.On googling i got to know that LocationTextExtractionStrategy but not getting how to use this to get the coordinates.
Here is the code..
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
string getcoordinate="My name is XYZ";
How can i get the rectangular coordinate of this substring using ITEXTSHARP..
Please help.
Here is a very, very simple version of an implementation.
Before implementing it is very important to know that PDFs have zero concept of "words", "paragraphs", "sentences", etc. Also, text within a PDF is not necessarily laid out left to right and top to bottom and this has nothing to do with non-LTR languages. The phrase "Hello World" could be written into the PDF as:
Draw H at (10, 10)
Draw ell at (20, 10)
Draw rld at (90, 10)
Draw o Wo at (50, 20)
It could also be written as
Draw Hello World at (10,10)
The ITextExtractionStrategy interface that you need to implement has a method called RenderText that gets called once for every chunk of text within a PDF. Notice I said "chunk" and not "word". In the first example above the method would be called four times for those two words. In the second example it would be called once for those two words. This is the very important part to understand. PDFs don't have words and because of this, iTextSharp doesn't have words either. The "word" part is 100% up to you to solve.
Also along these lines, as I said above, PDFs don't have paragraphs. The reason to be aware of this is because PDFs cannot wrap text to a new line. Any time that you see something that looks like a paragraph return you are actually seeing a brand new text drawing command that has a different y coordinate as the previous line. See this for further discussion.
The code below is a very simple implementation. For it I'm subclassing LocationTextExtractionStrategy which already implements ITextExtractionStrategy. On each call to RenderText() I find the rectangle of the current chunk (using Mark's code here) and storing it for later. I'm using this simple helper class for storing these chunks and rectangles:
//Helper class that stores our rectangle and text
public class RectAndText {
public iTextSharp.text.Rectangle Rect;
public String Text;
public RectAndText(iTextSharp.text.Rectangle rect, String text) {
this.Rect = rect;
this.Text = text;
}
}
And here's the subclass:
public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy {
//Hold each coordinate
public List<RectAndText> myPoints = new List<RectAndText>();
//Automatically called for each chunk of text in the PDF
public override void RenderText(TextRenderInfo renderInfo) {
base.RenderText(renderInfo);
//Get the bounding box for the chunk of text
var bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
var topRight = renderInfo.GetAscentLine().GetEndPoint();
//Create a rectangle from it
var rect = new iTextSharp.text.Rectangle(
bottomLeft[Vector.I1],
bottomLeft[Vector.I2],
topRight[Vector.I1],
topRight[Vector.I2]
);
//Add this to our main collection
this.myPoints.Add(new RectAndText(rect, renderInfo.GetText()));
}
}
And finally an implementation of the above:
//Our test file
var testFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "test.pdf");
//Create our test file, nothing special
using (var fs = new FileStream(testFile, FileMode.Create, FileAccess.Write, FileShare.None)) {
using (var doc = new Document()) {
using (var writer = PdfWriter.GetInstance(doc, fs)) {
doc.Open();
doc.Add(new Paragraph("This is my sample file"));
doc.Close();
}
}
}
//Create an instance of our strategy
var t = new MyLocationTextExtractionStrategy();
//Parse page 1 of the document above
using (var r = new PdfReader(testFile)) {
var ex = PdfTextExtractor.GetTextFromPage(r, 1, t);
}
//Loop through each chunk found
foreach (var p in t.myPoints) {
Console.WriteLine(string.Format("Found text {0} at {1}x{2}", p.Text, p.Rect.Left, p.Rect.Bottom));
}
I can't stress enough that the above does not take "words" into account, that'll be up to you. The TextRenderInfo object that gets passed into RenderText has a method called GetCharacterRenderInfos() that you might be able to use to get more information. You might also want to use GetBaseline() instead ofGetDescentLine()` if you don't care about descenders in the font.
EDIT
(I had a great lunch so I'm feeling a little more helpful.)
Here's an updated version of MyLocationTextExtractionStrategy that does what my comments below say, namely it takes a string to search for and searches each chunk for that string. For all the reasons listed this will not work in some/many/most/all cases. If the substring exists multiple times in a single chunk it will also only return the first instance. Ligatures and diacritics could also mess with this.
public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy {
//Hold each coordinate
public List<RectAndText> myPoints = new List<RectAndText>();
//The string that we're searching for
public String TextToSearchFor { get; set; }
//How to compare strings
public System.Globalization.CompareOptions CompareOptions { get; set; }
public MyLocationTextExtractionStrategy(String textToSearchFor, System.Globalization.CompareOptions compareOptions = System.Globalization.CompareOptions.None) {
this.TextToSearchFor = textToSearchFor;
this.CompareOptions = compareOptions;
}
//Automatically called for each chunk of text in the PDF
public override void RenderText(TextRenderInfo renderInfo) {
base.RenderText(renderInfo);
//See if the current chunk contains the text
var startPosition = System.Globalization.CultureInfo.CurrentCulture.CompareInfo.IndexOf(renderInfo.GetText(), this.TextToSearchFor, this.CompareOptions);
//If not found bail
if (startPosition < 0) {
return;
}
//Grab the individual characters
var chars = renderInfo.GetCharacterRenderInfos().Skip(startPosition).Take(this.TextToSearchFor.Length).ToList();
//Grab the first and last character
var firstChar = chars.First();
var lastChar = chars.Last();
//Get the bounding box for the chunk of text
var bottomLeft = firstChar.GetDescentLine().GetStartPoint();
var topRight = lastChar.GetAscentLine().GetEndPoint();
//Create a rectangle from it
var rect = new iTextSharp.text.Rectangle(
bottomLeft[Vector.I1],
bottomLeft[Vector.I2],
topRight[Vector.I1],
topRight[Vector.I2]
);
//Add this to our main collection
this.myPoints.Add(new RectAndText(rect, this.TextToSearchFor));
}
You would use this the same as before but now the constructor has a single required parameter:
var t = new MyLocationTextExtractionStrategy("sample");
It's an old question but I leave here my response as I could not find a correct answer in the web.
As Chris Haas has exposed it is not easy dealing with words as iText deals with chunks. The code that Chris post failed in most of my test because a word is normally splited in different chunks (he warns about that in the post).
To solve that problem here it is the strategy I have used:
Split chunks in characters (actually textrenderinfo objects per each char)
Group chars by line. This is not straight forward as you have to deal with chunk alignment.
Search the word you need to find for each line
I leave here the code. I test it with several documents and it works pretty well but it could fail in some scenarios because it's a bit tricky this chunk -> words transformation.
Hope it helps to someone.
class LocationTextExtractionStrategyEx : LocationTextExtractionStrategy
{
private List<LocationTextExtractionStrategyEx.ExtendedTextChunk> m_DocChunks = new List<ExtendedTextChunk>();
private List<LocationTextExtractionStrategyEx.LineInfo> m_LinesTextInfo = new List<LineInfo>();
public List<SearchResult> m_SearchResultsList = new List<SearchResult>();
private String m_SearchText;
public const float PDF_PX_TO_MM = 0.3528f;
public float m_PageSizeY;
public LocationTextExtractionStrategyEx(String sSearchText, float fPageSizeY)
: base()
{
this.m_SearchText = sSearchText;
this.m_PageSizeY = fPageSizeY;
}
private void searchText()
{
foreach (LineInfo aLineInfo in m_LinesTextInfo)
{
int iIndex = aLineInfo.m_Text.IndexOf(m_SearchText);
if (iIndex != -1)
{
TextRenderInfo aFirstLetter = aLineInfo.m_LineCharsList.ElementAt(iIndex);
SearchResult aSearchResult = new SearchResult(aFirstLetter, m_PageSizeY);
this.m_SearchResultsList.Add(aSearchResult);
}
}
}
private void groupChunksbyLine()
{
LocationTextExtractionStrategyEx.ExtendedTextChunk textChunk1 = null;
LocationTextExtractionStrategyEx.LineInfo textInfo = null;
foreach (LocationTextExtractionStrategyEx.ExtendedTextChunk textChunk2 in this.m_DocChunks)
{
if (textChunk1 == null)
{
textInfo = new LocationTextExtractionStrategyEx.LineInfo(textChunk2);
this.m_LinesTextInfo.Add(textInfo);
}
else if (textChunk2.sameLine(textChunk1))
{
textInfo.appendText(textChunk2);
}
else
{
textInfo = new LocationTextExtractionStrategyEx.LineInfo(textChunk2);
this.m_LinesTextInfo.Add(textInfo);
}
textChunk1 = textChunk2;
}
}
public override string GetResultantText()
{
groupChunksbyLine();
searchText();
//In this case the return value is not useful
return "";
}
public override void RenderText(TextRenderInfo renderInfo)
{
LineSegment baseline = renderInfo.GetBaseline();
//Create ExtendedChunk
ExtendedTextChunk aExtendedChunk = new ExtendedTextChunk(renderInfo.GetText(), baseline.GetStartPoint(), baseline.GetEndPoint(), renderInfo.GetSingleSpaceWidth(), renderInfo.GetCharacterRenderInfos().ToList());
this.m_DocChunks.Add(aExtendedChunk);
}
public class ExtendedTextChunk
{
public string m_text;
private Vector m_startLocation;
private Vector m_endLocation;
private Vector m_orientationVector;
private int m_orientationMagnitude;
private int m_distPerpendicular;
private float m_charSpaceWidth;
public List<TextRenderInfo> m_ChunkChars;
public ExtendedTextChunk(string txt, Vector startLoc, Vector endLoc, float charSpaceWidth,List<TextRenderInfo> chunkChars)
{
this.m_text = txt;
this.m_startLocation = startLoc;
this.m_endLocation = endLoc;
this.m_charSpaceWidth = charSpaceWidth;
this.m_orientationVector = this.m_endLocation.Subtract(this.m_startLocation).Normalize();
this.m_orientationMagnitude = (int)(Math.Atan2((double)this.m_orientationVector[1], (double)this.m_orientationVector[0]) * 1000.0);
this.m_distPerpendicular = (int)this.m_startLocation.Subtract(new Vector(0.0f, 0.0f, 1f)).Cross(this.m_orientationVector)[2];
this.m_ChunkChars = chunkChars;
}
public bool sameLine(LocationTextExtractionStrategyEx.ExtendedTextChunk textChunkToCompare)
{
return this.m_orientationMagnitude == textChunkToCompare.m_orientationMagnitude && this.m_distPerpendicular == textChunkToCompare.m_distPerpendicular;
}
}
public class SearchResult
{
public int iPosX;
public int iPosY;
public SearchResult(TextRenderInfo aCharcter, float fPageSizeY)
{
//Get position of upperLeft coordinate
Vector vTopLeft = aCharcter.GetAscentLine().GetStartPoint();
//PosX
float fPosX = vTopLeft[Vector.I1];
//PosY
float fPosY = vTopLeft[Vector.I2];
//Transform to mm and get y from top of page
iPosX = Convert.ToInt32(fPosX * PDF_PX_TO_MM);
iPosY = Convert.ToInt32((fPageSizeY - fPosY) * PDF_PX_TO_MM);
}
}
public class LineInfo
{
public string m_Text;
public List<TextRenderInfo> m_LineCharsList;
public LineInfo(LocationTextExtractionStrategyEx.ExtendedTextChunk initialTextChunk)
{
this.m_Text = initialTextChunk.m_text;
this.m_LineCharsList = initialTextChunk.m_ChunkChars;
}
public void appendText(LocationTextExtractionStrategyEx.ExtendedTextChunk additionalTextChunk)
{
m_LineCharsList.AddRange(additionalTextChunk.m_ChunkChars);
this.m_Text += additionalTextChunk.m_text;
}
}
}
I know this is a really old question, but below is what I ended up doing. Just posting it here hoping that it will be useful for someone else.
The following code will tell you the starting coordinates of the line(s) that contains a search text. It should not be hard to modify it to give positions of words.
Note. I tested this on itextsharp 5.5.11.0 and won't work on some older versions
As mentioned above pdfs have no concept of words/lines or paragraphs. But I found that the LocationTextExtractionStrategy does a very good job of splitting lines and words. So my solution is based on that.
DISCLAIMER:
This solution is based on the https://github.com/itext/itextsharp/blob/develop/src/core/iTextSharp/text/pdf/parser/LocationTextExtractionStrategy.cs and that file has a comment saying that it's a dev preview. So this might not work in future.
Anyway here's the code.
using System.Collections.Generic;
using iTextSharp.text.pdf.parser;
namespace Logic
{
public class LocationTextExtractionStrategyWithPosition : LocationTextExtractionStrategy
{
private readonly List<TextChunk> locationalResult = new List<TextChunk>();
private readonly ITextChunkLocationStrategy tclStrat;
public LocationTextExtractionStrategyWithPosition() : this(new TextChunkLocationStrategyDefaultImp())
{
}
/**
* Creates a new text extraction renderer, with a custom strategy for
* creating new TextChunkLocation objects based on the input of the
* TextRenderInfo.
* #param strat the custom strategy
*/
public LocationTextExtractionStrategyWithPosition(ITextChunkLocationStrategy strat)
{
tclStrat = strat;
}
private bool StartsWithSpace(string str)
{
if (str.Length == 0) return false;
return str[0] == ' ';
}
private bool EndsWithSpace(string str)
{
if (str.Length == 0) return false;
return str[str.Length - 1] == ' ';
}
/**
* Filters the provided list with the provided filter
* #param textChunks a list of all TextChunks that this strategy found during processing
* #param filter the filter to apply. If null, filtering will be skipped.
* #return the filtered list
* #since 5.3.3
*/
private List<TextChunk> filterTextChunks(List<TextChunk> textChunks, ITextChunkFilter filter)
{
if (filter == null)
{
return textChunks;
}
var filtered = new List<TextChunk>();
foreach (var textChunk in textChunks)
{
if (filter.Accept(textChunk))
{
filtered.Add(textChunk);
}
}
return filtered;
}
public override void RenderText(TextRenderInfo renderInfo)
{
LineSegment segment = renderInfo.GetBaseline();
if (renderInfo.GetRise() != 0)
{ // remove the rise from the baseline - we do this because the text from a super/subscript render operations should probably be considered as part of the baseline of the text the super/sub is relative to
Matrix riseOffsetTransform = new Matrix(0, -renderInfo.GetRise());
segment = segment.TransformBy(riseOffsetTransform);
}
TextChunk tc = new TextChunk(renderInfo.GetText(), tclStrat.CreateLocation(renderInfo, segment));
locationalResult.Add(tc);
}
public IList<TextLocation> GetLocations()
{
var filteredTextChunks = filterTextChunks(locationalResult, null);
filteredTextChunks.Sort();
TextChunk lastChunk = null;
var textLocations = new List<TextLocation>();
foreach (var chunk in filteredTextChunks)
{
if (lastChunk == null)
{
//initial
textLocations.Add(new TextLocation
{
Text = chunk.Text,
X = iTextSharp.text.Utilities.PointsToMillimeters(chunk.Location.StartLocation[0]),
Y = iTextSharp.text.Utilities.PointsToMillimeters(chunk.Location.StartLocation[1])
});
}
else
{
if (chunk.SameLine(lastChunk))
{
var text = "";
// we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space
if (IsChunkAtWordBoundary(chunk, lastChunk) && !StartsWithSpace(chunk.Text) && !EndsWithSpace(lastChunk.Text))
text += ' ';
text += chunk.Text;
textLocations[textLocations.Count - 1].Text += text;
}
else
{
textLocations.Add(new TextLocation
{
Text = chunk.Text,
X = iTextSharp.text.Utilities.PointsToMillimeters(chunk.Location.StartLocation[0]),
Y = iTextSharp.text.Utilities.PointsToMillimeters(chunk.Location.StartLocation[1])
});
}
}
lastChunk = chunk;
}
//now find the location(s) with the given texts
return textLocations;
}
}
public class TextLocation
{
public float X { get; set; }
public float Y { get; set; }
public string Text { get; set; }
}
}
How to call the method:
using (var reader = new PdfReader(inputPdf))
{
var parser = new PdfReaderContentParser(reader);
var strategy = parser.ProcessContent(pageNumber, new LocationTextExtractionStrategyWithPosition());
var res = strategy.GetLocations();
reader.Close();
}
var searchResult = res.Where(p => p.Text.Contains(searchText)).OrderBy(p => p.Y).Reverse().ToList();
inputPdf is a byte[] that has the pdf data
pageNumber is the page where you want to search in
Here is how you use LocationTextExtractionStrategy in VB.NET.
Class definition:
Class TextExtractor
Inherits LocationTextExtractionStrategy
Implements iTextSharp.text.pdf.parser.ITextExtractionStrategy
Public oPoints As IList(Of RectAndText) = New List(Of RectAndText)
Public Overrides Sub RenderText(renderInfo As TextRenderInfo) 'Implements IRenderListener.RenderText
MyBase.RenderText(renderInfo)
Dim bottomLeft As Vector = renderInfo.GetDescentLine().GetStartPoint()
Dim topRight As Vector = renderInfo.GetAscentLine().GetEndPoint() 'GetBaseline
Dim rect As Rectangle = New Rectangle(bottomLeft(Vector.I1), bottomLeft(Vector.I2), topRight(Vector.I1), topRight(Vector.I2))
oPoints.Add(New RectAndText(rect, renderInfo.GetText()))
End Sub
Private Function GetLines() As Dictionary(Of Single, ArrayList)
Dim oLines As New Dictionary(Of Single, ArrayList)
For Each p As RectAndText In oPoints
Dim iBottom = p.Rect.Bottom
If oLines.ContainsKey(iBottom) = False Then
oLines(iBottom) = New ArrayList()
End If
oLines(iBottom).Add(p)
Next
Return oLines
End Function
Public Function Find(ByVal sFind As String) As iTextSharp.text.Rectangle
Dim oLines As Dictionary(Of Single, ArrayList) = GetLines()
For Each oEntry As KeyValuePair(Of Single, ArrayList) In oLines
'Dim iBottom As Integer = oEntry.Key
Dim oRectAndTexts As ArrayList = oEntry.Value
Dim sLine As String = ""
For Each p As RectAndText In oRectAndTexts
sLine += p.Text
If sLine.IndexOf(sFind) <> -1 Then
Return p.Rect
End If
Next
Next
Return Nothing
End Function
End Class
Public Class RectAndText
Public Rect As iTextSharp.text.Rectangle
Public Text As String
Public Sub New(ByVal rect As iTextSharp.text.Rectangle, ByVal text As String)
Me.Rect = rect
Me.Text = text
End Sub
End Class
Usage (Insert Signature box right to the found text)
Sub EncryptPdf(ByVal sInFilePath As String, ByVal sOutFilePath As String)
Dim oPdfReader As iTextSharp.text.pdf.PdfReader = New iTextSharp.text.pdf.PdfReader(sInFilePath)
Dim oPdfDoc As New iTextSharp.text.Document()
Dim oPdfWriter As PdfWriter = PdfWriter.GetInstance(oPdfDoc, New FileStream(sOutFilePath, FileMode.Create))
'oPdfWriter.SetEncryption(PdfWriter.STRENGTH40BITS, sPassword, sPassword, PdfWriter.AllowCopy)
oPdfDoc.Open()
oPdfDoc.SetPageSize(iTextSharp.text.PageSize.LEDGER.Rotate())
Dim oDirectContent As iTextSharp.text.pdf.PdfContentByte = oPdfWriter.DirectContent
Dim iNumberOfPages As Integer = oPdfReader.NumberOfPages
Dim iPage As Integer = 0
Dim iBottomMargin As Integer = txtBottomMargin.Text '10
Dim iLeftMargin As Integer = txtLeftMargin.Text '500
Dim iWidth As Integer = txtWidth.Text '120
Dim iHeight As Integer = txtHeight.Text '780
Dim oStrategy As New parser.SimpleTextExtractionStrategy()
Do While (iPage < iNumberOfPages)
iPage += 1
oPdfDoc.SetPageSize(oPdfReader.GetPageSizeWithRotation(iPage))
oPdfDoc.NewPage()
Dim oPdfImportedPage As iTextSharp.text.pdf.PdfImportedPage =
oPdfWriter.GetImportedPage(oPdfReader, iPage)
Dim iRotation As Integer = oPdfReader.GetPageRotation(iPage)
If (iRotation = 90) Or (iRotation = 270) Then
oDirectContent.AddTemplate(oPdfImportedPage, 0, -1.0F, 1.0F,
0, 0, oPdfReader.GetPageSizeWithRotation(iPage).Height)
Else
oDirectContent.AddTemplate(oPdfImportedPage, 1.0F, 0, 0, 1.0F, 0, 0)
End If
'Dim sPageText As String = parser.PdfTextExtractor.GetTextFromPage(oPdfReader, iPage, oStrategy)
'sPageText = System.Text.Encoding.UTF8.GetString(System.Text.ASCIIEncoding.Convert(System.Text.Encoding.Default, System.Text.Encoding.UTF8, System.Text.Encoding.Default.GetBytes(sPageText)))
'If txtFind.Text = "" OrElse sPageText.IndexOf(txtFind.Text) <> -1 Then
Dim oTextExtractor As New TextExtractor()
PdfTextExtractor.GetTextFromPage(oPdfReader, iPage, oTextExtractor) 'Initialize oTextExtractor
Dim oRect As iTextSharp.text.Rectangle = oTextExtractor.Find(txtFind.Text)
If oRect IsNot Nothing Then
Dim iX As Integer = oRect.Left + oRect.Width + iLeftMargin 'Move right
Dim iY As Integer = oRect.Bottom - iBottomMargin 'Move down
Dim field As PdfFormField = PdfFormField.CreateSignature(oPdfWriter)
field.SetWidget(New Rectangle(iX, iY, iX + iWidth, iY + iHeight), PdfAnnotation.HIGHLIGHT_OUTLINE)
field.FieldName = "myEmptySignatureField" & iPage
oPdfWriter.AddAnnotation(field)
End If
Loop
oPdfDoc.Close()
End Sub
Related
I'm using ColumnDocumentRenderer to draw content in two columns.
Below are the codes.
public void ManipulatePdf(string dest)
{
PdfDocument pdfDoc = new PdfDocument(new PdfWriter(dest));
Document doc = new Document(pdfDoc, PageSize.A4);
doc.SetMargins(55f, 55f, 45f, 55f);
var interval = 20f;
var columnWidth = (doc.GetPdfDocument().GetDefaultPageSize().GetWidth() - 110 - interval) / 2;
var pageHeight = doc.GetPdfDocument().GetDefaultPageSize().GetHeight() - 110;
var baseText = "We have seen too many reports, too many words, too many good intentions, too many families torn apart, and too many excruciatingly painful deaths to see yet more delays in taking collective action.";
var pTitle = new Paragraph(baseText);
pTitle.SetFontSize(20);
pTitle.SetTextAlignment(iText.Layout.Properties.TextAlignment.JUSTIFIED);
doc.Add(pTitle);
var currentYLine = doc.GetRenderer().GetCurrentArea().GetBBox().GetTop();
Rectangle[] columns = {
new Rectangle(55, 55,
columnWidth,
currentYLine - 55),
new Rectangle(55 + columnWidth + interval, 55,
columnWidth,
currentYLine - 55) };
doc.SetRenderer(new ColumnDocumentRenderer(doc, columns));
for (int i = 1; i <= 12; i++)
{
var text = baseText;
for (int j = 1; j <= i; j++)
{
text += " Additional Text. ";
}
var p = new Paragraph(text);
p.SetTextAlignment(iText.Layout.Properties.TextAlignment.JUSTIFIED);
doc.Add(p);
if (i == 6)
{
var tp = new Paragraph("Introduction");
tp.SetFontSize(20);
tp.SetMarginTop(50);
tp.SetTextAlignment(iText.Layout.Properties.TextAlignment.JUSTIFIED);
doc.Add(tp);
}
}
doc.Close();
}
Below is the created PDF:
Please take a look at the red line in this screenshot, my question is that how to make bottom of two columns alignment in a straight line?
Thanks & Regards.
As discussed in the comments to the original question, algorithm to auto-align the bottom lines does not exist in iText and defining its behavior turned out to be tricky even for the person asking the question.
However, it is possible to calculate the difference between the vertical positions of the bottom lines when the document is being rendered and then you can correct the content and create another document with proper layout.
Here is the example implementation which analyzes the positions of the content when it's being drawn on the document:
class CustomColumnDocumentRenderer : ColumnDocumentRenderer {
public CustomColumnDocumentRenderer(Document document, Rectangle[] columns) : base(document, columns) {
}
public CustomColumnDocumentRenderer(Document document, bool immediateFlush, Rectangle[] columns) : base(document, immediateFlush, columns) {
}
IDictionary<int, float> leftColumnBottom = new Dictionary<int, float>();
IDictionary<int, float> rightColumnBottom = new Dictionary<int, float>();
protected override void FlushSingleRenderer(IRenderer resultRenderer) {
TraverseRecursively(resultRenderer, leftColumnBottom, rightColumnBottom);
base.FlushSingleRenderer(resultRenderer);
}
void TraverseRecursively(IRenderer child, IDictionary<int, float> leftColumnBottom, IDictionary<int, float> rightColumnBottom) {
if (child is LineRenderer) {
int page = child.GetOccupiedArea().GetPageNumber();
if (!leftColumnBottom.ContainsKey(page)) {
leftColumnBottom[page] = 1000;
}
if (!rightColumnBottom.ContainsKey(page)) {
rightColumnBottom[page] = 1000;
}
bool isLeftColumn = !(child.GetOccupiedArea().GetBBox().GetX() > PageSize.A4.GetWidth() / 2);
if (isLeftColumn) {
leftColumnBottom[page] =
Math.Min(leftColumnBottom[page], child.GetOccupiedArea().GetBBox().GetBottom());
} else {
rightColumnBottom[page] = Math.Min(rightColumnBottom[page],
child.GetOccupiedArea().GetBBox().GetBottom());
}
} else {
foreach (IRenderer ownChild in (child is ParagraphRenderer ? (((ParagraphRenderer)child).GetLines().Cast<IRenderer>()) : child.GetChildRenderers())) {
TraverseRecursively(ownChild, leftColumnBottom, rightColumnBottom);
}
}
}
public List<float> getDiffs() {
List<float> ans = new List<float>();
foreach (int pageNum in leftColumnBottom.Keys) {
ans.Add(leftColumnBottom[pageNum] - rightColumnBottom[pageNum]);
}
return ans;
}
}
To use it, make sure to pass the customized document renderer to the Document instance:
CustomColumnDocumentRenderer renderer = new CustomColumnDocumentRenderer(doc, true, columns);
doc.SetRenderer(renderer);
Then you can get page-by-page diffs:
Console.WriteLine(renderer.getDiffs()[0]);
Following this actual solution I am trying to get all the words inside a TextChunk and each of its coordinates (actual page, top, bottom, left, right).
Since a TextChunk could be a phrase, a word or whatever, I tried to do this manually, counting on the last word's rectangle and cutting it each time. I noticed this manual method could be so buggy (I would need to manually count on special characters and so on), so I asked myself if ITextSharp provides any easier way to perform this.
My Chunk and LocationTextExtractionStragy inherited classes are the following:
public class Chunk
{
public Guid Id { get; set; }
public Rectangle Rect { get; set; }
public TextRenderInfo Render { get; set; }
public BaseFont BF { get; set; }
public string Text { get; set; }
public int FontSize { get; set; }
public Chunk(Rectangle rect, TextRenderInfo renderInfo)
{
this.Rect = rect;
this.Render = renderInfo;
this.Text = Render.GetText();
Initialize();
}
public Chunk(Rectangle rect, TextRenderInfo renderInfo, string text)
{
this.Rect = rect;
this.Render = renderInfo;
this.Text = text;
Initialize();
}
private void Initialize()
{
this.Id = Guid.NewGuid();
this.BF = Render.GetFont();
this.FontSize = ObtainFontSize();
}
private int ObtainFontSize()
{
return Convert.ToInt32(this.Render.GetSingleSpaceWidth() * 12 / this.BF.GetWidthPoint(" ", 12));
}
}
public class LocationTextExtractionPersonalizada : LocationTextExtractionStrategy
{
//Save each coordinate
public List<Chunk> ChunksInPage = new List<Chunk>();
//Automatically called on each chunk on PDF
public override void RenderText(TextRenderInfo renderInfo)
{
base.RenderText(renderInfo);
if (string.IsNullOrWhiteSpace(renderInfo.GetText())
|| renderInfo == null)
return;
//Get chunk Vectors
var bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
var topRight = renderInfo.GetAscentLine().GetEndPoint();
//Create Rectangle based on previous Vectors
var rect = new Rectangle(
bottomLeft[Vector.I1],
bottomLeft[Vector.I2],
topRight[Vector.I1],
topRight[Vector.I2]);
if (rect == null)
return;
//Add each chunk with its coordinates
ChunksInPage.Add(new Chunk(rect, renderInfo));
}
}
So once I get the file and so on, I proceed this way:
private void ProcessContent()
{
for (int page= 1; page <= pdfReader.NumberOfPages; page++)
{
var strategy = new LocationTextExtractionPersonalizada();
var currentPageText = PdfTextExtractor.GetTextFromPage(
pdfReader,
page,
strategy);
//Here is where I want to get each word with its coordinates
var chunksWords= ChunkRawToWord(strategy.ChunksInPage);
}
}
private List<Chunk> ChunkRawToWord(IList<Chunk> chunks)
{
if (chunks == null || chunks[0] == null)
return null;
var words = new List<Chunk>();
//Poor RegEx pattern to get the word and its wathever
string pattern = #"[#&\w+]*(-*\/*\s*\:*\;*\,*\.*\(*\)*\%*\>*\<*)?";
var something = chunks[0].Render.GetCharacterRenderInfos();
for (int i = 0; i < chunks.Count; i++)
{
var wordsInChunk = Regex.Matches(
chunks[i].Text,
pattern,
RegexOptions.IgnoreCase);
var rectangleChunk = new Rectangle(chunks[i].Rect);
for (int j = 0; j < wordsInChunk.Count; j++)
{
if (string.IsNullOrWhiteSpace(wordsInChunk[j].Value))
continue;
var word = new Chunk(
rectangleChunk,
chunks[i].Render,
wordsInChunk[j].ToString());
if (j == 0)
{
word.Rect.Right = word.BF.GetWidthPoint(word.Text, word.FontSize);
words.Add(word);
continue;
}
if (words.Count <= 0)
continue;
word.Rect.Left = words[j - 1].Rect.Right;
word.Rect.Right = words[j - 1].Rect.Right + word.BF.GetWidthPoint(word.Text, word.FontSize);
words.Add(word);
}
}
return words;
}
Afterwards, I wrote a comment on Mkl's solution, being replied with "use getCharacterRenderInfos()", which I use and I get every single character into a TextRenderInfo's List.
I'm sorry but I'm starting to mix concepts, ways to find out how to apply that solution and blowing my mind.
I would really appreciate a hand here.
You can use the method TextRenderInfo.GetCharacterRenderInfos() to get a collection of TextRenderInfo for each and every char in your chunk. Then you can could regroup the individual characters into words and calculate the rectangle that contains the word using the coordinates of the first and last TextRenderInfo in that word.
In your custom text extraction strategy:
var _separators = new[] { "-", "(", ")", "/", " ", ":", ";", ",", "."};
protected virtual void ParseRenderInfo(TextRenderInfo currentInfo)
{
var resultInfo = new List<TextRenderInfo>();
var chars = currentInfo.GetCharacterRenderInfos();
foreach (var charRenderInfo in chars)
{
resultInfo.Add(charRenderInfo);
var currentChar = charRenderInfo.GetText();
if (_separators.Contains(currentChar))
{
ProcessWord(currentInfo, resultInfo);
resultInfo.Clear();
}
}
ProcessWord(currentInfo, resultInfo);
}
private void ProcessWord(TextRenderInfo charChunk, List<TextRenderInfo> wordChunks)
{
var firstRender = wordChunks.FirstOrDefault();
var lastRender = wordChunks.LastOrDefault();
if (firstRender == null || lastRender == null)
{
return;
}
var startCoords = firstRender.GetDescentLine().GetStartPoint();
var endCoords = lastRender.GetAscentLine().GetEndPoint();
var wordText = string.Join("", wordChunks.Select(x => x.GetText()));
var wordLocation = new LocationTextExtractionStrategy.TextChunkLocationDefaultImp(startCoords, endCoords, charChunk.GetSingleSpaceWidth());
_chunks.Add(new CustomTextChunk(wordText, wordLocation));
}
I am displaying parts of a word by using:
public string GetPartialWord(string word)
{
if (string.IsNullOrEmpty(word))
{
return string.Empty;
}
char[] partialWord = word.ToCharArray();
int numberOfCharsToHide = word.Length / 2;
Random randomNumberGenerator = new Random();
HashSet<int> maskedIndices = new HashSet<int>();
for (int i = 0; i < numberOfCharsToHide; i++)
{
int rIndex = randomNumberGenerator.Next(0, word.Length);
while (!maskedIndices.Add(rIndex))
{
rIndex = randomNumberGenerator.Next(0, word.Length);
}
partialWord[rIndex] = '_';
}
return new string(partialWord);
}
Therefore: the word game would look like: _a_e
I am thinking of making adding a hint button to display another character. Any ideas on how to proceed?
G_m_ -> Hint -> G_me
You can use something like this:
public string GetWordAfterHint(string wordToProcess, string originalWord)
{
List<int> emptyIndexes = new List<int>();
for (int a = 0; a < wordToProcess.Length; a++)
{
if (wordToProcess[a] == '_')
{
emptyIndexes.Add(a);
}
}
// in case if word doesn't have empty positions
if (emptyIndexes.Count == 0)
{
return wordToProcess;
}
Random random = new Random();
var indexForLetter = random.Next(emptyIndexes.Count);
// create stringBuilder from string, because string is immutable and you can't change separate symbol
StringBuilder sb = new StringBuilder(wordToProcess);
// insert symbol from originalWord in empty previously generated position
sb[emptyIndexes[indexForLetter]] = originalWord[emptyIndexes[indexForLetter]]; //
//convert stringBuilder to string and return
return sb.ToString();
}
Method returns word after hint - if as wordToProcess you pass "_a_e" and as originalWord "game" then method returns ga_e or _ame.
Store the indexes of the characters displayed in an array or list. When the hint button is pressed, compute a random index. Compare that index with the indexes of letters already being displayed, and recalculate a new random index if necessary.
An Object Oriented approach:
From your GePartialWord method, return an instance of this class instead of a simple string:
public class GameWord
{
public string OriginalWord { get; set; }
public string GuessWord { get; set; }
public string Hint()
{
int index = this.GuessWord.IndexOf('_');
if (index != -1)
{
var builder = new StringBuilder(this.GuessWord);
builder[index] = this.OriginalWord[index];
this.GuessWord = builder.ToString();
return this.GuessWord; // if needed
}
// No more hints, the world has no underscores
return this.GuessWord;
}
}
So in your method you will do this instead:
public GameWord GetPartialWord(string word)
{
// The rest of your code
// Change this line return new string(partialWord);
// to this
return new GameWord{ OriginalWord = word, GuessWord = new string(partialWord)};
}
And in your form, create a private field like this:
private GameWord currentGameWord;
When you have the random word, call your GetPartialWord method and store the returned word in currentGameWord:
this.currentGameWord = GetPartialWord(someWord);
And because the method now returns an object, bind your textbox like this:
this.textBox1.Text = this.currentGameWord.GuessWord;
And in your button's click handler do this (your handler will have a different name):
private void HintButton_Click(object sender, EventArgs e)
{
this.textBox1.Text = this.currentGameWord.Hint();
}
Does anyone know how to use the visio insertListMember method (below) in c#?
https://msdn.microsoft.com/en-us/library/office/ff768115.aspx
I have tried to execute the method with the following commands but it gives a "Run Time Error- 424 object required"
I have also used the dropIntoList method and it works fine but for specific purposes I need to use the insertListMember method. (to determine the height of the list)
static void Main(string[] args)
{
//create the object that will do the drawing
visioDrawing.VisioDrawer Drawer = new visioDrawing.VisioDrawer();
Drawer.setUpVisio();
Visio.Shape testShape;
Visio.Shape testShape1;
testShape = Drawer.DropShape("abc", "lvl1Box");
testShape1 = Drawer.DropShape("ccc", "Capability");
Drawer.insertListMember(testShape, testShape1, 1);
}
public void insertListMember(Visio.Shape outerlist, Visio.Shape innerShape, int position)
{
ActiveDoc.ExecuteLine(outerlist + ".ContainerProperties.InsertListMember" + innerShape + "," + position);
}
To obtain the shape:
public Visio.Shape DropShape(string rectName, string masterShape)
{
//get the shape to drop from the masters collection
Visio.Master shapetodrop = GetMaster(stencilPath, masterShape);
// drop a shape on the page
Visio.Shape DropShape = acPage.Drop(shapetodrop, 1, 1);
//put name in the shape
Visio.Shape selShape = selectShp(DropShape.ID);
selShape.Text = rectName;
return DropShape;
}
private Visio.Master GetMaster(string stencilName, string mastername)
{
// open the page holding the masters collection so we can use it
MasterDoc = MastersDocuments.OpenEx(stencilName, (short)Visio.VisOpenSaveArgs.visOpenDocked);
// now get a masters collection to use
Masters = MasterDoc.Masters;
return Masters.get_ItemU(mastername);
}
From your code, 'Drawer' looks to be some kind of Visio app wrapper, but essentially InsertListMember allows you to add shapes to a list that already exist on the page. Here's an example of the method and an alternative Page.DropIntoList if you just want to drop directly from the stencil:
void Main()
{
// 'GetRunningVisio' as per
// http://visualsignals.typepad.co.uk/vislog/2015/12/getting-started-with-c-in-linqpad-with-visio.html
// but all you need is a reference to the app
var vApp = MyExtensions.GetRunningVisio();
var vDoc = vApp.Documents.Add("wfdgm_m.vstx");
var vPag = vDoc.Pages[1];
var vCtrlsStencil = vApp.Documents["WFCTRL_M.VSSX"];
var vListMst = vCtrlsStencil?.Masters["List box"];
if (vListMst != null)
{
var vListShp = vPag.Drop(vListMst, 2, 6);
var vListItemMst = vCtrlsStencil.Masters["List box item"];
var insertPosition = vListShp.ContainerProperties.GetListMembers().Length - 1;
//Use InsertListMember method
var firstListItem = vPag.Drop(vListItemMst, 4, 6);
vListShp.ContainerProperties.InsertListMember(firstListItem, insertPosition);
firstListItem.CellsU["FillForegnd"].FormulaU = "3"; //Green
//or use DropIntoList method on Page instead
var secondListItem = vPag.DropIntoList(vListItemMst, vListShp, insertPosition);
secondListItem.CellsU["FillForegnd"].FormulaU = "2"; //Red
}
}
This is using the Wireframe Diagram template (in Visio Professional) and should result in the following:
In case people were wondering I ended up fixing the method up. However I believe that this method is not required if you are using visio Interop assemblies v15. (I am using v14)
public void insertListMember(int outerShpID, int innerShpID, int position)
{
acWindow.DeselectAll();
Visio.Page page = acWindow.Page;
acWindow.Select(page.Shapes.get_ItemFromID(innerShpID), (short)Microsoft.Office.Interop.Visio.VisSelectArgs.visSelect);
Debug.WriteLine("Application.ActivePage.Shapes.ItemFromID(" + outerShpID + ").ContainerProperties.InsertListMember ActiveWindow.Selection," + position);
ActiveDoc.ExecuteLine("Application.ActivePage.Shapes.ItemFromID(" + outerShpID + ").ContainerProperties.InsertListMember ActiveWindow.Selection," + position);
}
I am trying to write a program that reads a text file, sorts it by character, and keeps track of how many times each character appears in the document. This is what I have so far.
class Program
{
static void Main(string[] args)
{
CharFrequency[] Charfreq = new CharFrequency[128];
try
{
string line;
System.IO.StreamReader file = new System.IO.StreamReader(#"C:\Users\User\Documents\Visual Studio 2013\Projects\Array_Project\wap.txt");
while ((line = file.ReadLine()) != null)
{
int ch = file.Read();
if (Charfreq.Contains(ch))
{
}
}
file.Close();
Console.ReadLine();
}
catch (Exception e)
{
Console.WriteLine("The process failed: {0}", e.ToString());
}
}
}
My question is, what should go in the if statement here?
I also have a Charfrequency class, which I'll include here in case it is helpful/necessary that I include it (and yes, it is necessary that I use an array versus a list or arraylist).
public class CharFrequency
{
private char m_character;
private long m_count;
public CharFrequency(char ch)
{
Character = ch;
Count = 0;
}
public CharFrequency(char ch, long charCount)
{
Character = ch;
Count = charCount;
}
public char Character
{
set
{
m_character = value;
}
get
{
return m_character;
}
}
public long Count
{
get
{
return m_count;
}
set
{
if (value < 0)
value = 0;
m_count = value;
}
}
public void Increment()
{
m_count++;
}
public override bool Equals(object obj)
{
bool equal = false;
CharFrequency cf = new CharFrequency('\0', 0);
cf = (CharFrequency)obj;
if (this.Character == cf.Character)
equal = true;
return equal;
}
public override int GetHashCode()
{
return m_character.GetHashCode();
}
public override string ToString()
{
String s = String.Format("'{0}' ({1}) = {2}", m_character, (byte)m_character, m_count);
return s;
}
}
Have a look at this post.
https://codereview.stackexchange.com/questions/63872/counting-the-number-of-character-occurrences
It uses LINQ to achieve your goal
You shouldn't use Contains
first you need to initialize your Charfreq array:
CharFrequency[] Charfreq = new CharFrequency[128];
for (int i = 0; i < Charferq.Length; i++)
{
Charfreq[i] = new CharFrequency((char)i);
}
try
then you can
int ch;
// -1 means that there are no more characters to read,
// otherwise ch is the char read
while ((ch = file.Read()) != -1)
{
CharFrequency cf = new CharFrequency((char)ch);
// This works because CharFrequency overloads the
// Equals method, and the Equals method checks only
// for the Character property of CharFrequency
int ix = Array.IndexOf(Charfreq, cf);
// if there is the "right" charfrequency
if (ix != -1)
{
Charfreq[ix].Increment();
}
}
Note that this isn't the way I would write the program. This is the minimum changes needed to make your program working.
As a sidenote, this program will count the "frequency" of ASCII characters (characters with code <= 127)
CharFrequency cf = new CharFrequency('\0', 0);
cf = (CharFrequency)obj;
And this is an useless initialization:
CharFrequency cf = (CharFrequency)obj;
is enough, otherwise you are creating a CharFrequency just to discard it the line below.
A dictionary is well suited for a task like this. You didn't say which character set and encoding the file was in. So, because Unicode is so common, let's assume the Unicode character set and UTF-8 encoding. (After all, it is the default for .NET, Java, JavaScript, HTML, XML,….) If that's not the case then read the file using the applicable encoding and fix your code because you currently are using UTF-8 in your StreamReader.
Next comes iterating across the "characters". And then incrementing the count for a "character" in the dictionary as it is seen in the text.
Unicode does have a few complex features. One is combining characters, where a base character can be overlaid with diacritics etc. Users view such combinations as one "character", or, as Unicode calls them, graphemes. Thankfully, .NET gives is the StringInfo class that iterates over them as a "text element."
So, if you think about it, using an array would be quite difficult. You'd have to build your own dictionary on top of your array.
The example below uses a Dictionary and is runnable using a LINQPad script. After it creates the dictionary, it orders and dumps it with a nice display.
var path = Path.GetTempFileName();
// Get some text we know is encoded in UTF-8 to simplify the code below
// and contains combining codepoints as a matter of example.
using (var web = new WebClient())
{
web.DownloadFile("http://superuser.com/questions/52671/which-unicode-characters-do-smilies-like-%D9%A9-%CC%AE%CC%AE%CC%83-%CC%83%DB%B6-consist-of", path);
}
// since the question asks to analyze a file
var content = File.ReadAllText(path, Encoding.UTF8);
var frequency = new Dictionary<String, int>();
var itor = System.Globalization.StringInfo.GetTextElementEnumerator(content);
while (itor.MoveNext())
{
var element = (String)itor.Current;
if (!frequency.ContainsKey(element))
{
frequency.Add(element, 0);
}
frequency[element]++;
}
var histogram = frequency
.OrderByDescending(f => f.Value)
// jazz it up with the list of codepoints in each text element
.Select(pair =>
{
var bytes = Encoding.UTF32.GetBytes(pair.Key);
var codepoints = new UInt32[bytes.Length/4];
Buffer.BlockCopy(bytes, 0, codepoints, 0, bytes.Length);
return new {
Count = pair.Value,
textElement = pair.Key,
codepoints = codepoints.Select(cp => String.Format("U+{0:X4}", cp) ) };
});
histogram.Dump(); // For use in LINQPad