There is a method that many times reads text from different pages of a pdf document using a rectangle. Accordingly, the larger the file, the slower everything is processed, I tried to use Parallel.Foreach, but I didn't get a substantial increase in processing speed, everything seems to be hampered by PdfReader.
The method is something like this:
var lst = new ConcurrentBag<Test3>();
using(var reader = new PdfReader(byteArr))
{
Parallel.Foreach(areas, t =>
{
var pageSize = reader.GetPageSize(t.PageNumber);
var rectangle = GetRectagle(t.AreaData, pageSize);
var text = GetTextFromRectangle(reader, rectagle, t.PageNumber);
lst.Add(text);
}
}
public string GetTextFromRectagle(PdfReader reader, Rectangle rect, int pageNum)
{
RenderFilter[] filter = {
new RegionTextRenderText()
};
ITextExtractionStrategy strategy =
new FilteredTextRenderListener(new
LocationTextExtractionStrategy(), filter);
return PdfTextExtractor.GetTextFromPage(reader, pageNumber, strategy);
}
After you mentioned in a comment that there are
Approximately 900 rectangle areas per page
and added your GetTextFromRectangle code, the cause of the problem became clear: For each of your pre-defined rectangles you make iText parse the whole content of the page the rectangle is on into a filtered text extraction strategy you expect to be focused on the respective rectangle area.
By the way, even worse, I don't see you using the Rectangle rect parameter in your GetTextFromRectangle method, thus after all you actually do not even focus on the respective rectangle!
So you parse each page approximately 900 times, each time throwing away most of the parsed information, instead of only once and retrieving the text from the pre-parsed data from each of those 900 rectangles per page.
This is waste of resources in its purest form!
What you should do instead, is
sort and separate your areas by their respective page and
for each of the pages
once (and only once) parse the content of that page into an unfiltered LocationTextExtractionStrategy and
for each rectangle on that page use the GetResultantText(TextChunkFilter) method of the strategy instance with a TextChunkFilter that filters by position (whether the chunk in question is inside the rectangle at hand) to retrieve the area text.
As an aside, in case of iText 7 instead of iText 5 (for .Net, formerly called iTextSharp) that GetResultantText overload with a TextChunkFilter is missing but you can emulate it, cf. this answer.
Related
I'm trying to write a extension method for Aspose's DocumentBuilder class that allows you to check if inserting a number of paragraphs into a document will cause a page break or not, I hoped this would be rather simple, but it turns out otherwise:
public static bool WillPageBreakAfter(this DocumentBuilder builder, int numParagraphs)
{
// Get current number of pages
int pageCountBefore = builder.Document.PageCount;
for (int i = 0; i < numParagraphs; i++)
{
builder.InsertParagraph();
}
// Get the number of pages after adding those paragraphs
int pageCountAfter = builder.Document.PageCount;
// Delete the paragraphs, we don't need them anymore
...
if (pageCountBefore != pageCountAfter)
{
return true;
}
else
{
return false;
}
}
MY problem is, that inserting paragraphs does not seem to update the builder.Document.PageCount property. Even plugging in something crazy like 5000 paragraphs does seem to modify that property. I've also tried InsertBreak() (including using BreakType.PageBreak) and Writeln() but those don't work either.
What's going on here? Is there anyway I can achieve the desired result?
UPDATE
It seems that absolutely nothing done on the DocumentBuilder parameter actually happens on the DocumentBuilder that is calling the method. In other words:
If I modify the for loop to do something like builder.InsertParagraph(i.ToString()); and then remove the code that deletes the paragraphs afterwords. I can call:
myBuilder.WillPageBreakAfter(10);
And expect to see 0-9 written to the document when it is saved, however it is not. None of the Writeln()s in the extension methods seem to do anything at all.
UPDATE 2
It appears for what ever reasons, I cannot write anything with the DocumentBuilder after accessing the page count. So calling something like Writeln() before the int pageCountBefore = builder.Document.PageCount; line works, but trying to write after that line simply does nothing.
The Document.PageCount invokes page layout. You are modifying the document after using this property. Note that when you modify the document after using this property, Aspose.Words will not update the page layout automatically. In this case you should call Document.UpdatePageLayout method.
I work with Aspose as Developer Evangelist.
And it seems I've figured it out.
From the Aspose docs:
// This invokes page layout which builds the document in memory so note that with large documents this
// property can take time. After invoking this property, any rendering operation e.g rendering to PDF or image
// will be instantaneous.
int pageCount = doc.PageCount;
The most important line here:
This invokes page layout
By "invokes page layout", they mean it calls UpdatePageLayout(), for which the docs contain this note:
However, if you modify the document after rendering and then attempt to render it again - Aspose.Words will not update the page layout automatically. In this case you should call UpdatePageLayout() before rendering again.
So basically, given my original code, I have to call UpdatePageLayout() after my Writeln()s in order to get the updated page count.
// Get current number of pages
int pageCountBefore = builder.Document.PageCount;
for (int i = 0; i < numParagraphs; i++)
{
builder.InsertParagraph();
}
// Update the page layout.
builder.Document.UpdatePageLatout();
// Get the number of pages after adding those paragraphs
int pageCountAfter = builder.Document.PageCount;
I try to build an application that can convert a PDF to an excel with C#.
I have searched for some library to help me with this, but most of them are commercially licensed, so I ended up to iTextSharp.dll
It's good that is free, but I rarely find any good open source documentation for it.
These are some link that I have read:
https://yoda.entelect.co.za/view/9902/extracting-data-from-pdf-files
https://www.mikesdotnetting.com/article/80/create-pdfs-in-asp-net-getting-started-with-itextsharp
http://www.thedevelopertips.com/DotNet/ASPDotNet/Read-PDF-and-Convert-to-Stream.aspx?id=34
there're more. But, most of them did not really explain what use of the code.
So this is most common code in IText with C#:
StringBuilder text = new StringBuilder(); // my new file that will have pdf content?
PdfReader pdfReader = new PdfReader(myPath); // This maybe how IText read the pdf?
for (int page = 1; page <= pdfReader.NumberOfPages; page++) // looping for read all content in pdf?
{
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy(); // ?
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy); // ?
currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText))); // maybe how IText convert the data to text?
text.Append(currentText); // maybe the full content?
}
pdfReader.Close(); // to close the PdfReader?
As you can see, I still do not have a clear knowledge of the IText code that I have. Tell me, if my knowledge is correct and give me an answer for code that I still not understand.
Thank You.
Let me start by explaining a bit about PDF.
PDF is not a 'what you see is what you get'-format.
Internally, PDF is more like a file containing instructions for rendering software. Unless you are working with a tagged PDF file, a PDF document does not naturally have a concept of 'paragraph' or 'table'.
If you open a PDF in notepad for instance, you might see something like
7 0 obj
<</BaseFont/Helvetica-Oblique/Encoding/WinAnsiEncoding/Subtype/Type1/Type/Font>>
endobj
Instructions in the document get gathered into 'objects' and objects are numbered, and can be cross-referenced.
As Bruno already indicated in the comments, this means that finding out what a table is, or what the content of a table is, can be really hard.
The PDF document itself can only tell you things like:
object 8 is a line from [50, 100] to [150, 100]
object 125 is a piece of text, in font Helvetica, at position [50, 110]
With the iText core library you can
get all of these objects (which iText calls PathRenderInfo, TextRenderInfo and ImageRenderInfo objects)
get the graphics state when the object was rendered (which font, font-size, color, etc)
This can allow you to write your own parsing logic.
For instance:
gather all the PathRenderInfo objects
remove everything that is not a perfect horizontal or vertical line
make clusters of everything that intersects at 90 degree angles
if a cluster contains more than a given threshold of lines, consider it a table
Luckily, the pdf2Data solution (an iText add-on) already does that kind of thing for you.
For more information go to http://pdf2data.online/
I'm trying to use the watermark plugin to write text on images for my project. Right now I'm trying to find out how to set a "width" for a writing box so I can get automatic line returns. Is there a way to do this with the watermark plugin?
Also I'm trying to see if I can get a "text-align: center" effect when I'm writing my text (possibliy in relation to that set width), how could I get that setup?
I'm thinking that the alternative to this would be to have code driven line returns and centering, but this would mean that I would have to count the width of my characters and this seems like a world of pain hehe
Here is a code sample that shows what I'm doing (this currently works):
var c = Config.Current;
var wp = c.Plugins.Get<WatermarkPlugin>();
var t = new TextLayer();
t.Text = panty.Message;
t.TextColor = (System.Drawing.Color) color;
t.Font = fonts[myFunObject.Font];
t.FontSize = fontSize[myFunObject.LogoPosition];
t.Left = new DistanceUnit(5, DistanceUnit.Units.Pixels);
t.Top = new DistanceUnit(5, DistanceUnit.Units.Pixels);
wp.NamedWatermarks["myFunObjectMessage"] = new Layer[] { t };
EDIT: I also have to mention that the text I'm writing is user submitted so it's different everytime. If you want a similar case, think about thos funny cat images with funny text captions on them. This project is quite similar to that. (Minus the cats)
Thanks for the help!
Basically, System.Drawing (and therefore the current version of Watermark) are very primitive about line wrapping.
As you mentioned, you can do hacky stuff with character counting and separate MeasureString calls with loops, but the results are only barely acceptable.
You may try to fork the Watermark source code and hack support for your use case. I don't see a way to improve Watermark in a generic way without replacing the underlying graphics engine first (which may happen anyway).
System.Drawing has unsurpassed image resampling quality. Text wrapping, though, it kind of stinks at.
I have a bunch of PDF files- I read these as requested into a byte array and then also pass it to a iTextSharp PdfReader instance. I want to then grab the dimensions of each page- in pixels. From what I've read so far it seems by PDF files work in points- a point being a configurable unit stored in some kind of dictionary in an element called UserUnit.
Loading my PDF File into a PdfReader, what do I need to do to get the UserUnit for each page (apparently it can vary from page to page) so I can then get the page dimensions in pixels.
At present I have this code, which grabs the dimensions for each page in "points" - guess I just need the UerUnit, and can then multiply these dimensions by that to get pixels or something similar.
//Create an object to read the PDF
PdfReader reader = new iTextSharp.text.pdf.PdfReader(file_content);
for (int i = 1; i <= reader.NumberOfPages; i++)
{
Rectangle dim = reader.GetPageSize(i);
int[] xy = new int[] { (int)dim.Width, (int)dim.Height }; // returns page size in "points"
page_data[objectid + '-' + i] = xy;
}
Cheers!
Allow me to quote from my book:
iText in Action - Second Edition, page 9:
FAQ What is the measurement unit in PDF documents? Most of the measurements
in PDFs are expressed in user space units. ISO-32000-1 (section 8.3.2.3) tells us
“the default for the size of the unit in default user space (1/72 inch) is
approximately the same as a point (pt), a unit widely used in the printing
industry. It is not exactly the same; there is no universal definition of a point.”
In short, 1 in. = 25.4 mm = 72 user units (which roughly corresponds to 72 pt).
On the next page, I explain that it’s possible to change the default value of the user unit, and I add an example on how to create a document with pages that have a different user unit.
Now for your question: suppose you have an existing PDF, how do you find which user unit was used? Before we answer this, we need to take a look at ISO-32000-1.
In section 7.7.3.3Page Objects, you'll find the description of UserUnit in Table 30, "Entries in a page object":
(Optional; PDF 1.6) A positive number that shall give the size of
default user space units, in multiples of 1⁄72 inch. The range of
supported values shall be implementation-dependent. Default value: 1.0
(user space unit is 1⁄72 inch).
This key was introduced in PDF 1.6; you won't find it in older files. It's optional, so you won't always find it in every page dictionary. In my book, I also explain that the maximum value of the UserUnit key is 75,000.
Now how to retrieve this value with iTextSharp?
You already have Rectangle dim = reader.GetPageSize(i); which returns the MediaBox. This may not be the size of the visual part of the page. If there's a CropBox defined for the page, viewers will show a much smaller size than what you have in xy (but you probably knew that already).
What you need now is the page dictionary, so that you can retrieve the value of the UserUnit key:
PdfDictionary pageDict = reader.GetPageN(i);
PdfNumber userUnit = pageDict.GetAsNumber(PdfName.USERUNIT);
Most of the times userUnit will be null, but if it isn't you can use userUnit.FloatValue.
I have a problem with some user provided pdf documents. They are created from 3d packages and are basically a HUGE list of vector lines that take and age to render (over 60 secs).
How can I generate a report on the number of vector lines present in a pdf document using iTextSharp (5.0.5)?
I can get text and image data but can't see where to get a handle on vector. They don't seems to be represented as an image.
iText[Sharp]'s parser package doesn't yet handle lineTo or curveTo commands. It's a goal, but not one that's been important enough to implement as yet. Other Things are getting attention at the moment.
If you're feeling adventurous, you should check out PdfContentStreamProcessor. In a private function populateOperators, there's a long list of commands that are currently handled (in one fashion or another).
You'd need to write similar command classes for all the line art commands (moveTo, lineTo, rect, stroke, fill, clip), and expose them in some way.
Actually, if all you want to do is COUNT the number of paths, you could just implement stroke and fill to increment some static integer[s], then check them after parsing. Should be fairly simple (I'm writing in Java, but it's easy enough to translate):
private static class CountOps implements ContentOperator {
public static int operationCount = 0;
public void invoke(PdfContentStreamProcessor processor, PdfLiteral operator, ArrayList<PdfObject> operands) {
++operationCount;
}
}
Ah! registerContentOperator is a public function. You don't need to change iText's source at all:
PdfContentStreamProcessor proc = new PdfContentStreamProcessor(null);
CountOps counter = new CountOps();
proc.registerContentOperator("S", countOps); // stroke the path
proc.registerContentOperator("s", countOps); // close & stroke
proc.registerContentOperator("F", countOps); // fill, backward compat
proc.registerContentOperator("f", countOps); // fill
proc.registerContentOperator("f*", countOps); // fill with event-odd winding rule
proc.registerContentOperator("B", countOps); // fill & stroke
proc.registerContentOperator("B*", countOps); // fill & stroke with even-odd
proc.registerContentOperator("b", countOps); // close, fill, & stroke
proc.registerContentOperator("b*", countOps); // close, fill, & stroke with even-odd
proc.processContent( contentBytes, pageResourceDict );
int totalStrokesAndFills = CountOps.operationCount; // note that stroke&fill operators will be counted once, not twice.
Something like that. Only a null RenderListener will cause a null pointer exception if you run into any text or images. You could whip up a no-op listener yourself or use one of the existing ones and ignore its output.
PS: iTextSharp 5.0.6 should be released any day now if it isn't out already.
There is no specific Vector image. Normally it is just added to the content stream which is essentially a Vector data stream for drawing the whole page.
There is a blog article which you might find useful for understanding this at http://www.jpedal.org/PDFblog/2010/11/grow-your-own-pdf-file-%E2%80%93-part-5-path-objects/