Get document vector count in a pdf document?

Get document vector count in a pdf document? - c#

I have a problem with some user provided pdf documents. They are created from 3d packages and are basically a HUGE list of vector lines that take and age to render (over 60 secs).
How can I generate a report on the number of vector lines present in a pdf document using iTextSharp (5.0.5)?
I can get text and image data but can't see where to get a handle on vector. They don't seems to be represented as an image.

iText[Sharp]'s parser package doesn't yet handle lineTo or curveTo commands. It's a goal, but not one that's been important enough to implement as yet. Other Things are getting attention at the moment.
If you're feeling adventurous, you should check out PdfContentStreamProcessor. In a private function populateOperators, there's a long list of commands that are currently handled (in one fashion or another).
You'd need to write similar command classes for all the line art commands (moveTo, lineTo, rect, stroke, fill, clip), and expose them in some way.
Actually, if all you want to do is COUNT the number of paths, you could just implement stroke and fill to increment some static integer[s], then check them after parsing. Should be fairly simple (I'm writing in Java, but it's easy enough to translate):
private static class CountOps implements ContentOperator {
public static int operationCount = 0;
public void invoke(PdfContentStreamProcessor processor, PdfLiteral operator, ArrayList<PdfObject> operands) {
++operationCount;
}
}
Ah! registerContentOperator is a public function. You don't need to change iText's source at all:
PdfContentStreamProcessor proc = new PdfContentStreamProcessor(null);
CountOps counter = new CountOps();
proc.registerContentOperator("S", countOps); // stroke the path
proc.registerContentOperator("s", countOps); // close & stroke
proc.registerContentOperator("F", countOps); // fill, backward compat
proc.registerContentOperator("f", countOps); // fill
proc.registerContentOperator("f*", countOps); // fill with event-odd winding rule
proc.registerContentOperator("B", countOps); // fill & stroke
proc.registerContentOperator("B*", countOps); // fill & stroke with even-odd
proc.registerContentOperator("b", countOps); // close, fill, & stroke
proc.registerContentOperator("b*", countOps); // close, fill, & stroke with even-odd
proc.processContent( contentBytes, pageResourceDict );
int totalStrokesAndFills = CountOps.operationCount; // note that stroke&fill operators will be counted once, not twice.
Something like that. Only a null RenderListener will cause a null pointer exception if you run into any text or images. You could whip up a no-op listener yourself or use one of the existing ones and ignore its output.
PS: iTextSharp 5.0.6 should be released any day now if it isn't out already.

There is no specific Vector image. Normally it is just added to the content stream which is essentially a Vector data stream for drawing the whole page.
There is a blog article which you might find useful for understanding this at http://www.jpedal.org/PDFblog/2010/11/grow-your-own-pdf-file-%E2%80%93-part-5-path-objects/

Related

Is there any way to optimize pdfreader itextsharp?

There is a method that many times reads text from different pages of a pdf document using a rectangle. Accordingly, the larger the file, the slower everything is processed, I tried to use Parallel.Foreach, but I didn't get a substantial increase in processing speed, everything seems to be hampered by PdfReader.
The method is something like this:
var lst = new ConcurrentBag<Test3>();
using(var reader = new PdfReader(byteArr))
{
Parallel.Foreach(areas, t =>
{
var pageSize = reader.GetPageSize(t.PageNumber);
var rectangle = GetRectagle(t.AreaData, pageSize);
var text = GetTextFromRectangle(reader, rectagle, t.PageNumber);
lst.Add(text);
}
}
public string GetTextFromRectagle(PdfReader reader, Rectangle rect, int pageNum)
{
RenderFilter[] filter = {
new RegionTextRenderText()
};
ITextExtractionStrategy strategy =
new FilteredTextRenderListener(new
LocationTextExtractionStrategy(), filter);
return PdfTextExtractor.GetTextFromPage(reader, pageNumber, strategy);
}

After you mentioned in a comment that there are
Approximately 900 rectangle areas per page
and added your GetTextFromRectangle code, the cause of the problem became clear: For each of your pre-defined rectangles you make iText parse the whole content of the page the rectangle is on into a filtered text extraction strategy you expect to be focused on the respective rectangle area.
By the way, even worse, I don't see you using the Rectangle rect parameter in your GetTextFromRectangle method, thus after all you actually do not even focus on the respective rectangle!
So you parse each page approximately 900 times, each time throwing away most of the parsed information, instead of only once and retrieving the text from the pre-parsed data from each of those 900 rectangles per page.
This is waste of resources in its purest form!
What you should do instead, is
sort and separate your areas by their respective page and
for each of the pages
once (and only once) parse the content of that page into an unfiltered LocationTextExtractionStrategy and
for each rectangle on that page use the GetResultantText(TextChunkFilter) method of the strategy instance with a TextChunkFilter that filters by position (whether the chunk in question is inside the rectangle at hand) to retrieve the area text.
As an aside, in case of iText 7 instead of iText 5 (for .Net, formerly called iTextSharp) that GetResultantText overload with a TextChunkFilter is missing but you can emulate it, cf. this answer.

How to determine whether a MS-Word paragraph is more than one line?

What I want is not to find where enter was pressed in a paragraph (the end of a paragraph). I need to determine whether a paragraph contains a single line or multiple lines so that it can be formatted accordingly (centered or left-justified).
Like this in the center if it's in one line
or left justify if in Multiline
How to determine whether a paragraph is more than one line in VSTO?

Since "lines" are not objects in the Word object model, due to its dynamic layout algorithms, this needs to be approached via the old WordBasic technology still built into the APIs. (WordBasic worked based on selections, rather than objects, which is why this capability is present in these old methods.)
In this case, the Word.WdInformation enumeration offers parameters that work with "lines", more specifically for this problem wdFirstCharacterLineNumber.
The following sample code contains a code snippet that calls IsParaOneLine on a specific paragraph of a document.
IsParaOneLIne duplicates the paragraph Range passed two it twice: once for the starting point and once for the end point. These Ranges are then collapsed to their starting and end points, respectively and the line number determined. If the two are the same, true is returned to the calling code, otherwise false.
Notes:
rngEnd.MoveEnd(Word.WdUnits.wdCharacter, -1); moves the end point back by one character because after collapsing to the end of a paragraph Range, the Range is at the start of the following paragraph. This moves it back to the original paragraph.
The example applies a style rather than "direct formatting". Rather than formatting with centered and left alignment throughout a document I strongly recommend using Styles. If there's not a built-in style with the formatting required, create the custom styles you need. If you're familiar with CSS you know the advantages of using styles. With Word there's an additional reason: it massively reduces the temp files Word generates so that you're less likely to run out of memory.
Word.Range rng = doc.Paragraphs[2].Range;
if (IsParaOneLine(rng))
{
rng.set_Style(Word.WdBuiltinStyle.wdStyleHeading1);
}
else
{
Debug.Print("Not one line");
}
public bool IsParaOneLine(Word.Range rng)
{
Word.Range rngStart = rng.Duplicate;
rngStart.Collapse(Word.WdCollapseDirection.wdCollapseStart);
Word.Range rngEnd = rng.Duplicate;
rngEnd.Collapse(Word.WdCollapseDirection.wdCollapseEnd);
rngEnd.MoveEnd(Word.WdUnits.wdCharacter, -1);
int posLineStart = (int) rngStart.get_Information(Word.WdInformation.wdFirstCharacterLineNumber);
int posLineEnd = (int) rngEnd.get_Information(Word.WdInformation.wdFirstCharacterLineNumber);
bool isSameLine = false;
if (posLineStart == posLineEnd)
isSameLine = true;
return isSameLine;
}

Searching PDF for Underlined and Bolded text

Using iTextSharp, how can I determine if a parsed chunk of text is both bolded and underlined?
Details:
I'm trying to parse .PDF files in C# specifically for text that is both bolded and underlined. Using ITextSharp, I can derive from LocationTextExtractionStrategy and get the text, the location, the font, etc. from the iTextSharp.text.pdf.parser.TextRenderInfo object passed to the overridden .RenderText method.
However, determining if the text is Bold and/Underlined from the TextRenderInfo object has not been straight forward.
I tried to use TextRenderInfo.GetFont() to find the font properties, but was unsuccessful
I can currently determine if the text is Bold or not, by accessing the private Graphics State field on the TextRenderInfo object and checking it's .Font.PostscriptFontName property for the word "Bold" (Ugly, but appears to work.)
Biggest issue: I haven't found anything to determine if the text is underlined. How can I determine this?
Here is my current attempt:
private FieldInfo _gsField = typeof(TextRenderInfo).GetField("gs",
BindingFlags.GetField | BindingFlags.NonPublic | BindingFlags.Instance);
//Automatically called for each chunk of text in the PDF
public override void RenderText(TextRenderInfo renderInfo)
{
base.RenderText(renderInfo);
//UNDONE:Need to determine if text is underlined. How?
//NOTE: renderInfo.GetFont().FontWeight does not contain any actual information
var gs = (GraphicsState)_gsField.GetValue(renderInfo);
var textChunkInfo = new TextChunkInfo(renderInfo);
_allLocations.Add(textChunkInfo);
if (gs.Font.PostscriptFontName.Contains("Bold"))
//Add this to our found collection
FoundItems.Add(new TextChunkInfo(renderInfo));
if (!_lineHeights.Contains(textChunkInfo.LineHeight))
_lineHeights.Add(textChunkInfo.LineHeight);
}
Full source code of current attempt at: GitHub Repository (Two examples (example.pdf and example2.pdf) are included with text similar to what I'll be searching through.)

I tried to use TextRenderInfo.GetFont() to find the font properties, but was unsuccessful
I can currently determine if the text is Bold or not, by accessing the private Graphics State field on the TextRenderInfo object and checking it's .Font.PostscriptFontName property for the word "Bold" (Ugly, but appears to work.)
I don't quite understand this differentiation. TextRenderInfo.GetFont() is exactly the same as the Font property of the private Graphics State field of TextRenderInfo.
That being said, though, this is indeed one of the major ways to determine boldness.
Bold writing in PDFs is achieved either using
explicitly bold fonts (which is the better way); in this case one can try to determine whether or not the fonts are bold by
looking at the font name: it may contain a substring "bold" or something similar;
looking at some optional properties of the font, e.g. font weight, but beware, they are optional...
inspecting the embedded font file if applicable.
Neither of these methods is fool-proof;
the same font as for non-bold text but using special techniques to make them appear bold (aka poor man's bold), e.g.
not only filling the glyph contours but also drawing a thicker line along it for a bold impression,
drawing the glyph twice, the second time slightly displaced, also for a bold impression.
Underlined writing in PDFs is usually achieved by explicitly drawing a line or a very thin rectangle under the text. You can try and detect such lines by implementing IExtRenderListener, parsing the page in question with it to determine line locations, and then match with text positions during text extraction. Both can also be done in a single pass but beware, the underlines need not be drawn before the text or even shortly thereafter, the pdf producer may first draw all text and only then draw all underlines. Furthermore, I've also come across a funny construction, very short (e.g. 1pt) very wide (e.g. 50pt) vertical lines effectively are seen as horizontal ones...
IExtRenderListener extends the IRenderListener with three new methods, ModifyPath, RenderPath, and ClipPath. Whenever some path is drawn, be it a single line, a rectangle, or some very complex path, you'll first get a number of ModifyPath calls (at least one)
/**
* Called when the current path is being modified. E.g. new segment is being added,
* new subpath is being started etc.
*
* #param renderInfo Contains information about the path segment being added to the current path.
*/
void ModifyPath(PathConstructionRenderInfo renderInfo);
defining the lines and curves the path consists of, then at most one ClipPath call
/**
* Called when the current path should be set as a new clipping path.
*
* #param rule Either {#link PathPaintingRenderInfo#EVEN_ODD_RULE} or {#link PathPaintingRenderInfo#NONZERO_WINDING_RULE}
*/
void ClipPath(int rule);
(if and only if the path shall serve as clip path for the following drawing operations), and finally exactly one RenderPath call
/**
* Called when the current path should be rendered.
*
* #param renderInfo Contains information about the current path which should be rendered.
* #return The path which can be used as a new clipping path.
*/
Path RenderPath(PathPaintingRenderInfo renderInfo);
defining how that path shall be drawn (any combination of filling its interior and stroking the path itself).
I.e. for recognizing underlines, you'll have to collect the path pieces provided via ModifyPath and decide whether they might describe one or more underlines as soon as the RenderPath call comes.
Theoretically underlines could also be created differently, e.g. using a bitmap image, but I'm not aware of pdf producers doing so.
By the way, in your example PDF underlines appear consistently to be drawn using a MoveTo to the line starting point, a LineTo to its end, and then a Stroke to simply stroke the path. Thus, you'll get two ModifyPath calls (one with operation value MOVETO, one with LINETO) and one RenderPath call (with operation STROKE) respectively for each underline.

In DOCOTIC.pdf library there is a method responding as true or false.
In C#
bool FONT_ITALIC = data.Font.Italic;
bool FONT_UNDERLINE = data.Font.Underline;
Check for the value of FONT_ITALIC/FONT_UNDERLINE.
I have tried to use the same, but couldn't get correct value always.
Any suggestions are welcome.

How to detect a string-overflow from a line in C#?

I wanted to know how to word-wrap in C# when I came across a very good solution to my problem on word-wrapping text in a line in this URL. Unfortunately, I do not have enough reputations to ask the OP directly about one specific problem I am having(and most likely people dealing with this will be having indefinitely)
Problem Description:
I can word-wrap a string if it needs to be word-wrapped with this code:
Graphics PAddress = e.Graphics;
SizeF PAddressLength = PAddress.MeasureString("Residential Address: " + RAddressTextBox.Text, new Font(fontface, fontsize, FontStyle.Regular),700);
PAddress.DrawString("Residential Address: "+PAddressLength + RAddressTextBox.Text, new Font(fontface, fontsize, FontStyle.Regular), Brushes.Black, new RectangleF(new Point(pagemarginX, newline()),PAddressLength),StringFormat.GenericTypographic);
However, I could not find the place to receive a trigger whenever the word-length overflows from a single line.
for example:
In LINE-2 of that code, whenever the wordlength exceeds 700px, it moves to the next line. It does that by following the RectangleF to wordwrap. It is doing so automatically, which is a problem since that makes it difficult to know whether it has crossed 700px or not.This is the format in which information is displayed whenever I tried to print PAddressLength:
{Width=633.1881, Height=47.14897}
I am thinking that If I can extract the value of width from that using PAddressLength.Width ,then I can partially solve this problem. But with that, I will need to calculate if the remaining space(i.e 700px - 633.1881px ) will accommodate the next word or not(if there is one)
BREAKING DOWN THE PROBLEM:
I already know how to word-wrap when there is a string longer than what specify by using Graphics.MeasureString as given in this solution in another question.
But that^ process happens automatically, so I want to know how to detect if the word-wrap has occured(and how may lines it has wrapped with each line being 700px width maximum)
I need to know the number of lines that have been wrapped in order to know the number of times to execute newline() function that I wrote, which gives appropriate line spacing upon executing each time.
ADDITIONALLY, (bonus question; may or maynot solve) Is there some way to extract the value 633.1881 and then calculate whether the next word fits in ( 700 - 633.1881 )px space or not?

There is an overload to MeasureString that returns the number of lines used in an out parameter: https://msdn.microsoft.com/en-us/library/957webty%28v=vs.110%29.aspx

How to detect superscript with ItextSharp?

Hy
I am using ITextSharp to parse a pdf file to text output.
I want to know if I can catch if the pdf contains subscript or superscript, does anyone knows how to make the difference between a normal character and a superscript in a pdf using ITextSharp, or other library ?
Thanks

Disclaimer: I don't actually have any evidence for this but...
I would expect super/subscript to be identical to normal text. It's the same font, just smaller. If it happens to be on the same line as other text, super/sub scripts are raised and lowered - but you won't be able to detect that with some explicit meta-tag in a layout-oriented format such as PDF.
In other words, I'd guess that you need to identify super/subscripts by heuristics: finding text that's smaller and vertically displaced compared to other text on the "same" line. Whether that's easy to do or not depends on the PDF creator and the details of ITextSharp, since even identifying a "line" is not necessarily straightforward.

You are going to have to implement a bit of custom logic here. There is no tag denoting superscript/subscript in PDF, it is simply sitting upon a different baseline. In cases such as this, you will have to note your baseline (along with your height).
Some quick pseudo-code:
//input -> curText
if(curText.Baseline > previousText.Baseline &&
curText.Baseline < (prevText.Baseline + prevText.Height))
{
// This is most likely superscript //
}
else if(curText.Baseline < previousText.Baseline &&
prevText.Baseline < (curText.Baseline + curText.Height))
{
// This is most likely subscript //
}
else
{
// This is probably normal text //
}
This solution requires you to organize the thoroughly unorganized nature of a PDF file. In the past I have used List<> of a custom class meant to organize all text of a given y coordinate into arrays. Using something like this you can then compare the separate lines and do whatever work to them you might want before painting or otherwise transmitting them.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Get document vector count in a pdf document? - c#

Related

Is there any way to optimize pdfreader itextsharp?

How to determine whether a MS-Word paragraph is more than one line?

Searching PDF for Underlined and Bolded text

How to detect a string-overflow from a line in C#?

How to detect superscript with ItextSharp?

Categories

Resources