Searching PDF for Underlined and Bolded text

Searching PDF for Underlined and Bolded text - c#

Using iTextSharp, how can I determine if a parsed chunk of text is both bolded and underlined?
Details:
I'm trying to parse .PDF files in C# specifically for text that is both bolded and underlined. Using ITextSharp, I can derive from LocationTextExtractionStrategy and get the text, the location, the font, etc. from the iTextSharp.text.pdf.parser.TextRenderInfo object passed to the overridden .RenderText method.
However, determining if the text is Bold and/Underlined from the TextRenderInfo object has not been straight forward.
I tried to use TextRenderInfo.GetFont() to find the font properties, but was unsuccessful
I can currently determine if the text is Bold or not, by accessing the private Graphics State field on the TextRenderInfo object and checking it's .Font.PostscriptFontName property for the word "Bold" (Ugly, but appears to work.)
Biggest issue: I haven't found anything to determine if the text is underlined. How can I determine this?
Here is my current attempt:
private FieldInfo _gsField = typeof(TextRenderInfo).GetField("gs",
BindingFlags.GetField | BindingFlags.NonPublic | BindingFlags.Instance);
//Automatically called for each chunk of text in the PDF
public override void RenderText(TextRenderInfo renderInfo)
{
base.RenderText(renderInfo);
//UNDONE:Need to determine if text is underlined. How?
//NOTE: renderInfo.GetFont().FontWeight does not contain any actual information
var gs = (GraphicsState)_gsField.GetValue(renderInfo);
var textChunkInfo = new TextChunkInfo(renderInfo);
_allLocations.Add(textChunkInfo);
if (gs.Font.PostscriptFontName.Contains("Bold"))
//Add this to our found collection
FoundItems.Add(new TextChunkInfo(renderInfo));
if (!_lineHeights.Contains(textChunkInfo.LineHeight))
_lineHeights.Add(textChunkInfo.LineHeight);
}
Full source code of current attempt at: GitHub Repository (Two examples (example.pdf and example2.pdf) are included with text similar to what I'll be searching through.)

I tried to use TextRenderInfo.GetFont() to find the font properties, but was unsuccessful
I can currently determine if the text is Bold or not, by accessing the private Graphics State field on the TextRenderInfo object and checking it's .Font.PostscriptFontName property for the word "Bold" (Ugly, but appears to work.)
I don't quite understand this differentiation. TextRenderInfo.GetFont() is exactly the same as the Font property of the private Graphics State field of TextRenderInfo.
That being said, though, this is indeed one of the major ways to determine boldness.
Bold writing in PDFs is achieved either using
explicitly bold fonts (which is the better way); in this case one can try to determine whether or not the fonts are bold by
looking at the font name: it may contain a substring "bold" or something similar;
looking at some optional properties of the font, e.g. font weight, but beware, they are optional...
inspecting the embedded font file if applicable.
Neither of these methods is fool-proof;
the same font as for non-bold text but using special techniques to make them appear bold (aka poor man's bold), e.g.
not only filling the glyph contours but also drawing a thicker line along it for a bold impression,
drawing the glyph twice, the second time slightly displaced, also for a bold impression.
Underlined writing in PDFs is usually achieved by explicitly drawing a line or a very thin rectangle under the text. You can try and detect such lines by implementing IExtRenderListener, parsing the page in question with it to determine line locations, and then match with text positions during text extraction. Both can also be done in a single pass but beware, the underlines need not be drawn before the text or even shortly thereafter, the pdf producer may first draw all text and only then draw all underlines. Furthermore, I've also come across a funny construction, very short (e.g. 1pt) very wide (e.g. 50pt) vertical lines effectively are seen as horizontal ones...
IExtRenderListener extends the IRenderListener with three new methods, ModifyPath, RenderPath, and ClipPath. Whenever some path is drawn, be it a single line, a rectangle, or some very complex path, you'll first get a number of ModifyPath calls (at least one)
/**
* Called when the current path is being modified. E.g. new segment is being added,
* new subpath is being started etc.
*
* #param renderInfo Contains information about the path segment being added to the current path.
*/
void ModifyPath(PathConstructionRenderInfo renderInfo);
defining the lines and curves the path consists of, then at most one ClipPath call
/**
* Called when the current path should be set as a new clipping path.
*
* #param rule Either {#link PathPaintingRenderInfo#EVEN_ODD_RULE} or {#link PathPaintingRenderInfo#NONZERO_WINDING_RULE}
*/
void ClipPath(int rule);
(if and only if the path shall serve as clip path for the following drawing operations), and finally exactly one RenderPath call
/**
* Called when the current path should be rendered.
*
* #param renderInfo Contains information about the current path which should be rendered.
* #return The path which can be used as a new clipping path.
*/
Path RenderPath(PathPaintingRenderInfo renderInfo);
defining how that path shall be drawn (any combination of filling its interior and stroking the path itself).
I.e. for recognizing underlines, you'll have to collect the path pieces provided via ModifyPath and decide whether they might describe one or more underlines as soon as the RenderPath call comes.
Theoretically underlines could also be created differently, e.g. using a bitmap image, but I'm not aware of pdf producers doing so.
By the way, in your example PDF underlines appear consistently to be drawn using a MoveTo to the line starting point, a LineTo to its end, and then a Stroke to simply stroke the path. Thus, you'll get two ModifyPath calls (one with operation value MOVETO, one with LINETO) and one RenderPath call (with operation STROKE) respectively for each underline.

In DOCOTIC.pdf library there is a method responding as true or false.
In C#
bool FONT_ITALIC = data.Font.Italic;
bool FONT_UNDERLINE = data.Font.Underline;
Check for the value of FONT_ITALIC/FONT_UNDERLINE.
I have tried to use the same, but couldn't get correct value always.
Any suggestions are welcome.

Related

Check a Font for a 32bit Unicode Glyph [duplicate]

How can I determine from the .NET runtime if, for a given font, if it has the glyph for a character? I want to switch the font to Arial Unicode MS if I have text that the specified font does not have a glyph for (very common for CJK).
Update: I'm looking for a C# (ie all managed code) solution. I think GlyphTypeface may be what I need but I can't see a way in it to ask if a given character has a glyph. You can get the entire map back, but I assume that would be an expensive call.

I've done some unicode tools and the technique I use is getting the map and chache it
for each font used.
IDictionary<int, ushort> characterMap = GlyphTypeface.CharacterToGlyphMap
will give you the defined glyph index per codepoint.
msdn ref
if (characterMap.ContainsKey(CodePoint))
glyphExists = true;
else
glyphExists = false;

How to detect a string-overflow from a line in C#?

I wanted to know how to word-wrap in C# when I came across a very good solution to my problem on word-wrapping text in a line in this URL. Unfortunately, I do not have enough reputations to ask the OP directly about one specific problem I am having(and most likely people dealing with this will be having indefinitely)
Problem Description:
I can word-wrap a string if it needs to be word-wrapped with this code:
Graphics PAddress = e.Graphics;
SizeF PAddressLength = PAddress.MeasureString("Residential Address: " + RAddressTextBox.Text, new Font(fontface, fontsize, FontStyle.Regular),700);
PAddress.DrawString("Residential Address: "+PAddressLength + RAddressTextBox.Text, new Font(fontface, fontsize, FontStyle.Regular), Brushes.Black, new RectangleF(new Point(pagemarginX, newline()),PAddressLength),StringFormat.GenericTypographic);
However, I could not find the place to receive a trigger whenever the word-length overflows from a single line.
for example:
In LINE-2 of that code, whenever the wordlength exceeds 700px, it moves to the next line. It does that by following the RectangleF to wordwrap. It is doing so automatically, which is a problem since that makes it difficult to know whether it has crossed 700px or not.This is the format in which information is displayed whenever I tried to print PAddressLength:
{Width=633.1881, Height=47.14897}
I am thinking that If I can extract the value of width from that using PAddressLength.Width ,then I can partially solve this problem. But with that, I will need to calculate if the remaining space(i.e 700px - 633.1881px ) will accommodate the next word or not(if there is one)
BREAKING DOWN THE PROBLEM:
I already know how to word-wrap when there is a string longer than what specify by using Graphics.MeasureString as given in this solution in another question.
But that^ process happens automatically, so I want to know how to detect if the word-wrap has occured(and how may lines it has wrapped with each line being 700px width maximum)
I need to know the number of lines that have been wrapped in order to know the number of times to execute newline() function that I wrote, which gives appropriate line spacing upon executing each time.
ADDITIONALLY, (bonus question; may or maynot solve) Is there some way to extract the value 633.1881 and then calculate whether the next word fits in ( 700 - 633.1881 )px space or not?

There is an overload to MeasureString that returns the number of lines used in an out parameter: https://msdn.microsoft.com/en-us/library/957webty%28v=vs.110%29.aspx

Is it possible to modify spot color names in a PDF using iTextSharp in C#?

Is it possible to modify spot color names in a PDF using iTextSharp in C#, its just the Colour name that requires changing.

So you have an existing PDF that uses some spot colors, for instance a color named "ABC", and you want to manipulate that PDF so that the name is "XYZ".
This is possible, but it requires low-level PDF syntax manipulation.
You need to create a PdfReader instance, look for the dictionary that defines the spot color, change the name, then use PdfStamper to create a new file based on the altered PdfReader instance.
There is no "ready-made" example on how to answer your specific question (I doubt somebody else but the original developer of iText will answer a question like this), but you can get some inspiration by looking at the code samples from chapter 13 of the second edition of "iText in Action": http://itextpdf.com/book/chapter.php?id=13
See for instance the manipulatePdf() method in this example: http://itextpdf.com/examples/iia.php?id=239
In this example an URL is replaced by another one using the principle explained above.
You need to adapt this example so that you find the path to the place where the spot color name is stored, change that name, and persist the changes.
Hint: the Spot color name will be in an array of which the first element is a name (/Separation), the second entry will be the name you want to change (this is the one you want to replace with a new PdfName instance), and so on.
How to find this /Separation array? I would loop over the pages (the getPageN() method will give you the page dictionary), get the resources of every page (pageDict.getAsDict(PdfName.RESOURCES)), look for the existence of a /Colorspace dictionary, then look for all the /Separation colors in that dictionary. Replace the second element whenever you encounter a name you want to change.
The examples in chapter 13 in combination with ISO-32000-1 (can be downloaded from the Adobe.com site) will lead the way.

Get document vector count in a pdf document?

I have a problem with some user provided pdf documents. They are created from 3d packages and are basically a HUGE list of vector lines that take and age to render (over 60 secs).
How can I generate a report on the number of vector lines present in a pdf document using iTextSharp (5.0.5)?
I can get text and image data but can't see where to get a handle on vector. They don't seems to be represented as an image.

iText[Sharp]'s parser package doesn't yet handle lineTo or curveTo commands. It's a goal, but not one that's been important enough to implement as yet. Other Things are getting attention at the moment.
If you're feeling adventurous, you should check out PdfContentStreamProcessor. In a private function populateOperators, there's a long list of commands that are currently handled (in one fashion or another).
You'd need to write similar command classes for all the line art commands (moveTo, lineTo, rect, stroke, fill, clip), and expose them in some way.
Actually, if all you want to do is COUNT the number of paths, you could just implement stroke and fill to increment some static integer[s], then check them after parsing. Should be fairly simple (I'm writing in Java, but it's easy enough to translate):
private static class CountOps implements ContentOperator {
public static int operationCount = 0;
public void invoke(PdfContentStreamProcessor processor, PdfLiteral operator, ArrayList<PdfObject> operands) {
++operationCount;
}
}
Ah! registerContentOperator is a public function. You don't need to change iText's source at all:
PdfContentStreamProcessor proc = new PdfContentStreamProcessor(null);
CountOps counter = new CountOps();
proc.registerContentOperator("S", countOps); // stroke the path
proc.registerContentOperator("s", countOps); // close & stroke
proc.registerContentOperator("F", countOps); // fill, backward compat
proc.registerContentOperator("f", countOps); // fill
proc.registerContentOperator("f*", countOps); // fill with event-odd winding rule
proc.registerContentOperator("B", countOps); // fill & stroke
proc.registerContentOperator("B*", countOps); // fill & stroke with even-odd
proc.registerContentOperator("b", countOps); // close, fill, & stroke
proc.registerContentOperator("b*", countOps); // close, fill, & stroke with even-odd
proc.processContent( contentBytes, pageResourceDict );
int totalStrokesAndFills = CountOps.operationCount; // note that stroke&fill operators will be counted once, not twice.
Something like that. Only a null RenderListener will cause a null pointer exception if you run into any text or images. You could whip up a no-op listener yourself or use one of the existing ones and ignore its output.
PS: iTextSharp 5.0.6 should be released any day now if it isn't out already.

There is no specific Vector image. Normally it is just added to the content stream which is essentially a Vector data stream for drawing the whole page.
There is a blog article which you might find useful for understanding this at http://www.jpedal.org/PDFblog/2010/11/grow-your-own-pdf-file-%E2%80%93-part-5-path-objects/

How to detect superscript with ItextSharp?

Hy
I am using ITextSharp to parse a pdf file to text output.
I want to know if I can catch if the pdf contains subscript or superscript, does anyone knows how to make the difference between a normal character and a superscript in a pdf using ITextSharp, or other library ?
Thanks

Disclaimer: I don't actually have any evidence for this but...
I would expect super/subscript to be identical to normal text. It's the same font, just smaller. If it happens to be on the same line as other text, super/sub scripts are raised and lowered - but you won't be able to detect that with some explicit meta-tag in a layout-oriented format such as PDF.
In other words, I'd guess that you need to identify super/subscripts by heuristics: finding text that's smaller and vertically displaced compared to other text on the "same" line. Whether that's easy to do or not depends on the PDF creator and the details of ITextSharp, since even identifying a "line" is not necessarily straightforward.

You are going to have to implement a bit of custom logic here. There is no tag denoting superscript/subscript in PDF, it is simply sitting upon a different baseline. In cases such as this, you will have to note your baseline (along with your height).
Some quick pseudo-code:
//input -> curText
if(curText.Baseline > previousText.Baseline &&
curText.Baseline < (prevText.Baseline + prevText.Height))
{
// This is most likely superscript //
}
else if(curText.Baseline < previousText.Baseline &&
prevText.Baseline < (curText.Baseline + curText.Height))
{
// This is most likely subscript //
}
else
{
// This is probably normal text //
}
This solution requires you to organize the thoroughly unorganized nature of a PDF file. In the past I have used List<> of a custom class meant to organize all text of a given y coordinate into arrays. Using something like this you can then compare the separate lines and do whatever work to them you might want before painting or otherwise transmitting them.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Searching PDF for Underlined and Bolded text - c#

Related

Check a Font for a 32bit Unicode Glyph [duplicate]

How to detect a string-overflow from a line in C#?

Is it possible to modify spot color names in a PDF using iTextSharp in C#?

Get document vector count in a pdf document?

How to detect superscript with ItextSharp?

Categories

Resources