abcpdf - Reuse existing font in PDF

abcpdf - Reuse existing font in PDF - c#

I am wondering if it's possible to re-use an existing font that has already been embedded in a PDF. I ask this because when I add a font that I wish to use to the PDF, it looks like it has been added multiple times to the PDF file:
I can't seem to way to search for a font by it's name. I am embedding the font like so:
Doc theDoc = new Doc();
theDoc.Read("existing-pdf-file.pdf");
int FONT_MyriadPro = theDoc.EmbedFont("Myriad Pro");
theDoc.Font = FONT_MyriadPro;
theDoc.AddText("Example");
I note that the FONT_MyriadPro variable has value of 61, so I presume that it's possible to reference other existing fonts. But can I know what the font is? There doesn't seem to be any collection of fonts in the Doc object.

The document itself may contain different fonts. These are not accessible via the XFont.FindName method but you can find them by looking through the document ObjectSoup.
To a certain extent fonts in a document may be reused. However it is not uncommon to find fonts in a state where they cannot be sensibly reused. For example, font subsetting often removes crucial characters that you may wish to use.
In most cases it is just better to use a globally available font that you know is not going to have been mangled.
Later if you should wish to rationalize multiple font subset that may exist in the document you can use the ReduceSizeOperation to do so.

Related

Is there any way to assign Id's to paragraphs in Open XML SDK 2.5?

I'm working on an application which has to create word documents with the use of Office Open XML SDK 2.5. The idea that I'm having now is that I will start from a template with an empty body (so I have all the namespaces etc. defined already), and add Paragraphsto it. If I need images I will add the ImageParts and try to give the ImagePart the Id present in the predefined paragraphpart which will contain the image. I will store the paragraphs as xml in a database, fetch the ones I need, fill in/modify some values if needed and insert them into my word document. But this is the tricky part, how can I easily insert them in a way so I don't have to query on their content to later on find one of the paragraphs? In other words, I need Id's. I have some options in mind:
For each possible paragraph I have, manually create a SdtBlock. This SdtBlock will have an Id which matches the Id of each paragraph in the database. This seems like a lot of manual work though, and I'd rather be able to create future word documents easier...
I chose this approach but I insert Building Blocks which can be stored in templates with a specific tagname.
Create the paragraphs, copy the xml from the developer tool, and manually add a ParagraphId. This seems even more of a nightmare though, because for every future new paragraphs I will have to create new Id's etc. Also it would be impossible to insert tables as there is no way (afaik) to give those an Id.
Work with bookmarks to know where to insert the data. I don't really like this either as bookmarks are visible for everyone. I know I can replace them, but then I don't have any way to identify individual paragraphs later on.
**** my database and just add everything in the template :D Remove the paragraphs I don't need by deleting the bookmarks with their content. This idea seems the worst of all though as I don't want to depend on having a templatefile with all possible content per word-file I need to generate.
Anyone with experience in OpenXml who knows which approach would be the best? Maybe there is another approach which is better and I have completely overlooked? The ideal solution would be that I can add Ids in Office Word but that's a no-go as I haven't found anything to do that yet.
Thanks in advance!

Content Controls (std) were designed for this, although I'm not sure the designers ever contemplated "targeting" each and every paragraph in the document...
Back in the 2003/2007 days it was possible to add custom XML mark-up to a Word document, which would have been exactly what you're looking for. But Microsoft lost a patent court case around 2009 and had to pull the functionality. So content controls are really your only good choice.
Your approach could, possibly, be combined with the BuildingBlocks concept. BuildingBlocks are Word content stored in a Word template file as valid Word Open XML. They can be assigned to "galleries" and categorized. There is a Content Control of type BuildingBlock that can be associated with a specific Gallery and Category which might help you to a certain extent and would be an alternative to storing content in a database.

Ok, I did a small research, you can do it in strict OpenXML, but only before you open your file in Word. Word will remove everything it cannot read.
using (WordprocessingDocument document = WordprocessingDocument.Open(path, true)) {
document.MainDocumentPart.Document.Body.Ancestors().First()
.SetAttribute(new OpenXmlAttribute() {
LocalName = "someIdName",
Value = "111" });
}
Here, for example, I set attribute "someIdName", which doesn't exits in OpenXML, to some random element. You can set it anywhere and use it as id

Selecting text by font in Word

Is there a way of extracting all lines that are using a particular font (size, is it bolded, font name, etc) in word via C#?
In addition, is there a way to find out what is the font for some text that is in the document?
My hunch is that there are functions in the Microsoft.Office.Interop.Word namespace that can do this, but I cannot seem to find them.
Edit: I am using word 2010.

You can loop through the document using the Find object from Word Interop. You can set the Find.Font.Name property for a Selection or Range from your document. Note that the Font interface has several Name* properties for various encodings.
EDIT
Here's the equivalent VBA code:
Dim selectionRange As Range
Set selectionRange = Application.ActiveDocument.Range
With selectionRange.Find
.ClearFormatting
.Format = True
.Font.NameBi = "Narkisim" //for doc without bidirectional script, use Name
Do While .Execute
MsgBox selectionRange.Text
Loop
End With
The object model from Word Interop is the same, see the link above.
Don't go asking me for C# code now... this is SO, we don't do silver platters. And if you're ever going to do serious work with the Office Interop API, you will need to be able to read VBA code.

Best way to extracting only the bold text from a PDF

iTextSharp is a great tool, I can use
PdfTextExtractor.GetTextFromPage(reader, iPage) + " ";
and it works great, but is there a way to extract only the bold text (e.g. the headlines) from the pdf, and not everything?
Any solution is useful, regardless of the programing language. Thank you

From within iText, You need to use the classes from the com.itextpdf.text.pdf.parser package.
Specifically, you'll need to use a PdfTextExtractor with a custom TextExtractionStrategy that checks the font name. Bold fonts USUALLY have the world "bold" in their name.
Potential Issues:
1) Not everything that looks like text is rendered with fonts and letters. It can be paths or a bitmap. The only way to extract such text is with OCR, and there's no way to get font info.
2) Font Encoding. The bytes that map to the glyphs you're seeing in the PDF may not have a map from those bytes to actual character information.
3) Not all bold-looking text is made with a bold font. Some bold text is made by stroking the text outline with a fairly thin line as well as the usual filling. In this case, the text render mode will be set to "stroke & fill" instead of the usual "fill". This is pretty rare, but it does happen from time to time.
An easy way to test for problems 1 and 2 is to attempt to copy and paste the text within Reader/Acrobat. If you can't select it, it's almost certainly paths or an image. If you can select it but the characters come out as random junk when pasted, then iText will come up with the same junk.
Problem 3 isn't that hard to test for programattically, though you have to handle it on a case by case basis. You need to call TextRenderInfo.getTextRenderMode(). 0 is fill (the standard way of doing things), and 2 is "stroke and fill".
So your TextExtractionStrategy can stub out beginTextBlock, endTextBlock, renderImage, and getResultantText. In your renderText implementation, you'll have to check the font name (for "bold", case insensitive) and the text render mode. If either of those is the case, it's part of on of your headings.
All this is supposing that you are dealing with arbitrary PDF files. If all your PDFs come from the same source, you can start cutting corners. I'll leave that as an Exercise For The Reader.

One of your best bets for this job surely is TET by pdflib.com with its ability to extract to the TETML format. Available for Windows, Mac OS X, Linux, Solaris, AIX, HP-UX...
I'm not sure if it does indeed recognize "headlines" as such (because PDF does not know much of structural markups, only visual ones) -- but it surely can tell you exact position and font used by each string of characters.

iTextSharp international text

I have a table in asp.net page,and trying to export it as a PDF file,I have couple of international characters that are not shown in generated PDF file,any suggestions,
Thanks in advance

The key for proper display of alternate characters sets (Russian, Chinese, Japanese, etc.) is to use IDENTITY_H encoding when creating the BaseFont.
Dim bfR As iTextSharp.text.pdf.BaseFont
bfR = iTextSharp.text.pdf.BaseFont.CreateFont("MyFavoriteFont.ttf", iTextSharp.text.pdf.BaseFont.IDENTITY_H, iTextSharp.text.pdf.BaseFont.EMBEDDED)
IDENTITY_H provides unicode support for your chosen font, so you should be able to display pretty much any character. I've used it for Russian, Greek, and all the different European language letters.
EDIT - 2013-May-28
This also works for v5.0.2 of iTextSharp.
EDIT - 2015-June-23
Given below is a complete code sample (in C#):
private void CreatePdf()
{
string testText = "đĔĐěÇøç";
string tmpFile = #"C:\test.pdf";
string myFont = #"C:\<<valid path to the font you want>>\verdana.ttf";
iTextSharp.text.Rectangle pgeSize = new iTextSharp.text.Rectangle(595, 792);
iTextSharp.text.Document doc = new iTextSharp.text.Document(pgeSize, 10, 10, 10, 10);
iTextSharp.text.pdf.PdfWriter wrtr;
wrtr = iTextSharp.text.pdf.PdfWriter.GetInstance(doc,
new System.IO.FileStream(tmpFile, System.IO.FileMode.Create));
doc.Open();
doc.NewPage();
iTextSharp.text.pdf.BaseFont bfR;
bfR = iTextSharp.text.pdf.BaseFont.CreateFont(myFont,
iTextSharp.text.pdf.BaseFont.IDENTITY_H,
iTextSharp.text.pdf.BaseFont.EMBEDDED);
iTextSharp.text.BaseColor clrBlack =
new iTextSharp.text.BaseColor(0, 0, 0);
iTextSharp.text.Font fntHead =
new iTextSharp.text.Font(bfR, 12, iTextSharp.text.Font.NORMAL, clrBlack);
iTextSharp.text.Paragraph pgr =
new iTextSharp.text.Paragraph(testText, fntHead);
doc.Add(pgr);
doc.Close();
}
This is a screenshot of the pdf file that is created:
An important point to remember is that if the font you have chosen does not support the characters you are trying to send to the pdf file, nothing you do in iTextSharp is going to change that. Verdana nicely displays the characters from all the European fonts I know of.
Other fonts may not be able to display as many characters.

There are two potential reasons characters aren't rendered:
The encoding. As Stewbob pointed out, Identity-H is a great way to avoid the issue entirely, though it does require you to embed a subset of the font. This has two consequences.
It increases the file size a bit over unembedded fonts.
The font has to be licensed for embedded subsets. Most are, some are not.
The font has to contain that character. If you ask for some Arabic ligatures out of a Cyrillic (Russian) font, chances aren't good that it'll be there. There are very few fonts that cover a variety of languages, and they tend to be HUGE. The biggest/most comprehensive font I've run into was "Arial Unicode MS". Over 23 megabytes.
That's another good reason to require embedding SUBSETS. Tacking on a few megabytes because you wanted to add a couple Chinese glyphs is a bit steep.
If you're feeling paranoid, you can check your strings against a given BaseFont instance (which I believe takes the encoding into account as well) with myBaseFont.charExists(someChar). If you have a font you're confident in, I wouldn't bother.
PS: There's another good reason that Identity-H requires an embedded subset. Identity-H reads the bytes from the content stream as Glyph Indexes. The order of glyphs can vary wildly from one font to the next, or even between versions of the same font. Relying on a viewers system to have the EXACT same font is a bad idea, so its illegal... particularly when Acrobat/Reader starts substituting fonts because it couldn't find the exact font you asked for and you didn't embed it.

You can try setting the encoding for the font you are using. In Java would be something like this:
BaseFont bf = BaseFont.createFont(BaseFont.HELVETICA, BaseFont.CP1252, BaseFont.EMBEDDED);
where the BaseFont.CP1252 is the encoding. Try to search for the exact encoding you need for the characters to be displayed.

It caused by default iTextSharp font - Helvetica - that does not support other than base characters (or not support all other characters.
There are actually 2 options:
One is to rewrite the table content by hand into the code. This approach might look faster to you, but it requires any modification to the original table to be repeated in the code as well (breaking DRY principle). In this case, you can easily set-up font as you wish.
The other is to extract PDF from HTML extracted from HtmlEngine. This might sound a bit more complicated and complex (and it is), however, working solution is much more flexible and universal. I suffered the struggle with special characters myself just a while ago and decided to post a somewhat complete solution under other similar solution here on stackoverflow: https://stackoverflow.com/a/24587745/1138663

How to determine the O/S filename of an installed font?

We use a third-party PDF Generator library which requires that you specify the TrueType or Type1 file name when using a font other than the 14 or so that are part of the default PDF standard.
So if I want to use Bitstream Arrus Bold I have to know to reference arrusb.ttf.
Short of enumerating all of the files in the font folder and creating a disposable PrivateFontCollection to extract the relationships, is there a way to obtain the underlying font name from font information, i.e. given Courier New, Bold, Italic derive CourBI.ttf?
I've already looked at the InstalledFontCollection and there's nothing.

If you don't mind poking around in the registry, take a look at
HKLM\Software\Microsoft\Windows NT\CurrentVersion\Fonts
It has pairs like
Name = "Arial (TrueType)"
Data = "arial.ttf"
You can do this the necessary queries like this:
static RegistryKey fontsKey =
Registry.LocalMachine.OpenSubKey(
#"Software\Microsoft\Windows NT\CurrentVersion\Fonts");
static public string GetFontFile(string fontName)
{
return fontsKey.GetValue(fontName, string.Empty) as string;
}
A call to GetFontFile("Arial (TrueType)") returns "arial.ttf"
You could of course modify the code to append the (TrueType) portion to the fontName, or to look through everything returned by fontsKey.GetValueNames() to find the best match.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.