I have a table in asp.net page,and trying to export it as a PDF file,I have couple of international characters that are not shown in generated PDF file,any suggestions,
Thanks in advance
The key for proper display of alternate characters sets (Russian, Chinese, Japanese, etc.) is to use IDENTITY_H encoding when creating the BaseFont.
Dim bfR As iTextSharp.text.pdf.BaseFont
bfR = iTextSharp.text.pdf.BaseFont.CreateFont("MyFavoriteFont.ttf", iTextSharp.text.pdf.BaseFont.IDENTITY_H, iTextSharp.text.pdf.BaseFont.EMBEDDED)
IDENTITY_H provides unicode support for your chosen font, so you should be able to display pretty much any character. I've used it for Russian, Greek, and all the different European language letters.
EDIT - 2013-May-28
This also works for v5.0.2 of iTextSharp.
EDIT - 2015-June-23
Given below is a complete code sample (in C#):
private void CreatePdf()
{
string testText = "đĔĐěÇøç";
string tmpFile = #"C:\test.pdf";
string myFont = #"C:\<<valid path to the font you want>>\verdana.ttf";
iTextSharp.text.Rectangle pgeSize = new iTextSharp.text.Rectangle(595, 792);
iTextSharp.text.Document doc = new iTextSharp.text.Document(pgeSize, 10, 10, 10, 10);
iTextSharp.text.pdf.PdfWriter wrtr;
wrtr = iTextSharp.text.pdf.PdfWriter.GetInstance(doc,
new System.IO.FileStream(tmpFile, System.IO.FileMode.Create));
doc.Open();
doc.NewPage();
iTextSharp.text.pdf.BaseFont bfR;
bfR = iTextSharp.text.pdf.BaseFont.CreateFont(myFont,
iTextSharp.text.pdf.BaseFont.IDENTITY_H,
iTextSharp.text.pdf.BaseFont.EMBEDDED);
iTextSharp.text.BaseColor clrBlack =
new iTextSharp.text.BaseColor(0, 0, 0);
iTextSharp.text.Font fntHead =
new iTextSharp.text.Font(bfR, 12, iTextSharp.text.Font.NORMAL, clrBlack);
iTextSharp.text.Paragraph pgr =
new iTextSharp.text.Paragraph(testText, fntHead);
doc.Add(pgr);
doc.Close();
}
This is a screenshot of the pdf file that is created:
An important point to remember is that if the font you have chosen does not support the characters you are trying to send to the pdf file, nothing you do in iTextSharp is going to change that. Verdana nicely displays the characters from all the European fonts I know of.
Other fonts may not be able to display as many characters.
There are two potential reasons characters aren't rendered:
The encoding. As Stewbob pointed out, Identity-H is a great way to avoid the issue entirely, though it does require you to embed a subset of the font. This has two consequences.
It increases the file size a bit over unembedded fonts.
The font has to be licensed for embedded subsets. Most are, some are not.
The font has to contain that character. If you ask for some Arabic ligatures out of a Cyrillic (Russian) font, chances aren't good that it'll be there. There are very few fonts that cover a variety of languages, and they tend to be HUGE. The biggest/most comprehensive font I've run into was "Arial Unicode MS". Over 23 megabytes.
That's another good reason to require embedding SUBSETS. Tacking on a few megabytes because you wanted to add a couple Chinese glyphs is a bit steep.
If you're feeling paranoid, you can check your strings against a given BaseFont instance (which I believe takes the encoding into account as well) with myBaseFont.charExists(someChar). If you have a font you're confident in, I wouldn't bother.
PS: There's another good reason that Identity-H requires an embedded subset. Identity-H reads the bytes from the content stream as Glyph Indexes. The order of glyphs can vary wildly from one font to the next, or even between versions of the same font. Relying on a viewers system to have the EXACT same font is a bad idea, so its illegal... particularly when Acrobat/Reader starts substituting fonts because it couldn't find the exact font you asked for and you didn't embed it.
You can try setting the encoding for the font you are using. In Java would be something like this:
BaseFont bf = BaseFont.createFont(BaseFont.HELVETICA, BaseFont.CP1252, BaseFont.EMBEDDED);
where the BaseFont.CP1252 is the encoding. Try to search for the exact encoding you need for the characters to be displayed.
It caused by default iTextSharp font - Helvetica - that does not support other than base characters (or not support all other characters.
There are actually 2 options:
One is to rewrite the table content by hand into the code. This approach might look faster to you, but it requires any modification to the original table to be repeated in the code as well (breaking DRY principle). In this case, you can easily set-up font as you wish.
The other is to extract PDF from HTML extracted from HtmlEngine. This might sound a bit more complicated and complex (and it is), however, working solution is much more flexible and universal. I suffered the struggle with special characters myself just a while ago and decided to post a somewhat complete solution under other similar solution here on stackoverflow: https://stackoverflow.com/a/24587745/1138663
Related
I used this Example in C#
https://kb.itextpdf.com/home/it7kb/examples/replacing-pdf-objects
The Problem in my Code:
String replacedData = IO.Util.JavaUtil.GetStringForBytes(data).Replace(placeholder, replacetext);
The String replacetext is: 34,60
In the Final PDF
What can i do, any Idea ?
Actually that example should have a mile-high warning sign. It only works under very benign circumstances, and depending on your search and replacement texts you can easily damage the content stream contents.
In your case the circumstances are not benign: It looks like the font in question is only subset-embedded, i.e. only the glyphs used in the original PDF are embedded. Apparently the comma glyph is not used originally, so it is not included in the embedded subset and cannot be displayed; instead that framed question mark is shown. (It could also be a case of a not quite standard encoded font.)
Additionally the widths of the excluded glyphs appears to be set to 0, causing the '6' to be drawn over the replacement glyph for the comma.
I am using the following code to extract text from the first page of PDF files with iTextSharp :
public static string ExtractTextFromPDFFirstPage(string fileName)
{
string text = null;
using (var pdfReader = new PdfReader(fileName))
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
text = PdfTextExtractor.GetTextFromPage(pdfReader,1,strategy);
text = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text)));
}
return text;
}
It works quite well for many PDF, but not for some other ones.
Working PDF : http://data.hexagosoft.com/LFBO.pdf
Not working PDF : http://data.hexagosoft.com/LFBP.pdf
These two PDF seems to be quite similar, but one is working and the other is not.
I guess the fact that their producer tag is not the same is a clue here.
Another clue is that this function works for any other page of the PDF without a chart.
I also tried with ghostscipt, without success.
The Encoding line seems to be useless as well.
How can i extract the text of the first page of the non working PDF, using iTextSharp ?
Thanks
Both documents use fonts with inofficial glyph names in their Encoding/Differences array and both do not use a ToUnicode map. The glyph naming seems to be somewhat straight: the number following the MT prefix is the ASCII code of the used glyph.
The first document works, because the mapping is not changed at all and iText will use the default encoding (I guess):
/Differences[65/MT65/MT66/MT67 71/MT71/MT72/MT73 76/MT76 78/MT78 83/MT83]
The other document really changes the mapping:
/Differences [2 /MT76 /MT105 /MT103 /MT104 /MT116 /MT110 /MT32 /MT97 /MT100 /MT115 /MT58 ]
This means: E.g. the character code 2 should map to the glyph named MT76 which is an inofficial/private glyph name that iText doesn't know, so it doesn't have more information but the character code 2 and will use this code for the final result (I guess).
It's impossible without implementing a logic for the MT prefixed glyph names to get the correct text out of this document. Anyhow it is nowhere defined that a glyph name beginning with MT followed by an integer can be mapped to the ASCII value... That's simply by accident or implemented by the font designer/creation tool, whatever it came from.
The 2nd PDF (LFBP.pdf) contains the incorrect mapping from glyphs to text, i.e. you see correct glyphs but the text representation was not correctly encoded for some reason during the generation of this PDF. If you have lot of files like this then the working approach could be:
detect broken pages while extracting text by searching some phrase that should appear on every page, maybe like "service"
process these pages separately using OCR with tools like Tesseract with .NET Wraper
I am wondering if it's possible to re-use an existing font that has already been embedded in a PDF. I ask this because when I add a font that I wish to use to the PDF, it looks like it has been added multiple times to the PDF file:
I can't seem to way to search for a font by it's name. I am embedding the font like so:
Doc theDoc = new Doc();
theDoc.Read("existing-pdf-file.pdf");
int FONT_MyriadPro = theDoc.EmbedFont("Myriad Pro");
theDoc.Font = FONT_MyriadPro;
theDoc.AddText("Example");
I note that the FONT_MyriadPro variable has value of 61, so I presume that it's possible to reference other existing fonts. But can I know what the font is? There doesn't seem to be any collection of fonts in the Doc object.
The document itself may contain different fonts. These are not accessible via the XFont.FindName method but you can find them by looking through the document ObjectSoup.
To a certain extent fonts in a document may be reused. However it is not uncommon to find fonts in a state where they cannot be sensibly reused. For example, font subsetting often removes crucial characters that you may wish to use.
In most cases it is just better to use a globally available font that you know is not going to have been mangled.
Later if you should wish to rationalize multiple font subset that may exist in the document you can use the ReduceSizeOperation to do so.
iTextSharp is a great tool, I can use
PdfTextExtractor.GetTextFromPage(reader, iPage) + " ";
and it works great, but is there a way to extract only the bold text (e.g. the headlines) from the pdf, and not everything?
Any solution is useful, regardless of the programing language. Thank you
From within iText, You need to use the classes from the com.itextpdf.text.pdf.parser package.
Specifically, you'll need to use a PdfTextExtractor with a custom TextExtractionStrategy that checks the font name. Bold fonts USUALLY have the world "bold" in their name.
Potential Issues:
1) Not everything that looks like text is rendered with fonts and letters. It can be paths or a bitmap. The only way to extract such text is with OCR, and there's no way to get font info.
2) Font Encoding. The bytes that map to the glyphs you're seeing in the PDF may not have a map from those bytes to actual character information.
3) Not all bold-looking text is made with a bold font. Some bold text is made by stroking the text outline with a fairly thin line as well as the usual filling. In this case, the text render mode will be set to "stroke & fill" instead of the usual "fill". This is pretty rare, but it does happen from time to time.
An easy way to test for problems 1 and 2 is to attempt to copy and paste the text within Reader/Acrobat. If you can't select it, it's almost certainly paths or an image. If you can select it but the characters come out as random junk when pasted, then iText will come up with the same junk.
Problem 3 isn't that hard to test for programattically, though you have to handle it on a case by case basis. You need to call TextRenderInfo.getTextRenderMode(). 0 is fill (the standard way of doing things), and 2 is "stroke and fill".
So your TextExtractionStrategy can stub out beginTextBlock, endTextBlock, renderImage, and getResultantText. In your renderText implementation, you'll have to check the font name (for "bold", case insensitive) and the text render mode. If either of those is the case, it's part of on of your headings.
All this is supposing that you are dealing with arbitrary PDF files. If all your PDFs come from the same source, you can start cutting corners. I'll leave that as an Exercise For The Reader.
One of your best bets for this job surely is TET by pdflib.com with its ability to extract to the TETML format. Available for Windows, Mac OS X, Linux, Solaris, AIX, HP-UX...
I'm not sure if it does indeed recognize "headlines" as such (because PDF does not know much of structural markups, only visual ones) -- but it surely can tell you exact position and font used by each string of characters.
I'm trying to display ascii-art in a textbox.
If I open a specific .nfo file in notepad with the font "Lucida Console", 9pt, regular, it looks like this :
http://i48.tinypic.com/24zvvnr.png
In my app I set the font of the textbox to "Lucida Console", 9 pt, regular, it looks like this :
http://i49.tinypic.com/2ihq8h0.png
What am I doing wrong ?
(Or - what should I do to get it to look like in notepad ?)
Your problem can be summed up like this: ASCII is not UTF-8, and UTF-8 is not ASCII.
The StreamReader(string) constructor initializes the StreamReader to use Encoding.UTF8, which is a UTF-8 decoder that silently attempts to resolve invalid characters. A very quick glance at the Wikipedia page for .nfo files reveals that most .nfo files are generally encoded in Extended ASCII (aka Code Page 437). While the first 127 characters of ASCII map to the first 127 bit patterns of UTF-8, the encodings are not the same, and so you will get incorrect characters if you use one where the other is expected.
You probably want:
System.Text.Encoding encoding = System.Text.Encoding.GetEncoding(437);
System.IO.StreamReader file = new System.IO.StreamReader(fileName, encoding);
You're probably reading the file with the wrong encoding.
How are you opening the file?