We use a third-party PDF Generator library which requires that you specify the TrueType or Type1 file name when using a font other than the 14 or so that are part of the default PDF standard.
So if I want to use Bitstream Arrus Bold I have to know to reference arrusb.ttf.
Short of enumerating all of the files in the font folder and creating a disposable PrivateFontCollection to extract the relationships, is there a way to obtain the underlying font name from font information, i.e. given Courier New, Bold, Italic derive CourBI.ttf?
I've already looked at the InstalledFontCollection and there's nothing.
If you don't mind poking around in the registry, take a look at
HKLM\Software\Microsoft\Windows NT\CurrentVersion\Fonts
It has pairs like
Name = "Arial (TrueType)"
Data = "arial.ttf"
You can do this the necessary queries like this:
static RegistryKey fontsKey =
Registry.LocalMachine.OpenSubKey(
#"Software\Microsoft\Windows NT\CurrentVersion\Fonts");
static public string GetFontFile(string fontName)
{
return fontsKey.GetValue(fontName, string.Empty) as string;
}
A call to GetFontFile("Arial (TrueType)") returns "arial.ttf"
You could of course modify the code to append the (TrueType) portion to the fontName, or to look through everything returned by fontsKey.GetValueNames() to find the best match.
Related
I am using the following code to extract text from the first page of PDF files with iTextSharp :
public static string ExtractTextFromPDFFirstPage(string fileName)
{
string text = null;
using (var pdfReader = new PdfReader(fileName))
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
text = PdfTextExtractor.GetTextFromPage(pdfReader,1,strategy);
text = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text)));
}
return text;
}
It works quite well for many PDF, but not for some other ones.
Working PDF : http://data.hexagosoft.com/LFBO.pdf
Not working PDF : http://data.hexagosoft.com/LFBP.pdf
These two PDF seems to be quite similar, but one is working and the other is not.
I guess the fact that their producer tag is not the same is a clue here.
Another clue is that this function works for any other page of the PDF without a chart.
I also tried with ghostscipt, without success.
The Encoding line seems to be useless as well.
How can i extract the text of the first page of the non working PDF, using iTextSharp ?
Thanks
Both documents use fonts with inofficial glyph names in their Encoding/Differences array and both do not use a ToUnicode map. The glyph naming seems to be somewhat straight: the number following the MT prefix is the ASCII code of the used glyph.
The first document works, because the mapping is not changed at all and iText will use the default encoding (I guess):
/Differences[65/MT65/MT66/MT67 71/MT71/MT72/MT73 76/MT76 78/MT78 83/MT83]
The other document really changes the mapping:
/Differences [2 /MT76 /MT105 /MT103 /MT104 /MT116 /MT110 /MT32 /MT97 /MT100 /MT115 /MT58 ]
This means: E.g. the character code 2 should map to the glyph named MT76 which is an inofficial/private glyph name that iText doesn't know, so it doesn't have more information but the character code 2 and will use this code for the final result (I guess).
It's impossible without implementing a logic for the MT prefixed glyph names to get the correct text out of this document. Anyhow it is nowhere defined that a glyph name beginning with MT followed by an integer can be mapped to the ASCII value... That's simply by accident or implemented by the font designer/creation tool, whatever it came from.
The 2nd PDF (LFBP.pdf) contains the incorrect mapping from glyphs to text, i.e. you see correct glyphs but the text representation was not correctly encoded for some reason during the generation of this PDF. If you have lot of files like this then the working approach could be:
detect broken pages while extracting text by searching some phrase that should appear on every page, maybe like "service"
process these pages separately using OCR with tools like Tesseract with .NET Wraper
I am wondering if it's possible to re-use an existing font that has already been embedded in a PDF. I ask this because when I add a font that I wish to use to the PDF, it looks like it has been added multiple times to the PDF file:
I can't seem to way to search for a font by it's name. I am embedding the font like so:
Doc theDoc = new Doc();
theDoc.Read("existing-pdf-file.pdf");
int FONT_MyriadPro = theDoc.EmbedFont("Myriad Pro");
theDoc.Font = FONT_MyriadPro;
theDoc.AddText("Example");
I note that the FONT_MyriadPro variable has value of 61, so I presume that it's possible to reference other existing fonts. But can I know what the font is? There doesn't seem to be any collection of fonts in the Doc object.
The document itself may contain different fonts. These are not accessible via the XFont.FindName method but you can find them by looking through the document ObjectSoup.
To a certain extent fonts in a document may be reused. However it is not uncommon to find fonts in a state where they cannot be sensibly reused. For example, font subsetting often removes crucial characters that you may wish to use.
In most cases it is just better to use a globally available font that you know is not going to have been mangled.
Later if you should wish to rationalize multiple font subset that may exist in the document you can use the ReduceSizeOperation to do so.
I'm trying to show currency symbols on the dropdown in my monodroid application.
As you know currency units contain some thing like "र". but when I run application, the drop down just show a rectangle instead of "र".
How I can make it human-readable?
EDIT
Actually I parse this json for accessing to the unit ( saving the name attribute to a string variable):
{"id":"167","name":"\u0930","type":"4","enabled":"1","tosi":"0.0182","index":"1","extra":"INR","extra2":"Indian Rupee","extra3":"India","extra4":"Paisa","seperator":",","d_seperator":"","after_before":"0"},
When I parse it, in run-time the string variable includes "र" but when I show it on the dropdown the device show a box.
So according to 'Sam' comment I use this code. I pass the string varible to method and show the return string to the dropdow. but yet I see a box :(
public static string ConvertUnitsEncoding(Activity act,string Encoded){
try {
if( Encoded =="र")
return act. Resources .GetString(Resource .String .IndianUnit );
else
return Encoded ;
} catch (Exception ex) {
RltLog .HandleException (ex);
return Encoded ;
}
}
You've got two options:
Either load a custom font that includes that special character:
TextView tv = (TextView)findViewById(R.id.tv);
// Put the font in the asset folder
Typeface tf = Typeface.createFromAsset(Context.Assets, "Symbol.ttf");
tv.setTypeface(tf);
Most of the installed fonts on Windows have a currency subset which includes currency symbols but not Rupee. I read somewhere that Microsoft Update will add the Rupee to the fonts but I don't have it on my system. I have found Amty Currency Font with Rupee support but I'm not sure how useful it would be for your case. Try it.
Or simply use a small image for that purpose. I would prefer this approach because it's platform independent and you can find lots of symbol icons out there. Something like this:
iTextSharp is a great tool, I can use
PdfTextExtractor.GetTextFromPage(reader, iPage) + " ";
and it works great, but is there a way to extract only the bold text (e.g. the headlines) from the pdf, and not everything?
Any solution is useful, regardless of the programing language. Thank you
From within iText, You need to use the classes from the com.itextpdf.text.pdf.parser package.
Specifically, you'll need to use a PdfTextExtractor with a custom TextExtractionStrategy that checks the font name. Bold fonts USUALLY have the world "bold" in their name.
Potential Issues:
1) Not everything that looks like text is rendered with fonts and letters. It can be paths or a bitmap. The only way to extract such text is with OCR, and there's no way to get font info.
2) Font Encoding. The bytes that map to the glyphs you're seeing in the PDF may not have a map from those bytes to actual character information.
3) Not all bold-looking text is made with a bold font. Some bold text is made by stroking the text outline with a fairly thin line as well as the usual filling. In this case, the text render mode will be set to "stroke & fill" instead of the usual "fill". This is pretty rare, but it does happen from time to time.
An easy way to test for problems 1 and 2 is to attempt to copy and paste the text within Reader/Acrobat. If you can't select it, it's almost certainly paths or an image. If you can select it but the characters come out as random junk when pasted, then iText will come up with the same junk.
Problem 3 isn't that hard to test for programattically, though you have to handle it on a case by case basis. You need to call TextRenderInfo.getTextRenderMode(). 0 is fill (the standard way of doing things), and 2 is "stroke and fill".
So your TextExtractionStrategy can stub out beginTextBlock, endTextBlock, renderImage, and getResultantText. In your renderText implementation, you'll have to check the font name (for "bold", case insensitive) and the text render mode. If either of those is the case, it's part of on of your headings.
All this is supposing that you are dealing with arbitrary PDF files. If all your PDFs come from the same source, you can start cutting corners. I'll leave that as an Exercise For The Reader.
One of your best bets for this job surely is TET by pdflib.com with its ability to extract to the TETML format. Available for Windows, Mac OS X, Linux, Solaris, AIX, HP-UX...
I'm not sure if it does indeed recognize "headlines" as such (because PDF does not know much of structural markups, only visual ones) -- but it surely can tell you exact position and font used by each string of characters.
I've been having a hard time trying to get the right answers for my problem. And have spent numerous days searching online and in documentation and have found nothing.
I have a Text File that contains a bunch of text. And on one of those lines inside the file will contain some Font info like this:
Tahoma,12.5,Regular
Note that the font info will not always have the same font name, size or style so I can't just set it manually.
When this file is opened into my app, it will parse the contents (which I have mostly covered), I just need some help converting the above font string into an actual font object and then assigning that font to a control, i.e. a label etc...
Can somebody please help me with this?
You will want to use the Font class. Assuming you use String.Split() to parse the text into an array you will want to take each part of the array and use it to create a Font object like:
string s = "Tahoma,12.5,Regular";
string[] fi = s.Split(',');
Font font = new Font(fi[0], fi[1],fi[2]);
I don't have a C# compiler on this Mac so it may not be exactly correct.
Example constructor:
public Font(
string familyName,
float emSize,
FontStyle style
)
Here you need to specify the second argument as a float, so cast the string to a float with:
(float)fi[1]
Next you need to lookup a FontStyle based on what fi2 is:
if (fi[2] == "Regular") {
// set font style
}