I used this Example in C#
https://kb.itextpdf.com/home/it7kb/examples/replacing-pdf-objects
The Problem in my Code:
String replacedData = IO.Util.JavaUtil.GetStringForBytes(data).Replace(placeholder, replacetext);
The String replacetext is: 34,60
In the Final PDF
What can i do, any Idea ?
Actually that example should have a mile-high warning sign. It only works under very benign circumstances, and depending on your search and replacement texts you can easily damage the content stream contents.
In your case the circumstances are not benign: It looks like the font in question is only subset-embedded, i.e. only the glyphs used in the original PDF are embedded. Apparently the comma glyph is not used originally, so it is not included in the embedded subset and cannot be displayed; instead that framed question mark is shown. (It could also be a case of a not quite standard encoded font.)
Additionally the widths of the excluded glyphs appears to be set to 0, causing the '6' to be drawn over the replacement glyph for the comma.
Related
I am using the following code to extract text from the first page of PDF files with iTextSharp :
public static string ExtractTextFromPDFFirstPage(string fileName)
{
string text = null;
using (var pdfReader = new PdfReader(fileName))
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
text = PdfTextExtractor.GetTextFromPage(pdfReader,1,strategy);
text = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text)));
}
return text;
}
It works quite well for many PDF, but not for some other ones.
Working PDF : http://data.hexagosoft.com/LFBO.pdf
Not working PDF : http://data.hexagosoft.com/LFBP.pdf
These two PDF seems to be quite similar, but one is working and the other is not.
I guess the fact that their producer tag is not the same is a clue here.
Another clue is that this function works for any other page of the PDF without a chart.
I also tried with ghostscipt, without success.
The Encoding line seems to be useless as well.
How can i extract the text of the first page of the non working PDF, using iTextSharp ?
Thanks
Both documents use fonts with inofficial glyph names in their Encoding/Differences array and both do not use a ToUnicode map. The glyph naming seems to be somewhat straight: the number following the MT prefix is the ASCII code of the used glyph.
The first document works, because the mapping is not changed at all and iText will use the default encoding (I guess):
/Differences[65/MT65/MT66/MT67 71/MT71/MT72/MT73 76/MT76 78/MT78 83/MT83]
The other document really changes the mapping:
/Differences [2 /MT76 /MT105 /MT103 /MT104 /MT116 /MT110 /MT32 /MT97 /MT100 /MT115 /MT58 ]
This means: E.g. the character code 2 should map to the glyph named MT76 which is an inofficial/private glyph name that iText doesn't know, so it doesn't have more information but the character code 2 and will use this code for the final result (I guess).
It's impossible without implementing a logic for the MT prefixed glyph names to get the correct text out of this document. Anyhow it is nowhere defined that a glyph name beginning with MT followed by an integer can be mapped to the ASCII value... That's simply by accident or implemented by the font designer/creation tool, whatever it came from.
The 2nd PDF (LFBP.pdf) contains the incorrect mapping from glyphs to text, i.e. you see correct glyphs but the text representation was not correctly encoded for some reason during the generation of this PDF. If you have lot of files like this then the working approach could be:
detect broken pages while extracting text by searching some phrase that should appear on every page, maybe like "service"
process these pages separately using OCR with tools like Tesseract with .NET Wraper
I am using a StringBuilder in C# to append some text, which can be English (left to right) or Arabic (right to left)
stringBuilder.Append("(");
stringBuilder.Append(text);
stringBuilder.Append(") ");
stringBuilder.Append(text);
If text = "A", then output is "(A) A"
But if text = "بتث", then output is "(بتث) بتث"
Any ideas?
This is a well-known flaw in the Windows text rendering engine when asked to render Right-To-Left text, Arabic or Hebrew. It has a difficult problem to solve, people often fall back to Western words and punctuation when there is no good alternative word available in the language. Brand and company names for example. The renderer tries to guess at the proper render order by looking at the code points, with characters in the Latin character set clearly having to be rendered left-to-right.
But it fumbles at punctuation, with brackets being the most visible. You have to be explicit about it so it knows what to do, you must use the Unicode Right-to-left mark, U+200F or \u200f in C# code. Conversely, use the Left-to-right mark if you know you need LTR rendering, U+200E.
Use AppendFormat instead of just Append:
stringBuilder.AppendFormat("({0}) {0}", text)
This may fix the issue, but it may - you need to look at the text value - it probably has LTR/RTL markers characters embedded. These need to either be removed or corrected in the value.
I had a similar issue and I managed to solve it by creating a function that checks each Char in Unicode. If it is from page FE then I add 202C after it as shown below. Without this it gets RTL and LTF mixed for what I wanted.
string us = string.Format("\uFE9E\u202C\uFE98\u202C\uFEB8\u202C\uFEC6\u202C\uFEEB\u202C\u0020\u0660\u0662\u0664\u0668 Aa1");
I'm testing an SDK that extracts text from a searchable PDF. One of the SDK's dependencies was recently updated, and it's causing an existing test on Hebrew text to fail. I don't know Hebrew nor enough about how the involved technologies represent right-to-left languages.
The NUnit test asserts that the extracted text matches the C# string "מנבוצץז ".
string hebrewText = reader.ReadToEnd();
Assert.AreEqual("מנבוצץז ", hebrewText);
The rasterized PDF has what I believe are the same characters, but in the opposite order.
The unit test fails with this message:
Expected: "מנבוצץז "
But was: " זץצובנמ"
Although the actual result more closely matches what I see in the rasterized PDF, I'm not completely sure the original test is wrong.
Are Hebrew characters in a C# string supposed to be read right-to-left like printed Hebrew text?
Does any part of the .NET stack tamper with the direction of Hebrew strings?
What about NUnit?
Are Hebrew characters embedded in a searchable PDF normally supposed to go in the same direction as the rasterized text?
Anything else I should know before deciding whether to "fix" this unit test?
There are various ways to encode RTL languages. The most common way (and Window's default) is to use logical ordering, which means the first letter is encoded as the first character in a string (or file). So whether visually the first letter appears on the left or right side of the screen doesn't affect the order in which they are stored.
Now as for the text appearing in Visual Studio, it depends on the version. As far as I remember, prior to Visual Studio 2010 the code editor displayed Hebrew backwards, and it was apparent as when you tried to select Hebrew text, it reversed in an odd way (which was visually confusing). It appears this issue no longer exists is Visual Studio 2010 (at least with SP1 which I just tested).
Let's take a Hebrew word for which the direction is more clear to non-Hebrew speakers than the string specified in your text:
יון
The word happens to be the Hebrew word for an ion, and on your screen, it should appear as three letters where the tallest letter is on the left and the shortest is on the right. In a .NET string, the expression "יון".Substring(0, 1) will produce the short letter, since it's the first letter in the string. The string can also be written as "\u05D9\u05D5\u05DF" where the leftmost Unicode character \u05D9 represents the short letter displayed on the right, which clearly demonstrates the order in which the letters are stored.
Since the string in your test case is nonsensical, I can't tell you whether it was a wrong test all along or if it a correct test that should pass. If the image you uploaded has been rendered correctly then it appears the actual result of your test is correct and the expected value is incorrect, and so you should fix the test.
I believe that all strings in C# will be stored internally as LTR; RTL strings will have a non-printable character (or something) denoting that they are indeed RTL.
More than likely. RTL GUIs and rendered text for example need certain properties (specifically RightToLeft and RightToLeftLayout) to be set in order to display correctly.
NUnit shouldn't. Nor should it care. IMHO a reversed string != the original string.
I couldn't comment. I'd assume that they should be whatever the test is expecting though, assuming it was passing at first.
Don't do half measures with RTL, it really doesn't like it. Either have full RTL support, or nothing. It can be pretty nasty, I wish you the best of luck!
I'm writing a console app that needs to print some atypical (for a console app) unicode characters such as musical notes, box drawing symbols, etc.
Most characters show up correctly, or show a ? if the glyph doesn't exist for whatever font the console is using, however I found one character which behaves oddly which can be demonstrated with the lines below:
Console.Write("ABC");
Console.Write('♪'); //This is the same as: Console.Write((char)0x266A);
Console.Write("XYZ");
When this is run it will print ABC then move the cursor back to the start of the line and overwrite it with XYZ. Why does this happen?
The console doesn't use Uncode, so the characters has to be translated to an 8-bit code page. The ♪ character is converted to the character with code 13 (hex 0x0d), which is CR or Carrage Return.
In most code pages, for example code page 850, the CR chararacter glyph resembles a quarter note, and the 266a character is specified as the Unicode equivalent.
However, if you write the CR character to the console, it will not display the quarter note glyph, instead it is interpreted as the control character CR which moves the cursor to the beginning of the line.
Console.Write('♪'); is considered Unicode. My guess it is it translates it to the closest ASCII character. You should be using U+1D160 or the appropriate unicode, musical equivalent.
There are the required primitives to generate musical output in the Unicode code set (starting at U+1D100). For example, U+1D11A is a 5-line staff, U+1D158 is a closed notehead.
See http://www.unicode.org/charts/PDF/U1D100.pdf
..then the issue becomes making sure that you have a typeface with the appropriate glyphs included (and dealing with the issues of spacing things correctly, etc.)
IF you're looking to generate printed output, you should look at Lilypond, which is an OSS music notation package that uses a text file format to define the musical content and then generates gorgeous output.
iTextSharp is a great tool, I can use
PdfTextExtractor.GetTextFromPage(reader, iPage) + " ";
and it works great, but is there a way to extract only the bold text (e.g. the headlines) from the pdf, and not everything?
Any solution is useful, regardless of the programing language. Thank you
From within iText, You need to use the classes from the com.itextpdf.text.pdf.parser package.
Specifically, you'll need to use a PdfTextExtractor with a custom TextExtractionStrategy that checks the font name. Bold fonts USUALLY have the world "bold" in their name.
Potential Issues:
1) Not everything that looks like text is rendered with fonts and letters. It can be paths or a bitmap. The only way to extract such text is with OCR, and there's no way to get font info.
2) Font Encoding. The bytes that map to the glyphs you're seeing in the PDF may not have a map from those bytes to actual character information.
3) Not all bold-looking text is made with a bold font. Some bold text is made by stroking the text outline with a fairly thin line as well as the usual filling. In this case, the text render mode will be set to "stroke & fill" instead of the usual "fill". This is pretty rare, but it does happen from time to time.
An easy way to test for problems 1 and 2 is to attempt to copy and paste the text within Reader/Acrobat. If you can't select it, it's almost certainly paths or an image. If you can select it but the characters come out as random junk when pasted, then iText will come up with the same junk.
Problem 3 isn't that hard to test for programattically, though you have to handle it on a case by case basis. You need to call TextRenderInfo.getTextRenderMode(). 0 is fill (the standard way of doing things), and 2 is "stroke and fill".
So your TextExtractionStrategy can stub out beginTextBlock, endTextBlock, renderImage, and getResultantText. In your renderText implementation, you'll have to check the font name (for "bold", case insensitive) and the text render mode. If either of those is the case, it's part of on of your headings.
All this is supposing that you are dealing with arbitrary PDF files. If all your PDFs come from the same source, you can start cutting corners. I'll leave that as an Exercise For The Reader.
One of your best bets for this job surely is TET by pdflib.com with its ability to extract to the TETML format. Available for Windows, Mac OS X, Linux, Solaris, AIX, HP-UX...
I'm not sure if it does indeed recognize "headlines" as such (because PDF does not know much of structural markups, only visual ones) -- but it surely can tell you exact position and font used by each string of characters.