Hy
I am using ITextSharp to parse a pdf file to text output.
I want to know if I can catch if the pdf contains subscript or superscript, does anyone knows how to make the difference between a normal character and a superscript in a pdf using ITextSharp, or other library ?
Thanks
Disclaimer: I don't actually have any evidence for this but...
I would expect super/subscript to be identical to normal text. It's the same font, just smaller. If it happens to be on the same line as other text, super/sub scripts are raised and lowered - but you won't be able to detect that with some explicit meta-tag in a layout-oriented format such as PDF.
In other words, I'd guess that you need to identify super/subscripts by heuristics: finding text that's smaller and vertically displaced compared to other text on the "same" line. Whether that's easy to do or not depends on the PDF creator and the details of ITextSharp, since even identifying a "line" is not necessarily straightforward.
You are going to have to implement a bit of custom logic here. There is no tag denoting superscript/subscript in PDF, it is simply sitting upon a different baseline. In cases such as this, you will have to note your baseline (along with your height).
Some quick pseudo-code:
//input -> curText
if(curText.Baseline > previousText.Baseline &&
curText.Baseline < (prevText.Baseline + prevText.Height))
{
// This is most likely superscript //
}
else if(curText.Baseline < previousText.Baseline &&
prevText.Baseline < (curText.Baseline + curText.Height))
{
// This is most likely subscript //
}
else
{
// This is probably normal text //
}
This solution requires you to organize the thoroughly unorganized nature of a PDF file. In the past I have used List<> of a custom class meant to organize all text of a given y coordinate into arrays. Using something like this you can then compare the separate lines and do whatever work to them you might want before painting or otherwise transmitting them.
Related
I try to build an application that can convert a PDF to an excel with C#.
I have searched for some library to help me with this, but most of them are commercially licensed, so I ended up to iTextSharp.dll
It's good that is free, but I rarely find any good open source documentation for it.
These are some link that I have read:
https://yoda.entelect.co.za/view/9902/extracting-data-from-pdf-files
https://www.mikesdotnetting.com/article/80/create-pdfs-in-asp-net-getting-started-with-itextsharp
http://www.thedevelopertips.com/DotNet/ASPDotNet/Read-PDF-and-Convert-to-Stream.aspx?id=34
there're more. But, most of them did not really explain what use of the code.
So this is most common code in IText with C#:
StringBuilder text = new StringBuilder(); // my new file that will have pdf content?
PdfReader pdfReader = new PdfReader(myPath); // This maybe how IText read the pdf?
for (int page = 1; page <= pdfReader.NumberOfPages; page++) // looping for read all content in pdf?
{
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy(); // ?
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy); // ?
currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText))); // maybe how IText convert the data to text?
text.Append(currentText); // maybe the full content?
}
pdfReader.Close(); // to close the PdfReader?
As you can see, I still do not have a clear knowledge of the IText code that I have. Tell me, if my knowledge is correct and give me an answer for code that I still not understand.
Thank You.
Let me start by explaining a bit about PDF.
PDF is not a 'what you see is what you get'-format.
Internally, PDF is more like a file containing instructions for rendering software. Unless you are working with a tagged PDF file, a PDF document does not naturally have a concept of 'paragraph' or 'table'.
If you open a PDF in notepad for instance, you might see something like
7 0 obj
<</BaseFont/Helvetica-Oblique/Encoding/WinAnsiEncoding/Subtype/Type1/Type/Font>>
endobj
Instructions in the document get gathered into 'objects' and objects are numbered, and can be cross-referenced.
As Bruno already indicated in the comments, this means that finding out what a table is, or what the content of a table is, can be really hard.
The PDF document itself can only tell you things like:
object 8 is a line from [50, 100] to [150, 100]
object 125 is a piece of text, in font Helvetica, at position [50, 110]
With the iText core library you can
get all of these objects (which iText calls PathRenderInfo, TextRenderInfo and ImageRenderInfo objects)
get the graphics state when the object was rendered (which font, font-size, color, etc)
This can allow you to write your own parsing logic.
For instance:
gather all the PathRenderInfo objects
remove everything that is not a perfect horizontal or vertical line
make clusters of everything that intersects at 90 degree angles
if a cluster contains more than a given threshold of lines, consider it a table
Luckily, the pdf2Data solution (an iText add-on) already does that kind of thing for you.
For more information go to http://pdf2data.online/
My question is pretty similar to this one and I'm afraid the answer is the same... I want to save all the shapes/images on a slide as a single png (or jpeg). Programmatically, I get as far as
slide.Shapes.SelectAll();
but don't see a way to save as image. Is this possible? If not, any other suggestions, hopfully w/ examples? (not VBA - I need to automate the whole conversion)
There was a reference to OpenXML in the other post, but I'm not even sure how to pull that in.
I don't know how you'd do this in C# but I'd guess that you'd make use of the same methods as you would with VBA, where you can do:
Activewindow.Selection.ShapeRange.Export( "c:\temp\delete-me.jpg",ppShapeFormatJPG)
ppShapeFormatJPG is a PowerPoint constant, a VBA Long = 1; IIRC that'd be an Integer in C#.
The method also can take two more optional parameters, scalewidth and scaleheight, which govern the width and height of the exported image in undocumented ways. By default, no parms supplied, I get exports at 72 dpi. Larger numbers result in higher pixel count exports but distorted proportions. I'm sure there's some strange logic to it, but it escapes me; all hints welcome!
There's a third optional parm, ExportMode. In my tests, it makes no difference whether you supply it or not, and if you do, which of the available values you choose.
We are developing a Pdf parser to be used along with our system.
The requirement is such that, we store all the information on any pdf documents and should be able to reproduce the document as such (with minimal changes from original document).
We did some googling and found iTextSharp be the best mate for our purpose.
We are developing our project using .net.
You might have guessed as i mentioned in my title requiring comparisons for specific versions of iTextSharp (4.1.6 vs 5.x). We know that 4.1.6 is the last version of iTextSharp with the LGPL/MPL license . The 5.x versions are AGPL.
We would like to have a good comparison between the versions before choosing the LGPL version or we buy the license for AGPL (we dont like to publish our code).
I did some browsing through the revision changes in the iTextSharp but i would like to know if any content exist, making a good comparison between the versions.
Thanks in advance!
I'm the CTO of iText Software, so just like Michaƫl who already answered in the comment section, I'm at the same time the most authoritative source as well as a biased source.
There's a very simple comparison chart on the iText web site.
This chart doesn't cover text extraction, so allow me to list the relevant improvements since iText 5.
You've probably also found this page.
In case you wonder about the bug fixes and the performance improvements regarding text parsing, this is a more exhaustive list:
5.0.0: Text extraction: major overhaul to perform calculations in user space. This allows the parser to correctly determine line breaks, even if the text or page is rotated.
5.0.1: Refactored callback so method signature won't need to change as render callback API evolves.
5.0.1: Refactoring to make it easier for outside users to interact with the content stream processor. Also refactored render listener so text and image event listening occurs in the same interface (reduces a lot of non-value-add complexity)
5.0.1: New filtering functionality for text renderers.
5.0.1: Additional utility method for previewing PDF content.
5.0.1: Added a much more advanced text renderer listener that can reconstruct page content based on physical location of text on the page
5.0.1: Added support for XObject Form processing (text added via PdfTemplate can now be parsed)
5.0.1: Added rudimentary support for XObject Image callbacks
5.0.1: Bug fix - text extraction wasn't correct for certain page orientations
5.0.1: Bug fix - matrices were being concatenated in the wrong order.
5.0.1: PdfTextExtractor: changed the default render listener (new location aware strategy)
5.0.1: Getters for GraphicsState
5.0.2: Major refactoring of interface to text extraction functionality: for instance introduction of class PdfReaderContentParser
5.0.2: CMapAwareDocumentFont: Tweaks to make processing quasi-invalid PDF files more robust
5.0.2: PdfContentReaderTool: null pointer handling, plus a few well placed flush calls
5.0.2: PdfContentReaderTool: Show details on resource entries
5.0.2: PdfContentStreamProcessor: Adjustment so embedded images don't cause parsing problems and improvements to EI detection
5.0.2: LocationTextExtractionStrategy: Fixed anti-parallel algorithm, plus accounting for negative inter-character offsets. Change to text extraction strategy that builds out the text model first, then computes concatenation requirements.
5.0.2: Adjustments to linesegment implementation; optimalization of changes made by Bruno to text extraction; for example: introduction of the class MarkedContentInfo.
5.0.2: Major refactoring of interface to text extraction functionality: for instance introduction of class PdfReaderContentParser
5.0.3: added method to get area of image in user units
5.0.3: better parsing of inline images
5.0.3: Adding an extra check for begin/end sequences when parsing a ToUnicode stream.
5.0.4: Content streams in arrays should be parsed as if they were separated by whitespace
5.0.4: Expose CTM
5.0.4: Refactor to pull inline image processing into it's own class. Added parsing of image data if there is no filter applied (there are some PDFs where there is no white space between the end of the image data and the EI operator). Ultimately, it will be best to actually parse the image data, but this will require a pretty big refactoring of the iText decoders (to work from streams instead of byte[] of known lengths).
5.0.4: Handle multi-stage filters; Correct bug that pulled whitespace as first byte of inline image stream.
5.0.4: Applying stream filters to inline images.
5.0.4: PdfReader: Expose filter decoder for arbitrary byte arrays (instead of only streams)
5.0.6: CMapParser: Fix to read broken ToUnicode cmaps.
5.0.6: handle slightly malformed embedded images
5.0.6: CMapAwareDocumentFont: Some PDFs have a diff map bigger than 256 characters.
5.0.6: performance: Cache the fonts used in text extraction
5.1.2: PRTokeniser: Made the algorithm to find startxref more memory efficient.
5.1.2: RandomAccessFileOrArray: Improved handling for huge files that can't be mapped
5.1.2: CMapAwareDocumentFont: fix NPE if mapping doesn't get initialized (I'd rather wind up with junk characters than throw an unexpected exception down the road)
5.1.3: refactoring of how filters are applied to streams, adjust parser so it can handle multi-stage filters
5.1.3: images: allow correct decoding of 1bpc bitmask images
5.1.3: images: add jbig2 streams to pass through
5.1.3: images: handle null and indirect references in decode parameters, throw exception if unable to decode an image
5.2.0: Better error messages and better handling zero sized files and attempts to read past the end of the file.
5.2.0: Removed restriction that using memory mapping requires the file be smaller than ~2GB.
5.2.0: Avoid NullPointerException in RandomAccessFileOrArray
5.2.0: Made a utility method in pdfContentStreamProcessor private and clarified the stateful nature of the class
5.2.0: LocationTextExtractionStrategy: bounds checking on string lengths and refactoring to make code easier to read.
5.2.0: Better handling of color space dictionaries in images.
5.2.0: improve handling of quasi improper inline image content.
5.2.0: don't decode inline image streams until we absolutely need them.
5.2.0: avoid NullPointerException of resource dictionary isn't provided.
5.3.0: LocationTextExtractionStrategy: old comparison approach caused runtime exceptions in Java 7
5.3.3: incorporate the text-rise parameter
5.3.3: expose glyph-by-glyph information
5.3.3: Bugfix: text to user space transformation was being applied multiple times for sub-textrenderinfo objects
5.3.3: Bugfix: Correct baseline calculation so it doesn't include final character spacing
5.3.4: Added low-level filtering hook to LocationTextExtractionStrategy.
5.3.5: Fixed bug in PRTokeniser: handle case where number is at end of stream.
5.3.5: Replaced StringBuffer with StringBuilder in PRTokeniser for performance reasons.
5.4.2: Added an isChunkAtWordBoundary() method to LocationTextExtractionStrategy to check if a space character should be inserted between a previous chunk and the current one.
5.4.2: Added a getCharSpaceWidth() method to LocationTextExtractionStrategy to get the width of a space character.
5.4.2: Added a getText() method to LocationTextExtractionStrategy to get the text of the current Chunk.
5.4.2: Added an appendTextChunk(() method to SimpleTextExtractionStrategy to expose the append process so that subclasses can add text from outside the text parse operation.
5.4.5: Added MultiFilteredRenderListener class for PDF parser.
5.4.5: Added GlyphRenderListener and GlyphTextRenderListener classes for processing each glyph rather than processing chunks of text.
5.4.5: Added method getMcid() in TextRenderInfo.
5.4.5: fixed resource leak when many inline images were in content stream
5.5.0: CMapAwareDocumentFont: if font space width isn't defined, use the default width for the font.
5.5.0: PdfContentReader: avoid exception when displaying an empty dictionary.
There are some things that you won't be able to do if you don't upgrade. For instance, you won't be able to do the things described in these slides.
If you look at the roadmap for iText, you'll see that we'll invest even more time on text extraction in the future.
In all honesty: using the 5 year old version wouldn't only be like reinventing the wheel, it would also be like falling in every pitfall we've fallen in in the last 5 years. I can assure you that buying a license will be less expensive.
I am trying to cast AcroFields to the specific types so I can set properties on them.
When I call AcroFields.GetField(string name); all I get is a string.
When I call AcroFields.GetFieldItem(string name); I get an object but cannot cast it to the specific type.
I have also tried:
AcroFields.SetFieldProperty("myfield", "CheckType", RadioCheckField.TYPE_STAR, null);
This returned false every time.
To better explain my scenario:
I have an existing PDF ( I am NOT generating this file ).
There is a checkbox in it. I want to change the "CheckType" like this:
myRadioCheckField.CheckType = RadioCheckField.TYPE_STAR
But since I cannot cast the AcroField to the specific type I cannot access that property "CheckType".
Is there a way to achieve this?
Please provide a working sample if possible.
Thank you.
Your post contains two different questions. One question is easy to answer. The other question is impossible to answer.
Let's start with the easy question:
How to get specific types from AcroFields? Like PushButtonField, RadioCheckField, etc
This is explained in the FormInformation example, which is part of chapter 6 of my book:
PdfReader reader = new PdfReader(datasheet);
// Get the fields from the reader (read-only!!!)
AcroFields form = reader.AcroFields;
// Loop over the fields and get info about them
StringBuilder sb = new StringBuilder();
foreach (string key in form.Fields.Keys) {
sb.Append(key);
sb.Append(": ");
switch (form.GetFieldType(key)) {
case AcroFields.FIELD_TYPE_CHECKBOX:
sb.Append("Checkbox");
break;
case AcroFields.FIELD_TYPE_COMBO:
sb.Append("Combobox");
break;
case AcroFields.FIELD_TYPE_LIST:
sb.Append("List");
break;
case AcroFields.FIELD_TYPE_NONE:
sb.Append("None");
break;
case AcroFields.FIELD_TYPE_PUSHBUTTON:
sb.Append("Pushbutton");
break;
case AcroFields.FIELD_TYPE_RADIOBUTTON:
sb.Append("Radiobutton");
break;
case AcroFields.FIELD_TYPE_SIGNATURE:
sb.Append("Signature");
break;
case AcroFields.FIELD_TYPE_TEXT:
sb.Append("Text");
break;
default:
sb.Append("?");
break;
}
sb.Append(Environment.NewLine);
}
// Get possible values for field "CP_1"
sb.Append("Possible values for CP_1:");
sb.Append(Environment.NewLine);
string[] states = form.GetAppearanceStates("CP_1");
for (int i = 0; i < states.Length; i++) {
sb.Append(" - ");
sb.Append(states[i]);
sb.Append(Environment.NewLine);
}
// Get possible values for field "category"
sb.Append("Possible values for category:");
sb.Append(Environment.NewLine);
states = form.GetAppearanceStates("category");
for (int i = 0; i < states.Length - 1; i++) {
sb.Append(states[i]);
sb.Append(", ");
}
sb.Append(states[states.Length - 1]);
This code snippet stores the types of the fields in a StringBuilder, as well as the possible values of a radio field and a check box.
If you execute it on datasheet.pdf, you get form_info.txt as a result.
So far so good, but then comes the difficult question:
How do I find the check type? How do I change it?
That question reveals a lack of understanding of PDF. When we looked for the possible values of a check box (or radio button) in the previous answer, we asked for the different appearance states. These appearance states are small pieces of content that are expressed in PDF syntax.
For instance: take a look at the buttons.pdf form.
When we look at it on the outside, we see:
The check box next to "English" can be an empty square or a square with a pinkish background and a cross. Now let's take a look at the inside:
We see that this is the table check box, and we see that there are two appearance states: /Yes and /Off. What these states look like when selected, is described in a stream.
The stream of the /Off state is rather simple:
You immediately see that we are constructing a rectangle (re) and drawing it without filling it (S).
The /Yes state is slightly more complex:
We see that the fill color is being changed (rg), and that we stroke the rectangle in black and fill it using the fill color that was defined (B). Then we define two lines with moveTo (m) and lineTo (l) operations and we stroke them (S).
If you are proficient in PDF syntax, it is easy to see that we're drawing a cross inside a colored rectangle. So that answers your question on condition that you're proficient in PDF...
If you want to replace the appearance, then you have to replace the stream that draws the rectangle and the cross. That's not impossible, but it's a different question than the one you've posted.
Summarized: There is no such thing as a TYPE_STAR in the PDF reference (ISO-32000-1), nor in any PDF. If you have an existing PDF, you can not cast a check box or radio button to a RadioCheckField. You could try to reconstruct the RadioCheckField object, but if you'd want to know if a check box is visualized using a check mark, a star,... then you have to interpret the PDF syntax. If you don't want to do that, you can't create the "original" RadioCheckField object that was used to create the PDF due to the lack of ready-to-use information in the PDF.
After some research I have come to the conclusion that at this point in time, with the current version of iTextSharp v5.5.7.0:
It is NOT possible to grab an AcroField and cast it back to the original class ( RadioCheckField ) that was used to generate the field in the first place.
So the specific classes like PushButtonField and RadioCheckField are only useful for generating a new PDF but not for editing an existing PDF.
This cannot be done.
As far as I know iTextSharp does not support casting from AcroField back to the orginal class ( RadioCheckField ) that was used to generate the field.
You would have to write your own code to parse and inspect the PDF to achieve this.
We are working on a kind of document search engine - primary focused around indexing user-submitted MS word documents.
We have noticed, that there is keyword-stuffing abuse.
We have determined two main kinds of abuse:
Repeating the same term, again and again
Many, irrelevant terms added to the document en-masse
These two forms of abuse are enabled, by either adding text with the same font colour as the background colour of the document, or by setting the font size to be something like 1px.
Whilst determining if the background colour is the same as the text colour, it is tricky, given the intricacies of MS word layouts - the same goes for font size - as any cut-off seems potentially arbitrary - we may accidentally remove valid text if we set a cut-off too large.
My question is - are there any standardized pre-processing or statistical analysis techniques that could be use to reduce the impact of this kind of keyword stuffing?
Any guidance would be appreciated!
There's a surprisingly simple solution to your problem using the notion of compressibility.
If you convert your Word documents to text (you can easily do that on the fly), you can then compress them (for example, use zlib library which is free) and look at the compression ratios. Normal text documents usually have a compression ratio of around 2, so any important deviation would mean that they have been "stuffed". The analyzing process is extremely easy, I have analyzed around 100k texts and it just takes around 1 minute using Python.
Another option is to look at the statistical properties of the documents/words. In order to do that, you need to have a sample of "clean" documents and calculate the mean frequency of the distinct words as well as their standard deviations.
After you had done that, you can take a new document and compare it against the mean and the deviation. Stuffed documents will be characterized as those with a few words with very high deviation from the mean from that word (documents where one or two words are repeated several times) or many words with high deviations (documents with blocks of text repeated)
Here are some useful links about compressibility:
http://www.ra.ethz.ch/cdstore/www2006/devel-www2006.ecs.soton.ac.uk/programme/files/pdf/3052.pdf
http://www.ispras.ru/ru/proceedings/docs/2011/21/isp_21_2011_277.pdf
You could also probably use the concept of entropy, for example Shannon Entropy Calculation http://code.activestate.com/recipes/577476-shannon-entropy-calculation/
Another possible solution would be to use Part-of-speech (POS) tagging. I reckon that the average percentage of nouns is similar across "normal" documents (37% percent according to http://www.ingentaconnect.com/content/jbp/ijcl/2007/00000012/00000001/art00004?crawler=true) . If the percentage were higher or lower for some POS tags, then you could possibly detect "stuffed" documents.
As Chris Sinclair commented in your question, unless you have google level algorithms (and even they get it wrong and thereby have an appeal process) it's best to flag likely keyword stuffed documents for further human review...
If a page has 100 words, and you search through the page detecting the count for the occurences of keywords (rendering stuffing by 1px or bgcolor irrelevant), thereby gaining a keyword density count, there really is no hard and fast method for a certain percentage 'allways' being keyword stuffing, generally 3-7% is normal. Perhaps if you detect 10% + then you flag it as 'potentially stuffed' and set aside for human review.
Furthermore consider these scenarios (taken from here):
Lists of phone numbers without substantial added value
Blocks of text listing cities and states a webpage is trying to rank for
and what the context of a keyword is.
Pretty damn difficult to do correctly.
Detect tag-abuse with forecolor/backcolor detection like you already do.
For size detection calculate the average text size and remove the outliers.
Also set predefined limits on the textsize (like you already do).
Next up is the structure of the tag "blobs".
For your first point you can just count the words and if one occurs too often (maybe 5x more often than the 2nd word) you can flag it as a repeated tag.
When adding tags en-mass the user often adds them all in one place, so you can see if known "fraud tags" appear next to each other (maybe with one or two words in between).
If you could identify at least some common "fraud tags" and want to get a bit more advanced then you could do the following:
Split the document into parts with the same textsize / font and analyze each part separately. For better results group parts that use nearly the same font/size, not only those that have EXACTLY the same font/size.
Count the occurrence of each known tag and when some limit set by you is exceeded this part of the document is removed or the document is flagged as "bad" (as in "uses excessiv tags")
No matter how advanced your detection is, as soon as people know its there and more or less know how it works they will find ways to circumvent it.
When that happens you should just flag the offending documents and see trough them yourself. Then if you notice that your detection algorithm got a false-positive you improve it.
If you notice a pattern in that the common stuffers are always using a font size below a certain size and that size i.e 1-5 which is not really readable then you could assume that that is the "stuffed part".
You can then go on to check if the font colour is also the same as the background colour and remove it that section.