How can I tell ghostscript not to rasterize gradients in eps files? - c#

I was searching for solution that would allow me to read, edit and save .eps files. I found out that ghostscript can give all of this opportunities. The algoritm I need is simple: read several .eps files, concatenate them in one big file and save new .eps file. I can do that already but there is a problem: new generated and saved files don't preserve gradients. Gradients are rasterized and shapes which use that gradients are converted to clipping masks. Is there a way to tell ghostscript not to rasterize gradients in eps?
I'm using latest 32 bit version of ghostscript library though my Windows is 64 bit (there were problems running solution on 64 bit version of ghostscript). Actually it's not so important but I'm writting using C# and Ghostscript.Net.
This is the sample code:
using (GhostscriptProcessor processor = new GhostscriptProcessor(lastInstalledVersion, true))
{
List<string> switches = new List<string>();
switches.Add("-o");
switches.Add(#"-sOutputFile=" + outputFile);
switches.Add("-sDEVICE=eps2write");
switches.Add("-dUseCIEColor=true");
switches.Add("-c");
switches.Add("<</Install {0.5 0.5 scale}>> setpagedevice");
switches.Add("-f");
switches.Add(inputFile);
processor.Process(switches.ToArray());
}

The answer to the question you have asked is simple; you can't. The eps2write device is called that for a reason, it only produces level 2 PostScript, and the shfill operator, or type 2 pattern (shading dictionary in PDF) is a level 3 PostScript primitive.
However, there seems to be no good reason to run the exiting files through Ghostscript anyway. You say you already have a number of EPS files. The whole point of EPS files is that they can be treated as a 'black box', you do not need to know what's in them in order to concatenate them, rearrange them etc.
All you do is write some 'wrapper' PostScript that alters the CTM before including the EPS file in its entirety. You can work out what the arguments to scale and translate should be, because the EPS file will have a %%BoundingBox comment that tells you where it sits in user space. All you need to do is alter the scale, and offset the 0,0 origin (bottom left) using translate.
Note that the eps2write device, because it is limited to producing level 2 PostScript, also does not support some other features of PostScript beyond the original level 2 specification, such as CIDFonts.

Related

How can I set halftone (sethalftone) to each separation color with device tiffsep1 and other separation ones?

The code works but the commented code will create an error. The error are not solved by changing -sDEVICE to tiffgray, for example.
String[] ARGS = new String[] {
"",
"-sDEVICE=tiffsep1",
"-r1200",
"-o out.tiff",
"SOSample.pdf",
//"-c",
//"<< /HalftoneType 1 /Frequency 300 /Angle 45 /SpotFunction {180 mul cos exch 180 mul cos add 2 div} >> sethalftone",
//"-f"
};
How can I define sethalftone with ghostscript and how can I set it for each color of tiffsep1? What am I doing wrong with one color and how to make it for separations?
I'm using:
[DllImport("gsdll64.dll", EntryPoint = "gsapi_init_with_args")]
public static extern int INSTANCEStart(IntPtr instance, int argc, string[] argv);
and so on.
I'm working with Ghostscript 9.52.
Something that could help (\"):
"-c",
"\"<</Orientation 1>> setpagedevice\"",
You need to use the sethalftone PostScript operator in order to change the halftone. Obviously this will involve writing some PostScript.
Not only that, but you really need to set the default halftone, or set the halftone at the start of the page, because the current PDF interpreter in Ghostscript does an initgraphics at the start of every page of a PDF file.
For all of this you are going to need a copy of the PostScript Language Reference Manual, which you can get from somewhere on the Adobe web site. They keep moving stuff around so I'm not going to try and post a link, just google for the name of the manual. You want the third edition.
So you need to write a BeginPage procedure, which you will find covered in Chapter 6 under device control, pages 427 onwards.
The BeginPage procedure will need to set a halftone, and you will find halftones covered in Section 7.4, page 480 onwards. You will presumably want to use either a type 2 or type 4 halftone dictionary.
When you've assembled that, you then need to pass it to Ghostscript before you process the PDF file. The simplest method is to put the PostScript program in a file (called eg setup.ps) and then put that filename on the command line immediately before the PDF filename.
Eg:
gs -r1200 -sDEVICE=tiffsep1 -o out%d.tif setup.ps sample.pdf
Note that PDF files can contain a halftone specification themselves (this is deprecated in PDF 2.0) and Ghostscript will honour any halftone in a PDF file.
Finally; this is an unusual request and, given that you are writing code to link to the Ghostscript DLL, makes me think you may be using Ghostscript commercially. You should review the AGPL to ensure you are complying with the terms of the license. If you plan on distributing your application you will almost certainly need a commercial license.

Powerpoint "Save As Picture" from C# Microsoft.Office.Interop.PowerPoint

My question is pretty similar to this one and I'm afraid the answer is the same... I want to save all the shapes/images on a slide as a single png (or jpeg). Programmatically, I get as far as
slide.Shapes.SelectAll();
but don't see a way to save as image. Is this possible? If not, any other suggestions, hopfully w/ examples? (not VBA - I need to automate the whole conversion)
There was a reference to OpenXML in the other post, but I'm not even sure how to pull that in.
I don't know how you'd do this in C# but I'd guess that you'd make use of the same methods as you would with VBA, where you can do:
Activewindow.Selection.ShapeRange.Export( "c:\temp\delete-me.jpg",ppShapeFormatJPG)
ppShapeFormatJPG is a PowerPoint constant, a VBA Long = 1; IIRC that'd be an Integer in C#.
The method also can take two more optional parameters, scalewidth and scaleheight, which govern the width and height of the exported image in undocumented ways. By default, no parms supplied, I get exports at 72 dpi. Larger numbers result in higher pixel count exports but distorted proportions. I'm sure there's some strange logic to it, but it escapes me; all hints welcome!
There's a third optional parm, ExportMode. In my tests, it makes no difference whether you supply it or not, and if you do, which of the available values you choose.

How to extract text from PDF using iTextSharp version 4.1.6? [duplicate]

We are developing a Pdf parser to be used along with our system.
The requirement is such that, we store all the information on any pdf documents and should be able to reproduce the document as such (with minimal changes from original document).
We did some googling and found iTextSharp be the best mate for our purpose.
We are developing our project using .net.
You might have guessed as i mentioned in my title requiring comparisons for specific versions of iTextSharp (4.1.6 vs 5.x). We know that 4.1.6 is the last version of iTextSharp with the LGPL/MPL license . The 5.x versions are AGPL.
We would like to have a good comparison between the versions before choosing the LGPL version or we buy the license for AGPL (we dont like to publish our code).
I did some browsing through the revision changes in the iTextSharp but i would like to know if any content exist, making a good comparison between the versions.
Thanks in advance!
I'm the CTO of iText Software, so just like Michaƫl who already answered in the comment section, I'm at the same time the most authoritative source as well as a biased source.
There's a very simple comparison chart on the iText web site.
This chart doesn't cover text extraction, so allow me to list the relevant improvements since iText 5.
You've probably also found this page.
In case you wonder about the bug fixes and the performance improvements regarding text parsing, this is a more exhaustive list:
5.0.0: Text extraction: major overhaul to perform calculations in user space. This allows the parser to correctly determine line breaks, even if the text or page is rotated.
5.0.1: Refactored callback so method signature won't need to change as render callback API evolves.
5.0.1: Refactoring to make it easier for outside users to interact with the content stream processor. Also refactored render listener so text and image event listening occurs in the same interface (reduces a lot of non-value-add complexity)
5.0.1: New filtering functionality for text renderers.
5.0.1: Additional utility method for previewing PDF content.
5.0.1: Added a much more advanced text renderer listener that can reconstruct page content based on physical location of text on the page
5.0.1: Added support for XObject Form processing (text added via PdfTemplate can now be parsed)
5.0.1: Added rudimentary support for XObject Image callbacks
5.0.1: Bug fix - text extraction wasn't correct for certain page orientations
5.0.1: Bug fix - matrices were being concatenated in the wrong order.
5.0.1: PdfTextExtractor: changed the default render listener (new location aware strategy)
5.0.1: Getters for GraphicsState
5.0.2: Major refactoring of interface to text extraction functionality: for instance introduction of class PdfReaderContentParser
5.0.2: CMapAwareDocumentFont: Tweaks to make processing quasi-invalid PDF files more robust
5.0.2: PdfContentReaderTool: null pointer handling, plus a few well placed flush calls
5.0.2: PdfContentReaderTool: Show details on resource entries
5.0.2: PdfContentStreamProcessor: Adjustment so embedded images don't cause parsing problems and improvements to EI detection
5.0.2: LocationTextExtractionStrategy: Fixed anti-parallel algorithm, plus accounting for negative inter-character offsets. Change to text extraction strategy that builds out the text model first, then computes concatenation requirements.
5.0.2: Adjustments to linesegment implementation; optimalization of changes made by Bruno to text extraction; for example: introduction of the class MarkedContentInfo.
5.0.2: Major refactoring of interface to text extraction functionality: for instance introduction of class PdfReaderContentParser
5.0.3: added method to get area of image in user units
5.0.3: better parsing of inline images
5.0.3: Adding an extra check for begin/end sequences when parsing a ToUnicode stream.
5.0.4: Content streams in arrays should be parsed as if they were separated by whitespace
5.0.4: Expose CTM
5.0.4: Refactor to pull inline image processing into it's own class. Added parsing of image data if there is no filter applied (there are some PDFs where there is no white space between the end of the image data and the EI operator). Ultimately, it will be best to actually parse the image data, but this will require a pretty big refactoring of the iText decoders (to work from streams instead of byte[] of known lengths).
5.0.4: Handle multi-stage filters; Correct bug that pulled whitespace as first byte of inline image stream.
5.0.4: Applying stream filters to inline images.
5.0.4: PdfReader: Expose filter decoder for arbitrary byte arrays (instead of only streams)
5.0.6: CMapParser: Fix to read broken ToUnicode cmaps.
5.0.6: handle slightly malformed embedded images
5.0.6: CMapAwareDocumentFont: Some PDFs have a diff map bigger than 256 characters.
5.0.6: performance: Cache the fonts used in text extraction
5.1.2: PRTokeniser: Made the algorithm to find startxref more memory efficient.
5.1.2: RandomAccessFileOrArray: Improved handling for huge files that can't be mapped
5.1.2: CMapAwareDocumentFont: fix NPE if mapping doesn't get initialized (I'd rather wind up with junk characters than throw an unexpected exception down the road)
5.1.3: refactoring of how filters are applied to streams, adjust parser so it can handle multi-stage filters
5.1.3: images: allow correct decoding of 1bpc bitmask images
5.1.3: images: add jbig2 streams to pass through
5.1.3: images: handle null and indirect references in decode parameters, throw exception if unable to decode an image
5.2.0: Better error messages and better handling zero sized files and attempts to read past the end of the file.
5.2.0: Removed restriction that using memory mapping requires the file be smaller than ~2GB.
5.2.0: Avoid NullPointerException in RandomAccessFileOrArray
5.2.0: Made a utility method in pdfContentStreamProcessor private and clarified the stateful nature of the class
5.2.0: LocationTextExtractionStrategy: bounds checking on string lengths and refactoring to make code easier to read.
5.2.0: Better handling of color space dictionaries in images.
5.2.0: improve handling of quasi improper inline image content.
5.2.0: don't decode inline image streams until we absolutely need them.
5.2.0: avoid NullPointerException of resource dictionary isn't provided.
5.3.0: LocationTextExtractionStrategy: old comparison approach caused runtime exceptions in Java 7
5.3.3: incorporate the text-rise parameter
5.3.3: expose glyph-by-glyph information
5.3.3: Bugfix: text to user space transformation was being applied multiple times for sub-textrenderinfo objects
5.3.3: Bugfix: Correct baseline calculation so it doesn't include final character spacing
5.3.4: Added low-level filtering hook to LocationTextExtractionStrategy.
5.3.5: Fixed bug in PRTokeniser: handle case where number is at end of stream.
5.3.5: Replaced StringBuffer with StringBuilder in PRTokeniser for performance reasons.
5.4.2: Added an isChunkAtWordBoundary() method to LocationTextExtractionStrategy to check if a space character should be inserted between a previous chunk and the current one.
5.4.2: Added a getCharSpaceWidth() method to LocationTextExtractionStrategy to get the width of a space character.
5.4.2: Added a getText() method to LocationTextExtractionStrategy to get the text of the current Chunk.
5.4.2: Added an appendTextChunk(() method to SimpleTextExtractionStrategy to expose the append process so that subclasses can add text from outside the text parse operation.
5.4.5: Added MultiFilteredRenderListener class for PDF parser.
5.4.5: Added GlyphRenderListener and GlyphTextRenderListener classes for processing each glyph rather than processing chunks of text.
5.4.5: Added method getMcid() in TextRenderInfo.
5.4.5: fixed resource leak when many inline images were in content stream
5.5.0: CMapAwareDocumentFont: if font space width isn't defined, use the default width for the font.
5.5.0: PdfContentReader: avoid exception when displaying an empty dictionary.
There are some things that you won't be able to do if you don't upgrade. For instance, you won't be able to do the things described in these slides.
If you look at the roadmap for iText, you'll see that we'll invest even more time on text extraction in the future.
In all honesty: using the 5 year old version wouldn't only be like reinventing the wheel, it would also be like falling in every pitfall we've fallen in in the last 5 years. I can assure you that buying a license will be less expensive.

consolidate the fonts between merges pdfs itextsharp C#

I need to merge multiple pdfs together. I am using itextsharp to create all the pdfs. I need to reduce the size of the pdfs to the lowest possible size. I know the fonts are being duplicated for each pdf. Is there to use only one set of fonts throughout the merged pdf? For example, pdf1 is 2.8mb and pdf2 is 2.8 mb I merge them together and its about 5.7mb. I know for a fact that both of those pdfs are using the same font but the data for the font is being duclpicated even though its in the same pdf.
I tried using setting the compression properties to best compression and set full compression and that barely reduced the size.
Though when i ran the pdf through Acrobat X pro and optimize its reduce almost 90%+ from like 160 mb to 5 mb. The usage audit says its 90% of the pdf is fonts before optimizing.
Now is there a way to consolidate the fonts between merges pdfs ?
My answer consists of two parts:
You're not telling us how you're merging the PDFs. Let's hope you've read the official documentation and that you're using PdfSmartCopy. If not, you're doing it wrong. PdfSmartCopy examines the content of the different PDFs and reuses possibly redundant objects (such as reused images, XObjects, fonts). Note that there were some bugs in earlier versions of PdfSmartCopy so please make sure you're using the latest version.
If the different PDFs use different subsets of the same font, you're out of luck. iText doesn't merge font subsets. Merging different font subsets would involve rewriting content streams, creating new fonts if we're talking about simple font sets that require more than 256 characters if the subsets are merged, etc...
You could rename subsets.
As if you had
Helvetica (subset)
and
Helvetica (subset)
you would create
Helvetica-1 (subset)
and
Helvetiva-2 (subset)
once they were different implementations (binary stream compare)
According to
https://turreta.com/2013/12/13/remove-duplicate-fonts-in-pdf-files/
iTextSharp.text.Document tdocument = new iTextSharp.text.Document();
iTextSharp.text.pdf.PdfSmartCopy smart =
new iTextSharp.text.pdf.PdfSmartCopy(tdocument,
new FileStream(#"newAddressPath", FileMode.Create));
tdocument.Open();
iTextSharp.text.pdf.PdfReader reader = new iTextSharp.text.pdf.PdfReader(#"yourPDFMergeFile");
// Where the magic happens
for (int i = 1; i <= reader.NumberOfPages; i++)
{
smart.AddPage(smart.GetImportedPage(reader, i));
}
tdocument.Close();

How to detect keyword stuffing?

We are working on a kind of document search engine - primary focused around indexing user-submitted MS word documents.
We have noticed, that there is keyword-stuffing abuse.
We have determined two main kinds of abuse:
Repeating the same term, again and again
Many, irrelevant terms added to the document en-masse
These two forms of abuse are enabled, by either adding text with the same font colour as the background colour of the document, or by setting the font size to be something like 1px.
Whilst determining if the background colour is the same as the text colour, it is tricky, given the intricacies of MS word layouts - the same goes for font size - as any cut-off seems potentially arbitrary - we may accidentally remove valid text if we set a cut-off too large.
My question is - are there any standardized pre-processing or statistical analysis techniques that could be use to reduce the impact of this kind of keyword stuffing?
Any guidance would be appreciated!
There's a surprisingly simple solution to your problem using the notion of compressibility.
If you convert your Word documents to text (you can easily do that on the fly), you can then compress them (for example, use zlib library which is free) and look at the compression ratios. Normal text documents usually have a compression ratio of around 2, so any important deviation would mean that they have been "stuffed". The analyzing process is extremely easy, I have analyzed around 100k texts and it just takes around 1 minute using Python.
Another option is to look at the statistical properties of the documents/words. In order to do that, you need to have a sample of "clean" documents and calculate the mean frequency of the distinct words as well as their standard deviations.
After you had done that, you can take a new document and compare it against the mean and the deviation. Stuffed documents will be characterized as those with a few words with very high deviation from the mean from that word (documents where one or two words are repeated several times) or many words with high deviations (documents with blocks of text repeated)
Here are some useful links about compressibility:
http://www.ra.ethz.ch/cdstore/www2006/devel-www2006.ecs.soton.ac.uk/programme/files/pdf/3052.pdf
http://www.ispras.ru/ru/proceedings/docs/2011/21/isp_21_2011_277.pdf
You could also probably use the concept of entropy, for example Shannon Entropy Calculation http://code.activestate.com/recipes/577476-shannon-entropy-calculation/
Another possible solution would be to use Part-of-speech (POS) tagging. I reckon that the average percentage of nouns is similar across "normal" documents (37% percent according to http://www.ingentaconnect.com/content/jbp/ijcl/2007/00000012/00000001/art00004?crawler=true) . If the percentage were higher or lower for some POS tags, then you could possibly detect "stuffed" documents.
As Chris Sinclair commented in your question, unless you have google level algorithms (and even they get it wrong and thereby have an appeal process) it's best to flag likely keyword stuffed documents for further human review...
If a page has 100 words, and you search through the page detecting the count for the occurences of keywords (rendering stuffing by 1px or bgcolor irrelevant), thereby gaining a keyword density count, there really is no hard and fast method for a certain percentage 'allways' being keyword stuffing, generally 3-7% is normal. Perhaps if you detect 10% + then you flag it as 'potentially stuffed' and set aside for human review.
Furthermore consider these scenarios (taken from here):
Lists of phone numbers without substantial added value
Blocks of text listing cities and states a webpage is trying to rank for
and what the context of a keyword is.
Pretty damn difficult to do correctly.
Detect tag-abuse with forecolor/backcolor detection like you already do.
For size detection calculate the average text size and remove the outliers.
Also set predefined limits on the textsize (like you already do).
Next up is the structure of the tag "blobs".
For your first point you can just count the words and if one occurs too often (maybe 5x more often than the 2nd word) you can flag it as a repeated tag.
When adding tags en-mass the user often adds them all in one place, so you can see if known "fraud tags" appear next to each other (maybe with one or two words in between).
If you could identify at least some common "fraud tags" and want to get a bit more advanced then you could do the following:
Split the document into parts with the same textsize / font and analyze each part separately. For better results group parts that use nearly the same font/size, not only those that have EXACTLY the same font/size.
Count the occurrence of each known tag and when some limit set by you is exceeded this part of the document is removed or the document is flagged as "bad" (as in "uses excessiv tags")
No matter how advanced your detection is, as soon as people know its there and more or less know how it works they will find ways to circumvent it.
When that happens you should just flag the offending documents and see trough them yourself. Then if you notice that your detection algorithm got a false-positive you improve it.
If you notice a pattern in that the common stuffers are always using a font size below a certain size and that size i.e 1-5 which is not really readable then you could assume that that is the "stuffed part".
You can then go on to check if the font colour is also the same as the background colour and remove it that section.

Categories