I'm trying to sort through a collection of DeepZoom sub-images based on arbitrary data associated with each image. The sub-images get loaded automagically through an XML file generated by DeepZoom Composer. I don't see a clear way to associate arbitrary data with a DeepZoom sub-image.
The solutions that seem most obvious to me are brittle and don't scale well. Ideally, I'd like to put the relevant data in the generated XML file, but I'd lose that information on the next set of generated images.
Is there a well-established way of accomplishing this goal?
As you've noticed DeepZoomComposer supports a <Tag></Tag> element which you can use in your Silverlight MultiScaleImage control (filtering by tag example).
You are also right that you would 'lose' any information you add to the XML file when you edit in DeepZoomComposer and re-generate (however you don't lose it if you typed into DeepZoomComposer).
To get around this problem, I've written a little console application called TagUpdater -- basically it works like this:
You put your metadata IN THE IMAGES: JPG file format supports lots of different fields, but for now let's use Title, Keywords (tags), Description and Rating
You add your images to Microsoft's DeepZoomComposer (don't necessarily bother laying them out, since you will probably want to sort them dynamically; and don't bother entering any metadata) and Export as normal
Call TagUpdater.exe Metadata.xml via the console (DeepZoomComposer will have generated the Metadata.xml file).
TagUpdater extracts the metadata direct from your images and updates Metadata.xml (see below). It is destructive to the existing <Tag> data, but otherwise the file can be used as-before to provide metadata information for a DeepZoom collection in a MultiScaleImage control.
<Image>
<FileName>C:\Documents and Settings\xxxxxx\My Documents\Expression\Deep Zoom Composer Projects\Bhutan\source images\page01.jpg</FileName>
<x>0</x>
<y>0</y>
<Width>0.241254523522316</Width>
<Height>0.27256162721473</Height>
<ZOrder>1</ZOrder>
<Tag>Bhutan,Mask</Tag>
<Description>Land of the Thunder Dragon</Description>
<Title>Bhutan 2008</Title>
<Rating>3</Rating>
</Image>
You can keep adding images/regenerating because the metadata is coming from the images (not the DeepZoomComposer tag box).
Here's an example - notice how the tags and description on the right are updated as you hover over each image, as well as the visible images being filtered based on clicking a tag.
Kirupa's source can be modified to use this extra data...
string tagString = g.Element("Tag").Value;
// get new elements as well
string descriptionString = g.Element("Description").Value;
string titleString = g.Element("Title").Value;
string ratingString = g.Element("Rating").Value;
Hope that's of some interest - TagUpdater itself isn't the only way to accomplish this. It's pretty simple: it just opens the Metadata.XML file, loops through the <Image> elements to open the <FileName>, extract the metadata, add the additional XML elements and save the XML. Using the filename as a 'key' you could get additional information from a database (eg. a price or more description data) and expand the XML file as much as you want.
Metadata.xml has a Tag property that can be associated with each image. Hurray!
Related
We are using the classes in the Microsoft.VisualStudio.XmlEditor namespace (https://msdn.microsoft.com/en-us/library/microsoft.visualstudio.xmleditor.aspx) to parse an xml document in an Visual Studio Extension. Documents are parsed to an object structure, and only changes are parsed. This part is working perfectly.
Now we want to use the parsed data objects to show adornments (https://msdn.microsoft.com/en-us/library/microsoft.visualstudio.text.editor.intratextadornmenttag.aspx) on specific locations within the XML file, based on some conditions. The tagging system expects that the extension gives a list of snapshotspans, where the tags should be rendered.
The problem is that we cannot find a way to know the correct positions of the XML tags. The XmlModel class, provides a GetTextSpan() method (https://msdn.microsoft.com/en-us/library/microsoft.visualstudio.xmleditor.xmlmodel.gettextspan.aspx), but the result of that method call doesn't contain the ITextSnapshot to which the coordinates belong. Since the XML parser is running on a separate thread, there is no way to guarantee that these coordinates are still valid for the most recent snapshot of the text buffer when doing changes quickly after each other. If the result would have been a SnapshotSpan struct (https://msdn.microsoft.com/en-us/library/microsoft.visualstudio.text.snapshotspan.aspx), we would be able to translate that span to the current snapshot.
Does anybody know a proper way to resolve this issue?
We are developing a Pdf parser to be used along with our system.
The requirement is such that, we store all the information on any pdf documents and should be able to reproduce the document as such (with minimal changes from original document).
We did some googling and found iTextSharp be the best mate for our purpose.
We are developing our project using .net.
You might have guessed as i mentioned in my title requiring comparisons for specific versions of iTextSharp (4.1.6 vs 5.x). We know that 4.1.6 is the last version of iTextSharp with the LGPL/MPL license . The 5.x versions are AGPL.
We would like to have a good comparison between the versions before choosing the LGPL version or we buy the license for AGPL (we dont like to publish our code).
I did some browsing through the revision changes in the iTextSharp but i would like to know if any content exist, making a good comparison between the versions.
Thanks in advance!
I'm the CTO of iText Software, so just like Michaƫl who already answered in the comment section, I'm at the same time the most authoritative source as well as a biased source.
There's a very simple comparison chart on the iText web site.
This chart doesn't cover text extraction, so allow me to list the relevant improvements since iText 5.
You've probably also found this page.
In case you wonder about the bug fixes and the performance improvements regarding text parsing, this is a more exhaustive list:
5.0.0: Text extraction: major overhaul to perform calculations in user space. This allows the parser to correctly determine line breaks, even if the text or page is rotated.
5.0.1: Refactored callback so method signature won't need to change as render callback API evolves.
5.0.1: Refactoring to make it easier for outside users to interact with the content stream processor. Also refactored render listener so text and image event listening occurs in the same interface (reduces a lot of non-value-add complexity)
5.0.1: New filtering functionality for text renderers.
5.0.1: Additional utility method for previewing PDF content.
5.0.1: Added a much more advanced text renderer listener that can reconstruct page content based on physical location of text on the page
5.0.1: Added support for XObject Form processing (text added via PdfTemplate can now be parsed)
5.0.1: Added rudimentary support for XObject Image callbacks
5.0.1: Bug fix - text extraction wasn't correct for certain page orientations
5.0.1: Bug fix - matrices were being concatenated in the wrong order.
5.0.1: PdfTextExtractor: changed the default render listener (new location aware strategy)
5.0.1: Getters for GraphicsState
5.0.2: Major refactoring of interface to text extraction functionality: for instance introduction of class PdfReaderContentParser
5.0.2: CMapAwareDocumentFont: Tweaks to make processing quasi-invalid PDF files more robust
5.0.2: PdfContentReaderTool: null pointer handling, plus a few well placed flush calls
5.0.2: PdfContentReaderTool: Show details on resource entries
5.0.2: PdfContentStreamProcessor: Adjustment so embedded images don't cause parsing problems and improvements to EI detection
5.0.2: LocationTextExtractionStrategy: Fixed anti-parallel algorithm, plus accounting for negative inter-character offsets. Change to text extraction strategy that builds out the text model first, then computes concatenation requirements.
5.0.2: Adjustments to linesegment implementation; optimalization of changes made by Bruno to text extraction; for example: introduction of the class MarkedContentInfo.
5.0.2: Major refactoring of interface to text extraction functionality: for instance introduction of class PdfReaderContentParser
5.0.3: added method to get area of image in user units
5.0.3: better parsing of inline images
5.0.3: Adding an extra check for begin/end sequences when parsing a ToUnicode stream.
5.0.4: Content streams in arrays should be parsed as if they were separated by whitespace
5.0.4: Expose CTM
5.0.4: Refactor to pull inline image processing into it's own class. Added parsing of image data if there is no filter applied (there are some PDFs where there is no white space between the end of the image data and the EI operator). Ultimately, it will be best to actually parse the image data, but this will require a pretty big refactoring of the iText decoders (to work from streams instead of byte[] of known lengths).
5.0.4: Handle multi-stage filters; Correct bug that pulled whitespace as first byte of inline image stream.
5.0.4: Applying stream filters to inline images.
5.0.4: PdfReader: Expose filter decoder for arbitrary byte arrays (instead of only streams)
5.0.6: CMapParser: Fix to read broken ToUnicode cmaps.
5.0.6: handle slightly malformed embedded images
5.0.6: CMapAwareDocumentFont: Some PDFs have a diff map bigger than 256 characters.
5.0.6: performance: Cache the fonts used in text extraction
5.1.2: PRTokeniser: Made the algorithm to find startxref more memory efficient.
5.1.2: RandomAccessFileOrArray: Improved handling for huge files that can't be mapped
5.1.2: CMapAwareDocumentFont: fix NPE if mapping doesn't get initialized (I'd rather wind up with junk characters than throw an unexpected exception down the road)
5.1.3: refactoring of how filters are applied to streams, adjust parser so it can handle multi-stage filters
5.1.3: images: allow correct decoding of 1bpc bitmask images
5.1.3: images: add jbig2 streams to pass through
5.1.3: images: handle null and indirect references in decode parameters, throw exception if unable to decode an image
5.2.0: Better error messages and better handling zero sized files and attempts to read past the end of the file.
5.2.0: Removed restriction that using memory mapping requires the file be smaller than ~2GB.
5.2.0: Avoid NullPointerException in RandomAccessFileOrArray
5.2.0: Made a utility method in pdfContentStreamProcessor private and clarified the stateful nature of the class
5.2.0: LocationTextExtractionStrategy: bounds checking on string lengths and refactoring to make code easier to read.
5.2.0: Better handling of color space dictionaries in images.
5.2.0: improve handling of quasi improper inline image content.
5.2.0: don't decode inline image streams until we absolutely need them.
5.2.0: avoid NullPointerException of resource dictionary isn't provided.
5.3.0: LocationTextExtractionStrategy: old comparison approach caused runtime exceptions in Java 7
5.3.3: incorporate the text-rise parameter
5.3.3: expose glyph-by-glyph information
5.3.3: Bugfix: text to user space transformation was being applied multiple times for sub-textrenderinfo objects
5.3.3: Bugfix: Correct baseline calculation so it doesn't include final character spacing
5.3.4: Added low-level filtering hook to LocationTextExtractionStrategy.
5.3.5: Fixed bug in PRTokeniser: handle case where number is at end of stream.
5.3.5: Replaced StringBuffer with StringBuilder in PRTokeniser for performance reasons.
5.4.2: Added an isChunkAtWordBoundary() method to LocationTextExtractionStrategy to check if a space character should be inserted between a previous chunk and the current one.
5.4.2: Added a getCharSpaceWidth() method to LocationTextExtractionStrategy to get the width of a space character.
5.4.2: Added a getText() method to LocationTextExtractionStrategy to get the text of the current Chunk.
5.4.2: Added an appendTextChunk(() method to SimpleTextExtractionStrategy to expose the append process so that subclasses can add text from outside the text parse operation.
5.4.5: Added MultiFilteredRenderListener class for PDF parser.
5.4.5: Added GlyphRenderListener and GlyphTextRenderListener classes for processing each glyph rather than processing chunks of text.
5.4.5: Added method getMcid() in TextRenderInfo.
5.4.5: fixed resource leak when many inline images were in content stream
5.5.0: CMapAwareDocumentFont: if font space width isn't defined, use the default width for the font.
5.5.0: PdfContentReader: avoid exception when displaying an empty dictionary.
There are some things that you won't be able to do if you don't upgrade. For instance, you won't be able to do the things described in these slides.
If you look at the roadmap for iText, you'll see that we'll invest even more time on text extraction in the future.
In all honesty: using the 5 year old version wouldn't only be like reinventing the wheel, it would also be like falling in every pitfall we've fallen in in the last 5 years. I can assure you that buying a license will be less expensive.
i am having a trouble in retrieving images and text in a pdf file at the same, i was able to get images and text in a pdf file but not at the same time (this will cause a question of whether to render the image first or the text first for example in my panel control?), maybe if you guys can help me define what does each constants in pdfname means? i tried using pdfname.all but it returns null, but when using pdfname.resources it returns procset, font and xobject. i used xobject for image, but what are procset and font (could this be the style of the text? does it have pdfname.text for retrieving text)?
thanks in advance.
First of all,
i am having a trouble in retrieving images and text in a pdf file at the same
for this task you should use the iText(Sharp) parser API. In iTextSharp you essentially implement IRenderListener (an interface with methods for being informed about (bitmap) images and text fragments in a content stream) and process the page contents with it:
PdfReader reader = new PdfReader(...);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
int pageNumber = [... the number of the page you are interested in; may be a loop variable ...];
IRenderListener listener = new [... your IRenderListener implementation ...]
parser.ProcessContent(pageNumber, listener);
You ask
whether to render the image first or the text first for example in my panel control
The IRenderListener methods also retrieve information on the location of the bitmap or text fragment in question.
For ideas how the text fragments may be combined in your listener, you may want to be inspired by the implementations SimpleTextExtractionStrategy or LocationTextExtractionStrategy present in iTextSharp.
If you insist on doing it manually, though...
maybe if you guys can help me define what does each constants in pdfname means?
You find the definitions of what the names map to in the PDF specification ISO 32000-1:2008 a copy of which Adobe made available here.
when using pdfname.resources it returns procset, font and xobject. i used xobject for image, but what are procset and font (could this be the style of the text?
The contents of the page Resource Dictionaries are explained in section 7.8.3 of the specification.
does it have pdfname.text for retrieving text)?
You'll find how test is presented in page content streams and xobjects in section 9.
Is it possible to modify spot color names in a PDF using iTextSharp in C#, its just the Colour name that requires changing.
So you have an existing PDF that uses some spot colors, for instance a color named "ABC", and you want to manipulate that PDF so that the name is "XYZ".
This is possible, but it requires low-level PDF syntax manipulation.
You need to create a PdfReader instance, look for the dictionary that defines the spot color, change the name, then use PdfStamper to create a new file based on the altered PdfReader instance.
There is no "ready-made" example on how to answer your specific question (I doubt somebody else but the original developer of iText will answer a question like this), but you can get some inspiration by looking at the code samples from chapter 13 of the second edition of "iText in Action": http://itextpdf.com/book/chapter.php?id=13
See for instance the manipulatePdf() method in this example: http://itextpdf.com/examples/iia.php?id=239
In this example an URL is replaced by another one using the principle explained above.
You need to adapt this example so that you find the path to the place where the spot color name is stored, change that name, and persist the changes.
Hint: the Spot color name will be in an array of which the first element is a name (/Separation), the second entry will be the name you want to change (this is the one you want to replace with a new PdfName instance), and so on.
How to find this /Separation array? I would loop over the pages (the getPageN() method will give you the page dictionary), get the resources of every page (pageDict.getAsDict(PdfName.RESOURCES)), look for the existence of a /Colorspace dictionary, then look for all the /Separation colors in that dictionary. Replace the second element whenever you encounter a name you want to change.
The examples in chapter 13 in combination with ISO-32000-1 (can be downloaded from the Adobe.com site) will lead the way.
everyone,
I am trying to work with uploading images to my site, and I have successfully gotten it to work, however, I need to extend the functionality beyond that of one simple image. The examples I have seen use the WebMatrix File Helper (File Helper? Is that right? Oh well, it's a helper of some kind that auto plots the html necessary for the input=type"file" field). The line of code I have in the form:
#FileUpload.GetHtml(initialNumberOfFiles:1, allowMoreFilesToBeAdded:false, includeFormTag:false)
The line of code I have in (IsPost):
var UploadedPicture = Request.Files[0];
if(Path.GetFileName(UploadedPicture.FileName) != String.Empty)
{
var ContentType = UploadedPicture.ContentType;
var ContentLength = UploadedPicture.ContentLength;
var InputStream = UploadedPicture.InputStream;
Session["gMugshot"] = new byte[ContentLength];
InputStream.Read((byte[])Session["gMugshot"], 0, ContentLength);
}
else
{
Session["gMugshot"] = new byte[0];
}
More code in the (IsPost) after this stores it in the database as binary data, and I can get the image back on the page from that (I have no desire to save the actual image files in a folder on the server and use GUID, etc. etc. Binary data is fine, and I imagine takes up a lot less space).
I have it set up to click-ably scroll through pictures by using jQuery to read the clicks of manually created buttons and subsequently hide and unhide the divs that contain the images rendered by C# (which gets them from reading the database). Sorry if that's a little TMI, just trying to be thorough, but to refine the question: I don't know enough about file uploading to know how to work with the uploaded data that well yet. I tried researching this information, but I didn't find any information that seemed pertinent to me (actually, I didn't find much useful information on input type="file", or the FileUpload method, at all, really).
Would it be better to use input type="file" id="pic1id"? Is there something that I can use such as Request.Files["pic1id"] that could get the file from the id of the input element? Or does the program simply take all uploaded files, stick them in a logistical group somewhere waiting to be called by index like this: "Request.Files[0]". If so, what order does the index get put in? Do I need to use Request.Files.Count to test how many have been uploaded before I begin working with the data?
Please note that I want separate input type="file" fields (whether plotted by the helper or not). I do not want to accept multiple files in one input (mainly because of a lack of knowledge, e.g., I am afraid I won't know how to work with the data). So far, the plan is that the separate input type="file" fields will be within the divs that get hidden/unhidden upon scrolling through pictures, so each picture (space) will have its own input type="file" field. The hiding and unhiding of divs, (the one) picture being displayed, storing and receiving binary data from the database, and clicking through the picture placeholders all function great. Pretty much I just need to know how to work with more than one uploaded picture at a time for storage in their individual database "image" fields.
Any examples of the syntax I need to use will be much appreciated.
Sorry so many questions, but I just couldn't find much useful information on this at all.
Thanks to any who try to help!
Okay, in order to solve this, I had to test and test and test, until something finally worked for me. Here's what I did:
First, I abandoned my use of the part of the helper that plotted the html, that is I took out:
#FileUpload.GetHtml(initialNumberOfFiles:1, allowMoreFilesToBeAdded:false, includeFormTag:false)
And added a regular input type="file" with a certain id, such as id="pic1".
Next I was able to get the individual file post based on id, which was really the main thing I needed to know how to do, and it really was as simple as this:
Request.Files["pic1"];