Checking values within a Base64 Encoded Image - c#

Consider this Base64 encoded image string value:
data:application/pdf;base64,
This encoded string represents a 4 by 5 inch image with a string text in the middle "Image Title". Is there any way I can check this encoded Base64 string to validate that there is in fact a string with the value of "Image Title"?
The reason I need this is so I can unit test our Image Generation process to ensure that we are generating images properly.

You would need some form of OCR to process the detect "words" written within the image's pixels. There are a number of products and services which can accomplish this for you. With that said, I do not know if there are any free options available, outside of rolling your own OCR engine.

The fact that it's base 64 is not even slightly relevant or related. It would be no different to loading the image as a byte array.
Anyway, the underlying question here is basically "How do I check these pixels form the textual characters I want?". You need some kind of OCR to achieve this.

Related

Whats differece between SFML Image.Pixels and FIle.ReadAllBytes

When I try to understand SFML, I tried to set an icon with RenderWindowInstanse.SetIcon()
the method, that takes 3 parameters, fist two is size, 3 - byte[], then I try to use File.ReadAllBytes()
and same tools in c#, but that don't work, I search and find on-site ImageInstanse.Pixels property that returns byte[] like a parameter, that's works but I don't understand why they are returning different byte arrays
In SFML.NET, Image.Pixels returns an array of bytes that are nicely organized RGBA pixel values that represent the image in memory.
.NET's own File.ReadAllBytes() function returns the bytes that come from the file itself in the system's storage device.
Every file has a format that defines the layout and meaning of the bytes that make up that file. Image files are an extension of that concept as there any many different file formats for images. The pixel data for an image has to be encoded (and/or compressed) according to the format it is being saved as. This means that the bytes in the file no longer matches the raw RGBA pixel data as it was in the computer memory.
Files often contain lots of extra bytes for things like a file header, metadata, compression information, or possibly even an index for blocks of data that are smaller files or images within a file.
When you use File.ReadAllBytes(), you are given all of the bytes that represent this data in an array and you have to know exactly what the meaning of the byte at each index is.
SFML understands how to decode many different image formats, and will read the bytes of the file and process that into an array of pixel data. This is what the constructor for Image that takes a file is doing in the background. Once you have an SFML.Graphics.Image instance, you can use its Pixels property to access that decoded RGBA pixel data.

How to extract text from PDF using iTextSharp version 4.1.6? [duplicate]

We are developing a Pdf parser to be used along with our system.
The requirement is such that, we store all the information on any pdf documents and should be able to reproduce the document as such (with minimal changes from original document).
We did some googling and found iTextSharp be the best mate for our purpose.
We are developing our project using .net.
You might have guessed as i mentioned in my title requiring comparisons for specific versions of iTextSharp (4.1.6 vs 5.x). We know that 4.1.6 is the last version of iTextSharp with the LGPL/MPL license . The 5.x versions are AGPL.
We would like to have a good comparison between the versions before choosing the LGPL version or we buy the license for AGPL (we dont like to publish our code).
I did some browsing through the revision changes in the iTextSharp but i would like to know if any content exist, making a good comparison between the versions.
Thanks in advance!
I'm the CTO of iText Software, so just like Michaƫl who already answered in the comment section, I'm at the same time the most authoritative source as well as a biased source.
There's a very simple comparison chart on the iText web site.
This chart doesn't cover text extraction, so allow me to list the relevant improvements since iText 5.
You've probably also found this page.
In case you wonder about the bug fixes and the performance improvements regarding text parsing, this is a more exhaustive list:
5.0.0: Text extraction: major overhaul to perform calculations in user space. This allows the parser to correctly determine line breaks, even if the text or page is rotated.
5.0.1: Refactored callback so method signature won't need to change as render callback API evolves.
5.0.1: Refactoring to make it easier for outside users to interact with the content stream processor. Also refactored render listener so text and image event listening occurs in the same interface (reduces a lot of non-value-add complexity)
5.0.1: New filtering functionality for text renderers.
5.0.1: Additional utility method for previewing PDF content.
5.0.1: Added a much more advanced text renderer listener that can reconstruct page content based on physical location of text on the page
5.0.1: Added support for XObject Form processing (text added via PdfTemplate can now be parsed)
5.0.1: Added rudimentary support for XObject Image callbacks
5.0.1: Bug fix - text extraction wasn't correct for certain page orientations
5.0.1: Bug fix - matrices were being concatenated in the wrong order.
5.0.1: PdfTextExtractor: changed the default render listener (new location aware strategy)
5.0.1: Getters for GraphicsState
5.0.2: Major refactoring of interface to text extraction functionality: for instance introduction of class PdfReaderContentParser
5.0.2: CMapAwareDocumentFont: Tweaks to make processing quasi-invalid PDF files more robust
5.0.2: PdfContentReaderTool: null pointer handling, plus a few well placed flush calls
5.0.2: PdfContentReaderTool: Show details on resource entries
5.0.2: PdfContentStreamProcessor: Adjustment so embedded images don't cause parsing problems and improvements to EI detection
5.0.2: LocationTextExtractionStrategy: Fixed anti-parallel algorithm, plus accounting for negative inter-character offsets. Change to text extraction strategy that builds out the text model first, then computes concatenation requirements.
5.0.2: Adjustments to linesegment implementation; optimalization of changes made by Bruno to text extraction; for example: introduction of the class MarkedContentInfo.
5.0.2: Major refactoring of interface to text extraction functionality: for instance introduction of class PdfReaderContentParser
5.0.3: added method to get area of image in user units
5.0.3: better parsing of inline images
5.0.3: Adding an extra check for begin/end sequences when parsing a ToUnicode stream.
5.0.4: Content streams in arrays should be parsed as if they were separated by whitespace
5.0.4: Expose CTM
5.0.4: Refactor to pull inline image processing into it's own class. Added parsing of image data if there is no filter applied (there are some PDFs where there is no white space between the end of the image data and the EI operator). Ultimately, it will be best to actually parse the image data, but this will require a pretty big refactoring of the iText decoders (to work from streams instead of byte[] of known lengths).
5.0.4: Handle multi-stage filters; Correct bug that pulled whitespace as first byte of inline image stream.
5.0.4: Applying stream filters to inline images.
5.0.4: PdfReader: Expose filter decoder for arbitrary byte arrays (instead of only streams)
5.0.6: CMapParser: Fix to read broken ToUnicode cmaps.
5.0.6: handle slightly malformed embedded images
5.0.6: CMapAwareDocumentFont: Some PDFs have a diff map bigger than 256 characters.
5.0.6: performance: Cache the fonts used in text extraction
5.1.2: PRTokeniser: Made the algorithm to find startxref more memory efficient.
5.1.2: RandomAccessFileOrArray: Improved handling for huge files that can't be mapped
5.1.2: CMapAwareDocumentFont: fix NPE if mapping doesn't get initialized (I'd rather wind up with junk characters than throw an unexpected exception down the road)
5.1.3: refactoring of how filters are applied to streams, adjust parser so it can handle multi-stage filters
5.1.3: images: allow correct decoding of 1bpc bitmask images
5.1.3: images: add jbig2 streams to pass through
5.1.3: images: handle null and indirect references in decode parameters, throw exception if unable to decode an image
5.2.0: Better error messages and better handling zero sized files and attempts to read past the end of the file.
5.2.0: Removed restriction that using memory mapping requires the file be smaller than ~2GB.
5.2.0: Avoid NullPointerException in RandomAccessFileOrArray
5.2.0: Made a utility method in pdfContentStreamProcessor private and clarified the stateful nature of the class
5.2.0: LocationTextExtractionStrategy: bounds checking on string lengths and refactoring to make code easier to read.
5.2.0: Better handling of color space dictionaries in images.
5.2.0: improve handling of quasi improper inline image content.
5.2.0: don't decode inline image streams until we absolutely need them.
5.2.0: avoid NullPointerException of resource dictionary isn't provided.
5.3.0: LocationTextExtractionStrategy: old comparison approach caused runtime exceptions in Java 7
5.3.3: incorporate the text-rise parameter
5.3.3: expose glyph-by-glyph information
5.3.3: Bugfix: text to user space transformation was being applied multiple times for sub-textrenderinfo objects
5.3.3: Bugfix: Correct baseline calculation so it doesn't include final character spacing
5.3.4: Added low-level filtering hook to LocationTextExtractionStrategy.
5.3.5: Fixed bug in PRTokeniser: handle case where number is at end of stream.
5.3.5: Replaced StringBuffer with StringBuilder in PRTokeniser for performance reasons.
5.4.2: Added an isChunkAtWordBoundary() method to LocationTextExtractionStrategy to check if a space character should be inserted between a previous chunk and the current one.
5.4.2: Added a getCharSpaceWidth() method to LocationTextExtractionStrategy to get the width of a space character.
5.4.2: Added a getText() method to LocationTextExtractionStrategy to get the text of the current Chunk.
5.4.2: Added an appendTextChunk(() method to SimpleTextExtractionStrategy to expose the append process so that subclasses can add text from outside the text parse operation.
5.4.5: Added MultiFilteredRenderListener class for PDF parser.
5.4.5: Added GlyphRenderListener and GlyphTextRenderListener classes for processing each glyph rather than processing chunks of text.
5.4.5: Added method getMcid() in TextRenderInfo.
5.4.5: fixed resource leak when many inline images were in content stream
5.5.0: CMapAwareDocumentFont: if font space width isn't defined, use the default width for the font.
5.5.0: PdfContentReader: avoid exception when displaying an empty dictionary.
There are some things that you won't be able to do if you don't upgrade. For instance, you won't be able to do the things described in these slides.
If you look at the roadmap for iText, you'll see that we'll invest even more time on text extraction in the future.
In all honesty: using the 5 year old version wouldn't only be like reinventing the wheel, it would also be like falling in every pitfall we've fallen in in the last 5 years. I can assure you that buying a license will be less expensive.

How to remove metadata from jpg and png images

This should be a pretty trivial programming task in C#, however after I have searched a while I simply cannot find anything relevant on how to remove metadata.
I want to remove jpg and png image metadata such as: folder path, shared with, owner and computer.
My application is an MVC 4 application. In my website users can upload an image I get this image at this ActionResult method
if (image != null)
{
photo.ImageFileName = image.FileName;
photo.ImageMimeType = image.ContentType;
photo.PhotoFile = new byte[image.ContentLength];
image.InputStream.Read(photo.PhotoFile, 0, image.ContentLength);
}
Photo is a property in the model, goes like this.
public byte[] PhotoFile { get; set; }
I imagine the way to remove above mentioned metadata or just all metadata, would be to use some coding like this
if (image != null)
{
image = image.RemoveAllMetaData; !!!
I dont mind using some 3rd party dll as long as it is compatible with NET 4.
Thanks.
'Metadata' here is a bit ambiguous--Do you mean the data which is required for a viewer to properly determine the image format so it can be displayed, saving only the raw image data? Or, more likely, do you mean the extra information, such as author, camera type, GPS location, etc, that is often added via the EXIF tags?
If you mean something like the EXIF data, there's a lot of programming material already on the web about how to add/modify/remove EXIF tags, and even some apps which already strips such tags: http://www.steelbytes.com/?mid=30 for example.
If you mean you just want the raw image data, you'll probably have to read and process the image first, since both JPEG and PNG do not contain simply the raw image data; It's encoded with various methods--which is why they contain metadata to tell you how to decode it in the first place. You'll have to learn/explore the JPEG and PNG data formats to extract the original raw image data (or a reasonable facsimile in the case of a "lossy" encoding).
All the above is well-documented on various websites which can be found on Google, and many include image manipulation libraries which can handle these chores for you. I suspect you just didn't know to search for something like "JPEG PNG EXIF METADATA".
BTW, EXIF applies to JPEG's, where EXIF is, loosely (and not fully technically correct) an addition of data (extension) to the end of the JPEG file, which can usually simply be truncated to remove. A quick Google search for me turned up something like libexif.sourceforge.net and other similar results.
I'm not entirely certain about the PNG format, but I believe the PNG format (which does call such items "metadata" as well) was written to include such data as part of the file format rather than an "extension" tagged on after the fact like EXIF is. PNG, however, is open source, and you can obtain libraries and code for manipulating them from the PNG website (www.libpng.org).
There's an app for that but it's written in Perl. It doesn't recompress the image and it's here http://www.sno.phy.queensu.ca/~phil/exiftool
Found it in this thread
How to remove EXIF data without recompressing the JPEG?
Do what all the social media websites do. Create a new image file, stream in the image byte data and use the file you created than the original one that was uploaded. Of course, now you will need to find out the original image's color depth and so on so that the image you create is not of a lower quality -- unless you need to do a disk or image resize as well.

Dissolve string bytes into a fixed length formula based pattern by using keys, and even extract those bytes

Suppose there is a string containing 255 characters. And there is a fixed length assume 64-128 bytes a kind of byte pattern. I want to "dissolve" that string with 255 characters, byte by byte into the other fixed length byte pattern. The byte pattern is like a formula based "hash" or something similar into which a formula based algorithm dissolves the bytes into it. Later, when I am required to extract the dissolved bytes from that fixed length pattern, I would use the same algorithm's reverse, or extract function. The algorithm works through special keys or passwords and uses them to dissolve the bytes into the pattern, the same keys are used to extract the bytes in their original value from the pattern. I ask for help from the coders here. Please also guide me with steps so that I be able to understand what steps are to be taken, what to do. I only know VB .NET and C#.
For instance:
I have this three characters: "A", "B", "C"
The formula based fixed length super pattern (works like a whirlpool) is:
AJE83HDL389SB4VS9L3
Now I wish to "dissolve", "submerge" the characters "A", "B", "C", one by one into the above pattern to change it completely. After dissolving the characters, the super pattern changes drastically, just like the hash:
EJS83HDLG89DB2G9L47
I would be able to extract the characters from the last dissolved character to the first by using an extraction algorhythm and the original keys which were used to dissolve the characters into this super pattern. After the extraction of all the characters, the super pattern resets to the original initial state. Each character insert and remove has a unique pattern state.
After extraction of all characters, the super pattern goes back to the original state. This happens upon the removal of the character by the extraction algo:
AJE83HDL389SB4VS9L3
This looks a lot like your previous question(s). The problem with them is that you seem to start asking from a half-baked solution.
So, what do you really want? Input , Output, Constraints?
To encrypt a string, use Encryption (Reijndael). To transform the resulting byte[] data to a string (for transport), use base64.
If you're happy having the 'keys' for the individual bits of data being determined for you, this can be done similarly to a one-time-pad (though it's not one-time!) - generate a random string as your 'base', then xor your data strings with it. Each output is the 'key' to get the original data back, and the 'base' doesn't change. This doesn't result in output data that's any smaller than the input, however (and this is impossible in the general case anyway), if that's what you're going for.
Like your previous question, you're not really being clear about what you want. Why not just ask a question about how to achieve your end goals, and let people provide answers describing how, or tell you why it's not possible.
Here are 2 cases
Lossless compression (exact bytes are decoded from compressed info)
In this case Shannon Entropy
clearly states that there can't be any algorithm which could compress data to rates greater than information entropy predicts.
Loosy compression (some original bytes are lost forever in compression scheme,- such as used in JPG image files (Do you remember setting of 'image quality' ??))
In this type of compression, you however can make better and better compression scheme with penalty that you loose more and more original bytes.
(Down to example of compression to zero bytes, where zero bytes are restored after, but this compression is invented either - magical button DELETE - moves information to black hole (sorry for sarcasm );)

Really simple short string compression

Is there a really simple compression technique for strings up to about 255 characters in length (yes, I'm compressing URLs)?
I am not concerned with the strength of compression - I am looking for something that performs very well and is quick to implement. I would like something simpler than SharpZipLib: something that can be implemented with a couple of short methods.
I think the key question here is "Why do you want to compress URLs?"
Trying to shorten long urls for the address bar?
You're better storing the original URL somewhere (database, text file ...) alongside a hashcode of the non-domain part (MD5 is fine). You can then have a simple page (or some HTTPModule if you're feeling flashy) to read the MD5 and lookup the real URL. This is how TinyURL and others work.
For example:
http://mydomain.com/folder1/folder2/page1.aspx
Could be shorted to:
http://mydomain.com/2d4f1c8a
Using a compression library for this will not work. The string will be compressed into a shorter binary representation, but converting this back to a string which needs to be valid as part of a URL (e.g. Base64) will negate any benefit you gained from the compression.
Storing lots of URLs in memory or on disk?
Use the built in compressing library within System.IO.Compression or the ZLib library which is simple and incredibly good. Since you will be storing binary data the compressed output will be fine as-is. You'll need to uncompress it to use it as a URL.
As suggested in the accepted answer, Using data compression does not work to shorten URL paths that are already fairly short.
DotNetZip has a DeflateStream class that exposes a static (Shared in VB) CompressString method. It's a one-line way to compress a string using DEFLATE (RFC 1951). The DEFLATE implementation is fully compatible with System.IO.Compression.DeflateStream, but DotNetZip compresses better. Here's how you might use it:
string[] orig = {
"folder1/folder2/page1.aspx",
"folderBB/folderAA/page2.aspx",
};
public void Run()
{
foreach (string s in orig)
{
System.Console.WriteLine("original : {0}", s);
byte[] compressed = DeflateStream.CompressString(s);
System.Console.WriteLine("compressed : {0}", ByteArrayToHexString(compressed));
string uncompressed = DeflateStream.UncompressString(compressed);
System.Console.WriteLine("uncompressed: {0}\n", uncompressed);
}
}
Using that code, here are my test results:
original : folder1/folder2/page1.aspx
compressed : 4bcbcf49492d32d44f03d346fa0589e9a9867a89c5051500
uncompressed: folder1/folder2/page1.aspx
original : folderBB/folderAA/page2.aspx
compressed : 4bcbcf49492d7272d24f03331c1df50b12d3538df4128b0b2a00
uncompressed: folderBB/folderAA/page2.aspx
So you can see the "compressed" byte array, when represented in hex, is longer than the original, about 2x as long. The reason is that a hex byte is actually 2 ASCII chars.
You could compensate somewhat for that by using base-62, instead of base-16 (hex) to represent the number. In that case a-z and A-Z are also digits, giving you 0-9 (10) + a-z (+26) + A-Z (+26) = 62 total digits. That would shorten the output significantly. I haven't tried that. yet.
EDIT
Ok I tested the Base-62 encoder. It shortens the hex string by about half. I figured it would cut it to 25% (62/16 =~ 4) But I think I am losing something with the discretization. In my tests, the resulting base-62 encoded string is about the same length as the original URL. So, no, using compression and then base-62 encoding is still not a good approach. you really want a hash value.
I'd suggest looking in the System.IO.Compression Namespace. There's an article on CodeProject that may help.
I have just created a compression scheme that targets URLs and achieves around 50% compression (compared to base64 representation of the original URL text).
see http://blog.alivate.com.au/packed-url/
It would be great if someone from a big tech company built this out properly and published it for all to use. Google championed Protocol buffers. This tool can save a lot of disk space for someone like Google, while still being scannable. Or perhaps the great captain himself? https://twitter.com/capnproto
Technically, I would call this a binary (bitwise) serialisation scheme for the data that underlies a URL. Treat the URL as text-representation of conceptual data, then serialize that conceptual data model with a specialised serializer. The outcome is a more compressed version of the original of course. This is very different to how a general-purpose compression algorithm works.
What's your goal?
A shorter URL? Try URL shorteners like http://tinyurl.com/ or http://is.gd/
Storage space? Check out System.IO.Compression. (Or SharpZipLib)
You can use deflate algorithm directly, without any headers checksums or footers, as described in this question: Python: Inflate and Deflate implementations
This cuts down a 4100 character URL to 1270 base64 characters, in my test, allowing it to fit inside IE's 2000 limit.
And here's an example of a 4000-character URL, which can't be solved with a hashtable since the applet can exist on any server.
I would start with trying one of the existing (free or open source) zip libraries, e.g. http://www.icsharpcode.net/OpenSource/SharpZipLib/
Zip should work well for text strings, and I am not sure if it is worth implementing a compression algorithm yourserlf....
Have you tried just using gzip?
No idea if it would work effectively with such short strings, but I'd say its probably your best bet.
The open source library SharpZipLib is easy to use and will provide you with compression tools

Categories