How to create uncompressed PDF file? - c#

An unorthodox question for sure. :-) During development testing, I'd prefer to create uncompressed, non-binary PDF files with iTextSharp so that I can check their internals easily. I can always convert the finished PDF with various utilities but that takes an extra step that would be comfortable to avoid. There are some places (eg. PdfImage) where I can decide about compression level but I couldn't find anything about the compression used to output the individual PDF objects into the stream. Do you think this is possible with iTextSharp?

You have two options:
[1.] Disable compression for all your documents at once:
Please take a look at the Document class. You'll find:
public static bool Compress = true;
If you change this static bool to false, PDF syntax streams won't be compressed.
Of course: images will are de facto compressed. For instance: if you add a JPEG to your document, you are adding an image that is already compressed, and iText won't uncompress it.
[2.] Disable compression for a single document at a time:
Please take a look at the PdfWriter class. You'll find:
protected internal int compressionLevel = PdfStream.DEFAULT_COMPRESSION;
If you change the value of compressionLevel to PdfStream.NO_COMPRESSION before opening the ´document´, your PDF syntax streams won't be compressed.

Related

Convert iTextSharp.text.pdf.BarcodeQRCode to System.Drawing.Image

Looking for a way to convert iTextSharp.text.pdf.BarcodeQRCode to System.Drawing.Image
This is what I have so far...
public System.Drawing.Image GetQRCode(string content)
{
iTextSharp.text.pdf.BarcodeQRCode qrcode = new iTextSharp.text.pdf.BarcodeQRCode(content, 115, 115, null);
iTextSharp.text.Image img = qrcode.GetImage();
MemoryStream ms = new MemoryStream(img.OriginalData);
return System.Drawing.Image.FromStream(ms);
}
In line 3 above using img.OriginalData returns null
Using img.RawData on line 3 instead thows invalid parameter error on line 4.
I've googled some of the code samples on how to perform the thing you want and your code (the "OriginalData" approach) is basicaly the same: https://csharp.hotexamples.com/examples/iTextSharp.text.pdf/BarcodeQRCode/-/php-barcodeqrcode-class-examples.html .
However, I don't see how it could work. From my investigations of BarcodeQRCode#getImage it seems that OriginalData is not set while processing such a barcode, so it will always be null.
More than that, the code you mention belongs to iText 5, which is end of life and no longer maintained (with an exception of considerable security fixes), so it's recommended to update to iText 7.
As for iText 7, I do see how to achieve the same in Java, since barcode classes do have a createAwtImage method there. .NET, on the other hand, lacks such a functionality, so I'd day that one unfortunately couldn't do it in .NET.
There are some good reasons for that. iText's Images (and a barcode could be easily converted to an iText's Image object as shown here: https://kb.itextpdf.com/home/it7kb/faq/how-to-generate-2d-barcode-as-vector-image) represent a PDF's XObject. In PDF syntax, an image file (jpg, png, etc.) is an XObject with the raw image data stored inside. However, an XObject can also contain PDF syntaxt content (it is not just used for image files). So to render such a content one needs to process the data from PDF syntax to image syntax, which is not that easy. There are some means in Java's awt to do so, that's why it's implemented in Java. As for .NET, since there is no out-of-the-box means to convert PDF images to System.Drawing.Image, it was decided not to implement it.
To conclude, there is another iText product, pdfRender, which allows one to convert PDF files (and you could create a page just for a barcode) to images. Perhaps you might want to play with it: https://itextpdf.com/en/products/itext-7/convert-pdf-to-image-pdfrender

Image to PDF conversion without using third party library in C#

I need to convert image files to PDF without using third party libraries in C#. The images can be in any format like (.jpg, .png, .jpeg, .tiff).
I am successfully able to do this with the help of itextsharp; here is the code.
string value = string.Empty;//value contains the data from a json file
List<string> sampleData;
public void convertdata()
{
//sampleData = Newtonsoft.Json.JsonConvert.DeserializeObject<List<string>>(value);
var jsonD = System.IO.File.ReadAllLines(#"json.txt");
sampleData = Newtonsoft.Json.JsonConvert.DeserializeObject<List<string>>(jsonD[0]);
Document document = new Document();
using (var stream = new FileStream("test111.pdf", FileMode.Create, FileAccess.Write, FileShare.None))
{
PdfWriter.GetInstance(document, stream);
document.Open();
foreach (var item in sampleData)
{
newdata = Convert.FromBase64String(item);
var image = iTextSharp.text.Image.GetInstance(newdata);
document.Add(image);
Console.WriteLine("Conversion done check folder");
}
document.Close();
}
But now I need to perform the same without using third party library.
I have searched the internet but I am unable to get something that can suggest a proper answer. All I am getting is to use it with either "itextsharp" or "PdfSharp" or with the "GhostScriptApi".
Would someone suggest a possible solution?
This is doable but not practical in the sense that it would very likely take way too much time for you to implement. The general procedure is:
Open the image file format
Either copy the encoded bytes verbatim to a stream in a PDF document you have created or decode the image data and re-encode it in a PDF stream (whether it's the former or latter depends on the image format)
Save the PDF
This looks easy (it's only three points after all :-)) but when you start to investigate you'll see that it's very complicated.
First of all you need to understand enough of the PDF specification to write a new PDF file from scratch, doing all of the right things. The PDF specification is way over 1000 pages by now; you don't need all of it but you need to support a good portion of it to write a proper PDF document.
Secondly you will need to understand every image file format you want to support. That by itself is not trivial (the TIFF file format for example is so broad that it's a nightmare to support a reasonable fraction of TIFF files out there). In some cases you'll be able to simply copy the bulk of an image file format into your PDF document (jpeg files fall in that category for example), that's a complication you want to support because uncompressing the JPEG file and then recompressing it in a PDF stream will cause quality loss.
So... possible? Yes. Plausible? No. Unless you have gotten lots and lots of time to complete this project.
The structure of the simpliest PDF document with one single page and one single image is the following:
- pdf header
- pdf document catalog
- pages info
- image
- image header
- image data
- page
- reference to image
- list of references to objects inside pdf document
Check this Python code that is doing the following steps to convert image to PDF:
Writes PDF header;
Checks image data to find which filter to use. You should better select just one format like FlateDecode codec (used by PDF to compress images without loss);
Writes "catalog" object which is basically is the array of references to page objects.
Writes image object header;
Writes image data (pixels by pixels, converted to the given codec format) as the "stream" object in pdf;
Writes "page" object which contains "image" object;
Writes "trailer" section with the set of references to objects inside PDF and their starting offsets. PDF format stores references of objects at the end of PDF document.
I would write my own ASP.NET Web Service or Web API service and call it within the app :)

Remove Byte Order Mark from signed PDF file?

I am using iTextSharp 5.5.1 in order to sign PDF files digitally with a detached signature (obtained from a third party authority). Everything seems to work fine, the file is valid and e.g. Adobe Reader reports no problems, displays the signatures as valid etc.
The problem is that the Java Clients have apparently some problems with those files - the file can be neither opened nor parsed.
The files have a byte order mark in the header which seems to cause the behavior (\x00EF\x00BB\x00BF).
I could identify the BOM like this:
PdfReader reader = new PdfReader(path);
byte[] metadata = reader.Metadata;
// metadata[0], metadata[1], metadata[2] contain the BOM
How can I either remove the BOM (without losing the validity of the signature), or force the iTextSharp library not to append these bytes into the files?
First things first: once a PDF is signed, you shouldn't change any byte of that PDF, because you invalidate the signature if you do.
Second observation: the byte order mark is not part of the PDF header (a PDF always starts with %PDF-1.). In this context, it is the value of the begin attribute in the processing instruction of XMP metadata. I don't know of any Java client that has a problem with that byte sequence anywhere in a file. If they do have a problem with it, there's a problem with that client, not with the file.
The Byte Order Mark is an indication of the presence of UTF-8 characters. In the context of XMP, we have a stream inside the PDF that contains a clear text XML file that can be consumed by software that is not "PDF aware". For instance:
2 0 obj
<</Type/Metadata/Subtype/XML/Length 3492>>stream
<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.1.0-jc003">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:pdf="http://ns.adobe.com/pdf/1.3/"
xmlns:xmp="http://ns.adobe.com/xap/1.0/"
dc:format="application/pdf"
pdf:Keywords="Metadata, iText, PDF"
pdf:Producer="iText® 5.5.4-SNAPSHOT ©2000-2014 iText Group NV (AGPL-version); modified using iText® 5.5.4-SNAPSHOT ©2000-2014 iText Group NV (AGPL-version)"
xmp:CreateDate="2014-11-07T16:36:55+01:00"
xmp:CreatorTool="My program using iText"
xmp:ModifyDate="2014-11-07T16:36:56+01:00"
xmp:MetadataDate="2014-11-07T16:36:56+01:00">
<dc:description>
<rdf:Alt>
<rdf:li xml:lang="x-default">This example shows how to add metadata</rdf:li>
</rdf:Alt>
</dc:description>
<dc:creator>
<rdf:Seq>
<rdf:li>Bruno Lowagie</rdf:li>
</rdf:Seq>
</dc:creator>
<dc:subject>
<rdf:Bag>
<rdf:li>Metadata</rdf:li>
<rdf:li>iText</rdf:li>
<rdf:li>PDF</rdf:li>
</rdf:Bag>
</dc:subject>
<dc:title>
<rdf:Alt>
<rdf:li xml:lang="x-default">Hello World example</rdf:li>
</rdf:Alt>
</dc:title>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
<?xpacket end="w"?>
endstream
Such non-PDF-aware software will look for the sequence W5M0MpCehiHzreSzNTczkc9d, which is a sequence that is unlikely to appear by accident in a data stream.
The begin attribute is there to indicate that the characters in the stream use UTF-8 encoding. They are there because it is good practice for them to be there, but they are not mandatory (ISO-16684-1).
You could retrieve the metadata the way you do (byte[] metadata = reader.Metadata;), remove the bytes, and change the stream with a PdfStamper instance like this:
stamper.XmpMetadata = metadata;
After you have changed the metadata, you can sign the PDF.
Note that one aspect of your question surprises me. You write:
// metadata[0], metadata[1], metadata[2] contain the BOM
It is very strange that the first three bytes of the XMP metadata contain the BOM. XMP metadata is suppose to start with <?xpacket. If it doesn't, you are doing the right thing by removing those bytes.
Caveat: a PDF can contain XMP metadata at different levels. Right now, you are examining the most common one: document-level metadata. You may encounter PDFs with page-level XMP metadata, with XMP inside an image, etc...
Just a quick approach:
First: save both files un-encrypted.
Second: remove metadata 0 through 2 before saving the file
There are some considerations however: does the signing method require a BOM? Does the encryption method require a BOM?
You will also have to ascertain at what stage the BOM is added before you can determine whether you can/should remove the BOM.
I will have a quick hunt about for my pdf structure docs and see what I can get, however the simplest way would be (untried) load the whole thing as a byte array and simply remove xEF xBB xBF from the start of the file, then do any signing/encryption. However they may add it in again...
I will post an update over the weekend:)

How to remove metadata from jpg and png images

This should be a pretty trivial programming task in C#, however after I have searched a while I simply cannot find anything relevant on how to remove metadata.
I want to remove jpg and png image metadata such as: folder path, shared with, owner and computer.
My application is an MVC 4 application. In my website users can upload an image I get this image at this ActionResult method
if (image != null)
{
photo.ImageFileName = image.FileName;
photo.ImageMimeType = image.ContentType;
photo.PhotoFile = new byte[image.ContentLength];
image.InputStream.Read(photo.PhotoFile, 0, image.ContentLength);
}
Photo is a property in the model, goes like this.
public byte[] PhotoFile { get; set; }
I imagine the way to remove above mentioned metadata or just all metadata, would be to use some coding like this
if (image != null)
{
image = image.RemoveAllMetaData; !!!
I dont mind using some 3rd party dll as long as it is compatible with NET 4.
Thanks.
'Metadata' here is a bit ambiguous--Do you mean the data which is required for a viewer to properly determine the image format so it can be displayed, saving only the raw image data? Or, more likely, do you mean the extra information, such as author, camera type, GPS location, etc, that is often added via the EXIF tags?
If you mean something like the EXIF data, there's a lot of programming material already on the web about how to add/modify/remove EXIF tags, and even some apps which already strips such tags: http://www.steelbytes.com/?mid=30 for example.
If you mean you just want the raw image data, you'll probably have to read and process the image first, since both JPEG and PNG do not contain simply the raw image data; It's encoded with various methods--which is why they contain metadata to tell you how to decode it in the first place. You'll have to learn/explore the JPEG and PNG data formats to extract the original raw image data (or a reasonable facsimile in the case of a "lossy" encoding).
All the above is well-documented on various websites which can be found on Google, and many include image manipulation libraries which can handle these chores for you. I suspect you just didn't know to search for something like "JPEG PNG EXIF METADATA".
BTW, EXIF applies to JPEG's, where EXIF is, loosely (and not fully technically correct) an addition of data (extension) to the end of the JPEG file, which can usually simply be truncated to remove. A quick Google search for me turned up something like libexif.sourceforge.net and other similar results.
I'm not entirely certain about the PNG format, but I believe the PNG format (which does call such items "metadata" as well) was written to include such data as part of the file format rather than an "extension" tagged on after the fact like EXIF is. PNG, however, is open source, and you can obtain libraries and code for manipulating them from the PNG website (www.libpng.org).
There's an app for that but it's written in Perl. It doesn't recompress the image and it's here http://www.sno.phy.queensu.ca/~phil/exiftool
Found it in this thread
How to remove EXIF data without recompressing the JPEG?
Do what all the social media websites do. Create a new image file, stream in the image byte data and use the file you created than the original one that was uploaded. Of course, now you will need to find out the original image's color depth and so on so that the image you create is not of a lower quality -- unless you need to do a disk or image resize as well.

Handling strings more than 2 GB

I have an application where an XLS file with lots of data entered by the user is opened and the data in it is converted to XML. I have already mapped the columns in the XLS file to XML Maps. When I try to use the ExportXml method in XMLMaps, I get a string with the proper XML representation of the XLS file. I parse this string a bit and upload it to my server.
The problem is, when my XLS file is really large, the string produced for XML is over 2 GB and I get a Out of Memory exception. I understand that the limit for CLR objects is 2 GB. But in my case I need to handle this scenario. Presently I just message asking the user to send less data.
Any ideas on how I can do this?
EDIT:
This is just a jist of the operation I need to do on the generated XML.
Remove certain fields which are not needed for the server data.
Add something like ID numbers for each row of data.
Modify the values of certain elements.
Do validation on the data.
While the XMLReader stream is a good idea, I cannot perform these operations by that method. While data validation can be done by Excel itself, the other things cannot be done here.
Using XMLTextReader and XMLTextWriter and creating a custom method for each of the step is a solution I had thought of. But to go through the jist above, it requires the XML document to be gone through or processed 4 times. This is just not efficient.
If the XML is that large, then you might be able to use Export to a temporary file, rather than using ExportXML to a string - http://msdn.microsoft.com/en-us/library/microsoft.office.interop.excel.xmlmap.export.aspx
If you then need to parse/handle the XML in C#, then for handling such large XML structures, you'll probably be better off implementing a custom XMLReader (or XMLWriter) which works at the stream level. See this question for some similar advice - What is the best way to parse large XML (size of 1GB) in C#?
I guess there is no other way then using x64-OS and FX if you really need to hold the whole thing in RAM, but using some other way to process the data like suggested by Stuart may is the better way to go...
What you need to do is to use "stream chaining", i.e. you open up an input stream which reads from your excel file and an output stream that writes to your xml file. Then your conversion class/method will take the two streams as input and read sufficient data from the input stream to be able to write to the output.
Edit: very simple minimal Example
Converting from file:
123
1244125
345345345
4566
11
to
<List>
<ListItem>123</ListItem>
<ListItem>1244125</ListItem>
...
</List>
using
void Convert(Stream fromStream, Stream toStream)
{
using(StreamReader from= new StreamReader(fromStream))
using(StreamWriter to = new StreamWriter(toStream))
{
to.WriteLine("<List>");
while(!from.EndOfStream)
{
string bulk = from.ReadLine(); //in this case, a single line is sufficient
//some code to parse the bulk or clean it up, e.g. remove '\r\n'
to.WriteLine(string.Format("<ListItem>{0}</ListItem>", bulk));
}
to.WriteLine("</List>");
}
}
Convert(File.OpenRead("source.xls"), File.OpenWrite("source.xml"));
Of course you could do this in much more elegent, abstract manner but this is only to show my point

Categories