I have a file format that was generated by an enterprise, legacy (10y+) C# application. It almost certainly was compressed by some form of zlib, and was found in a zipped wrapper package much like .docx. The files are generated and named something.xmlzip, but I have not found a way of decompressing the stream through zip/gzip-type tools, or by using python's deflate/gzip methods and trying to bypass the lack of any stream headers. The contents are certainly an XML document.
The main identifying characteristics of the data are a consistent header/magic number and trailer:
$ xxd thedoc | head -1
00000000: 0404 0a04 3ae4 706e 03c4 0585 3a1b 3a0c ....:.pn....:.:.
$ xxd thedoc | tail -n 2
00003320: 1c0c 6d8d 6d7d 6458 0bfe 61d7 7a5d 7c38 ..m.m}dX..a.z]|8
00003330: 338b 2640 fffe fffe 3.&#....
The 0404 0a04 header and fffe fffe trailer appear in every file. How might I inflate these files?
Related
i have a pdf file stored in a server url, and i want to get each line of the file,
i want later export it to an excel file so i need to get every line, one by one,
i will put the code here. OBS: the url of the pdf stop working after 3 hours, i will always update it here in the comments. thanks.
using System;
using System.Net.Http;
using System.Threading.Tasks;
public class Program
{
public static async Task Main()
{
var pdfUrl = "https://eproc.trf4.jus.br/eproc2trf4/controlador.php?acao=acessar_documento_implementacao&doc=41625504719486351366932807019&evento=20084&key=4baa2515293382eb41b2a95e121550490b5b154f1c4c06e8b0469eff082311e6&hash=3112f8451af24a1a5c3e69afab09f079&termosPesquisados=";
var client = new HttpClient();
var response = await client.GetAsync(pdfUrl);
using (var stream = await response.Content.ReadAsStreamAsync())
{
Console.WriteLine("print each line of my pdf file");
}
}
}
Well, extracting text from PDF is not an ordinary task. If you need really generic solution works with any pdf, then state of art solution here is to use AI based API provided for example by some cloud platforms like Google, AWS or Azure:
https://cloud.google.com/vision/docs/pdf
https://learn.microsoft.com/en-us/azure/applied-ai-services/form-recognizer/
https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/automatically-extract-content-from-pdf-files-using-amazon-textract.html
So, read pdf as bytes, send bytes to external AI based API, receive parsed content back.
Of course, you will need to do some preparation to use cloud services mentioned above and also it costs some money
How can I best explain why you need a pdf decompressor like pdftotext, is that, the first line when decoded by an app (this is not the raw byte stream) comes in three separate parts. Luckily as whole word strings (they do not need to) and also luckily in this case from the same ascii font table.
BT /F1 12.00 Tf ET
BT 42.52 793.70 Td (Espelho de Valores Atualizados.) Tj ET
BT /F1 12.00 Tf ET
BT 439.37 793.70 Td (Data: ) Tj ET
BT 481.89 793.70 Td (05/07/2021) Tj ET
so we can easily see when converted into ascii that all three parts are at level 793.70 thus a lib can assume they are one line with only 3 different offsets, hence you need a 3rd party lib to decode and reassemble a line of text as if it is just one line string. That requires first save pdf as file, parse the whole file into several common encodings like ascii, hex and UTF-16 mixed (there is generally no UTF-8) then save those as a plain text file with UTF-8 encoding, Then you can extract the UTF-8 lines as required.
Unclear what format of line output you are hoping for since a PDF does not have numbered lines, however if we allocate numbers to lines with text (and some without) based on Human concept of Layout we can run a few lines using poppler utils and native OS text parsing. Here Cme could have loops and arguments, but hardcoded for demonstration. Note the console output would need local chcp but the text file is good
Poppler\poppler-22.04.0\Library\bin>Cme.bat |more
#curl -o brtemp.pdf "https://eproc.trf4.jus.br/eproc2trf4/controlador.php?acao=acessar_documento_implementacao&doc=41625504719486351366932807019&evento=20084&key=c6c5f83e942a3ee021a874f6287505c1cb484235935ff1305c6081893e3481b1&hash=922cacb9024f200d13d3f819e2e906f4&termosPesquisados="
#pdftotext -f 1 -l 1 -nopgbrk -layout -enc UTF-8 brtemp.pdf page1.txt
#pdftotext -f 2 -l 2 -nopgbrk -layout -enc UTF-8 brtemp.pdf page2.txt
#find /N /V "Never2BFound" page1.txt
#find /N /V "Never2BFound" page2.txt
responds
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 3749 100 3749 0 0 4051 0 --:--:-- --:--:-- --:--:-- 4052
---------- PAGE1.TXT
[1]Espelho de Valores Atualizados. Data: 05/07/2021
[2]
Page 1.txt
Espelho de Valores Atualizados. Data: 05/07/2021
PROCESSO : 5018290-57.2021.4.04.9388
ORIGINÁRIO : 5002262-05.2018.4.04.7000/PR
TIPO : Precatório
REQUERENTE : ERCILIA GRACIE RIBEIRO
ADVOGADO : ANA PAULA HORIGUCHI - PR064269
REQUERIDO : INSTITUTO NACIONAL DO SEGURO SOCIAL - INSS
PROCURADOR : PROCURADORIA REGIONAL FEDERAL DA 4 REGIÃO - PRF4
DEPRECANTE : Juízo Substituto da 10ª VF de Curitiba
etc.....
I need:
Print a large number of PDFs with duplex on specific output printer feeder
I have:
printing using ghostscript with 'mswinpr2' device
using (GhostscriptProcessor processor = new GhostscriptProcessor(new GhostscriptVersionInfo("gsdll32.dll")))
{
List<string> switches = new List<string>();
switches.Add("-dPrinted");
switches.Add("-dBATCH");
switches.Add("-dNOPAUSE");
switches.Add("-dNumCopies=1");
switches.Add("-dPDFFitPage");
switches.Add("-dFIXEDMEDIA");
switches.Add("-dNoCancel");
switches.Add("-sFONTPATH = C:\\Windows\\Fonts");
switches.Add("-sDEVICE=mswinpr2");
switches.Add($"-sOutputFile=%printer%{settings.PrinterName}");
switches.Add("D:\\11.pdf");
processor.StartProcessing(switches.ToArray(), null);
}
Problem:
one job in the print queue consisting of 2 pages takes more than 50mb, while I have more than 1500 PDFs with 1 000 000 pages
What i think to do:
Convert PDF to PCL or PS, edit these files and somehow pass the settings (duplex and specific feeder). Then send edited PCL or PS file as RAW data to printer
Question:
How can i pass the settings to PCL or PS?
Since PDF files can't contain device-specific information, you clearly don't need to pick such information from the input, which makes life simpler.
Ghostscript's ps2write device is capable of inserting document wide or page specific PostScript into its output. So you can 'pass the settings' using that.
For PCL you (probably) need to write some device-specific PJL and insert that into the PCL output. However, PCL is nowhere near as uniform as PostScritp, it'll be up to you to find out what need too be prefixed to the file.
[EDIT]
You don't use -sPSDocOptions, PSDocOptions is a distiller param, so you need:
gswin64c.exe -q -dSAFER -dNOPAUSE -dBATCH -sDEVICE=ps2write -sOutputFile=D:\out.ps -c "<</PSDocOptions (<</Duplex true /NumCopies 10>> setpagedevice)>> setdistillerparams" -f D:\0.pdf
Notice that you don't need -f (as you have in your command line) unless you have first set -c. The -f switch is used as a terminator for the -c.
Ok, before we start. I work for a company that has a license to redistribute PDF files from various publishers in any media form. So, that being said, the extraction of embedded fonts from the given PDF files is not only legal - but also vital to the presentation.
I am using code found on this site, however I do not recall the author, when I find it I will reference them. I have located the stream within the PDF file that contains the embedded fonts, I have isolated this encoded stream as a string and then into a byte[]. When I use the following code I get an error
Block length does not match with its complement.
Code (the error occurs in the while line below):
private static byte[] DecodeFlateDecodeData(byte[] data)
{
MemoryStream outputStream;
using (outputStream = new MemoryStream())
{
using (var compressedDataStream = new MemoryStream(data))
{
// Remove the first two bytes to skip the header (it isn't recognized by the DeflateStream class)
compressedDataStream.ReadByte();
compressedDataStream.ReadByte();
var deflateStream = new DeflateStream(compressedDataStream, CompressionMode.Decompress, true);
var decompressedBuffer = new byte[compressedDataStream.Length];
int read;
// The error occurs in the following line
while ((read = deflateStream.Read(decompressedBuffer, 0, decompressedBuffer.Length)) != 0)
{
outputStream.Write(decompressedBuffer, 0, read);
}
outputStream.Flush();
compressedDataStream.Close();
}
return ReadFully(outputStream);
}
}
After using the usual tools (Google, Bing, archives here) I found that the majority of the time that this occurs is when one has not consumed the first two bytes of the encoding stream - but this is done here so i cannot find the source of this error. Below is the encoded stream:
H‰LT}lg?7ñù¤aŽÂ½ãnÕ´jh›Ú?-T’ÑRL–¦
ëš:Uí6Ÿ¶“ø+ñ÷ùü™”ÒÆŸŸíóWlDZ“ºu“°tƒ¦t0ÊD¶jˆ
Ö m:$½×^*qABBï?Þç÷|ýÞßóJÖˆD"yâP—òpgÇó¦Q¾S¯9£Û¾mçÁçÚ„cÂÛO¡É‡·¥ï~á³ÇãO¡ŸØö=öPD"d‚ìA—$H'‚DC¢D®¤·éC'Å:È—€ìEV%cÿŽS;þÔ’kYkùcË_ZÇZ/·þYº(ý݇Ã_ó3m¤[3¤²4ÿo?²õñÖ*Z/Þiãÿ¿¾õ8Ü ?»„O Ê£ðÅP9ÿ•¿Â¯*–z×No˜0ãÆ-êàîoR‹×ÉêÊêÂulaƒÝü
Please help, I am beating my head against the wall here!
NOTE: The stream above is the encoded version of Arial Black - according to the specs inside the PDF:
661 0 obj
<<
/Type /FontDescriptor
/FontFile3 662 0 R
/FontBBox [ -194 -307 1688 1083 ]
/FontName /HLJOBA+ArialBlack
/Flags 4
/StemV 0
/CapHeight 715
/XHeight 518
/Ascent 0
/Descent -209
/ItalicAngle 0
/CharSet (/space/T/e/s/t/a/k/i/n/g/S/r/E/x/m/O/u/l)
>>
endobj
662 0 obj
<< /Length 1700 /Filter /FlateDecode /Subtype /Type1C >>
stream
H‰LT}lg?7ñù¤aŽÂ½ãnÕ´jh›Ú?-T’ÑRL–¦
ëš:Uí6Ÿ¶“ø+ñ÷ùü™”ÒÆŸŸíóWlDZ“ºu“°tƒ¦t0ÊD¶jˆ
Ö m:$½×^*qABBï?Þç÷|ýÞßóJÖˆD"yâP—òpgÇó¦Q¾S¯9£Û¾mçÁçÚ„cÂÛO¡É‡·¥ï~á³ÇãO¡ŸØö=öPD"d‚ìA—$H'‚DC¢D®¤·éC'Å:È—€ìEV%cÿŽS;þÔ’kYkùcË_ZÇZ/·þYº(ý݇Ã_ó3m¤[3¤²4ÿo?²õñÖ*Z/Þiãÿ¿¾õ8Ü ?»„O Ê£ðÅP9ÿ•¿Â¯*–z×No˜0ãÆ-êàîoR‹×ÉêÊêÂulaƒÝü
Is there a particular reason why you're not using the GetStreamBytes() method that is provided with iText? What about data? Are you sure you are looking at the correct bytes? Did you create the PRStream object correctly and did you get the bytes with PdfReader.GetStreamBytesRaw()? If so, why decode the bytes yourself? Which brings me to my initial counter-question: is there a particular reason why you're not using the GetStreamBytes() method?
Looks like GetStreamBytes() might solve your problem out right, but let me point out that I think you're doing something dangerous concerning end-of-line markers. The PDF Specification in 7.3.8.1 states that:
The keyword stream that follows the stream dictionary shall be
followed by an end-of-line marker consisting of either a CARRIAGE
RETURN and a LINE FEED or just a LINE FEED, and not by a CARRIAGE
RETURN alone.
In your code it looks like you always skip two bytes while the spec says it could be either one or two (CR LF or LF).
You should be able to catch whether you are running into this by comparing the exact number of bytes you want to decode with the value of the (Required) "Length" key in the stream dictionary.
Okay, for anyone who might stumble across this issue themselves allow me to warn you - this is a rocky road without a great deal of good solutions. I eventually moved away from writing all of the code to extract the fonts myself. I simply downloaded MuPDF (open source) and then made command line calls to mutool.exe:
mutool extract C:\mypdf.pdf
This pulls all of the fonts into the folder mutool resides in (it also extracts some images (these are the fonts that could not be converted (usually small subsets I think))). I then wrote a method to move those from that folder into the one I wanted them in.
Of course, to convert these to anything usable is a headache in itself - but I have found it to be doable.
As a reminder, font piracy IS piracy.
How should I convert pdf file into image (.jpg, .gif etc) using ImageMagick on c# programming platform? or is there any third party library aside from ImageMagick that can be used to do this?
Ghostscript can read PDF (as well as PostScript and EPS) and convert it to many different image formats.
*BTW, ImageMagick cannot do that itself -- ImageMagick also utilizes Ghostscript for exactly that conversion as an external 'delegate'. ImageMagick is great for continuing to process and manipulate image files -- at which jobs it really excels!
The command gs -h (or on Windows: gswin32c.exe -h) should give you an overview about the different devices that are built into your Ghostscript:
GPL Ghostscript GIT PRERELEASE 9.05 (2011-03-30)
Copyright (C) 2010 Artifex Software, Inc. All rights reserved.
Usage: gs [switches] [file1.ps file2.ps ...]
Most frequently used switches: (you can use # in place of =)
-dNOPAUSE no pause after page | -q `quiet', fewer messages
-g<width>x<height> page size in pixels | -r<res> pixels/inch resolution
-sDEVICE=<devname> select device | -dBATCH exit after last file
-sOutputFile=<file> select output file: - for stdout, |command for pipe,
embed %d or %ld for page #
Input formats: PostScript PostScriptLevel1 PostScriptLevel2 PostScriptLevel3 PDF
Default output device: x11alpha
Available devices:
alc1900 [....] bmp16 bmp16m [...]
bmp256 bmp32b bmpgray bmpmono bmpsep1 bmpsep8 [....] jpeg jpegcmyk jpeggray
pamcmyk32 pamcmyk4 pbm pbmraw pcl3 pcx16 pcx24b [....]
pcx256 pcx2up pcxcmyk pcxgray pcxmono pdfwrite pgm pgmraw pgnm pgnmraw
png16 png16m png256 png48 pngalpha
pnggray pngmono pnm pnmraw ppm ppmraw [....] tiff12nc tiff24nc tiff32nc tiff48nc
tiff64nc tiffcrle tiffg3 tiffg32d tiffg4 tiffgray tifflzw tiffpack
tiffscaled tiffscaled24 tiffscaled8 tiffsep tiffsep1 [....]
So, to create a series of PNGs from the multipage PDF my_pdf.pdf with a certain image size (I chose DIN A4 paper format at 72 dpi) and resolution, use the pngalpha device. Try this command:
gswin32c ^
-o my_pdf_page_%03d.png ^
-sDEVICE=pngalpha ^
-dPDFFitPage ^
-g595x842 ^
-r72x72 ^
my_pdf.pdf
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I need to convert a PDF file to images. I used for testing purposes "Total PDF Converter" which offers a command line, but it's shareware and I need to find a free alternative.
Does anyone knows such a tool or maybe even a free C# library?
The convert tool (or magick since version 7) from the ImageMagick bundle can do this (and a whole lot more).
In its simplest form, it's just
convert myfile.pdf myfile.png
or
magick myfile.pdf myfile.png
As a GhostScript answer is missing and there is no hint for multipage PDF export yet I think adding another variant is ok.
gs -dBATCH -dNOPAUSE -sDEVICE=pnggray -r300 -dUseCropBox -sOutputFile=item-%03d.png examples.pdf
Options description:
dBatch and dNOPAUSE just tell gs to run in batch mode, which means
more or less it will not ask any questions. Those parameters are also
important if you want to run the command in a bash script.
sDEVICE tells gs what output format to produce. pnggray is for
grayscale, png16m for 24-bit RGB color. If you insist on creating
Jpegs use -sDEVICE=jpeg to produce color JPEG files. Use the -dJPEGQ=N (N is an integer from 0 to 100, default 75)
parameter to control the Jpgeg quality.
-r300 sets the scan resolution to 300dpi. If you prefer a smaller
output sizes use -r70 or if you input pdf has a high resoultion use
-r600. If you have a PDF with 300dpi and specify -r600 your images will be upscaled.
-dUseCropBox tell gs to use a CropBox if defined. A CropBox is
specifies an area of interest on a page. If you have a pdf with a
large white margin and you don't want this margin on your output this
option might help.
-sOutputFile defines the name(s) of the output file. The %03d.png part
tells gs to include a counter for multiple files. A two page pdf
would result in two files named item-001.png and item-002.png.
The last (unnamed parameter is the input file.)
Availability:
The convert command of imagemagick does use the gs command internally. If you can convert a pdf with imagemagick, you already have gs installed.
Install ghostscript:
RHEL:
yum install ghostscript
SLES:
zypper install ghostscript
Debian/Ubuntu:
sudo apt-get install ghostscript
Windows:
You can find Windows binaries under http://www.ghostscript.com/download/gsdnld.html
I have found this solution which worked for me: https://github.com/jhabjan/Ghostscript.NET. It is also available as an nuget download.
Here is the sample code for converting all pdf pages into png images:
private static void Test()
{
var localGhostscriptDll = Path.Combine(Environment.CurrentDirectory, "gsdll64.dll");
var localDllInfo = new GhostscriptVersionInfo(localGhostscriptDll);
int desired_x_dpi = 96;
int desired_y_dpi = 96;
string inputPdfPath = "test.pdf";
string outputPath = Environment.CurrentDirectory;
GhostscriptRasterizer _rasterizer = new GhostscriptRasterizer();
_rasterizer.Open(inputPdfPath, localDllInfo, false);
for (int pageNumber = 1; pageNumber <= _rasterizer.PageCount; pageNumber++)
{
string pageFilePath = Path.Combine(outputPath, "Page-" + pageNumber.ToString() + ".png");
Image img = _rasterizer.GetPage(desired_x_dpi, desired_y_dpi, pageNumber);
img.Save(pageFilePath, ImageFormat.Png);
}
_rasterizer.Close();
}
The #Thomas answer didn't work in my case.
I gues that works only if you have images in your pdf.
In my case what worked was pdftoppm (source from https://askubuntu.com/a/50180/37527):
pdftoppm input.pdf outputname -png
This will output each page in the PDF using the format outputname-01.png, with 01 being the index of the page.
Converting a single page of the PDF
pdftoppm input.pdf outputname -png -f {page} -singlefile
Change {page} to the page number. It's indexed at 1, so -f 1 would be the first page.
Specifying the converted image's resolution
The default resolution for this command is 150 DPI. Increasing it will result in both a larger file size and more detail.
To increase the resolution of the converted PDF, add the options -rx {resolution} and -ry {resolution}. For example:
pdftoppm input.pdf outputname -png -rx 300 -ry 300
You may want to check this free solution
http://www.codeproject.com/Articles/32274/How-To-Convert-PDF-to-Image-Using-Ghostscript-API
It easily convert PDF to images (single file or multiple files)
is open source, and use ghostscript (free download)
Example of its use:
converter = new PDFConverter();
converter.JPEGQuality = 90;
converter.OutputFormat = "jpg";
string output = "output.jpg";
converter.Convert("input.pdf", output);
You should use iText sharp. Its a port of an open source java project for manipulating PDFs.
http://sourceforge.net/projects/itextsharp/
2JPEG command line tool can do it, like:
2jpeg.exe -src "C:\In\*.pdf" -dst "C:\Out"