How to Convert PDF File into Image File using ImageMagick

How to Convert PDF File into Image File using ImageMagick - c#

How should I convert pdf file into image (.jpg, .gif etc) using ImageMagick on c# programming platform? or is there any third party library aside from ImageMagick that can be used to do this?

Ghostscript can read PDF (as well as PostScript and EPS) and convert it to many different image formats.
*BTW, ImageMagick cannot do that itself -- ImageMagick also utilizes Ghostscript for exactly that conversion as an external 'delegate'. ImageMagick is great for continuing to process and manipulate image files -- at which jobs it really excels!
The command gs -h (or on Windows: gswin32c.exe -h) should give you an overview about the different devices that are built into your Ghostscript:
GPL Ghostscript GIT PRERELEASE 9.05 (2011-03-30)
Copyright (C) 2010 Artifex Software, Inc. All rights reserved.
Usage: gs [switches] [file1.ps file2.ps ...]
Most frequently used switches: (you can use # in place of =)
-dNOPAUSE no pause after page | -q `quiet', fewer messages
-g<width>x<height> page size in pixels | -r<res> pixels/inch resolution
-sDEVICE=<devname> select device | -dBATCH exit after last file
-sOutputFile=<file> select output file: - for stdout, |command for pipe,
embed %d or %ld for page #
Input formats: PostScript PostScriptLevel1 PostScriptLevel2 PostScriptLevel3 PDF
Default output device: x11alpha
Available devices:
alc1900 [....] bmp16 bmp16m [...]
bmp256 bmp32b bmpgray bmpmono bmpsep1 bmpsep8 [....] jpeg jpegcmyk jpeggray
pamcmyk32 pamcmyk4 pbm pbmraw pcl3 pcx16 pcx24b [....]
pcx256 pcx2up pcxcmyk pcxgray pcxmono pdfwrite pgm pgmraw pgnm pgnmraw
png16 png16m png256 png48 pngalpha
pnggray pngmono pnm pnmraw ppm ppmraw [....] tiff12nc tiff24nc tiff32nc tiff48nc
tiff64nc tiffcrle tiffg3 tiffg32d tiffg4 tiffgray tifflzw tiffpack
tiffscaled tiffscaled24 tiffscaled8 tiffsep tiffsep1 [....]
So, to create a series of PNGs from the multipage PDF my_pdf.pdf with a certain image size (I chose DIN A4 paper format at 72 dpi) and resolution, use the pngalpha device. Try this command:
gswin32c ^
-o my_pdf_page_%03d.png ^
-sDEVICE=pngalpha ^
-dPDFFitPage ^
-g595x842 ^
-r72x72 ^
my_pdf.pdf

Related

Print each Line of a PDF File with c#

i have a pdf file stored in a server url, and i want to get each line of the file,
i want later export it to an excel file so i need to get every line, one by one,
i will put the code here. OBS: the url of the pdf stop working after 3 hours, i will always update it here in the comments. thanks.
using System;
using System.Net.Http;
using System.Threading.Tasks;
public class Program
{
public static async Task Main()
{
var pdfUrl = "https://eproc.trf4.jus.br/eproc2trf4/controlador.php?acao=acessar_documento_implementacao&doc=41625504719486351366932807019&evento=20084&key=4baa2515293382eb41b2a95e121550490b5b154f1c4c06e8b0469eff082311e6&hash=3112f8451af24a1a5c3e69afab09f079&termosPesquisados=";
var client = new HttpClient();
var response = await client.GetAsync(pdfUrl);
using (var stream = await response.Content.ReadAsStreamAsync())
{
Console.WriteLine("print each line of my pdf file");
}
}
}

Well, extracting text from PDF is not an ordinary task. If you need really generic solution works with any pdf, then state of art solution here is to use AI based API provided for example by some cloud platforms like Google, AWS or Azure:
https://cloud.google.com/vision/docs/pdf
https://learn.microsoft.com/en-us/azure/applied-ai-services/form-recognizer/
https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/automatically-extract-content-from-pdf-files-using-amazon-textract.html
So, read pdf as bytes, send bytes to external AI based API, receive parsed content back.
Of course, you will need to do some preparation to use cloud services mentioned above and also it costs some money

How can I best explain why you need a pdf decompressor like pdftotext, is that, the first line when decoded by an app (this is not the raw byte stream) comes in three separate parts. Luckily as whole word strings (they do not need to) and also luckily in this case from the same ascii font table.
BT /F1 12.00 Tf ET
BT 42.52 793.70 Td (Espelho de Valores Atualizados.) Tj ET
BT /F1 12.00 Tf ET
BT 439.37 793.70 Td (Data: ) Tj ET
BT 481.89 793.70 Td (05/07/2021) Tj ET
so we can easily see when converted into ascii that all three parts are at level 793.70 thus a lib can assume they are one line with only 3 different offsets, hence you need a 3rd party lib to decode and reassemble a line of text as if it is just one line string. That requires first save pdf as file, parse the whole file into several common encodings like ascii, hex and UTF-16 mixed (there is generally no UTF-8) then save those as a plain text file with UTF-8 encoding, Then you can extract the UTF-8 lines as required.
Unclear what format of line output you are hoping for since a PDF does not have numbered lines, however if we allocate numbers to lines with text (and some without) based on Human concept of Layout we can run a few lines using poppler utils and native OS text parsing. Here Cme could have loops and arguments, but hardcoded for demonstration. Note the console output would need local chcp but the text file is good
Poppler\poppler-22.04.0\Library\bin>Cme.bat |more
#curl -o brtemp.pdf "https://eproc.trf4.jus.br/eproc2trf4/controlador.php?acao=acessar_documento_implementacao&doc=41625504719486351366932807019&evento=20084&key=c6c5f83e942a3ee021a874f6287505c1cb484235935ff1305c6081893e3481b1&hash=922cacb9024f200d13d3f819e2e906f4&termosPesquisados="
#pdftotext -f 1 -l 1 -nopgbrk -layout -enc UTF-8 brtemp.pdf page1.txt
#pdftotext -f 2 -l 2 -nopgbrk -layout -enc UTF-8 brtemp.pdf page2.txt
#find /N /V "Never2BFound" page1.txt
#find /N /V "Never2BFound" page2.txt
responds
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 3749 100 3749 0 0 4051 0 --:--:-- --:--:-- --:--:-- 4052
---------- PAGE1.TXT
[1]Espelho de Valores Atualizados. Data: 05/07/2021
[2]
Page 1.txt
Espelho de Valores Atualizados. Data: 05/07/2021
PROCESSO : 5018290-57.2021.4.04.9388
ORIGINÁRIO : 5002262-05.2018.4.04.7000/PR
TIPO : Precatório
REQUERENTE : ERCILIA GRACIE RIBEIRO
ADVOGADO : ANA PAULA HORIGUCHI - PR064269
REQUERIDO : INSTITUTO NACIONAL DO SEGURO SOCIAL - INSS
PROCURADOR : PROCURADORIA REGIONAL FEDERAL DA 4 REGIÃO - PRF4
DEPRECANTE : Juízo Substituto da 10ª VF de Curitiba
etc.....

Printer settings into PostScript or PCL file

I need:
Print a large number of PDFs with duplex on specific output printer feeder
I have:
printing using ghostscript with 'mswinpr2' device
using (GhostscriptProcessor processor = new GhostscriptProcessor(new GhostscriptVersionInfo("gsdll32.dll")))
{
List<string> switches = new List<string>();
switches.Add("-dPrinted");
switches.Add("-dBATCH");
switches.Add("-dNOPAUSE");
switches.Add("-dNumCopies=1");
switches.Add("-dPDFFitPage");
switches.Add("-dFIXEDMEDIA");
switches.Add("-dNoCancel");
switches.Add("-sFONTPATH = C:\\Windows\\Fonts");
switches.Add("-sDEVICE=mswinpr2");
switches.Add($"-sOutputFile=%printer%{settings.PrinterName}");
switches.Add("D:\\11.pdf");
processor.StartProcessing(switches.ToArray(), null);
}
Problem:
one job in the print queue consisting of 2 pages takes more than 50mb, while I have more than 1500 PDFs with 1 000 000 pages
What i think to do:
Convert PDF to PCL or PS, edit these files and somehow pass the settings (duplex and specific feeder). Then send edited PCL or PS file as RAW data to printer
Question:
How can i pass the settings to PCL or PS?

Since PDF files can't contain device-specific information, you clearly don't need to pick such information from the input, which makes life simpler.
Ghostscript's ps2write device is capable of inserting document wide or page specific PostScript into its output. So you can 'pass the settings' using that.
For PCL you (probably) need to write some device-specific PJL and insert that into the PCL output. However, PCL is nowhere near as uniform as PostScritp, it'll be up to you to find out what need too be prefixed to the file.
[EDIT]
You don't use -sPSDocOptions, PSDocOptions is a distiller param, so you need:
gswin64c.exe -q -dSAFER -dNOPAUSE -dBATCH -sDEVICE=ps2write -sOutputFile=D:\out.ps -c "<</PSDocOptions (<</Duplex true /NumCopies 10>> setpagedevice)>> setdistillerparams" -f D:\0.pdf
Notice that you don't need -f (as you have in your command line) unless you have first set -c. The -f switch is used as a terminator for the -c.

Identify This Compressed Bytestream

I have a file format that was generated by an enterprise, legacy (10y+) C# application. It almost certainly was compressed by some form of zlib, and was found in a zipped wrapper package much like .docx. The files are generated and named something.xmlzip, but I have not found a way of decompressing the stream through zip/gzip-type tools, or by using python's deflate/gzip methods and trying to bypass the lack of any stream headers. The contents are certainly an XML document.
The main identifying characteristics of the data are a consistent header/magic number and trailer:
$ xxd thedoc | head -1
00000000: 0404 0a04 3ae4 706e 03c4 0585 3a1b 3a0c ....:.pn....:.:.
$ xxd thedoc | tail -n 2
00003320: 1c0c 6d8d 6d7d 6458 0bfe 61d7 7a5d 7c38 ..m.m}dX..a.z]|8
00003330: 338b 2640 fffe fffe 3.&#....
The 0404 0a04 header and fffe fffe trailer appear in every file. How might I inflate these files?

Convert PDF to JPG or PNG using C# or Command Line [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I need to convert a PDF file to images. I used for testing purposes "Total PDF Converter" which offers a command line, but it's shareware and I need to find a free alternative.
Does anyone knows such a tool or maybe even a free C# library?

The convert tool (or magick since version 7) from the ImageMagick bundle can do this (and a whole lot more).
In its simplest form, it's just
convert myfile.pdf myfile.png
or
magick myfile.pdf myfile.png

As a GhostScript answer is missing and there is no hint for multipage PDF export yet I think adding another variant is ok.
gs -dBATCH -dNOPAUSE -sDEVICE=pnggray -r300 -dUseCropBox -sOutputFile=item-%03d.png examples.pdf
Options description:
dBatch and dNOPAUSE just tell gs to run in batch mode, which means
more or less it will not ask any questions. Those parameters are also
important if you want to run the command in a bash script.
sDEVICE tells gs what output format to produce. pnggray is for
grayscale, png16m for 24-bit RGB color. If you insist on creating
Jpegs use -sDEVICE=jpeg to produce color JPEG files. Use the -dJPEGQ=N (N is an integer from 0 to 100, default 75)
parameter to control the Jpgeg quality.
-r300 sets the scan resolution to 300dpi. If you prefer a smaller
output sizes use -r70 or if you input pdf has a high resoultion use
-r600. If you have a PDF with 300dpi and specify -r600 your images will be upscaled.
-dUseCropBox tell gs to use a CropBox if defined. A CropBox is
specifies an area of interest on a page. If you have a pdf with a
large white margin and you don't want this margin on your output this
option might help.
-sOutputFile defines the name(s) of the output file. The %03d.png part
tells gs to include a counter for multiple files. A two page pdf
would result in two files named item-001.png and item-002.png.
The last (unnamed parameter is the input file.)
Availability:
The convert command of imagemagick does use the gs command internally. If you can convert a pdf with imagemagick, you already have gs installed.
Install ghostscript:
RHEL:
yum install ghostscript
SLES:
zypper install ghostscript
Debian/Ubuntu:
sudo apt-get install ghostscript
Windows:
You can find Windows binaries under http://www.ghostscript.com/download/gsdnld.html

I have found this solution which worked for me: https://github.com/jhabjan/Ghostscript.NET. It is also available as an nuget download.
Here is the sample code for converting all pdf pages into png images:
private static void Test()
{
var localGhostscriptDll = Path.Combine(Environment.CurrentDirectory, "gsdll64.dll");
var localDllInfo = new GhostscriptVersionInfo(localGhostscriptDll);
int desired_x_dpi = 96;
int desired_y_dpi = 96;
string inputPdfPath = "test.pdf";
string outputPath = Environment.CurrentDirectory;
GhostscriptRasterizer _rasterizer = new GhostscriptRasterizer();
_rasterizer.Open(inputPdfPath, localDllInfo, false);
for (int pageNumber = 1; pageNumber <= _rasterizer.PageCount; pageNumber++)
{
string pageFilePath = Path.Combine(outputPath, "Page-" + pageNumber.ToString() + ".png");
Image img = _rasterizer.GetPage(desired_x_dpi, desired_y_dpi, pageNumber);
img.Save(pageFilePath, ImageFormat.Png);
}
_rasterizer.Close();
}

The #Thomas answer didn't work in my case.
I gues that works only if you have images in your pdf.
In my case what worked was pdftoppm (source from https://askubuntu.com/a/50180/37527):
pdftoppm input.pdf outputname -png
This will output each page in the PDF using the format outputname-01.png, with 01 being the index of the page.
Converting a single page of the PDF
pdftoppm input.pdf outputname -png -f {page} -singlefile
Change {page} to the page number. It's indexed at 1, so -f 1 would be the first page.
Specifying the converted image's resolution
The default resolution for this command is 150 DPI. Increasing it will result in both a larger file size and more detail.
To increase the resolution of the converted PDF, add the options -rx {resolution} and -ry {resolution}. For example:
pdftoppm input.pdf outputname -png -rx 300 -ry 300

You may want to check this free solution
http://www.codeproject.com/Articles/32274/How-To-Convert-PDF-to-Image-Using-Ghostscript-API
It easily convert PDF to images (single file or multiple files)
is open source, and use ghostscript (free download)
Example of its use:
converter = new PDFConverter();
converter.JPEGQuality = 90;
converter.OutputFormat = "jpg";
string output = "output.jpg";
converter.Convert("input.pdf", output);

You should use iText sharp. Its a port of an open source java project for manipulating PDFs.
http://sourceforge.net/projects/itextsharp/

2JPEG command line tool can do it, like:
2jpeg.exe -src "C:\In\*.pdf" -dst "C:\Out"

convert video files to flv format in c#

How do I get video file from user and convert it into flv format?

A solution to have multimedia support that converts virtually ANYTHING you throw at it is to use FFmpeg and MEncoder, you'll have quite a good input/output support:
Input Image Sequence: jpg, pgm, png, ppm (with sequentially numbered-ONLY filenames)
having the same filename numbered-ONLY format. Example: 0001.jpg --up to--> 0999.jpg
note that this format will NOT work name_0001.jpg --up to--> name_0999.jpg (take out name_)
Input Video format: 3gp, 3g2, amv, asf, avi, dat, dvr-ms, fli, flc, flv, m2ts, mpg, mkv, mov
m4v, mp4, nsv, ogm, qt, rm(vb), str, swf, ts, trp, ty, ty+, tmf, viv, vob, wmv ..
Input Audio format: aac, ac3, amr, flac, mmf, m4a, mp2, mp3, mpc, ogg, ra, wav, wma ..
Input AviSynth Script Files: avs. To write a script and specify advanced encoding commands using AviSynth scripts!
(taken from SUPER's website)
NOTE: it has MUCH more support than "just" that. those are just some of the capabilities!! :D

Here is some example code and a C#/.NET library that will convert audio to video, and vice versa using an FFMpeg wrapper. It's explained here:
http://ivolo.mit.edu/post/Convert-Audio-Video-to-Any-Format-using-C.aspx

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.