i have a pdf file stored in a server url, and i want to get each line of the file,
i want later export it to an excel file so i need to get every line, one by one,
i will put the code here. OBS: the url of the pdf stop working after 3 hours, i will always update it here in the comments. thanks.
using System;
using System.Net.Http;
using System.Threading.Tasks;
public class Program
{
public static async Task Main()
{
var pdfUrl = "https://eproc.trf4.jus.br/eproc2trf4/controlador.php?acao=acessar_documento_implementacao&doc=41625504719486351366932807019&evento=20084&key=4baa2515293382eb41b2a95e121550490b5b154f1c4c06e8b0469eff082311e6&hash=3112f8451af24a1a5c3e69afab09f079&termosPesquisados=";
var client = new HttpClient();
var response = await client.GetAsync(pdfUrl);
using (var stream = await response.Content.ReadAsStreamAsync())
{
Console.WriteLine("print each line of my pdf file");
}
}
}
Well, extracting text from PDF is not an ordinary task. If you need really generic solution works with any pdf, then state of art solution here is to use AI based API provided for example by some cloud platforms like Google, AWS or Azure:
https://cloud.google.com/vision/docs/pdf
https://learn.microsoft.com/en-us/azure/applied-ai-services/form-recognizer/
https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/automatically-extract-content-from-pdf-files-using-amazon-textract.html
So, read pdf as bytes, send bytes to external AI based API, receive parsed content back.
Of course, you will need to do some preparation to use cloud services mentioned above and also it costs some money
How can I best explain why you need a pdf decompressor like pdftotext, is that, the first line when decoded by an app (this is not the raw byte stream) comes in three separate parts. Luckily as whole word strings (they do not need to) and also luckily in this case from the same ascii font table.
BT /F1 12.00 Tf ET
BT 42.52 793.70 Td (Espelho de Valores Atualizados.) Tj ET
BT /F1 12.00 Tf ET
BT 439.37 793.70 Td (Data: ) Tj ET
BT 481.89 793.70 Td (05/07/2021) Tj ET
so we can easily see when converted into ascii that all three parts are at level 793.70 thus a lib can assume they are one line with only 3 different offsets, hence you need a 3rd party lib to decode and reassemble a line of text as if it is just one line string. That requires first save pdf as file, parse the whole file into several common encodings like ascii, hex and UTF-16 mixed (there is generally no UTF-8) then save those as a plain text file with UTF-8 encoding, Then you can extract the UTF-8 lines as required.
Unclear what format of line output you are hoping for since a PDF does not have numbered lines, however if we allocate numbers to lines with text (and some without) based on Human concept of Layout we can run a few lines using poppler utils and native OS text parsing. Here Cme could have loops and arguments, but hardcoded for demonstration. Note the console output would need local chcp but the text file is good
Poppler\poppler-22.04.0\Library\bin>Cme.bat |more
#curl -o brtemp.pdf "https://eproc.trf4.jus.br/eproc2trf4/controlador.php?acao=acessar_documento_implementacao&doc=41625504719486351366932807019&evento=20084&key=c6c5f83e942a3ee021a874f6287505c1cb484235935ff1305c6081893e3481b1&hash=922cacb9024f200d13d3f819e2e906f4&termosPesquisados="
#pdftotext -f 1 -l 1 -nopgbrk -layout -enc UTF-8 brtemp.pdf page1.txt
#pdftotext -f 2 -l 2 -nopgbrk -layout -enc UTF-8 brtemp.pdf page2.txt
#find /N /V "Never2BFound" page1.txt
#find /N /V "Never2BFound" page2.txt
responds
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 3749 100 3749 0 0 4051 0 --:--:-- --:--:-- --:--:-- 4052
---------- PAGE1.TXT
[1]Espelho de Valores Atualizados. Data: 05/07/2021
[2]
Page 1.txt
Espelho de Valores Atualizados. Data: 05/07/2021
PROCESSO : 5018290-57.2021.4.04.9388
ORIGINÁRIO : 5002262-05.2018.4.04.7000/PR
TIPO : Precatório
REQUERENTE : ERCILIA GRACIE RIBEIRO
ADVOGADO : ANA PAULA HORIGUCHI - PR064269
REQUERIDO : INSTITUTO NACIONAL DO SEGURO SOCIAL - INSS
PROCURADOR : PROCURADORIA REGIONAL FEDERAL DA 4 REGIÃO - PRF4
DEPRECANTE : Juízo Substituto da 10ª VF de Curitiba
etc.....
Related
Is there a possibility to get the names of the spotcolors used in a pdf?
I'm using c#. Maybe there is a workaround with ghostscript?
color seperation
I searched the ghostscript doc but didn't find the thing.
Also tried with itextsharp.
The output of -dPDFINFO will be determined by the file contents so start with a valid empty file and using OP windows version 1000\gswin64c
gswin64c -dPDFINFO blank.pdf -o should look like this (note this is console copy
GPL Ghostscript 10.0.0 (2022-09-21)
Copyright (C) 2022 Artifex Software, Inc. All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
File has 1 page.
Producer: GPL Ghostscript 10.00.0
CreationDate: D:20230115003354Z00'00'
ModDate: D:20230115003354Z00'00'
Processing pages 1 through 1.
Page 1 MediaBox: [0 0 595 842]
C:\Apps\PDF\GS\gs1000w64\bin>
to suppress the copy write use -q
to save in a file use level2 redirection
gswin64c -q -dBATCH -dPDFINFO blank.pdf 2>out.txt
To filter output of text file use pipe filters
Does it have spot colours
What are they
As long as no open standard for spot colours exists, TCPDF users will have to buy a colour book by one of the colour manufacturers and insert the values and names of spot colours directly
So here the names are on a RGB scale
- Dark Green is 0,71,57
- Light Yellow is 255,246,142
- Black is 39,36,37
- Red is 166,40,52
- Green is 0,132,75
- Blue is 0,97,157
- Yellow is 255,202,9
But black is not full black is there a better way? yes of course
type example_037.pdf|find /i "/separation" Now we can see the CMYK spots
In this simplified case the CMYK values after each name are shown as for example Full Black = [0.000000 0.000000 0.000000 1.000000].
Note often the separation may be encoded inside the PDF data thus you need to decompress the data first to do the search. There are several tools to do the decompression, so common cross platform ones are qpdf (FOSS) mutool (partner to GhostScript) and PDFtk amongst others.
Yes, you can extract the spot color names used in a PDF using Ghostscript. Here is an example of how to do it with C# and Ghostscript:
using System;
using System.Diagnostics;
using System.IO;
namespace GetSpotColorsFromPDF
{
class Program
{
static void Main(string[] args)
{
var filePath = #"path\to\pdf";
var outputFile = #"path\to\output.txt";
var gsProcess = new Process
{
StartInfo = new ProcessStartInfo
{
FileName = #"path\to\gswin64c.exe",
Arguments = $"-dNODISPLAY -dDumpSpotColors -sOutputFile={outputFile} {filePath}",
RedirectStandardOutput = true,
UseShellExecute = false,
CreateNoWindow = true,
}
};
gsProcess.Start();
gsProcess.WaitForExit();
var output = File.ReadAllText(outputFile);
Console.WriteLine(output);
}
}
}
Note that you need to have Ghostscript installed on your machine and specify the path to the gswin64c.exe executable in the code. The -dNODISPLAY and -dDumpSpotColors arguments are used to extract the spot color information from the PDF and the -sOutputFile argument is used to specify the output file for the extracted information. The extracted information will be saved to the specified output file and then read into a string and printed to the console.
cpdf -list-spot-colors in.pdf will list them, one per line, to standard output.
I need:
Print a large number of PDFs with duplex on specific output printer feeder
I have:
printing using ghostscript with 'mswinpr2' device
using (GhostscriptProcessor processor = new GhostscriptProcessor(new GhostscriptVersionInfo("gsdll32.dll")))
{
List<string> switches = new List<string>();
switches.Add("-dPrinted");
switches.Add("-dBATCH");
switches.Add("-dNOPAUSE");
switches.Add("-dNumCopies=1");
switches.Add("-dPDFFitPage");
switches.Add("-dFIXEDMEDIA");
switches.Add("-dNoCancel");
switches.Add("-sFONTPATH = C:\\Windows\\Fonts");
switches.Add("-sDEVICE=mswinpr2");
switches.Add($"-sOutputFile=%printer%{settings.PrinterName}");
switches.Add("D:\\11.pdf");
processor.StartProcessing(switches.ToArray(), null);
}
Problem:
one job in the print queue consisting of 2 pages takes more than 50mb, while I have more than 1500 PDFs with 1 000 000 pages
What i think to do:
Convert PDF to PCL or PS, edit these files and somehow pass the settings (duplex and specific feeder). Then send edited PCL or PS file as RAW data to printer
Question:
How can i pass the settings to PCL or PS?
Since PDF files can't contain device-specific information, you clearly don't need to pick such information from the input, which makes life simpler.
Ghostscript's ps2write device is capable of inserting document wide or page specific PostScript into its output. So you can 'pass the settings' using that.
For PCL you (probably) need to write some device-specific PJL and insert that into the PCL output. However, PCL is nowhere near as uniform as PostScritp, it'll be up to you to find out what need too be prefixed to the file.
[EDIT]
You don't use -sPSDocOptions, PSDocOptions is a distiller param, so you need:
gswin64c.exe -q -dSAFER -dNOPAUSE -dBATCH -sDEVICE=ps2write -sOutputFile=D:\out.ps -c "<</PSDocOptions (<</Duplex true /NumCopies 10>> setpagedevice)>> setdistillerparams" -f D:\0.pdf
Notice that you don't need -f (as you have in your command line) unless you have first set -c. The -f switch is used as a terminator for the -c.
I have a file format that was generated by an enterprise, legacy (10y+) C# application. It almost certainly was compressed by some form of zlib, and was found in a zipped wrapper package much like .docx. The files are generated and named something.xmlzip, but I have not found a way of decompressing the stream through zip/gzip-type tools, or by using python's deflate/gzip methods and trying to bypass the lack of any stream headers. The contents are certainly an XML document.
The main identifying characteristics of the data are a consistent header/magic number and trailer:
$ xxd thedoc | head -1
00000000: 0404 0a04 3ae4 706e 03c4 0585 3a1b 3a0c ....:.pn....:.:.
$ xxd thedoc | tail -n 2
00003320: 1c0c 6d8d 6d7d 6458 0bfe 61d7 7a5d 7c38 ..m.m}dX..a.z]|8
00003330: 338b 2640 fffe fffe 3.&#....
The 0404 0a04 header and fffe fffe trailer appear in every file. How might I inflate these files?
Ok, before we start. I work for a company that has a license to redistribute PDF files from various publishers in any media form. So, that being said, the extraction of embedded fonts from the given PDF files is not only legal - but also vital to the presentation.
I am using code found on this site, however I do not recall the author, when I find it I will reference them. I have located the stream within the PDF file that contains the embedded fonts, I have isolated this encoded stream as a string and then into a byte[]. When I use the following code I get an error
Block length does not match with its complement.
Code (the error occurs in the while line below):
private static byte[] DecodeFlateDecodeData(byte[] data)
{
MemoryStream outputStream;
using (outputStream = new MemoryStream())
{
using (var compressedDataStream = new MemoryStream(data))
{
// Remove the first two bytes to skip the header (it isn't recognized by the DeflateStream class)
compressedDataStream.ReadByte();
compressedDataStream.ReadByte();
var deflateStream = new DeflateStream(compressedDataStream, CompressionMode.Decompress, true);
var decompressedBuffer = new byte[compressedDataStream.Length];
int read;
// The error occurs in the following line
while ((read = deflateStream.Read(decompressedBuffer, 0, decompressedBuffer.Length)) != 0)
{
outputStream.Write(decompressedBuffer, 0, read);
}
outputStream.Flush();
compressedDataStream.Close();
}
return ReadFully(outputStream);
}
}
After using the usual tools (Google, Bing, archives here) I found that the majority of the time that this occurs is when one has not consumed the first two bytes of the encoding stream - but this is done here so i cannot find the source of this error. Below is the encoded stream:
H‰LT}lg?7ñù¤aŽÂ½ãnÕ´jh›Ú?-T’ÑRL–¦
ëš:Uí6Ÿ¶“ø+ñ÷ùü™”ÒÆŸŸíóWlDZ“ºu“°tƒ¦t0ÊD¶jˆ
Ö m:$½×^*qABBï?Þç÷|ýÞßóJÖˆD"yâP—òpgÇó¦Q¾S¯9£Û¾mçÁçÚ„cÂÛO¡É‡·¥ï~á³ÇãO¡ŸØö=öPD"d‚ìA—$H'‚DC¢D®¤·éC'Å:È—€ìEV%cÿŽS;þÔ’kYkùcË_ZÇZ/·þYº(ý݇Ã_ó3m¤[3¤²4ÿo?²õñÖ*Z/Þiãÿ¿¾õ8Ü ?»„O Ê£ðÅP9ÿ•¿Â¯*–z×No˜0ãÆ-êàîoR‹×ÉêÊêÂulaƒÝü
Please help, I am beating my head against the wall here!
NOTE: The stream above is the encoded version of Arial Black - according to the specs inside the PDF:
661 0 obj
<<
/Type /FontDescriptor
/FontFile3 662 0 R
/FontBBox [ -194 -307 1688 1083 ]
/FontName /HLJOBA+ArialBlack
/Flags 4
/StemV 0
/CapHeight 715
/XHeight 518
/Ascent 0
/Descent -209
/ItalicAngle 0
/CharSet (/space/T/e/s/t/a/k/i/n/g/S/r/E/x/m/O/u/l)
>>
endobj
662 0 obj
<< /Length 1700 /Filter /FlateDecode /Subtype /Type1C >>
stream
H‰LT}lg?7ñù¤aŽÂ½ãnÕ´jh›Ú?-T’ÑRL–¦
ëš:Uí6Ÿ¶“ø+ñ÷ùü™”ÒÆŸŸíóWlDZ“ºu“°tƒ¦t0ÊD¶jˆ
Ö m:$½×^*qABBï?Þç÷|ýÞßóJÖˆD"yâP—òpgÇó¦Q¾S¯9£Û¾mçÁçÚ„cÂÛO¡É‡·¥ï~á³ÇãO¡ŸØö=öPD"d‚ìA—$H'‚DC¢D®¤·éC'Å:È—€ìEV%cÿŽS;þÔ’kYkùcË_ZÇZ/·þYº(ý݇Ã_ó3m¤[3¤²4ÿo?²õñÖ*Z/Þiãÿ¿¾õ8Ü ?»„O Ê£ðÅP9ÿ•¿Â¯*–z×No˜0ãÆ-êàîoR‹×ÉêÊêÂulaƒÝü
Is there a particular reason why you're not using the GetStreamBytes() method that is provided with iText? What about data? Are you sure you are looking at the correct bytes? Did you create the PRStream object correctly and did you get the bytes with PdfReader.GetStreamBytesRaw()? If so, why decode the bytes yourself? Which brings me to my initial counter-question: is there a particular reason why you're not using the GetStreamBytes() method?
Looks like GetStreamBytes() might solve your problem out right, but let me point out that I think you're doing something dangerous concerning end-of-line markers. The PDF Specification in 7.3.8.1 states that:
The keyword stream that follows the stream dictionary shall be
followed by an end-of-line marker consisting of either a CARRIAGE
RETURN and a LINE FEED or just a LINE FEED, and not by a CARRIAGE
RETURN alone.
In your code it looks like you always skip two bytes while the spec says it could be either one or two (CR LF or LF).
You should be able to catch whether you are running into this by comparing the exact number of bytes you want to decode with the value of the (Required) "Length" key in the stream dictionary.
Okay, for anyone who might stumble across this issue themselves allow me to warn you - this is a rocky road without a great deal of good solutions. I eventually moved away from writing all of the code to extract the fonts myself. I simply downloaded MuPDF (open source) and then made command line calls to mutool.exe:
mutool extract C:\mypdf.pdf
This pulls all of the fonts into the folder mutool resides in (it also extracts some images (these are the fonts that could not be converted (usually small subsets I think))). I then wrote a method to move those from that folder into the one I wanted them in.
Of course, to convert these to anything usable is a headache in itself - but I have found it to be doable.
As a reminder, font piracy IS piracy.
I must send a Font file to my printer Zebra RW420 via bluetooth. Im using Zebra Windows Mobile SDK, but can't find any way to send and store it on printer. I could do it manually by Label Vista but It must be done in 200+ printers.
Anyone have any suggestion or know what method from the SDK I could use?
Thanks in advance.
CISDF is the correct answer, it's probably the checksum value that you are computing that is incorrect. I put a port sniffer on my RW420 attached to a USB port and found this to work. I actually sent some PCX images to the printer, then used them in a label later on.
! CISDF
<filename>
<size>
<cksum>
<data>
There is a CRLF at the end of the 1st four lines. Using 0000 as the checksum causes the printer to ignore any checksum verification (I found some really obscure references to this in some ZPL manuals, tried it and it worked). <filename> is the 8.3 name of the file as it will be stored in the file system on the printer and <size> is the size of the file, 8 characters long and formatted as a hexadecimal number. <cksum> is the two's complement of the sum of the data bytes as the checksum. <data> is, of course, the contents of the file to be stored on the printer.
Here is the actual C# code that I used to send my sample images to the printer:
// calculate the checksum for the file
// get the sum of all the bytes in the data stream
UInt16 sum = 0;
for (int i = 0; i < Properties.Resources.cmlogo.Length; i++)
{
sum += Convert.ToUInt16(Properties.Resources.cmlogo[ i]);
}
// compute the two's complement of the checksum
sum = (Uint16)~sum;
sum += 1;
// create a new printer
MP2Bluetooth bt = new MP2Bluetooth();
// connect to the printer
bt.ConnectPrinter("<MAC ADDRESS>", "<PIN>");
// write the header and data to the printer
bt.Write("! CISDF\r\n");
bt.Write("cmlogo.pcx\r\n");
bt.Write(String.Format("{0:X8}\r\n", Properties.Resources.cmlogo.Length));
bt.Write(String.Format("{0:X4}\r\n", sum)); // checksum, 0000 => ignore checksum
bt.Write(Properties.Resources.cmlogo);
// gracefully close our connection and disconnect
bt.Close();
bt.DisconnectPrinter();
MP2Bluetooth is a class we use internally to abstract BT connections and communications - you have your own as well, I'm sure!
You can use the SDK to send any kind of data. A Zebra font is just a font file with a header on it. So if you capture the output cpf file from Label Vista, you can send that file from the SDK. Just create a connection, and call write(byte[]) with the contents of the file