Issues Decoding Flate from PDF Embedded Font - c#

Ok, before we start. I work for a company that has a license to redistribute PDF files from various publishers in any media form. So, that being said, the extraction of embedded fonts from the given PDF files is not only legal - but also vital to the presentation.
I am using code found on this site, however I do not recall the author, when I find it I will reference them. I have located the stream within the PDF file that contains the embedded fonts, I have isolated this encoded stream as a string and then into a byte[]. When I use the following code I get an error
Block length does not match with its complement.
Code (the error occurs in the while line below):
private static byte[] DecodeFlateDecodeData(byte[] data)
{
MemoryStream outputStream;
using (outputStream = new MemoryStream())
{
using (var compressedDataStream = new MemoryStream(data))
{
// Remove the first two bytes to skip the header (it isn't recognized by the DeflateStream class)
compressedDataStream.ReadByte();
compressedDataStream.ReadByte();
var deflateStream = new DeflateStream(compressedDataStream, CompressionMode.Decompress, true);
var decompressedBuffer = new byte[compressedDataStream.Length];
int read;
// The error occurs in the following line
while ((read = deflateStream.Read(decompressedBuffer, 0, decompressedBuffer.Length)) != 0)
{
outputStream.Write(decompressedBuffer, 0, read);
}
outputStream.Flush();
compressedDataStream.Close();
}
return ReadFully(outputStream);
}
}
After using the usual tools (Google, Bing, archives here) I found that the majority of the time that this occurs is when one has not consumed the first two bytes of the encoding stream - but this is done here so i cannot find the source of this error. Below is the encoded stream:
H‰LT}lg?7ñù¤aŽÂ½ãnÕ´jh›Ú?-T’ÑRL–¦
ëš:Uí6Ÿ¶“ø+ñ÷ùü™”ÒÆŸŸíóWlDZ“ºu“°tƒ¦t0ÊD¶jˆ
Ö m:$½×^*qABBï?Þç÷|ýÞßóJÖˆD"yâP—òpgÇó¦Q¾S¯9£Û¾mçÁçÚ„cÂÛO¡É‡·¥ï~á³ÇãO¡ŸØö=öPD"d‚ìA—$H'‚DC¢D®¤·éC'Å:È—€ìEV%cÿŽS;þÔ’kYkùcË_ZÇZ/·þYº(ý݇Ã_ó3m¤[3¤²4ÿo?²õñÖ*Z/Þiãÿ¿¾õ8Ü ?»„O Ê£ðÅ­P9ÿ•¿Â¯*–z×No˜0ãÆ-êàîoR‹×ÉêÊêÂulaƒÝü
Please help, I am beating my head against the wall here!
NOTE: The stream above is the encoded version of Arial Black - according to the specs inside the PDF:
661 0 obj
<<
/Type /FontDescriptor
/FontFile3 662 0 R
/FontBBox [ -194 -307 1688 1083 ]
/FontName /HLJOBA+ArialBlack
/Flags 4
/StemV 0
/CapHeight 715
/XHeight 518
/Ascent 0
/Descent -209
/ItalicAngle 0
/CharSet (/space/T/e/s/t/a/k/i/n/g/S/r/E/x/m/O/u/l)
>>
endobj
662 0 obj
<< /Length 1700 /Filter /FlateDecode /Subtype /Type1C >>
stream
H‰LT}lg?7ñù¤aŽÂ½ãnÕ´jh›Ú?-T’ÑRL–¦
ëš:Uí6Ÿ¶“ø+ñ÷ùü™”ÒÆŸŸíóWlDZ“ºu“°tƒ¦t0ÊD¶jˆ
Ö m:$½×^*qABBï?Þç÷|ýÞßóJÖˆD"yâP—òpgÇó¦Q¾S¯9£Û¾mçÁçÚ„cÂÛO¡É‡·¥ï~á³ÇãO¡ŸØö=öPD"d‚ìA—$H'‚DC¢D®¤·éC'Å:È—€ìEV%cÿŽS;þÔ’kYkùcË_ZÇZ/·þYº(ý݇Ã_ó3m¤[3¤²4ÿo?²õñÖ*Z/Þiãÿ¿¾õ8Ü ?»„O Ê£ðÅ­P9ÿ•¿Â¯*–z×No˜0ãÆ-êàîoR‹×ÉêÊêÂulaƒÝü

Is there a particular reason why you're not using the GetStreamBytes() method that is provided with iText? What about data? Are you sure you are looking at the correct bytes? Did you create the PRStream object correctly and did you get the bytes with PdfReader.GetStreamBytesRaw()? If so, why decode the bytes yourself? Which brings me to my initial counter-question: is there a particular reason why you're not using the GetStreamBytes() method?

Looks like GetStreamBytes() might solve your problem out right, but let me point out that I think you're doing something dangerous concerning end-of-line markers. The PDF Specification in 7.3.8.1 states that:
The keyword stream that follows the stream dictionary shall be
followed by an end-of-line marker consisting of either a CARRIAGE
RETURN and a LINE FEED or just a LINE FEED, and not by a CARRIAGE
RETURN alone.
In your code it looks like you always skip two bytes while the spec says it could be either one or two (CR LF or LF).
You should be able to catch whether you are running into this by comparing the exact number of bytes you want to decode with the value of the (Required) "Length" key in the stream dictionary.

Okay, for anyone who might stumble across this issue themselves allow me to warn you - this is a rocky road without a great deal of good solutions. I eventually moved away from writing all of the code to extract the fonts myself. I simply downloaded MuPDF (open source) and then made command line calls to mutool.exe:
mutool extract C:\mypdf.pdf
This pulls all of the fonts into the folder mutool resides in (it also extracts some images (these are the fonts that could not be converted (usually small subsets I think))). I then wrote a method to move those from that folder into the one I wanted them in.
Of course, to convert these to anything usable is a headache in itself - but I have found it to be doable.
As a reminder, font piracy IS piracy.

Related

ZLIB.net (.NET) zeros after inflating a byte array (randomly working)

I am confused by the behavior of Deflate algorithm, such as, the first chunk of bytes (size 12~13k) always decompress successfully. But the second decompression never turns out successful..
I am using DotNetZip (DeflateStream) with a simple code, later I switched to ZLIB.Net (component ace), Org.Bouncycastle, and variety of c# libraries.
The compression goes in c++ (the server that sends the packets) with deflateInit2, windowSize (-15) -> (15 - nowrap).
What could be incorrectly going so that I'm having zeros at the end of the buffer despite the fact that the decompression went successfully?
a example code with "Org.BouncyCastle.Utilities.Zlib"
it's pretty much the same code for almost any lib (DotNetZip, ZLIB.Net, ...)
internal static bool Inflate(byte[] compressed, out byte[] decompressed)
{
using (var inputStream = new MemoryStream(compressed))
using (var zInputStream = new ZInputStream(inputStream, true))
using (var outputStream = new MemoryStream())
{
zInputStream.CopyTo(outputStream);
decompressed = outputStream.ToArray();
}
return true;
}
To make sure everything working correctly, you should check for the following:
Both zlib versions match on both sides (compression - server, and decompression - client).
Flush mode is set to sync, that means the buffer must be synchronized in order to decompress further packets sent by the server.
Ensure that the packets you received is actually correct, and at my specific case, I was appending a different array size (a constant one in fact, of 0xFFFF) which might be different than the size of the received data (and that happens in most cases).
[EDIT 13th Nov. 19']
Keep in mind that conventionally the server may not send the last 4 bytes (sync 00 00 ff ff) if both has a contract that the flush type is sync, so be aware of adding them manually.

load screenshot from adb through c#

I want to get a screenshot into c# using adb without saving files to the filesystem all the time.
I'm using the SharpAdbClient to talk with the device.
I'm on a windows platform.
This is what i got so far:
AdbServer server = new AdbServer();
StartServerResult result = server.StartServer(#"path\to\adb.exe", restartServerIfNewer: false);
DeviceData device = AdbClient.Instance.GetDevices().First();
ConsoleOutputReceiver receiver = new ConsoleOutputReceiver();
AdbClient.Instance.ExecuteRemoteCommand("screencap -p", device, receiver);
string str_image = receiver.ToString().Replace("\r\r", "");
byte[] bytes = Encoding.ASCII.GetBytes(str_image);
Image image = Image.FromStream(new MemoryStream(bytes));
I can successfully load both str_image, and create the byte array but it keeps saying System.ArgumentException when trying to load it into an Image.
I also tried saving the data to a file, but the file is corrupt.
I tried both replacing "\r\r" and "\r\n", both same result.
Anyone has some insight in how to load this file?
It's actually preferred if it could be loaded into a Emgu image since i'm gonna do some CV on it later.
One possible cause is the nonprintable ASCII characters in the string.
Look at the code below
string str_image = File.ReadAllText("test.png");
byte[] bytes = Encoding.ASCII.GetBytes(str_image);
byte[] actualBytes = File.ReadAllBytes("test.png");
str_image is shown in the below screencap, note that there are some non-printable chars (displayed as question mark).
The first eight bytes of a PNG file are always
137 80 78 71 13 10 26 10
While you read the console output as a string, then use ASCII to encode the string, the first byte becomes 63 (0x3F), which is the ASCII code for a question mark.
Also note that the size of the two byte arrays vary hugely (7828/7378).
And other thing is you are replace "\r\r", while actually a new line character in Windows is "\r\n".
So my conclusion is some image data is lost or modified in the output redirection by the ConsoleOutputReceiver, and you cannot recover the original data from the output string.

Zlib compression incompatibile C vs C# implementations

I'm trying to decompress data compressed with zlib algorithm in C# using 2 most legitimate libraries compatible with zlib algorithm and I got similar exception thrown.
Using DotNetZip:
Ionic.Zlib.ZlibException: Bad state (invalid stored block lengths)
Using Zlib.Net:
inflate: invalid stored block lenghts
but using same data as input to zlib-flate command on linux using only default parameters, works great and decompressed without any warnings (output is correct):
zlib-flate -uncompress < ./dbgZlib
Any suggestions what I can do in order to decompress this data in C# or why actually decompression failing in this case?
Compressed data as hex:
root#localhost:~# od -t x1 -An ./dbgZlib |tr -d '\n '
789c626063520b666060606262d26160d05307329999e70a6400e93c2066644080cf8c938c0c0c4d0d0d0d2d839c437c02dcfd0c0c0c11d28ea121013e7e41860ce18e210640e06810141669c080051840012eb970d790800090f99eee409ea189025e806c8e8b5354a89b13d81c136ca60f3a000e5fd6af0fb14a3221873e96400506374cd6c7d52dc8d98980657e7e06460ace0a4ce86e80da9f0249030edf816c16481ab06b60404f03931169c0cdc728c0db0fd928681a3042a481480347336c6e21320d78fb8155195a9090067ca3420387771a400a546aa70100000000ffff
Compressed data as base64:
root#localhost:~# base64 ./dbgZlib
eJxiYGNSC2ZgYGBiYtJhYNBTBzKZmecKZADpPCBmZECAz4yTjAwMTQ0NDS2DnEN8Atz9DAwMEdKO
oSEBPn5BhgzhjiEGQOBoEBQWacCABRhAAS65cNeQgACQ+Z7uQJ6hiQJegGyOi1NUqJsT2BwTbKYP
OgAOX9avD7FKMiGHPpZABQY3TNbH1S3I2YmAZX5+BkYKzgpM6G6A2p8CSQMO34FsFkgasGtgQE8D
kxFpwM3HKMDbD9koaBowQqSBSANHM2xuITINePuBVRlakJAGfKNCA4d3GkAKVGqnAQAAAAD//w==
Data after decompression, encoded with base64 look like this:
root#localhost:~# zlib-flate -uncompress < ./dbgZlib | base64
AAYCJlMAAAACAgIsAAAuJwAAAAMDnRBoAAAAbgAAAAEAAAAAAAAAAAAAAPMBkjIwMTUxMTE5UkNU
TFBHTjAwMQAAAAAAAAAAAABBVVRQTE5SMQBXQVQwMDAwQTBSVlkwAAAAAAAAAAAAAAAAAAAAAAAA
AAAwMDAwMDAwMAAAAAAAAAAAAAAAAAAAAAAAAAAAMDAwMFdFVFBQTFBHTklHMDAwMTQgICAgICAg
ICAgICAgICAgICAgICAgICAgICAgICAwMAAAAAAAAAAAAABEQlpVRkIAAAAAMDQAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABAAAAAAX14QAAAAAAAAAA
AAAAAAAAAAAAAAAAAAIBAAAAAAAAAAAAAABBVDAwMDBBMFJWWTAAAAAAAAAAUExOAAAAAAAAAABM
RUZSQ0IAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATk4wMiBDIAIAAAAAAAAAAAAAAAAA
AAABBfXhAAAAAGQCAgIsAABA9wAAAAQDnRBoAAA+gAAAAAEAAAAAAAAAAAAAAPMBkzIwMTUxMTE5
UkNGTDJQS04AAAAAAAAAAAAAAABBVVRQTE5SMgBXQVQwMDAwQTBZMEE2AAAAAAAAAAAAAAAAAAAA
AAAAAAAwMDAwMDAwMAAAAAAAAAAAAAAAAAAAAAAAAAAAMDAwMFdFVFBQTFBLTjAwMDAwMTggICAg
ICAgICAgICAgICAgICAgICAgICAgICAgICAwMAAAAAAAAAAAAABETVpVUUIAAAAAMDQAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABAAAAAAX14QAAAAAA
AAAAAAAAAAAAAAAAAAAAAAIBAAAAAAAAAAAAAABBVDAwMDBBMFkwQTYAAAAAAAAAUExOAAAAAAAA
AABMRUZSQ0IAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATk4wMiBDIAIAAAAAAAAAAAAA
AAAAAAABBfXhAAAAAGQ=
The problem is that you are using zlib-flate as a general-purpose compression algorithm which, according to the manpage for it, you should not do:
This program should not be used as a general purpose compression tool.
Use something like gzip(1) instead.
So perhaps you should follow the instructions given by your tools and not use them for things that they are not intended for. Use gzip and the System.IO.Compression.GZipStream instead, it's much simpler, especially when you're looking for cross-platform compatible compression algorithms.
That said...
The reason that you can't inflate the data is that it lacks a correct GZIP header. If you add the right header to it you will get something that can be decompressed.
For instance:
public static byte[] DecompressZLibRaw(byte[] bCompressed)
{
byte[] bHdr = new byte[] { 0x1F, 0x8b, 0x08, 0, 0, 0, 0, 0 };
using (var sOutput = new MemoryStream())
using (var sCompressed = new MemoryStream())
{
sCompressed.Write(bHdr, 0, bHdr.Length);
sCompressed.Write(bCompressed, 0, bCompressed.Length);
sCompressed.Position = 0;
using (var decomp = new GZipStream(sCompressed, CompressionMode.Decompress))
{
decomp.CopyTo(sOutput);
}
return sOutput.ToArray();
}
}
Adding the header makes all the difference.
NB: There are two bytes in the 10-byte GZIP header that are not stripped from your source. These are normally used to store the compression flags and the source file system. In the compressed data you present they are invalid values. Additionally the file footer is abbreviated to 5 bytes instead of 8 bytes... all of which is not actually required for decompression. Which probably has a lot to do with why the manpage says not to use this for general compression.
The stream you provided is not complete. It appears that you ended it with a Z_SYNC_FLUSH or Z_FULL_FLUSH in your C# code, instead of a Z_FINISH like you're supposed to. That is causing the error. If you terminate the stream properly, you won't have a problem.
zlib-flate is simply ignoring that error.
If you are not in control of the generation of the stream, you can still use zlib to decompress what's there. You just need to use it at a lower level where you operate on blocks of data and get the decompressed data available given the provided input.

Is there an easy way to manually decode a FlateDecode Filter to extract text in a PDF? C#

I posted a question related to this a while back but got no responses. Since then, I've discovered that the PDF is encoded using FlateDecode, and I was wondering if there is a way to manually decode the PDF in C# (Windows Phone 8)? I'm getting output like the following:
%PDF-1.5
%????
1 0 obj
<<
/Type /Catalog
/Pages 2 0 R
>>
endobj
5 0 obj
<<
/Filter /FlateDecode
/Length 9
>>
stream x^+
The PDF has been created using the SyncFusion PDF controls for Windows Phone 8. Unfortunately, they do not currently have a text extraction feature, and I couldn't find that feature in other WP PDF controls either.
Basically, all I want is to download the PDF from OneDrive and read the PDF contents. Curious if this is easily doable?
private static string decompress(byte[] input)
{
byte[] cutinput = new byte[input.Length - 2];
Array.Copy(input, 2, cutinput, 0, cutinput.Length);
var stream = new MemoryStream();
using (var compressStream = new MemoryStream(cutinput))
using (var decompressor = new DeflateStream(compressStream, CompressionMode.Decompress))
decompressor.CopyTo(stream);
return Encoding.Default.GetString(stream.ToArray());
}
According to below similar question the first 2 bytes of the stream has to be cut from the stream. This is done in above function. Just pass all bytes of the stream to input. Make sure the bytecount is the same as the length specified.
C# decode (decompress) Deflate data of PDF File
The easiest solution is to use DeflateStream provided by .NET framework. Example can be found in similar thread. This approach might have some pitfalls.
If this doesn't work, there are libraries (like DotNetZip), capable of deflate stream decompression. Please check this link for performance comparison.
The last possible option I see, without reinventing wheel is to use other PDF parsing libraries and use them for stream decompression, or even for whole PDF processing.

How to detect if a file is PDF or TIFF?

Please bear with me as I've been thrown into the middle of this project without knowing all the background. If you've got WTF questions, trust me, I have them too.
Here is the scenario: I've got a bunch of files residing on an IIS server. They have no file extension on them. Just naked files with names like "asda-2342-sd3rs-asd24-ut57" and so on. Nothing intuitive.
The problem is I need to serve up files on an ASP.NET (2.0) page and display the tiff files as tiff and the PDF files as PDF. Unfortunately I don't know which is which and I need to be able to display them appropriately in their respective formats.
For example, lets say that there are 2 files I need to display, one is tiff and one is PDF. The page should show up with a tiff image, and perhaps a link that would open up the PDF in a new tab/window.
The problem:
As these files are all extension-less I had to force IIS to just serve everything up as TIFF. But if I do this, the PDF files won't display. I could change IIS to force the MIME type to be PDF for unknown file extensions but I'd have the reverse problem.
http://support.microsoft.com/kb/326965
Is this problem easier than I think or is it as nasty as I am expecting?
OK, enough people are getting this wrong that I'm going to post some code I have to identify TIFFs:
private const int kTiffTagLength = 12;
private const int kHeaderSize = 2;
private const int kMinimumTiffSize = 8;
private const byte kIntelMark = 0x49;
private const byte kMotorolaMark = 0x4d;
private const ushort kTiffMagicNumber = 42;
private bool IsTiff(Stream stm)
{
stm.Seek(0);
if (stm.Length < kMinimumTiffSize)
return false;
byte[] header = new byte[kHeaderSize];
stm.Read(header, 0, header.Length);
if (header[0] != header[1] || (header[0] != kIntelMark && header[0] != kMotorolaMark))
return false;
bool isIntel = header[0] == kIntelMark;
ushort magicNumber = ReadShort(stm, isIntel);
if (magicNumber != kTiffMagicNumber)
return false;
return true;
}
private ushort ReadShort(Stream stm, bool isIntel)
{
byte[] b = new byte[2];
_stm.Read(b, 0, b.Length);
return ToShort(_isIntel, b[0], b[1]);
}
private static ushort ToShort(bool isIntel, byte b0, byte b1)
{
if (isIntel)
{
return (ushort)(((int)b1 << 8) | (int)b0);
}
else
{
return (ushort)(((int)b0 << 8) | (int)b1);
}
}
I hacked apart some much more general code to get this.
For PDF, I have code that looks like this:
public bool IsPdf(Stream stm)
{
stm.Seek(0, SeekOrigin.Begin);
PdfToken token;
while ((token = GetToken(stm)) != null)
{
if (token.TokenType == MLPdfTokenType.Comment)
{
if (token.Text.StartsWith("%PDF-1."))
return true;
}
if (stm.Position > 1024)
break;
}
return false;
}
Now, GetToken() is a call into a scanner that tokenizes a Stream into PDF tokens. This is non-trivial, so I'm not going to paste it here. I'm using the tokenizer instead of looking at substring to avoid a problem like this:
% the following is a PostScript file, NOT a PDF file
% you'll note that in our previous version, it started with %PDF-1.3,
% incorrectly marking it as a PDF
%
clippath stroke showpage
this code is marked as NOT a PDF by the above code snippet, whereas a more simplistic chunk of code will incorrectly mark it as a PDF.
I should also point out that the current ISO spec is devoid of the implementation notes that were in the previous Adobe-owned specification. Most importantly from the PDF Reference, version 1.6:
Acrobat viewers require only that the header appear somewhere within
the first 1024 bytes of the file.
TIFF can be detected by peeking at first bytes http://local.wasp.uwa.edu.au/~pbourke/dataformats/tiff/
The first 8 bytes forms the header.
The first two bytes of which is either
"II" for little endian byte ordering
or "MM" for big endian byte ordering.
About PDF: http://www.adobe.com/devnet/livecycle/articles/lc_pdf_overview_format.pdf
The header contains just one line that
identifies the version of PDF.
Example: %PDF-1.6
Reading the specification for each file format will tell you how to identify files of that format.
TIFF files - Check bytes 1 and 2 for 0x4D4D or 0x4949 and bytes 2-3 for the value '42'.
Page 13 of the spec reads:
A TIFF file begins with an 8-byte
image file header, containing the
following information: Bytes 0-1: The
byte order used within the file. Legal
values are: “II” (4949.H) “MM”
(4D4D.H) In the “II” format, byte
order is always from the least
significant byte to the most
significant byte, for both 16-bit and
32-bit integers This is called
little-endian byte order. In the “MM”
format, byte order is always from most
significant to least significant, for
both 16-bit and 32-bit integers. This
is called big-endian byte order. Bytes
2-3 An arbitrary but carefully chosen
number (42) that further identifies
the file as a TIFF file. The byte
order depends on the value of Bytes
0-1.
PDF files start with the PDF version followed by several binary bytes. (I think you now have to purchase the ISO spec for the current version.)
Section 7.5.2
The first line of a PDF file shall be
a header consisting of the 5
characters %PDF– followed by a version
number of the form 1.N, where N is a
digit between 0 and 7. A conforming
reader shall accept files with any of
the following headers: %PDF–1.0,
%PDF–1.1, %PDF–1.2, %PDF–1.3, %PDF–1.4,
%PDF–1.5, %PDF–1.6, %PDF–1.7 Beginning
with PDF 1.4, the Version entry in the
document’s catalog dictionary (located
via the Root entry in the file’s
trailer, as described in 7.5.5, "File
Trailer"), if present, shall be used
instead of the version specified in
the Header.
If a PDF file contains binary data, as
most do (see 7.2, "Lexical
Conventions"), the header line shall
be immediately followed by a comment
line containing at least four binary
characters—that is, characters whose
codes are 128 or greater. This ensures
proper behaviour of file transfer
applications that inspect data near
the beginning of a file to determine
whether to treat the file’s contents
as text or as binary.
Of course you could do a "deeper" check on each file by checking more file specific items.
A very useful list of File Signatures aka "magic numbers" by Gary Kessler is available http://www.garykessler.net/library/file_sigs.html
Internally, the file header information should help. if you do a low-level file open, such as StreamReader() or FOPEN(), look at the first two characters in the file... Almost every file type has its own signature.
PDF always starts with "%P" (but more specifically would have like %PDF)
TIFF appears to start with "II"
Bitmap files with "BM"
Executable files with "MZ"
I've had to deal with this in the past too... also to help prevent unwanted files from being uploaded to a given site and immediately aborting it once checked.
EDIT -- Posted sample code to read and test file header types
String fn = "Example.pdf";
StreamReader sr = new StreamReader( fn );
char[] buf = new char[5];
sr.Read( buf, 0, 4);
sr.Close();
String Hdr = buf[0].ToString()
+ buf[1].ToString()
+ buf[2].ToString()
+ buf[3].ToString()
+ buf[4].ToString();
String WhatType;
if (Hdr.StartsWith("%PDF"))
WhatType = "PDF";
else if (Hdr.StartsWith("MZ"))
WhatType = "EXE or DLL";
else if (Hdr.StartsWith("BM"))
WhatType = "BMP";
else if (Hdr.StartsWith("?_"))
WhatType = "HLP (help file)";
else if (Hdr.StartsWith("\0\0\1"))
WhatType = "Icon (.ico)";
else if (Hdr.StartsWith("\0\0\2"))
WhatType = "Cursor (.cur)";
else
WhatType = "Unknown";
If you go here, you will see that the TIFF usually starts with "magic numbers" 0x49 0x49 0x2A 0x00 (some other definitions are also given), which is the first 4 bytes of the file.
So just use these first 4 bytes to determine whether file is TIFF or not.
EDIT, it is probably better to do it the other way, and detect PDF first. The magic numbers for PDF are more standardized: As Plinth kindly pointed out they start with "%PDF" somewhere in the first 1024 bytes (0x25 0x50 0x44 0x46). source
You are going to have to write an ashx to get the file requested.
then, your handler should read the first few bytes (or so) to determine what the file type really is-- PDF and TIFF's have "magic numers" in the beginning of the file that you can use to determin this, then set your Response Headers accordingly.
you can use Myrmec to identify the file type, this library use the file byte head. this library avaliable on nuget "Myrmec",and this is the repo, myrmec also support mime type,you can try it. the code will like this :
// create a sniffer instance.
Sniffer sniffer = new Sniffer();
// populate with mata data.
sniffer.Populate(FileTypes.CommonFileTypes);
// get file head byte, may be 20 bytes enough.
byte[] fileHead = ReadFileHead();
// start match.
List<string> results = sniffer.Match(fileHead);
and get mime type :
List<string> result = sniffer.Match(head);
string mimeType = MimeTypes.GetMimeType(result.First());
but that support tiff only "49 49 2A 00" and "4D 4D 00 2A" two signature, if you have more you can add your self, may be you can see the readme file of myrmec for help. myrmec github repo

Categories