How to detect if a file is PDF or TIFF?

How to detect if a file is PDF or TIFF? - c#

Please bear with me as I've been thrown into the middle of this project without knowing all the background. If you've got WTF questions, trust me, I have them too.
Here is the scenario: I've got a bunch of files residing on an IIS server. They have no file extension on them. Just naked files with names like "asda-2342-sd3rs-asd24-ut57" and so on. Nothing intuitive.
The problem is I need to serve up files on an ASP.NET (2.0) page and display the tiff files as tiff and the PDF files as PDF. Unfortunately I don't know which is which and I need to be able to display them appropriately in their respective formats.
For example, lets say that there are 2 files I need to display, one is tiff and one is PDF. The page should show up with a tiff image, and perhaps a link that would open up the PDF in a new tab/window.
The problem:
As these files are all extension-less I had to force IIS to just serve everything up as TIFF. But if I do this, the PDF files won't display. I could change IIS to force the MIME type to be PDF for unknown file extensions but I'd have the reverse problem.
http://support.microsoft.com/kb/326965
Is this problem easier than I think or is it as nasty as I am expecting?

OK, enough people are getting this wrong that I'm going to post some code I have to identify TIFFs:
private const int kTiffTagLength = 12;
private const int kHeaderSize = 2;
private const int kMinimumTiffSize = 8;
private const byte kIntelMark = 0x49;
private const byte kMotorolaMark = 0x4d;
private const ushort kTiffMagicNumber = 42;
private bool IsTiff(Stream stm)
{
stm.Seek(0);
if (stm.Length < kMinimumTiffSize)
return false;
byte[] header = new byte[kHeaderSize];
stm.Read(header, 0, header.Length);
if (header[0] != header[1] || (header[0] != kIntelMark && header[0] != kMotorolaMark))
return false;
bool isIntel = header[0] == kIntelMark;
ushort magicNumber = ReadShort(stm, isIntel);
if (magicNumber != kTiffMagicNumber)
return false;
return true;
}
private ushort ReadShort(Stream stm, bool isIntel)
{
byte[] b = new byte[2];
_stm.Read(b, 0, b.Length);
return ToShort(_isIntel, b[0], b[1]);
}
private static ushort ToShort(bool isIntel, byte b0, byte b1)
{
if (isIntel)
{
return (ushort)(((int)b1 << 8) | (int)b0);
}
else
{
return (ushort)(((int)b0 << 8) | (int)b1);
}
}
I hacked apart some much more general code to get this.
For PDF, I have code that looks like this:
public bool IsPdf(Stream stm)
{
stm.Seek(0, SeekOrigin.Begin);
PdfToken token;
while ((token = GetToken(stm)) != null)
{
if (token.TokenType == MLPdfTokenType.Comment)
{
if (token.Text.StartsWith("%PDF-1."))
return true;
}
if (stm.Position > 1024)
break;
}
return false;
}
Now, GetToken() is a call into a scanner that tokenizes a Stream into PDF tokens. This is non-trivial, so I'm not going to paste it here. I'm using the tokenizer instead of looking at substring to avoid a problem like this:
% the following is a PostScript file, NOT a PDF file
% you'll note that in our previous version, it started with %PDF-1.3,
% incorrectly marking it as a PDF
%
clippath stroke showpage
this code is marked as NOT a PDF by the above code snippet, whereas a more simplistic chunk of code will incorrectly mark it as a PDF.
I should also point out that the current ISO spec is devoid of the implementation notes that were in the previous Adobe-owned specification. Most importantly from the PDF Reference, version 1.6:
Acrobat viewers require only that the header appear somewhere within
the first 1024 bytes of the file.

TIFF can be detected by peeking at first bytes http://local.wasp.uwa.edu.au/~pbourke/dataformats/tiff/
The first 8 bytes forms the header.
The first two bytes of which is either
"II" for little endian byte ordering
or "MM" for big endian byte ordering.
About PDF: http://www.adobe.com/devnet/livecycle/articles/lc_pdf_overview_format.pdf
The header contains just one line that
identifies the version of PDF.
Example: %PDF-1.6

Reading the specification for each file format will tell you how to identify files of that format.
TIFF files - Check bytes 1 and 2 for 0x4D4D or 0x4949 and bytes 2-3 for the value '42'.
Page 13 of the spec reads:
A TIFF file begins with an 8-byte
image file header, containing the
following information: Bytes 0-1: The
byte order used within the file. Legal
values are: “II” (4949.H) “MM”
(4D4D.H) In the “II” format, byte
order is always from the least
significant byte to the most
significant byte, for both 16-bit and
32-bit integers This is called
little-endian byte order. In the “MM”
format, byte order is always from most
significant to least significant, for
both 16-bit and 32-bit integers. This
is called big-endian byte order. Bytes
2-3 An arbitrary but carefully chosen
number (42) that further identifies
the file as a TIFF file. The byte
order depends on the value of Bytes
0-1.
PDF files start with the PDF version followed by several binary bytes. (I think you now have to purchase the ISO spec for the current version.)
Section 7.5.2
The first line of a PDF file shall be
a header consisting of the 5
characters %PDF– followed by a version
number of the form 1.N, where N is a
digit between 0 and 7. A conforming
reader shall accept files with any of
the following headers: %PDF–1.0,
%PDF–1.1, %PDF–1.2, %PDF–1.3, %PDF–1.4,
%PDF–1.5, %PDF–1.6, %PDF–1.7 Beginning
with PDF 1.4, the Version entry in the
document’s catalog dictionary (located
via the Root entry in the file’s
trailer, as described in 7.5.5, "File
Trailer"), if present, shall be used
instead of the version specified in
the Header.
If a PDF file contains binary data, as
most do (see 7.2, "Lexical
Conventions"), the header line shall
be immediately followed by a comment
line containing at least four binary
characters—that is, characters whose
codes are 128 or greater. This ensures
proper behaviour of file transfer
applications that inspect data near
the beginning of a file to determine
whether to treat the file’s contents
as text or as binary.
Of course you could do a "deeper" check on each file by checking more file specific items.

A very useful list of File Signatures aka "magic numbers" by Gary Kessler is available http://www.garykessler.net/library/file_sigs.html

Internally, the file header information should help. if you do a low-level file open, such as StreamReader() or FOPEN(), look at the first two characters in the file... Almost every file type has its own signature.
PDF always starts with "%P" (but more specifically would have like %PDF)
TIFF appears to start with "II"
Bitmap files with "BM"
Executable files with "MZ"
I've had to deal with this in the past too... also to help prevent unwanted files from being uploaded to a given site and immediately aborting it once checked.
EDIT -- Posted sample code to read and test file header types
String fn = "Example.pdf";
StreamReader sr = new StreamReader( fn );
char[] buf = new char[5];
sr.Read( buf, 0, 4);
sr.Close();
String Hdr = buf[0].ToString()
+ buf[1].ToString()
+ buf[2].ToString()
+ buf[3].ToString()
+ buf[4].ToString();
String WhatType;
if (Hdr.StartsWith("%PDF"))
WhatType = "PDF";
else if (Hdr.StartsWith("MZ"))
WhatType = "EXE or DLL";
else if (Hdr.StartsWith("BM"))
WhatType = "BMP";
else if (Hdr.StartsWith("?_"))
WhatType = "HLP (help file)";
else if (Hdr.StartsWith("\0\0\1"))
WhatType = "Icon (.ico)";
else if (Hdr.StartsWith("\0\0\2"))
WhatType = "Cursor (.cur)";
else
WhatType = "Unknown";

If you go here, you will see that the TIFF usually starts with "magic numbers" 0x49 0x49 0x2A 0x00 (some other definitions are also given), which is the first 4 bytes of the file.
So just use these first 4 bytes to determine whether file is TIFF or not.
EDIT, it is probably better to do it the other way, and detect PDF first. The magic numbers for PDF are more standardized: As Plinth kindly pointed out they start with "%PDF" somewhere in the first 1024 bytes (0x25 0x50 0x44 0x46). source

You are going to have to write an ashx to get the file requested.
then, your handler should read the first few bytes (or so) to determine what the file type really is-- PDF and TIFF's have "magic numers" in the beginning of the file that you can use to determin this, then set your Response Headers accordingly.

you can use Myrmec to identify the file type, this library use the file byte head. this library avaliable on nuget "Myrmec",and this is the repo, myrmec also support mime type,you can try it. the code will like this :
// create a sniffer instance.
Sniffer sniffer = new Sniffer();
// populate with mata data.
sniffer.Populate(FileTypes.CommonFileTypes);
// get file head byte, may be 20 bytes enough.
byte[] fileHead = ReadFileHead();
// start match.
List<string> results = sniffer.Match(fileHead);
and get mime type :
List<string> result = sniffer.Match(head);
string mimeType = MimeTypes.GetMimeType(result.First());
but that support tiff only "49 49 2A 00" and "4D 4D 00 2A" two signature, if you have more you can add your self, may be you can see the readme file of myrmec for help. myrmec github repo

Related

Reading a file signature and telling the difference between a zip file and a docx file

I have an upload routine where I read the first few bytes into an array and convert it to a hex string to get the file signature.
I have been reading the first 4 bytes into the array and everything seemed to be going fine until I ran across a problem with a .zip file and a .docx file. They both have the same signature in the first 4 bytes: "50-4b-03-04".
So I looked at the next byte and for .docx it is "14" but it was on some .zip files as well. I got looked up this file signature and found this sequence is for a lot of file types including JAR, ZIP, DOCX, XSLX, and Open Office documents.
Does anyone know of a good way to read the file signature and determine the file type accurately? How does Windows know the difference? It has to be more than just the first 4 bytes. I'm looking to read the file signatures for file uploads to ensure only approved file types are allowed to be uploaded.

What I did was put the file signatures into a database, put the signature length of file type and the extension. If the file doesn't have an extension, it isn't uploaded. If the file extension has changed from the signature, the routine will reject the file. Here is the code in the routine that pulls the signatures and does a compare:
using var fileStream = file.OpenReadStream();
var signature = _context.FileSignatures.Select(f => new { f.FileSignature, f.AllowedFileType.FileExtension, f.SignatureLength })
.Where(x => x.FileExtension == fileType);
byte[] bytes = new byte[signature.Max(x => x.SignatureLength)];
fileStream.Read(bytes, 0, signature.Max(x => x.SignatureLength));
string hexData = BitConverter.ToString(bytes);
var foundFile = await signature.FirstAsync(x => x.FileSignature == hexData);
return foundFile.FileExtension;
File signatures are stored in the table like this:
File Extension FileSignature SignatureLength
.PDF 25-50-44-46 4
This way I can make sure the read the max number of bytes for the signature and get the extension. If I want to include more files, I just add them to the database.

load screenshot from adb through c#

I want to get a screenshot into c# using adb without saving files to the filesystem all the time.
I'm using the SharpAdbClient to talk with the device.
I'm on a windows platform.
This is what i got so far:
AdbServer server = new AdbServer();
StartServerResult result = server.StartServer(#"path\to\adb.exe", restartServerIfNewer: false);
DeviceData device = AdbClient.Instance.GetDevices().First();
ConsoleOutputReceiver receiver = new ConsoleOutputReceiver();
AdbClient.Instance.ExecuteRemoteCommand("screencap -p", device, receiver);
string str_image = receiver.ToString().Replace("\r\r", "");
byte[] bytes = Encoding.ASCII.GetBytes(str_image);
Image image = Image.FromStream(new MemoryStream(bytes));
I can successfully load both str_image, and create the byte array but it keeps saying System.ArgumentException when trying to load it into an Image.
I also tried saving the data to a file, but the file is corrupt.
I tried both replacing "\r\r" and "\r\n", both same result.
Anyone has some insight in how to load this file?
It's actually preferred if it could be loaded into a Emgu image since i'm gonna do some CV on it later.

One possible cause is the nonprintable ASCII characters in the string.
Look at the code below
string str_image = File.ReadAllText("test.png");
byte[] bytes = Encoding.ASCII.GetBytes(str_image);
byte[] actualBytes = File.ReadAllBytes("test.png");
str_image is shown in the below screencap, note that there are some non-printable chars (displayed as question mark).
The first eight bytes of a PNG file are always
137 80 78 71 13 10 26 10
While you read the console output as a string, then use ASCII to encode the string, the first byte becomes 63 (0x3F), which is the ASCII code for a question mark.
Also note that the size of the two byte arrays vary hugely (7828/7378).
And other thing is you are replace "\r\r", while actually a new line character in Windows is "\r\n".
So my conclusion is some image data is lost or modified in the output redirection by the ConsoleOutputReceiver, and you cannot recover the original data from the output string.

Saving a byte array to PDF file with OfficeJs

Using OfficeJs I want to save a Word document as a PDF and post that file to an Api.
Office.context.document.getFileAsync will let you get the entire document in a choice of 3 formats:
compressed: returns the entire document (.pptx or .docx) in Office Open XML (OOXML) format as a byte array
pdf: returns the entire document in PDF format as a byte array
text: returns only the text of the document as a string. (Word only)
I am posting the PDF byte array to a WebApi action that looks like this:
public async Task<IHttpActionResult> Upload([FromBody]byte[] bytes)
{
File.WriteAllBytes(#"C:\temp\testpdf.pdf", bytes);
return Ok();
}
On inspection the byte array is the same array created by the getFileAsync from Office Js.
The problem is the file written in File.WriteAllBytes is corrupt. If I open it with notepad, it is a string of the bytes - 37,80,68,70,45,49,46,53,13,10,37... and so on.
Any idea why the method WriteAllBytes does not create a PDF file from the OfficeJS pdf byte stream?
UPDATE 25/5/16
As hawkeye #StefanHegny pointed out, the byte array appears to be Ascii characters. Converting each byte to char and writing that out to PDF like this creates a blank PDF, but on inspection with NotePad, the contents do like a like a PDF document, though quite different to that when saving the same .docx as a .pdf.
var content = "";
foreach (var b in model.Bytes)
{
content += (char) b;
}
File.WriteAllText(#"C:\temp\testpdf.pdf", content);
Also note, this is extremely slow - about 5 minutes for 500kb PDF byte array on my dev machine.

I had the same pdf empty problem, and it was because I was converting to string and writing string to file(encoding problem), I solved by sending to the c# code the comma separated byte codes instead of converting to string, parsing bytes and using File.WriteAllBytes()
C# code:
string[] strings = HttpUtility.HtmlDecode(pdf).Split(',');
byte[] bytes = strings.Select(s => byte.Parse(s)).ToArray();
System.IO.File.WriteAllBytes("filename.pdf", bytes);

How to determine encoding of image using header bytes

So I am using c#, and I need to determine the actual encoding of an image-file. Most images can be in one format while simultaneously having a different extension and still work in general.
My need's require precise knowledge of the image format.
There is one other thread that deals with this: Determine Image Encoding of Image File
This show's how to find the actual encoding once you have the image's header information. I need to open the image and extract this header information.
FileStream imageFile = new FileStream("myImage.gif", FileMode.Open);
After this bit, how do I open only the bytes which contain the header?
Thank you.

You can't really read "just the header" unless you know it's size.
Instead, determine the minimum amount of bytes you need to be able to distinguish between the formats you need to support, and read only those bytes. Most likely all of the formats you need will have a unique header.
For example, if you need to support png & jpeg, those formats start with:
PNG: 89 50 4E 47 0D 0A 1A 0A
JPEG: FF D8 FF E0
So in that case you'd only have to read a single byte to differ between the two. In reality I'd say use a few more bytes, just in case you encounter other file formats.
To read, say 8 bytes, from the beginning of a file:
using( var sr = new FileStream( "file", FileMode.Open ) )
{
var data = new byte[8];
int numRead = sr.Read( data, 0, data.Length );
// numRead gives you the number of bytes read
}

Well I figured it out in the end. So im going to update the thread and close it. The only issue with my solution is that it requires opening the entire image file, rather than just the required bytes. This uses alot more memory, and takes longer. So it isn't the optimal solution when speed is a concern.
Just to give credit where it's due, this code was created from a
couple of sources here on stack-overflow, you can find the link's in
the OP and earlier comments. The rest of the code was written by me.
If anyone feels like modifying the code to only open the correct amount of bytes, feel free.
TextWriterTraceListener writer = new TextWriterTraceListener(System.Console.Out);
Debug.Listeners.Add(writer);
// PNG file contains 8 - bytes header.
// JPEG file contains 2 - bytes header(SOI) followed by series of markers,
// some markers can be followed by data array. Each type of marker has different header format.
// The bytes where the image is stored follows SOF0 marker(10 - bytes length).
// However, between JPEG header and SOF0 marker there can be other segments.
// BMP file contains 14 - bytes header.
// GIF file contains at least 14 bytes in its header.
FileStream memStream = new FileStream(#"C:\\a.png", FileMode.Open);
Image fileImage = Image.FromStream(memStream);
//get image format
var fileImageFormat = typeof(System.Drawing.Imaging.ImageFormat).GetProperties(System.Reflection.BindingFlags.Public | System.Reflection.BindingFlags.Static).ToList().ConvertAll(property => property.GetValue(null, null)).Single(image_format => image_format.Equals(fileImage.RawFormat));
MessageBox.Show("File Format: " + fileImageFormat);
//get image codec
var fileImageFormatCodec = System.Drawing.Imaging.ImageCodecInfo.GetImageDecoders().ToList().Single(image_codec => image_codec.FormatID == fileImage.RawFormat.Guid);
MessageBox.Show("MimeType: " + fileImageFormatCodec.MimeType + " \n" + "Extension: " + fileImageFormatCodec.FilenameExtension + "\n" + "Actual Codec: " + fileImageFormatCodec.CodecName);
Output is as Expected:
file_image_format: Png
Built-in PNG Codec, mime: image/png, extension: *.PNG

Issues Decoding Flate from PDF Embedded Font

Ok, before we start. I work for a company that has a license to redistribute PDF files from various publishers in any media form. So, that being said, the extraction of embedded fonts from the given PDF files is not only legal - but also vital to the presentation.
I am using code found on this site, however I do not recall the author, when I find it I will reference them. I have located the stream within the PDF file that contains the embedded fonts, I have isolated this encoded stream as a string and then into a byte[]. When I use the following code I get an error
Block length does not match with its complement.
Code (the error occurs in the while line below):
private static byte[] DecodeFlateDecodeData(byte[] data)
{
MemoryStream outputStream;
using (outputStream = new MemoryStream())
{
using (var compressedDataStream = new MemoryStream(data))
{
// Remove the first two bytes to skip the header (it isn't recognized by the DeflateStream class)
compressedDataStream.ReadByte();
compressedDataStream.ReadByte();
var deflateStream = new DeflateStream(compressedDataStream, CompressionMode.Decompress, true);
var decompressedBuffer = new byte[compressedDataStream.Length];
int read;
// The error occurs in the following line
while ((read = deflateStream.Read(decompressedBuffer, 0, decompressedBuffer.Length)) != 0)
{
outputStream.Write(decompressedBuffer, 0, read);
}
outputStream.Flush();
compressedDataStream.Close();
}
return ReadFully(outputStream);
}
}
After using the usual tools (Google, Bing, archives here) I found that the majority of the time that this occurs is when one has not consumed the first two bytes of the encoding stream - but this is done here so i cannot find the source of this error. Below is the encoded stream:
H‰LT}lg?7ñù¤aŽÂ½ãnÕ´jh›Ú?-T’ÑRL–¦
ëš:Uí6Ÿ¶“ø+ñ÷ùü™”ÒÆŸŸíóWlÇ±“ºu“°tƒ¦t0ÊD¶jˆ
Ö m:$½×^*qABBï?Þç÷|ýÞßóJÖˆD"yâP—òpgÇó¦Q¾S¯9£Û¾mçÁçÚ„cÂÛO¡É‡·¥ï~á³ÇãO¡ŸØö=öPD"d‚ìA—$H'‚DC¢D®¤·éC'Å:È—€ìEV%cÿŽS;þÔ’kYkùcË_ZÇZ/·þYº(ýÝ‡Ã_ó3m¤[3¤²4ÿo?²õñÖ*Z/Þiãÿ¿¾õ8Ü ?»„O Ê£ðÅP9ÿ•¿Â¯*–z×No˜0ãÆ-êàîoR‹×ÉêÊêÂulaƒÝü
Please help, I am beating my head against the wall here!
NOTE: The stream above is the encoded version of Arial Black - according to the specs inside the PDF:
661 0 obj
<<
/Type /FontDescriptor
/FontFile3 662 0 R
/FontBBox [ -194 -307 1688 1083 ]
/FontName /HLJOBA+ArialBlack
/Flags 4
/StemV 0
/CapHeight 715
/XHeight 518
/Ascent 0
/Descent -209
/ItalicAngle 0
/CharSet (/space/T/e/s/t/a/k/i/n/g/S/r/E/x/m/O/u/l)
>>
endobj
662 0 obj
<< /Length 1700 /Filter /FlateDecode /Subtype /Type1C >>
stream
H‰LT}lg?7ñù¤aŽÂ½ãnÕ´jh›Ú?-T’ÑRL–¦
ëš:Uí6Ÿ¶“ø+ñ÷ùü™”ÒÆŸŸíóWlÇ±“ºu“°tƒ¦t0ÊD¶jˆ
Ö m:$½×^*qABBï?Þç÷|ýÞßóJÖˆD"yâP—òpgÇó¦Q¾S¯9£Û¾mçÁçÚ„cÂÛO¡É‡·¥ï~á³ÇãO¡ŸØö=öPD"d‚ìA—$H'‚DC¢D®¤·éC'Å:È—€ìEV%cÿŽS;þÔ’kYkùcË_ZÇZ/·þYº(ýÝ‡Ã_ó3m¤[3¤²4ÿo?²õñÖ*Z/Þiãÿ¿¾õ8Ü ?»„O Ê£ðÅP9ÿ•¿Â¯*–z×No˜0ãÆ-êàîoR‹×ÉêÊêÂulaƒÝü

Is there a particular reason why you're not using the GetStreamBytes() method that is provided with iText? What about data? Are you sure you are looking at the correct bytes? Did you create the PRStream object correctly and did you get the bytes with PdfReader.GetStreamBytesRaw()? If so, why decode the bytes yourself? Which brings me to my initial counter-question: is there a particular reason why you're not using the GetStreamBytes() method?

Looks like GetStreamBytes() might solve your problem out right, but let me point out that I think you're doing something dangerous concerning end-of-line markers. The PDF Specification in 7.3.8.1 states that:
The keyword stream that follows the stream dictionary shall be
followed by an end-of-line marker consisting of either a CARRIAGE
RETURN and a LINE FEED or just a LINE FEED, and not by a CARRIAGE
RETURN alone.
In your code it looks like you always skip two bytes while the spec says it could be either one or two (CR LF or LF).
You should be able to catch whether you are running into this by comparing the exact number of bytes you want to decode with the value of the (Required) "Length" key in the stream dictionary.

Okay, for anyone who might stumble across this issue themselves allow me to warn you - this is a rocky road without a great deal of good solutions. I eventually moved away from writing all of the code to extract the fonts myself. I simply downloaded MuPDF (open source) and then made command line calls to mutool.exe:
mutool extract C:\mypdf.pdf
This pulls all of the fonts into the folder mutool resides in (it also extracts some images (these are the fonts that could not be converted (usually small subsets I think))). I then wrote a method to move those from that folder into the one I wanted them in.
Of course, to convert these to anything usable is a headache in itself - but I have found it to be doable.
As a reminder, font piracy IS piracy.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.