Saving a byte array to PDF file with OfficeJs

Saving a byte array to PDF file with OfficeJs - c#

Using OfficeJs I want to save a Word document as a PDF and post that file to an Api.
Office.context.document.getFileAsync will let you get the entire document in a choice of 3 formats:
compressed: returns the entire document (.pptx or .docx) in Office Open XML (OOXML) format as a byte array
pdf: returns the entire document in PDF format as a byte array
text: returns only the text of the document as a string. (Word only)
I am posting the PDF byte array to a WebApi action that looks like this:
public async Task<IHttpActionResult> Upload([FromBody]byte[] bytes)
{
File.WriteAllBytes(#"C:\temp\testpdf.pdf", bytes);
return Ok();
}
On inspection the byte array is the same array created by the getFileAsync from Office Js.
The problem is the file written in File.WriteAllBytes is corrupt. If I open it with notepad, it is a string of the bytes - 37,80,68,70,45,49,46,53,13,10,37... and so on.
Any idea why the method WriteAllBytes does not create a PDF file from the OfficeJS pdf byte stream?
UPDATE 25/5/16
As hawkeye #StefanHegny pointed out, the byte array appears to be Ascii characters. Converting each byte to char and writing that out to PDF like this creates a blank PDF, but on inspection with NotePad, the contents do like a like a PDF document, though quite different to that when saving the same .docx as a .pdf.
var content = "";
foreach (var b in model.Bytes)
{
content += (char) b;
}
File.WriteAllText(#"C:\temp\testpdf.pdf", content);
Also note, this is extremely slow - about 5 minutes for 500kb PDF byte array on my dev machine.

I had the same pdf empty problem, and it was because I was converting to string and writing string to file(encoding problem), I solved by sending to the c# code the comma separated byte codes instead of converting to string, parsing bytes and using File.WriteAllBytes()
C# code:
string[] strings = HttpUtility.HtmlDecode(pdf).Split(',');
byte[] bytes = strings.Select(s => byte.Parse(s)).ToArray();
System.IO.File.WriteAllBytes("filename.pdf", bytes);

Related

Retrieve byte array from c# webservice into Qt

I use webservice written in c# that exposes bytearray of audio file (mp3 file) that is stored in database (using entity framework).
When I retrieve it in c# and save it into file using File.WriteAllBytes() I can listen to audio (file's size is 10kB).
I need to do the same with Qt. I parse the xml and save audio byte array to QByteArray like this:
QByteArray bytes = readValue().toUtf8();
where readValue() is QStringRef and then I save it to file
qint64 bytesWritten = file.write(bytes);
File hes 14kB then and I suppose that there is some format problem but not sure where.

I solved this problem. In webservice I return byte array converter to hex string
string hex = BitConverter.ToString(ba);
string hexString = hex.Replace("-", "");
then in Qt I parse data
QByteArray bytes = QByteArray::fromHex(readValue().toString().toLatin1()));
and save it to file
qint64 bytesWritten = file.write(bytes);

How to convert byte array of word doc into byte array of pdf document in C#

I have a sequence of bytes as follows (Word format) exported from crystal report
Stream stream = cryRpt.ExportToStream(CrystalDecisions.Shared.ExportFormatType.WordForWindows);
I would like to convert that stream to a PDF stream to be stored in a varbinary column in sql server database.
There is a barcode in the document and i am unable to embed it when I export to pdf. Exporting to word works though so I want to try from Word to PDF
Can anyone assist me in getting this PDF byte[]?

You can do that using the Microsoft.Office.Interop.Word NuGet Package. Once you added it on your application you can flush your Byte Array to a temporary file, then open the temp file with Interop.Word, work with it, save the result and read the result back into a Byte Array.
Here's a sample code snippet that does just that:
// byte[] fileBytes = getFileBytesFromDB();
var tmpFile = Path.GetTempFileName();
File.WriteAllBytes(tmpFile, fileBytes);
Application app = new word.Application();
Document doc = app.Documents.Open(filePath);
// Save DOCX into a PDF
var pdfPath = "path-to-pdf-file.pdf";
doc.SaveAs2(pdfPath, word.WdSaveFormat.wdFormatPDF);
doc.Close();
app.Quit(); // VERY IMPORTANT: do this to close the MS Word instance
byte[] pdfFileBytes = File.ReadAllBytes(pdfPath);
File.Delete(tmpFile);
For more info regarding the topic you can also read this post on my blog.

Use Byte[] was converted from a string and sent through webservice

I convert a file to a byte array and then I transfer it using a web service. When I receive the byte array, I convert it back to a file.
It's works ok, but the problem is when I convert a string to byte, like this:
<b>My Content</b>
and transfer it through the web service, how can I get a file with an HTML format (like doc, docx...) like:
My Content
I can save it as .txt but it's not the result I want. Here is the code I used to save it:
public ActionResult Download(byte[] bytes)
{
return File(bytes, System.Net.Mime.MediaTypeNames.Application.Octet,"Filename");
}

PDF content inside XML response file

I receive XML file that includes PDF content:
<pdf>
<pdfContent>JVBERi0xLjQKJaqrrK0KNCAwIG9iago8PCAvV.......
How can I save the content into PDF file?
I'm using C# 4.0

That string value is the PDF in base64. If you convert the base64 to a byte array you can just write that byte array to disk.
Convert.FromBase64String
var buffer = Convert.FromBase64String(xmlStringValue);
File.WriteAllBytes(yourFileName, buffer);

It looks like the pdf content is encoded in base64. You will have to decode it and save it to a file.
Edit: indeed, when I use base64 to encode a pdf file, the first few characters are JVBERi0x...

It seems encoded with Base64, But not sure. if it is, you can take that long string and convert with the function Convert.FromBase64. You will obtain a byte[] that you can save as the actual pdf.

How to detect if a file is PDF or TIFF?

Please bear with me as I've been thrown into the middle of this project without knowing all the background. If you've got WTF questions, trust me, I have them too.
Here is the scenario: I've got a bunch of files residing on an IIS server. They have no file extension on them. Just naked files with names like "asda-2342-sd3rs-asd24-ut57" and so on. Nothing intuitive.
The problem is I need to serve up files on an ASP.NET (2.0) page and display the tiff files as tiff and the PDF files as PDF. Unfortunately I don't know which is which and I need to be able to display them appropriately in their respective formats.
For example, lets say that there are 2 files I need to display, one is tiff and one is PDF. The page should show up with a tiff image, and perhaps a link that would open up the PDF in a new tab/window.
The problem:
As these files are all extension-less I had to force IIS to just serve everything up as TIFF. But if I do this, the PDF files won't display. I could change IIS to force the MIME type to be PDF for unknown file extensions but I'd have the reverse problem.
http://support.microsoft.com/kb/326965
Is this problem easier than I think or is it as nasty as I am expecting?

OK, enough people are getting this wrong that I'm going to post some code I have to identify TIFFs:
private const int kTiffTagLength = 12;
private const int kHeaderSize = 2;
private const int kMinimumTiffSize = 8;
private const byte kIntelMark = 0x49;
private const byte kMotorolaMark = 0x4d;
private const ushort kTiffMagicNumber = 42;
private bool IsTiff(Stream stm)
{
stm.Seek(0);
if (stm.Length < kMinimumTiffSize)
return false;
byte[] header = new byte[kHeaderSize];
stm.Read(header, 0, header.Length);
if (header[0] != header[1] || (header[0] != kIntelMark && header[0] != kMotorolaMark))
return false;
bool isIntel = header[0] == kIntelMark;
ushort magicNumber = ReadShort(stm, isIntel);
if (magicNumber != kTiffMagicNumber)
return false;
return true;
}
private ushort ReadShort(Stream stm, bool isIntel)
{
byte[] b = new byte[2];
_stm.Read(b, 0, b.Length);
return ToShort(_isIntel, b[0], b[1]);
}
private static ushort ToShort(bool isIntel, byte b0, byte b1)
{
if (isIntel)
{
return (ushort)(((int)b1 << 8) | (int)b0);
}
else
{
return (ushort)(((int)b0 << 8) | (int)b1);
}
}
I hacked apart some much more general code to get this.
For PDF, I have code that looks like this:
public bool IsPdf(Stream stm)
{
stm.Seek(0, SeekOrigin.Begin);
PdfToken token;
while ((token = GetToken(stm)) != null)
{
if (token.TokenType == MLPdfTokenType.Comment)
{
if (token.Text.StartsWith("%PDF-1."))
return true;
}
if (stm.Position > 1024)
break;
}
return false;
}
Now, GetToken() is a call into a scanner that tokenizes a Stream into PDF tokens. This is non-trivial, so I'm not going to paste it here. I'm using the tokenizer instead of looking at substring to avoid a problem like this:
% the following is a PostScript file, NOT a PDF file
% you'll note that in our previous version, it started with %PDF-1.3,
% incorrectly marking it as a PDF
%
clippath stroke showpage
this code is marked as NOT a PDF by the above code snippet, whereas a more simplistic chunk of code will incorrectly mark it as a PDF.
I should also point out that the current ISO spec is devoid of the implementation notes that were in the previous Adobe-owned specification. Most importantly from the PDF Reference, version 1.6:
Acrobat viewers require only that the header appear somewhere within
the first 1024 bytes of the file.

TIFF can be detected by peeking at first bytes http://local.wasp.uwa.edu.au/~pbourke/dataformats/tiff/
The first 8 bytes forms the header.
The first two bytes of which is either
"II" for little endian byte ordering
or "MM" for big endian byte ordering.
About PDF: http://www.adobe.com/devnet/livecycle/articles/lc_pdf_overview_format.pdf
The header contains just one line that
identifies the version of PDF.
Example: %PDF-1.6

Reading the specification for each file format will tell you how to identify files of that format.
TIFF files - Check bytes 1 and 2 for 0x4D4D or 0x4949 and bytes 2-3 for the value '42'.
Page 13 of the spec reads:
A TIFF file begins with an 8-byte
image file header, containing the
following information: Bytes 0-1: The
byte order used within the file. Legal
values are: “II” (4949.H) “MM”
(4D4D.H) In the “II” format, byte
order is always from the least
significant byte to the most
significant byte, for both 16-bit and
32-bit integers This is called
little-endian byte order. In the “MM”
format, byte order is always from most
significant to least significant, for
both 16-bit and 32-bit integers. This
is called big-endian byte order. Bytes
2-3 An arbitrary but carefully chosen
number (42) that further identifies
the file as a TIFF file. The byte
order depends on the value of Bytes
0-1.
PDF files start with the PDF version followed by several binary bytes. (I think you now have to purchase the ISO spec for the current version.)
Section 7.5.2
The first line of a PDF file shall be
a header consisting of the 5
characters %PDF– followed by a version
number of the form 1.N, where N is a
digit between 0 and 7. A conforming
reader shall accept files with any of
the following headers: %PDF–1.0,
%PDF–1.1, %PDF–1.2, %PDF–1.3, %PDF–1.4,
%PDF–1.5, %PDF–1.6, %PDF–1.7 Beginning
with PDF 1.4, the Version entry in the
document’s catalog dictionary (located
via the Root entry in the file’s
trailer, as described in 7.5.5, "File
Trailer"), if present, shall be used
instead of the version specified in
the Header.
If a PDF file contains binary data, as
most do (see 7.2, "Lexical
Conventions"), the header line shall
be immediately followed by a comment
line containing at least four binary
characters—that is, characters whose
codes are 128 or greater. This ensures
proper behaviour of file transfer
applications that inspect data near
the beginning of a file to determine
whether to treat the file’s contents
as text or as binary.
Of course you could do a "deeper" check on each file by checking more file specific items.

A very useful list of File Signatures aka "magic numbers" by Gary Kessler is available http://www.garykessler.net/library/file_sigs.html

Internally, the file header information should help. if you do a low-level file open, such as StreamReader() or FOPEN(), look at the first two characters in the file... Almost every file type has its own signature.
PDF always starts with "%P" (but more specifically would have like %PDF)
TIFF appears to start with "II"
Bitmap files with "BM"
Executable files with "MZ"
I've had to deal with this in the past too... also to help prevent unwanted files from being uploaded to a given site and immediately aborting it once checked.
EDIT -- Posted sample code to read and test file header types
String fn = "Example.pdf";
StreamReader sr = new StreamReader( fn );
char[] buf = new char[5];
sr.Read( buf, 0, 4);
sr.Close();
String Hdr = buf[0].ToString()
+ buf[1].ToString()
+ buf[2].ToString()
+ buf[3].ToString()
+ buf[4].ToString();
String WhatType;
if (Hdr.StartsWith("%PDF"))
WhatType = "PDF";
else if (Hdr.StartsWith("MZ"))
WhatType = "EXE or DLL";
else if (Hdr.StartsWith("BM"))
WhatType = "BMP";
else if (Hdr.StartsWith("?_"))
WhatType = "HLP (help file)";
else if (Hdr.StartsWith("\0\0\1"))
WhatType = "Icon (.ico)";
else if (Hdr.StartsWith("\0\0\2"))
WhatType = "Cursor (.cur)";
else
WhatType = "Unknown";

If you go here, you will see that the TIFF usually starts with "magic numbers" 0x49 0x49 0x2A 0x00 (some other definitions are also given), which is the first 4 bytes of the file.
So just use these first 4 bytes to determine whether file is TIFF or not.
EDIT, it is probably better to do it the other way, and detect PDF first. The magic numbers for PDF are more standardized: As Plinth kindly pointed out they start with "%PDF" somewhere in the first 1024 bytes (0x25 0x50 0x44 0x46). source

You are going to have to write an ashx to get the file requested.
then, your handler should read the first few bytes (or so) to determine what the file type really is-- PDF and TIFF's have "magic numers" in the beginning of the file that you can use to determin this, then set your Response Headers accordingly.

you can use Myrmec to identify the file type, this library use the file byte head. this library avaliable on nuget "Myrmec",and this is the repo, myrmec also support mime type,you can try it. the code will like this :
// create a sniffer instance.
Sniffer sniffer = new Sniffer();
// populate with mata data.
sniffer.Populate(FileTypes.CommonFileTypes);
// get file head byte, may be 20 bytes enough.
byte[] fileHead = ReadFileHead();
// start match.
List<string> results = sniffer.Match(fileHead);
and get mime type :
List<string> result = sniffer.Match(head);
string mimeType = MimeTypes.GetMimeType(result.First());
but that support tiff only "49 49 2A 00" and "4D 4D 00 2A" two signature, if you have more you can add your self, may be you can see the readme file of myrmec for help. myrmec github repo

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.