Convert ByteArray of Office document to ByteArray of PDF in C#

Convert ByteArray of Office document to ByteArray of PDF in C# - c#

How can I convert a byte[] of an Office document (.doc, .docx, .xlsx, .pptx) to a byte[] of a PDF document assuming Office is installed and Microsoft.Office.Interop is used?
I fetch the files' byteArray from the database as well as their name.
I would like to first convert each file to a PDF and then combine all of the PDFs to one single PDF using PDFSharp (this part is already implemented).
Code:
foreach (Entity en in res.Entities)
{
byte[] fileByteArray = Convert.FromBase64String(en.GetAttributeValue<string>("documentbody"));
string fileName = en.GetAttributeValue<string>("filename");
string extension = fileName.Split('.')[1];
switch(extension)
{
case "doc":
case "docx":
byteArr.Add(ConvertWordToPdf(fileName, fileByteArray)); break;
case "xlsx":
byteArr.Add(ConvertExcelToPdf(fileName, fileByteArray)); break;
}
}
The problem is I'm not too sure how to implement these two methods.
I tried using the following code:
private byte[] ConvertWordToPdf(string fileName, byte[] fileByteArray)
{
string tmpFile = Path.GetTempFileName();
File.WriteAllBytes(tmpFile, fileByteArray);
Microsoft.Office.Interop.Word.Application app = new Microsoft.Office.Interop.Word.Application();
Document doc = app.Documents.Open(tmpFile);
// Save Word doc into a PDF
string pdfPath = fileName.Split('.')[0] + ".pdf";
doc.SaveAs2(pdfPath, Microsoft.Office.Interop.Word.WdSaveFormat.wdFormatPDF);
doc.Close();
app.Quit();
byte[] pdfFileBytes = File.ReadAllBytes(pdfPath);
File.Delete(tmpFile);
return pdfFileBytes;
}
But it saves the file to disk and that's something I would like to avoid. Is doing the same operation without saving to disk possible?

If you check the documentation for Documents.Open there is no mentioning of opening a document directly from a stream. This is unfortunately an all to common problem in libraries. But there might be other libraries you could use that allow this.
I would not expect saving to a file to be a major performance issue since the conversion will probably be the dominating factor. But it might cause permission issues if your program is running in a very restrictive environment.
If you are keeping the file save method you should add some exception handling to ensure the temporary files are deleted even if an exception occurs. I have also seen issues where external programs release the file locks after some time, so it might be useful to try to delete the file multiple times.

Related

Writing and Reading CustomValues takes too long in PdfSharp

I have a c# method that writes a custom value for given pdf file. In order to write a custom value for a pdf, I am using PdfSharp 1.50.5147
The problem here is PdfReader.Open waits too long for the pdf belove :
https://www.mouser.com.tr/catalog/English/103/dload/pdf/mouser.pdf
public bool WritePropertyToFile(string filePath, string extension, string key, string value)
{
try
{
document = PdfReader.Open(filePath); //Here it lasts 2.5 minutes !!
var properties = document.CustomValues.Elements;
properties.SetString("/" + key, value);
document.Save(filePath);
document = null;
return true;
}
catch (Exception)
{
if (document != null)
document = null;
throw;
}
}
My requirement is to write and read custom values in miliseconds for a given file. Although lots of pdf files' custom values can be written and read in miliseconds, some of the files such as this one may cause problems for me.
Do I need to open whole document for writing or reading a custom value? Is there a different technique for this? Do you have suggestion for this problem?

Currently, there is no method to open, in this case, large pdf's quickly in PdfSharp due to the fact that PdfSharp first loads the entire pdf in memory. The pdf you're trying to open is a whopping 168MB file.
You may extend PdfSharp and try to load the trailer contents first and then read each block of contents according to trailer entries.

C# generated PDF files doesn't open in PDF readers. Error shows damaged or corrupt file

I am trying to save imageData as pdf file on server directory. Html5Canvas imageData was sent to server and after conversion in bytes array, tried to save as PDF file. File generated successfully on specified path but the generated file doesn't open correctly in most of the PDF readers(i.e. Adobe Reader, Foxit reader etc) and show error that file is either damaged or corrupt but it open correctly in MS Edge browser. I want them to show in common PDF reader too. Can you please suggest the solution. Here is my server side code.
public static string SaveImage(string imageData, string userEmail, int quantity)
{
string completePath = #"~\user-images\";
string imageName = "sample_file2.pdf";
string fileNameWitPath = completePath + imageName;
byte[] bytes = Convert.FromBase64String(imageData);
File.WriteAllBytes(HttpContext.Current.Server.MapPath(fileNameWitPath), bytes);
}
Same output generated for this code
FileStream fs = new FileStream(HttpContext.Current.Server.MapPath(fileNameWitPath), FileMode.OpenOrCreate);
fs.Write(bytes, 0, bytes.Length);
fs.Close();
and for this too.
using (FileStream fs = new FileStream(HttpContext.Current.Server.MapPath(fileNameWitPath), FileMode.Create))
{
using (BinaryWriter bw = new BinaryWriter(fs))
{
byte[] data = Convert.FromBase64String(imageData);
bw.Write(data);
bw.Close();
}
}

If you just save a raster image format file (like PNG or JPG one) with a .PDF file extension it doesn't make it a PDF file; it still remains an image file just with another extension. So it probably works in some browsers because they may do file format detection that is not based on extension alone.
To generate an actual PDF file you will need to employ some conversion. Consider one of the following libraries for this:
iTextSharp: https://sourceforge.net/projects/itextsharp/ (AGPL = a free software license that allows non-commercial use. For commercial use you need to purchase a commercial license)
PDFSharp http://www.pdfsharp.net/ (free, MIT license)
ABCpdf.NET http://www.websupergoo.com/products.htm#abcpdf (proprietary)

Programmatically save open document in MS Word Add-In

I'm trying to create an add-in in C# for MS Word 2010 that will add a new ribbon and a click event-handler. This click event-handler should save the active file in c:\temp, for example. And then I need to load the file content into a byte array.
Probably something like this:
public void ClickEventHandler(Office.IRibbonControl control)
{
string fileLocation = "c:\temp\test.docx";
Word.Document document = this.Document;
document.SaveAs(fileLocation);
byte[] byteArray = File.ReadAllBytes(fileLocation);
}
The point is, this is pseudo-code and I don't know how to load an active document into a byte array. If there is a way without saving the document it would be even better.
And a query if the active file is a docx (and not a doc file) would be nice as well.

Word.Document document = Globals.ThisAddIn.Application.ActiveDocument;
document.SaveAs2(goldenpath + "\\" + name + "." + id + ".docx");
document.Close();

I use this generic function in my program to serialize arbitrary objects to a byte array:
private byte[] MakeByteSize<U>(U obj)
{
if (obj == null) return null;
var bf = new System.Runtime.Serialization.Formatters.Binary.BinaryFormatter();
var ms = new System.IO.MemoryStream();
bf.Serialize(ms, obj);
return ms.ToArray();
}
Edit:
After reading your additional content, I'm confident that serializing the Word.Document object won't get you what you need, since the byte array representing that object in the program (which is probably a wrapper around some COM interop) won't be the same as the byte array representing the information stored in the file about the document.
Looking at the MSDN article you referenced, it looks like what we really need is a WordprocessingDoc instance representing the document that we can pass to the HtmlConverter class. So I think the question you really want to ask is "How can I create a DocumentFormat.OpenXml.Packaging.WordprocessingDocument from an open document without saving the file first?"
Unforunately, I'm not sure that's possible since I'm not really spotting any methods on that class that would do that.
On the .doc vs .docx issue, the Open XML SDK for Microsoft Office says that it works with documentat that adhere to the "Office Open XML File Formats Specification" which I believe means it will only work with the .docx file format. You might have to try a different route on this, like exporting to PDF perhaps. Good luck!

Replacing strings stored in a byte array that represent Word/Excel document

I'm storing Word and Excel documents inside a SQL Server database table. These documents are pulled from the database with my C# application and are put into byte[] arrays.
I want to replace certain strings found in the Word/Excel documents. What is the best way to do this with the byte array available?
I was looking at something like this:
string fileString = System.Text.Encoding.UTF8.GetString(image.ImageObject);
fileString = fileString.Replace("FROM", "TO");
byte[] newImageObject = System.Text.Encoding.UTF8.GetBytes(fileString);

I believe you will have to save the bytes as a Word/Excel file and use office automation tools to make the changes.
If you go changing bytes willy-nilly in binary files, you could mess up offsets, checksums, CRC checks, trigger anti-virus software, etc.

I would recommend you using the Open XML SDK.
With the library, you can do the following to replace text from a Word document, considering that documentByteArray is your document byte content taken from database:
using (MemoryStream mem = new MemoryStream())
{
mem.Write(documentByteArray, 0, (int)documentByteArray.Length);
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(document, true))
{
string docText = null;
using (StreamReader sr = new StreamReader(wordDoc.MainDocumentPart.GetStream()))
{
docText = sr.ReadToEnd();
}
Regex regexText = new Regex("Hello world!");
docText = regexText.Replace(docText, "Hi Everyone!");
using (StreamWriter sw = new StreamWriter(wordDoc.MainDocumentPart.GetStream(FileMode.Create)))
{
sw.Write(docText);
}
}
}
The example above was taken from here. You can do similarly with Excel spreadsheets.

Your approach is likely to fail.
If you are talking about .doc and .xls, these file formats are binary, making it most likely that the byte stream contains byte sequences that are not valid UTF-8.
Even if that's not the case, replacing strings of different lengths will make offsets and length fields invalid, thus causing the documents to fail when opening them.
If, on the other hand, you are talking about .docx and .xslx, these files are in fact zipped XML files, which again cannot be simply searched&replaced: just consider that the find string matches an XML element or attribute name (or a part thereof). Again, the replace operation cannot operate on the whole file.

Unable to read a byte array (created from a .docx file) into a Doc object using ABCPDF

I am retrieving a .docx file as a byte array. I am then trying to call the Doc’s read() function with said byte array as the data parameter but I am getting an unrecognized file extension error.
I retrieve the byte array with the following (c#) code:
WebClient testWc = new WebClient();
testWc.Credentials = CredentialCache.DefaultCredentials;
byte[] data = testWc.DownloadData("http://localhost/Lists/incidents/Attachments/1/Testdocnospaces.docx");
IF at this point I output the byte array as a .docx file, my program will correctly allow me to open or save the file. For this reason, I believe the byte array has been retrieved correctly. Here is a sample of what I mean by outputting a .docx file:
Response.ClearHeaders();
Response.Clear();
Response.AppendHeader("Content-Disposition", "attachment;Filename=test.docx");
Response.BinaryWrite(data);
Response.Flush();
Response.End();
However, if I try to read the byte array into a Doc like so:
Doc doc = new Doc();
XReadOptions xr = new XReadOptions();
xr.ReadModule = ReadModuleType.MSOffice;
doc.Read(data, xr);
My program will error out at the last line of said code, throwing the following: “FileExtension '' was invalid for ReadModuleType.MSOffice.”
The Doc.Read() function seems to be finding an empty string where it would typically be finding the file type.
Also, I do have Office 2007 installed on this machine.

If you know the file extension of your file bytes (which you should) you can solve your problem by:
Doc doc = new Doc();
string extension = Path.GetExtension("your file name/path").Substring(1).ToUpper();
XReadOptions opts = new XReadOptions();
opts.FileExtension = extension;
doc.Read(fileBytes, opts);
This approach worked for me. When you provide correct file extension you won't need to set ReadModule property of your XReadOptions object. ToUpper() is not mandatory.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Convert ByteArray of Office document to ByteArray of PDF in C# - c#

Related

Writing and Reading CustomValues takes too long in PdfSharp

C# generated PDF files doesn't open in PDF readers. Error shows damaged or corrupt file

Programmatically save open document in MS Word Add-In

Replacing strings stored in a byte array that represent Word/Excel document

Unable to read a byte array (created from a .docx file) into a Doc object using ABCPDF

Categories

Resources