Large PDFsharp (MigraDoc) PdfDocument to byte[]

Large PDFsharp (MigraDoc) PdfDocument to byte[] - c#

I have been attempting to save a large PdfDocument into a byte array using various means but always come back to an out of memory exception (file is 200 MB and 2.5K pages).
My initial attempt was to simply use MemoryStream
public static byte[] ProcessLargePdfDocument(PdfDocument pdfDocument)
{
using (MemoryStream stream = new MemoryStream())
{
pdfDocument.Save(stream, true);
return stream.ToArray();
}
}
Then I tried adding in some buffering
public static byte[] ProcessLargePdfDocument(PdfDocument pdfDocument, long whereToStartReading = 0)
{
List<byte> byteList = new List<byte>();
using (MemoryStream stream = new MemoryStream())
{
pdfDocument.Save(stream, false);
byte[] buffer = new byte[megabyte];
stream.Seek(whereToStartReading, SeekOrigin.Begin);
int bytesRead = stream.Read(buffer, 0, megabyte);
while (bytesRead > 0)
{
byteList.AddRange(buffer);
bytesRead = stream.Read(buffer, 0, megabyte);
}
}
return byteList.ToArray();
}
No matter what I try I get an out of memory exception on the pdfDocument.Save call. I am able to write it to a file location and to read it back using a buffered FileStream in dev but I'm not able to do this on the production environment due to permissions (yet).

Two tips:
Make sure your process runs as a 64-bit process to allow it to use more than 2 GiB of RAM.
stream.ToArray() creates a copy, stream.GetBuffer() lets you access the internal buffer of the MemoryStream. If the exception occurs after the Save() this may make a difference.

Related

Convert Stream to byte[] c# for large files of 2GB

Trying to convert Stream object to byte[] and using the below method for the same:
public static byte[] ReadFully(System.IO.Stream input)
{
byte[] buffer = new byte[16*1024];
using (System.IO.MemoryStream ms = new System.IO.MemoryStream())
{
int read;
while ((read = input.Read(buffer, 0, buffer.Length)) > 0)
{
ms.Write(buffer, 0, read);
}
return ms.ToArray();
}
}
However the input parameter "input" is for large file that is of 2 GB and hence the code does not get enter into while loop and hence does not convert it to byte array.
For smaller files it is working fine

That's what a Stream is for.
You don't load the whole content into a byte[], you read a small buffer from the Stream into memory and handle it, then dispose and read the next buffer.
If you still need to use a byte[]:
It seems like your app can't handle more than 2^32 Bytes Memory, meaning it's 32bit.
Try changing it to 64bit (in Project Properties go to Build and disable Prefer 32 bit)

Object limited below 32bit. (that is why all index using int)
how about use list contains byte array to deal entire data?
public List<byte[]> ReadBytesList(string fileName)
{
List<byte[]> rawDataBytes= new List<byte[]>();
byte[] buff;
FileStream fs = new FileStream(fileName,
FileMode.Open,
FileAccess.Read);
BinaryReader br = new BinaryReader(fs);
long numBytes = new FileInfo(fileName).Length;
int arrayCount= (int)(numBytes / 2100000000); //2147483648 is max
int arrayRest = (int)(numBytes % 2100000000);
if(arrayCount>0)
{
for (int i = 0; i < arrayCount; i++)
{
buff = br.ReadBytes(2100000000);
rawDataBytes.Add(buff);
}
buff = br.ReadBytes(arrayRest);
rawDataBytes.Add(buff);
}
else
{
buff = br.ReadBytes(arrayRest);
rawDataBytes.Add(buff);
}
return rawDataBytes;
}

SharpZipLib not compressing memory stream

I have a memory stream that I want to compress:
public static MemoryStream ZipChunk(MemoryStream unZippedChunk) {
MemoryStream zippedChunk = new MemoryStream();
ZipOutputStream zipOutputStream = new ZipOutputStream(zippedChunk);
zipOutputStream.SetLevel(3);
ZipEntry entry = new ZipEntry("name");
zipOutputStream.PutNextEntry(entry);
Utils.StreamCopy(unZippedChunk, zippedChunk, new byte[4096]);
zipOutputStream.CloseEntry();
zipOutputStream.IsStreamOwner = false;
zipOutputStream.Close();
zippedChunk.Close();
return zippedChunk;
}
public static void StreamCopy(Stream source, Stream destination, byte[] buffer, bool bFlush = true) {
bool flag = true;
while (flag) {
int num = source.Read(buffer, 0, buffer.Length);
if (num > 0) {
destination.Write(buffer, 0, num);
}
else {
if (bFlush) {
destination.Flush();
}
flag = false;
}
}
}
It's supposed to be quite simple. You provide it with a stream you want to compress. The methods compresses the stream and returns it. Great.
However, I don't get compressed stream back. What I get are streams that have about 20ish bytes added at the beginning and end, which seem to have something to do with the zip library. But the data in the middle is completely uncompressed (ranges of 256 bytes that have same value, etc). I tried upping the level to 9, but nothing changed.
Why aren't my streams compressing?

You yourself copy original stream right into output stream via:
Utils.StreamCopy(unZippedChunk, zippedChunk, new byte[4096]);
You should copy to zipOutputStream instead:
StreamCopy(unZippedChunk, zipOutputStream, new byte[4096]);
Side note: instead of using custom copy stream methods - use a default one:
unZippedChunk.CopyTo(zipOutputStream);

Convert a VERY LARGE binary file into a Base64String incrementally

I need help converting a VERY LARGE binary file (ZIP file) to a Base64String and back again. The files are too large to be loaded into memory all at once (they throw OutOfMemoryExceptions) otherwise this would be a simple task. I do not want to process the contents of the ZIP file individually, I want to process the entire ZIP file.
The problem:
I can convert the entire ZIP file (test sizes vary from 1 MB to 800 MB at present) to Base64String, but when I convert it back, it is corrupted. The new ZIP file is the correct size, it is recognized as a ZIP file by Windows and WinRAR/7-Zip, etc., and I can even look inside the ZIP file and see the contents with the correct sizes/properties, but when I attempt to extract from the ZIP file, I get: "Error: 0x80004005" which is a general error code.
I am not sure where or why the corruption is happening. I have done some investigating, and I have noticed the following:
If you have a large text file, you can convert it to Base64String incrementally without issue. If calling Convert.ToBase64String on the entire file yielded: "abcdefghijklmnopqrstuvwx", then calling it on the file in two pieces would yield: "abcdefghijkl" and "mnopqrstuvwx".
Unfortunately, if the file is a binary then the result is different. While the entire file might yield: "abcdefghijklmnopqrstuvwx", trying to process this in two pieces would yield something like: "oiweh87yakgb" and "kyckshfguywp".
Is there a way to incrementally base 64 encode a binary file while avoiding this corruption?
My code:
private void ConvertLargeFile()
{
FileStream inputStream = new FileStream("C:\\Users\\test\\Desktop\\my.zip", FileMode.Open, FileAccess.Read);
byte[] buffer = new byte[MultipleOfThree];
int bytesRead = inputStream.Read(buffer, 0, buffer.Length);
while(bytesRead > 0)
{
byte[] secondaryBuffer = new byte[buffer.Length];
int secondaryBufferBytesRead = bytesRead;
Array.Copy(buffer, secondaryBuffer, buffer.Length);
bool isFinalChunk = false;
Array.Clear(buffer, 0, buffer.Length);
bytesRead = inputStream.Read(buffer, 0, buffer.Length);
if(bytesRead == 0)
{
isFinalChunk = true;
buffer = new byte[secondaryBufferBytesRead];
Array.Copy(secondaryBuffer, buffer, buffer.length);
}
String base64String = Convert.ToBase64String(isFinalChunk ? buffer : secondaryBuffer);
File.AppendAllText("C:\\Users\\test\\Desktop\\Base64Zip", base64String);
}
inputStream.Dispose();
}
The decoding is more of the same. I use the size of the base64String variable above (which varies depending on the original buffer size that I test with), as the buffer size for decoding. Then, instead of Convert.ToBase64String(), I call Convert.FromBase64String() and write to a different file name/path.
EDIT:
In my haste to reduce the code (I refactored it into a new project, separate from other processing to eliminate code that isn't central to the issue) I introduced a bug. The base 64 conversion should be performed on the secondaryBuffer for all iterations save the last (Identified by isFinalChunk), when buffer should be used. I have corrected the code above.
EDIT #2:
Thank you all for your comments/feedback. After correcting the bug (see the above edit), I re-tested my code, and it is actually working now. I intend to test and implement #rene's solution as it appears to be the best, but I thought that I should let everyone know of my discovery as well.

Based on the code shown in the blog from Wiktor Zychla the following code works. This same solution is indicated in the remarks section of Convert.ToBase64String as pointed out by Ivan Stoev
// using System.Security.Cryptography
private void ConvertLargeFile()
{
//encode
var filein= #"C:\Users\test\Desktop\my.zip";
var fileout = #"C:\Users\test\Desktop\Base64Zip";
using (FileStream fs = File.Open(fileout, FileMode.Create))
using (var cs=new CryptoStream(fs, new ToBase64Transform(),
CryptoStreamMode.Write))
using(var fi =File.Open(filein, FileMode.Open))
{
fi.CopyTo(cs);
}
// the zip file is now stored in base64zip
// and decode
using (FileStream f64 = File.Open(fileout, FileMode.Open) )
using (var cs=new CryptoStream(f64, new FromBase64Transform(),
CryptoStreamMode.Read ) )
using(var fo =File.Open(filein +".orig", FileMode.Create))
{
cs.CopyTo(fo);
}
// the original file is in my.zip.orig
// use the commandlinetool
// fc my.zip my.zip.orig
// to verify that the start file and the encoded and decoded file
// are the same
}
The code uses standard classes found in System.Security.Cryptography namespace and uses a CryptoStream and the FromBase64Transform and its counterpart ToBase64Transform

You can avoid using a secondary buffer by passing offset and length to Convert.ToBase64String, like this:
private void ConvertLargeFile()
{
using (var inputStream = new FileStream("C:\\Users\\test\\Desktop\\my.zip", FileMode.Open, FileAccess.Read))
{
byte[] buffer = new byte[MultipleOfThree];
int bytesRead = inputStream.Read(buffer, 0, buffer.Length);
while(bytesRead > 0)
{
String base64String = Convert.ToBase64String(buffer, 0, bytesRead);
File.AppendAllText("C:\\Users\\test\\Desktop\\Base64Zip", base64String);
bytesRead = inputStream.Read(buffer, 0, buffer.Length);
}
}
}
The above should work, but I think Rene's answer is actually the better solution.

Use this code:
public void ConvertLargeFile(string source , string destination)
{
using (FileStream inputStream = new FileStream(source, FileMode.Open, FileAccess.Read))
{
int buffer_size = 30000; //or any multiple of 3
byte[] buffer = new byte[buffer_size];
int bytesRead = inputStream.Read(buffer, 0, buffer.Length);
while (bytesRead > 0)
{
byte[] buffer2 = buffer;
if(bytesRead < buffer_size)
{
buffer2 = new byte[bytesRead];
Buffer.BlockCopy(buffer, 0, buffer2, 0, bytesRead);
}
string base64String = System.Convert.ToBase64String(buffer2);
File.AppendAllText(destination, base64String);
bytesRead = inputStream.Read(buffer, 0, buffer.Length);
}
}
}

how to buffer an input stream until it is complete

i'm implementing a wcf service that accepts image streams. however i'm currently getting an exception when i run it. as its trying to get the length of the stream before the stream is complete. so what i'd like to do is buffer the stream until its complete. however i cant find any examples of how to do this...
can anyone help?
my code so far:
public String uploadUserImage(Stream stream)
{
Stream fs = stream;
BinaryReader br = new BinaryReader(fs);
Byte[] bytes = br.ReadBytes((Int32)fs.Length);// this causes exception
File.WriteAllBytes(filepath, bytes);
}

Rather than try to fetch the length, you should read from the stream until it returns that it's "done". In .NET 4, this is really easy:
// Assuming we *really* want to read it into memory first...
MemoryStream memoryStream = new MemoryStream();
stream.CopyTo(memoryStream);
memoryStream.Position = 0;
File.WriteAllBytes(filepath, memoryStream);
In .NET 3.5 there's no CopyTo method, but you can write something similar yourself:
public static void CopyStream(Stream input, Stream output)
{
byte[] buffer = new byte[8192];
int bytesRead;
while ((bytesRead = input.Read(buffer, 0, buffer.Length)) > 0)
{
output.Write(buffer, 0, bytesRead);
}
}
However, now we've got something to copy a stream, why bother reading it all into memory first? Let's just write it straight to a file:
using (FileStream output = File.OpenWrite(filepath))
{
CopyStream(stream, output); // Or stream.CopyTo(output);
}

I'm not sure what you are returning (or not returning), but something like this might work for you:
public String uploadUserImage(Stream stream) {
const int KB = 1024;
Byte[] bytes = new Byte[KB];
StringBuilder sb = new StringBuilder();
using (BinaryReader br = new BinaryReader(stream)) {
int len;
do {
len = br.Read(bytes, 0, KB);
string readData = Encoding.UTF8.GetString(bytes);
sb.Append(readData);
} while (len == KB);
}
//File.WriteAllBytes(filepath, bytes);
return sb.ToString();
}
A string can hold up to 2 GB, I believe.

Try this :
using (StreamWriter sw = File.CreateText(filepath))
{
stream.CopyTo(sw);
sw.Close();
}

Jon Skeets answer for .Net 3.5 and below using a Buffer Read is actually done incorrectly.
The buffer isn't cleared between reads which can result in issues on any read that returns less than 8192, for example if the 2nd read, read 192 bytes, the 8000 last bytes from the first read would STILL be in the buffer which would then be returned to the stream.
My code below you supply it a Stream and it will return a IEnumerable array.
Using this you can for-each it and Write to a MemoryStream and then use .GetBuffer() to end up with a compiled merged byte[].
private IEnumerable<byte[]> ReadFullStream(Stream stream) {
while(true) {
byte[] buffer = new byte[8192];//since this is created every loop, its buffer is cleared
int bytesRead = stream.Read(buffer, 0, buffer.Length);//read up to 8192 bytes into buffer
if (bytesRead == 0) {//if we read nothing, stream is finished
break;
}
if(bytesRead < buffer.Length) {//if we read LESS than 8192 bytes, resize the buffer to essentially remove everything after what was read, otherwise you will have nullbytes/0x00bytes at the end of your buffer
Array.Resize(ref buffer, bytesRead);
}
yield return buffer;//yield return the buffer data
}//loop here until we reach a read == 0 (end of stream)
}

Copy between two streams in .net 2.0

I have been using the following code to Compress data in .Net 4.0:
public static byte[] CompressData(byte[] data_toCompress)
{
using (MemoryStream outFile = new MemoryStream())
{
using (MemoryStream inFile = new MemoryStream(data_toCompress))
using (GZipStream Compress = new GZipStream(outFile, CompressionMode.Compress))
{
inFile.CopyTo(Compress);
}
return outFile.ToArray();
}
}
However, in .Net 2.0 Stream.CopyTo method is not available. So, I tried making a replacement:
public static byte[] CompressData(byte[] data_toCompress)
{
using (MemoryStream outFile = new MemoryStream())
{
using (MemoryStream inFile = new MemoryStream(data_toCompress))
using (GZipStream Compress = new GZipStream(outFile, CompressionMode.Compress))
{
//inFile.CopyTo(Compress);
Compress.Write(inFile.GetBuffer(), (int)inFile.Position, (int)(inFile.Length - inFile.Position));
}
return outFile.ToArray();
}
}
The compression fails, though, when using the above attempt - I get an error saying:
MemoryStream's internal buffer cannot be accessed.
Could anyone offer any help on this issue? I'm really not sure what else to do here.
Thank you,
Evan

This is the code straight out of .Net 4.0 Stream.CopyTo method (bufferSize is 4096):
byte[] buffer = new byte[bufferSize];
int count;
while ((count = this.Read(buffer, 0, buffer.Length)) != 0)
destination.Write(buffer, 0, count);

Since you have access to the array already, why don't you do this:
using (MemoryStream outFile = new MemoryStream())
{
using (GZipStream Compress = new GZipStream(outFile, CompressionMode.Compress))
{
Compress.Write(data_toCompress, 0, data_toCompress.Length);
}
return outFile.ToArray();
}
Most likely in the sample code you are using inFile.GetBuffer() will throw an exception since you do not use the right constructor - not all MemoryStream instances allow you access to the internal buffer - you have to look for this in the documentation:
Initializes a new instance of the MemoryStream class based on the
specified region of a byte array, with the CanWrite property set as
specified, and the ability to call GetBuffer set as specified.
This should work - but is not needed anyway in the suggested solution:
using (MemoryStream inFile = new MemoryStream(data_toCompress,
0,
data_toCompress.Length,
false,
true))

Why are you constructing a memory stream with an array and then trying to pull the array back out of the memory stream?
You could just do Compress.Write(data_toCompress, 0, data_toCompress.Length);
If you need to replace the functionality of CopyTo, you can create a buffer array of some length, read data from the source stream and write that data to the destination stream.

You can try
infile.WriteTo(Compress);

try to replace the line:
Compress.Write(inFile.GetBuffer(), (int)inFile.Position, (int)(inFile.Length - inFile.Position));
with:
Compress.Write(data_toCompress, 0, data_toCompress.Length);
you can get rid of this line completely:
using (MemoryStream inFile = new MemoryStream(data_toCompress))
Edit: find an example here: Why does gzip/deflate compressing a small file result in many trailing zeroes?

You should manually read and write between these 2 streams:
private static void CopyStream(Stream from, Stream to)
{
int bufSize = 1024, count;
byte[] buffer = new byte[bufSize];
count = from.Read(buffer, 0, bufSize);
while (count > 0)
{
to.Write(buffer, 0, count);
count = from.Read(buffer, 0, bufSize);
}
}

The open-source NuGet package Stream.CopyTo implements Stream.CopyTo for all versions of the .NET Framework.
Available on GitHub and via NuGet (Install-Package Stream.CopyTo)

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Large PDFsharp (MigraDoc) PdfDocument to byte[] - c#

Two tips: Make sure your process runs as a 64-bit process to allow it to use more than 2 GiB of RAM. stream.ToArray() creates a copy, stream.GetBuffer() lets you access the internal buffer of the MemoryStream. If the exception occurs after the Save() this may make a difference.

Related

Convert Stream to byte[] c# for large files of 2GB

SharpZipLib not compressing memory stream

Convert a VERY LARGE binary file into a Base64String incrementally

how to buffer an input stream until it is complete

Copy between two streams in .net 2.0

Categories

Resources