I'm trying to write a script to merge ~ 10,000 pdf's into a single file using iText 7 and c#.
My test files are ~5mb each and at around the 270 mark I start getting System.OutOfMemoryException ' s - even though I can see from task manager that I'm only using less than 25% of the available memory.
Heres the code
string sourceFolder = #"C:\Work\Generated5\";
string outputPath = #"C:\Work\MergeTest.pdf";
int i = 0;
string[] files = Directory.GetFiles(sourceFolder,"*.pdf");
if (files.Length > 0)
{
Array.Sort(files);
using (PdfDocument pdf = new PdfDocument(new PdfWriter(outputPath)))
{
foreach (var file in files)
{
try
{
using (var reader = new PdfReader(file))
{
using (PdfDocument sourceDoc = new PdfDocument(reader))
{
sourceDoc.CopyPagesTo(1, sourceDoc.GetNumberOfPages(), pdf);
}
reader.Close();
}
}
catch (Exception e)
{
e.Message.Dump(file);
}
if (i % 200 == 0)
{
//desperate attempt to free some memory - doesn't really help
GC.Collect(3);
}
}
}
}
I've found many examples online and here on stack overflow on doing this type of thing. However the documentation and previous answers I have found are out of date and in iTextSharp and some in iText 5. The classes used don't seem to be supported anymore in iText 7 and the only example I've been able to find is "How NOT to merge.."
Some stuff I've tried:
Enabling SmartMode on the pdf writer
Enabling Compression on the pdfwriter (got to about 1600 pdf's before throwing exceptions)
using PdfMerger instead of PdfDocument.copyPages
Force some Garbage Collection every X documents
Related
Using iTextSharp 5 I am trying to get the number of pages of a PDF file that I am pulling thru a memory stream.
using (var inms = new MemoryStream(file.Image))//file.Image is a byte array
{
var reader = new PdfReader(inms);
var pageCount = reader.NumberOfPages;
}
When I do this pageCount always comes out as 1 even though there are 18 pages in the document.
using (var pdfReader = new PdfReader(filePath))
{
var pageCount = pdfReader.NumberOfPages;
}
When I use the second method and read the document as a file from the file system it returns the expected 18 pages.
Any ideas on why this is and how to get around it?
I'm using SharpCompress library for extracting .7z files but it takes about 35 mins to extract 60mb .7z file. Is this normal or am I doing something wrong in terms of performance? .7z file is compressed in high compress mode and LZMA type.
using (var archive2 = ArchiveFactory.Open(source))
{
foreach (var entry in archive2.Entries)
{
if (!entry.IsDirectory)
{
entry.WriteToDirectory(destination, ExtractOptions.ExtractFullPath | ExtractOptions.Overwrite);
}
}
}
}
This is an old post, but I just had the same problem.
This line is the problem
foreach (var entry in archive2.Entries)
The problem is described here (ie. If there are 100 files, it decompresses the 1st file 100 times, 2nd file 99 times, and so on)
The solution is to use reader (forward-only). See the API.
But the sample code there doesn't support 7z.
For 7z you can use archive.ExtractAllEntries(), eg.
using (var archive = ArchiveFactory.Open(movedZipFile))
{
var reader = archive.ExtractAllEntries();
while (reader.MoveToNextEntry())
{
if (!reader.Entry.IsDirectory)
reader.WriteEntryToDirectory(extractDir, new ExtractionOptions() { ExtractFullPath = false, Overwrite = true });
}
}
It will be much faster.
I'm reading in a .docx file using the Novacode API, and am unable to create or display any images within the file to a WinForm app due to not being able to convert from a Novacode Picture (pic) or Image to a system image. I've noticed that there's very little info inside the pic itself, with no way to get any pixel data that I can see. So I have been unable to utilize any of the usual conversion ideas.
I've also looked up how Word saves images inside the files as well as Novacode source for any hints and I've come up with nothing.
My question then is is there a way to convert a Novacode Picture to a system one, or should I use something different to gather the image data like OpenXML? If so, would Novacode and OpenXML conflict in any way?
There's also this answer that might be another place to start.
Any help is much appreciated.
Okay. This is what I ended up doing. Thanks to gattsbr for the advice. This only works if you can grab all the images in order, and have descending names for all the images.
using System.IO.Compression; // Had to add an assembly for this
using Novacode;
// Have to specify to remove ambiguous error from Novacode
Dictionary<string, System.Drawing.Image> images = new Dictionary<string, System.Drawing.Image>();
void LoadTree()
{
// In case of previous exception
if(File.Exists("Images.zip")) { File.Delete("Images.zip"); }
// Allow the file to be open while parsing
using(FileStream stream = File.Open("Images.docx", FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
{
using(DocX doc = DocX.Load(stream))
{
// Work rest of document
// Still parse here to get the names of the images
// Might have to drag and drop images into the file, rather than insert through Word
foreach(Picture pic in doc.Pictures)
{
string name = pic.Description;
if(null == name) { continue; }
name = name.Substring(name.LastIndexOf("\\") + 1);
name = name.Substring(0, name.Length - 4);
images[name] = null;
}
// Save while still open
doc.SaveAs("Images.zip");
}
}
// Use temp zip directory to extract images
using(ZipArchive zip = ZipFile.OpenRead("Images.zip"))
{
// Gather all image names, in order
// They're retrieved from the bottom up, so reverse
string[] keys = images.Keys.OrderByDescending(o => o).Reverse().ToArray();
for(int i = 1; ; i++)
{
// Also had to add an assembly for ZipArchiveEntry
ZipArchiveEntry entry = zip.GetEntry(String.Format("word/media/image{0}.png", i));
if(null == entry) { break; }
Stream stream = entry.Open();
images[keys[i - 1]] = new Bitmap(stream);
}
}
// Remove temp directory
File.Delete("Images.zip");
}
I have a code that is SSIS script task to zip file written in C#.
I have problem when zipping 1gb (approxymately) file.
I try to implement this code and still get error 'System.OutOfMemoryException'
System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
at ST_4cb59661fb81431abcf503766697a1db.ScriptMain.AddFileToZipUsingStream(String sZipFile, String sFilePath, String sFileName, String sBackupFolder, String sPrefixFolder) in c:\Users\dtmp857\AppData\Local\Temp\vsta\84bef43d323b439ba25df47c365b5a29\ScriptMain.cs:line 333
at ST_4cb59661fb81431abcf503766697a1db.ScriptMain.Main() in c:\Users\dtmp857\AppData\Local\Temp\vsta\84bef43d323b439ba25df47c365b5a29\ScriptMain.cs:line 131
This is the snippet of code when zipping file:
protected bool AddFileToZipUsingStream(string sZipFile, string sFilePath, string sFileName, string sBackupFolder, string sPrefixFolder)
{
bool bIsSuccess = false;
try
{
if (File.Exists(sZipFile))
{
using (ZipArchive addFile = ZipFile.Open(sZipFile, ZipArchiveMode.Update))
{
addFile.CreateEntryFromFile(sFilePath, sFileName);
//Move File after zipping it
BackupFile(sFilePath, sBackupFolder, sPrefixFolder);
}
}
else
{
//from https://stackoverflow.com/questions/28360775/adding-large-files-to-io-compression-ziparchiveentry-throws-outofmemoryexception
using (var zipFile = ZipFile.Open(sZipFile, ZipArchiveMode.Update))
{
var zipEntry = zipFile.CreateEntry(sFileName);
using (var writer = new BinaryWriter(zipEntry.Open()))
using (FileStream fs = File.Open(sFilePath, FileMode.Open))
{
var buffer = new byte[16 * 1024];
using (var data = new BinaryReader(fs))
{
int read;
while ((read = data.Read(buffer, 0, buffer.Length)) > 0)
writer.Write(buffer, 0, read);
}
}
}
//Move File after zipping it
BackupFile(sFilePath, sBackupFolder, sPrefixFolder);
}
bIsSuccess = true;
}
catch (Exception ex)
{
throw ex;
}
return bIsSuccess;
}
What I am missing, please give me suggestion maybe tutorial or best practice handling this problem.
I know this is an old post but what can I say, it helped me sort out some stuff and still comes up as a top hit on Google.
So there is definitely something wrong with the System.IO.Compression library!
First and Foremost...
You must make sure to turn off 32-Preferred. Having this set (in my case with a build for "AnyCPU") causes so many inconsistent issues.
Now with that said, I took some demo files (several less than 500MB, one at 500MB, and one at 1GB), and created a sample program with 3 buttons that made use of the 3 methods.
Button 1 - ZipArchive.CreateFromDirectory(AbsolutePath, TargetFile);
Button 2 - ZipArchive.CreateEntryFromFile(AbsolutePath, RelativePath);
Button 3 - Using the [16 * 1024] Byte Buffer method from above
Now here is where it gets interesting. (Assuming that the program is built as "AnyCPU" and with NO 32 Preferred check)... all 3 Methods worked on a Windows 64-Bit OS, regardless of how much memory it had.
However, as soon as I ran the same test on a 32-Bit OS, regardless of how much memory it had, ONLY method 1 worked!
Method 2 and 3 blew up with the outofmemory error AND to add salt to it, method 3 (the preferred method of chunking) actually corrupted more files than method #2!
By messed up, I mean that of my files, the 500MB and the 1GB file ended up in the zipped archive but at a size less than the original (it was basically truncated).
So I dunno... since there are not many 32-bit OS around anymore, I guess maybe it is a moot point.
But seems like there are some bugs in the System.IO.Compression Framework!
While troubleshooting a performance problem, I came across an issue in Windows 8 which relates to file names containing .dat (e.g. file.dat, file.data.txt).
I found that it takes over 6x as long to create them as any file with any other extension.
The same issue occurs in windows explorer where it takes significantly longer when copying folders containing .dat* files.
I have created some sample code to illustrate the issue.
internal class DatExtnIssue
{
internal static void Run()
{
CreateFiles("txt");
CreateFiles("dat");
CreateFiles("dat2");
CreateFiles("doc");
}
internal static void CreateFiles(string extension)
{
var folder = Path.Combine(#"c:\temp\FileTests", extension);
if (!Directory.Exists(folder))
Directory.CreateDirectory(folder);
var sw = new Stopwatch();
sw.Start();
for (var n = 0; n < 500; n++)
{
var fileName = Path.Combine(folder, string.Format("File-{0:0000}.{1}", n, extension));
using (var fileStream = File.Create(fileName))
{
// Left empty to show the problem is due to creation alone
// Same issue occurs regardless of writing, closing or flushing
}
}
sw.Stop();
Console.WriteLine(".{0} = {1,6:0.000}secs", extension, sw.ElapsedMilliseconds/1000.0);
}
}
Results from creating 500 files with the following extensions
.txt = 0.847secs
.dat = 5.200secs
.dat2 = 5.493secs
.doc = 0.806secs
I got similar results using:
using (var fileStream = new FileStream(fileName, FileMode.Create, FileAccess.Write, FileShare.None))
{ }
and:
File.WriteAllText(fileName, "a");
This caused a problem as I had a batch application which was taking far too long to run. I finally tracked it down to this.
Does anyone have any idea why this would be happening? Is this by design? I hope not, as it could cause problems for high-volume application creating .dat files.
It could be something on my PC but I have checked the windows registry and found no unusual extension settings.
If all else fails, try a kludge:
Write all files out as .txt and then rename *.txt to .dat. Maybe it will be faster :)