Parsing Large List of Excel Files Failing

Parsing Large List of Excel Files Failing - c#

This is a C#/VSTO program. I've been working on a data capture project. The scope is basically 'process Excel files sent by a variety of third party companies.' Practically, this mean:
Locate columns that contain the data I want through a search method.
Grab data out of the workbooks
Clean the data, run some calculations, etc
Output cleaned data into new workbook
The program I have written works great for small-medium data sets, ~25 workbooks with a combined total of ~1000 rows of relavent data. I'm grabbing 7 columns with of data out of these workbooks. One edge case I have, though, is occasionally I will need to run a much larger data set, ~50 workbooks with a combined total of ~8,000 rows of relavent data (and possibly another ~2000 of duplicate data that I also have to remove).
I am currently putting a list of the files through a Parallel.ForEach loop inside of which I open a new Excel.Application() to process each file with multiple ActiveSheets. The parallel process runs much faster on the smaller data set than going through each one sequentially. But on the larger data set, I seem to hit a wall.
I start getting the message: Microsoft Excel is waiting for another application to complete an OLE action and eventually it just fails. Switching back to sequential foreach does allow the program to finish, but it just grinds along - going from 1-3 minutes for a Parallel medium sized data set to 20+ minutes for a sequential large data set. If I mess with ParallelOptions.MaxDegreeOfParallelism set to 10 it will complete the cycle, but still take 15 minutes. If I set it to 15, it fails. I also really don't like messing with TPL settings if I don't have to. I've also tried inserting a Thread.Sleep to just manually slow things down, but that only made the failure happen further out.
I close the workbook, quit the application, then ReleaseComObject to the Excel object and GC.Collect and GC.WaitForPendingFinalizers at the end of each loop.
My ideas at the moment are:
Split the list in half and run them seperately
Open some number of new Excel.Application() in parallel, but run a list of files sequentially inside of that Excel instance (so kinda like #1, but using a different path)
Seperate the list by file size, and run a small set of very large files independently/sequentially, run the rest as I have been
Things I am hoping to get some help with:
Suggestions on making real sure my memory is getting cleared (maybe Process.Id is getting twisted up in all the opening and closing?)
Suggestions on ordering a parallel process - I'm wondering if I can throw the 'big' guys in first, that will make the longer-running process more stable.
I have been looking at: http://reedcopsey.com/2010/01/26/parallelism-in-net-part-5-partitioning-of-work/ and he says "With prior knowledge about your work, it may be possible to partition data more meaningfully than the default Partitioner." But I'm having a hard time really knowing what/if partitioning makes sense.
Really appreciate any insights!
UPDATE
So as a general rule I test against Excel 2010, as we have both 2010 and 2013 under use here. I ran it against 2013 and it works fine - run time about 4 minutes, which is about what I would expect. Before I just abandon 2010 compatibility, any other ideas? The 2010 machine is a 64-bit machine with 64-bit Office, and the 2013 machine is a 64-bit machine with a 32-bit Office. Would that matter at all?

A few years ago i worked with excel files and automation. I then had problems of having zombie processes in task manager. Although our program ended and i thought i quit excel properly, the processes were not quitting.
The solution was not something i liked but it was effective. I can summarize the solution like this.
1) never use two dots consecutively like:
workBook.ActiveSheet.PageSetup
instead use variables.. when you are done relase and null them.
example: instead of doing this:
m_currentWorkBook.ActiveSheet.PageSetup.LeftFooter = str.ToString();
follow the practices in this function. (This function adds a barcode to excel footer.)
private bool SetBarcode(string text)
{
Excel._Worksheet sheet;
sheet = (Excel._Worksheet)m_currentWorkbook.ActiveSheet;
try
{
StringBuilder str = new StringBuilder();
str.Append(#"&""IDAutomationHC39M,Regular""&22(");
str.Append(text);
str.Append(")");
Excel.PageSetup setup;
setup = sheet.PageSetup;
try
{
setup.LeftFooter = str.ToString();
}
finally
{
RemoveReference(setup);
setup = null;
}
}
finally
{
RemoveReference(sheet);
sheet = null;
}
return true;
}
Here is the RemoveReference function (putting null in this function did not work)
private void RemoveReference(object o)
{
try
{
System.Runtime.InteropServices.Marshal.ReleaseComObject(o);
}
catch
{ }
finally
{
o = null;
}
}
If you follow this pattern EVERYWHERE it guarantees no leaks, no zombie processes etc..
2) In order to create excel files you can use excel application, however to get data from excel, i suggesst using OleDB. You can approach excel like a database and get data from it with sql queries, datatables etc.
Sample Code: (instead of filling dataset, you can use datareader for memory performance)
private List<DataTable> getMovieTables()
{
List<DataTable> movieTables = new List<DataTable>();
var connectionString = "Provider=Microsoft.ACE.OLEDB.12.0;Data Source=" + excelFilePath + ";Extended Properties=\"Excel 12.0;IMEX=1;HDR=NO;TypeGuessRows=0;ImportMixedTypes=Text\""; ;
using (var conn = new OleDbConnection(connectionString))
{
conn.Open();
DataRowCollection sheets = conn.GetOleDbSchemaTable(OleDbSchemaGuid.Tables, new object[] { null, null, null, "TABLE" }).Rows;
foreach (DataRow sheet in sheets)
{
using (var cmd = conn.CreateCommand())
{
cmd.CommandText = "SELECT * FROM [" + sheet["TABLE_NAME"].ToString() + "] ";
var adapter = new OleDbDataAdapter(cmd);
var ds = new DataSet();
try
{
adapter.Fill(ds);
movieTables.Add(ds.Tables[0]);
}
catch (Exception ex)
{
//Debug.WriteLine(ex.ToString());
continue;
}
}
}
}
return movieTables;
}

As an alternative solution to the one proposed by #Mustafa Düman I recommend you to use Version 4 beta of EPPlus. I used it without problems in several projects.
Pros:
Fast
No memory leaks (I can't tell the same for versions <4)
Does not require Office to be installed on the machine where you use it
Cons:
Can be used only for .xlsx files ( Excel 2007 / 2010 )
I tested it with the following code on 20 excel files around 12.5 MB each (over 50k records in each file) and I think it's enough to mention that it didn't crashed :)
Console.Write("Path: ");
var path = Console.ReadLine();
var dirInfo = new DirectoryInfo(path);
while (string.IsNullOrWhiteSpace(path) || !dirInfo.Exists)
{
Console.WriteLine("Invalid path");
Console.Write("Path: ");
path = Console.ReadLine();
dirInfo = new DirectoryInfo(path);
}
string[] files = null;
try
{
files = Directory.GetFiles(path, "*.xlsx", SearchOption.AllDirectories);
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
Console.ReadLine();
return;
}
Console.WriteLine("{0} files found.", files.Length);
if (files.Length == 0)
{
Console.ReadLine();
return;
}
int succeded = 0;
int failed = 0;
Action<string> LoadToDataSet = (filePath) =>
{
try
{
FileInfo fileInfo = new FileInfo(filePath);
using (ExcelPackage excel = new ExcelPackage(fileInfo))
using (DataSet dataSet = new DataSet())
{
int workSheetCount = excel.Workbook.Worksheets.Count;
for (int i = 1; i <= workSheetCount; i++)
{
var worksheet = excel.Workbook.Worksheets[i];
var dimension = worksheet.Dimension;
if (dimension == null)
continue;
bool hasData = dimension.End.Row >= 1;
if (!hasData)
continue;
DataTable dataTable = new DataTable();
//add columns
foreach (var firstRowCell in worksheet.Cells[1, 1, 1, dimension.End.Column])
dataTable.Columns.Add(firstRowCell.Start.Address);
for (int j = 0; j < dimension.End.Row; j++)
dataTable.Rows.Add(worksheet.Cells[j + 1, 1, j + 1, dimension.End.Column].Select(erb => erb.Value).ToArray());
dataSet.Tables.Add(dataTable);
}
dataSet.Clear();
dataSet.Tables.Clear();
}
Interlocked.Increment(ref succeded);
}
catch (Exception)
{
Interlocked.Increment(ref failed);
}
};
Stopwatch sw = new Stopwatch();
sw.Start();
files.AsParallel().ForAll(LoadToDataSet);
sw.Stop();
Console.WriteLine("{0} succeded, {1} failed in {2} seconds", succeded, failed, sw.Elapsed.TotalSeconds);
Console.ReadLine();

Related

Method in Service that uses excel isn't saving and doesn't appear to produce an exception

I have a problem with some code in a service I have. The method just creates a datatable of the report that needs to be created and then writes it to an existing excel file. The problem is that it fails at the save of the file. Somewhat more oddly, it doesn't appear to be catching errors, and I don't know why. I've included the code below. Of note:
excel.Visible=true; doesn't seem to make the excel sheet visible each time so I can't really watch what's going on in the excel itself. I assume it's not becoming visible because it's a service, but I don't really know.
I know that the datatable is producing output as I've had the log (Which is just a text file where I can write events and errors) writes both the cells value and it's location and it has never stopped in the Foreach loop or the for loop within it
I know that it's failing at the wb.Save(); because the Log.WriteLine("WriteTIByEmp 4"); successfully writes to the text file, but the Log.WriteLine("WriteTIByEmp 5"); does not.
The catch (Exception ex) also doesn't seem to be working, as it doesn't write anything about the exception but doesn't even write the Log.WriteLine("Catching Exception");, so I'm a bit lost.
Edit: Just to note, this is all using Interops in the excel portion. It's just that the "using Microsoft.Office.Interop.Excel" is declared at the top of the class to save time and typing as almost all the methods in this particular class are using excel. The excel file is always opening in this process, I can see it in the task manager, and I have had other methods successfully write to excel files in this process. Only this particular method has had issues.
public void WriteTIByEmp(CI.WriteReport Log)
{
try
{
System.Data.DataTable Emps = Pinpoint.TICardsByEmpStatsDaily.GetTICardsByEmployer();
Log.WriteLine("WriteTIByEmp 1");
Application excel = new Application();
excel.Visible = true;
Workbook wb = excel.Workbooks.Open(TIEMPPath);
Worksheet ws = wb.Worksheets[1];
ws.Range["A:G"].Clear();
Log.WriteLine("WriteTIByEmp 2");
int RowNum = 0;
int ColCount = Emps.Columns.Count;
Log.WriteLine("WriteTIByEmp 3");
foreach (DataRow dr in Emps.Rows)
{
RowNum++;
for (int i = 0; i < ColCount; i++)
{
ws.Cells[RowNum, i + 1] = dr[i].ToString();
Log.WriteLine("Cell Val:" + dr[i].ToString() + ". Cell Location: " + RowNum + "," + i);
}
}
Log.WriteLine("WriteTIByEmp 4");
wb.Save();
Log.WriteLine("WriteTIByEmp 5");
wb.Close();
Log.WriteLine("WriteTIByEmp 6");
excel = null;
Log.WriteLine("WriteTIByEmp 7");
}
catch (Exception ex)
{
Log.WriteLine("Catching Exception");
var st = new StackTrace(ex, true);
var frame = st.GetFrame(0);
var line = frame.GetFileLineNumber();
string msg = "Component Causing Error:" + ex.Source + System.Environment.NewLine + "Error Message: " + ex.Message + System.Environment.NewLine + "Line Number: " + line + System.Environment.NewLine + System.Environment.NewLine;
Log.WriteLine(msg, true);
}
}

Meaby try to use Interops
i don't understand how Your code can start Excel application
by this line :
Application excel = new Application();
try this
Microsoft.Office.Interop.Excel.Application xlexcel;
xlexcel = new Microsoft.Office.Interop.Excel.Application();
xlexcel.Visible = true;

I have been in similar situation.
Note: While using Interop Excel is a dependency as well as other processes accessing the file could cause issues. Therefore, I recommend using EPPlus Nuget Package as it works wonders.
https://www.nuget.org/packages/EPPlus/
Please refer to the below sample code.
FileInfo fi = new FileInfo(ExcelFilesPath + "myExcelFile.xlsx");
using (ExcelPackage pck = new ExcelPackage())
{
// Using Existing WorkSheet 1.
ExcelWorksheet ws = pck.Workbook.Worksheets[1];
// Loading Data From DataTable Called dt.
ws.Cells["A1"].LoadFromDataTable(dt, true);
// If you want to enable auto filter
ws.Cells[ws.Dimension.Address].AutoFilter = true;
// Some Formatting
Color colFromHex = System.Drawing.ColorTranslator.FromHtml("#00B388");
ws.Cells[ws.Dimension.Address].Style.Fill.PatternType = ExcelFillStyle.Solid;
ws.Cells[ws.Dimension.Address].Style.Fill.BackgroundColor.SetColor(colFromHex);
ws.Cells[ws.Dimension.Address].Style.Font.Color.SetColor(Color.White);
ws.Cells[ws.Dimension.Address].Style.Font.Bold = true;
ws.Cells["D:K"].Style.Numberformat.Format = "0";
ws.Cells["M:N"].Style.Numberformat.Format = "mm-dd-yyyy hh:mm:ss";
ws.Cells[ws.Dimension.Address].AutoFitColumns();
pck.SaveAs(fi);
}
You can refer to the above code from my project. I am loading the DataTable data into my excel file by providing the Range or Starting Cell.

I had found the issue. The Excel file was open on someone else's computer which resulted in it not being able to save the file, but it couldn't display the excel popup because excel wouldn't become visible (I still don't know why but Probably something to do with it being a service). So it just couldn't save but it wasn't a code error so it didn't show up in the log and couldn't continue so the code was just stuck. I'll make a copy for the purposes of updating in the future.

Get files information from given directory and process N file for Uploading on server simultaneously and in sequence

I need to process N files at a time, So I've stored all files information in Dictionary with Filename, Size, and SequenceNo, Now I've to select 5 files from that Dictionary and process that file, meanwhile if process for any file completed then it will select another 1 file from that dictionary.
For Example :
If I've 10 Files in the dictionary and I select the first 5 files File 1, File 2, File 3, File 4, File 5 from the dictionary and process it. If process File 3 is completed then the process for File 6 should be started.
So Help me.
Thank You.

Thanks, #netmage I finally find my answer with the user of ConcurrentBag so I'll post the answer of my own question.
There is one namespace that provides several thread-safe collection classes that is System.Collections.Concurrent. I have used one from that namespace that is ConcurrentBag
Unlike List, A ConcurrentBag bag allow modification while we are doing iteration on it. it's also thread-safe and allow concurrent access on it.
I have Implemented the following code for the solution of my problem.
I have declared ConcurrentBag object FileData as global.
ConcurrentBag<string[]> FileData = new ConcurrentBag<string[]>();
Created one function to get file information and store them into FileData.
private void GetFileInfoIntoBag(string DirectoryPath)
{
var files = Directory.GetFiles(DirectoryPath, " *", SearchOption.AllDirectories);
foreach (var file in files)
{
FileInfo f1 = new FileInfo(file);
fileData = new string[4];
fileData[0] = f1.Name;
fileData[1] = GetFileSize.ToActualFileSize(f1.Length, 2);
fileData[2] = Convert.ToString(i);
fileData[3] = f1.FullName;
i++;
FileData.Add(fileData);
}
}
And then on then for upload proccess, I have created N Task as I required and Implemented logic for upload inside them.
private void Upload_Click(object sender, EventArgs e)
{
List<Task> tskCopy = new List<Task>();
for (int i = 0; i < N; i++)
{
tskCopy.Add(Task.Run(() =>
{
while (FileData.Count > 0)
{
string[] file;
FileData.TryTake(out file);
if (file != null && file.Count() > 3)
{
/* Upload Logic*/
GC.Collect();
}
}
}));
}
Task.WaitAll(tskCopy.ToArray());
MessageBox.Show("Upload Complited Successfully");
}
Thank you all for your support.

Apparently, you wish to process your files in a specific order, at most five at a time.
So far, information about your files is stored sequentially in a List<T>.
One straightforward way to move across the list is to store the index of the next element to access in an int variable, e.g. nextFileIndex. You initialize it to 0.
When starting to process one of your files, you take the information from your list:
MyFileInfo currentFile;
lock (myFiles)
{
if (nextFileIndex < myFiles.Count)
{
currentFile = myFiles[nextFileIndex++];
}
}
You start five "processes" like that in the beginning, and whenever one of them has ended, you start a new one.
Now, for these "processes" to run in parallel (it seems like that is what you intend), please read about multithreading, e.g. the task parallel library that is part of .NET. My suggestion would be to create five tasks that grab the next file as long as the nextFileIndex has not exceeded the maximum index in the list, and use something like Task<TResult>.WaitAll to wait until none of the tasks has anything to do anymore.
Be aware of multi-threading issues.

Code works much slower on production server

I have some legacy code which is pretty basic. Code extracts files from ZIP file, de-serializes contents of the file from the ZIP from XML to objects, and does something with that objects.
Zip file is around 90mb. Problem with this is that this code executes around 3 seconds on local machine (1.5 sec to extract, and around 1.3 sec to deserialize all files), but when i publish that code on Windows server and IIS 6.1, it takes around 28 seconds to do the same action with same file. 14 sec to extract and 13 secs to deserialize.
Server is VPS, 8 cores, 16GB RAM.
Does anyone have any ideas?
public List<FileNameStream> UnzipFilesTest(List<string> files, string zippedPathAndFile)
{
//var result = new Dictionary<string, MemoryStream>();
var unzipedFiles = new List<FileNameStream>();
string file1 = System.Web.Hosting.HostingEnvironment.MapPath(zippedPathAndFile);
if (File.Exists(file1))
{
using (MemoryStream data = new MemoryStream())
{
using (Ionic.Zip.ZipFile zipFile = Ionic.Zip.ZipFile.Read(file1))
{
zipFile.ParallelDeflateThreshold = -1;
foreach (ZipEntry e in zipFile)
{
if (files.Contains(e.FileName, StringComparer.OrdinalIgnoreCase))
{
e.Extract(data);
unzipedFiles.Add(new FileNameStream() { FileContent = Encoding.UTF8.GetString(fs..ToArray()), FileName = e.FileName }); //(e.FileName, data);
}
}
}
}
}
return unzipedFiles;
}

Optimizing the foreach loop using a Parallel.Foreach loop will schedule the work of unzipping the files using multiple threads. The more threads the faster it will go. I am not saying it isn't a hardware, network, firewall or antivirus issue on the server - but it isn't wise to throw hardware at a software problem.
Here is a MSDN Link that may prove useful.
Your code would look something like:
Parallel.ForEach(zipEntires, (e) =>
{
if (files.Contains(e.FileName, StringComparer.OrdinalIgnoreCase))
{
e.Extract(data);
unzipedFiles.Add(new FileNameStream() { FileContent = Encoding.UTF8.GetString(fs..ToArray()), FileName = e.FileName }); //(e.FileName, data);
}
}

It was something in the VPS itself. After 7 days of research, hosting provider staff offered to migrate to a new machine, and everything seems to be in order now.

Interop.Excel 0x800A03EC - Out of memory error code 7

I am trying to cycle through a excel workbook and copy paste values over the top
of each sheet within that workbook. But I am running into a memory issue on line:
ws.select(true), when going to the next sheet.
Errors encountered:
Exception from HRESULT: 0x800A03EC:
and when I go to close out of the workbook at this point excel throws:
Out of memory (Error 7)
Additional information:
File size 3mb, 20 tabs and lots of database formulae reading from olap database TM1.
Microsoft office 2007
The code I am running is below. Is there a more efficient way of running it that may prevent the Out of Memory Error OR is this something different??
Any help would be much appreciated!
#
public bool wbCopyValueOnly(string workBook)
{
Workbook wb = excel.ActiveWorkbook;
Worksheet ws;
Range rng;
int WS_Count = wb.Sheets.Count;
for (int i = 0; i < WS_Count; i++)
{
try
{
ws = wb.Sheets[i + 1];
ws.Activate();
ws.Select(true);
//ws.Select(Type.Missing);
//ws.Cells.Select();
ws.UsedRange.Select();
//ws.Cells.Copy();
ws.UsedRange.Copy(Type.Missing);
ws.UsedRange.PasteSpecial(Excel.XlPasteType.xlPasteValues, Excel.XlPasteSpecialOperation.xlPasteSpecialOperationNone, false, false);
// select of range to get around memory issue
excel.Application.CutCopyMode = (Excel.XlCutCopyMode)0;
//rng = ws.get_Range("A1");
//rng.Select();
NAR(ws);
}
catch (System.Runtime.InteropServices.COMException err)
{
cLogging.write(LogLevel.Error, err.Message);
Debug.Print(err.Message);
return false;
}
}
NAR(wb);
return true;
}
private void NAR(object o)
{
try
{
while (System.Runtime.InteropServices.Marshal.ReleaseComObject(o) > 0) ;
}
catch { }
finally
{
o = null;
}
}

I have a program that copies from up to 100 tabs back to the first summary page and I found that using .Copy() caused several problems. Especially if your program runs for a length of time; the current user cannot use the copy-paste function without strange results. I recommend using variables to store what you need and then write to the intended range. Recording Macros is invaluable if you need to change the format of the range.

ws.Activate();
ws.Select(true);
ws.UsedRange.Select();
I think these code isn't necessary.
You can record a Marco to learn how to modify your code.

how to create a fresh database before tests run?

how to create a fresh database (everytime) before tests run from a schema file ?

You can use the SchemaExport class in NHibernate to do this in code:
var schema = new SchemaExport(config);
schema.Drop(true, true);
schema.Execute(true, true, false);

drop the entire database - don't drop table by table - that adds too much maintenance overhead

I have used the following utility methods for running SQL scripts for setting up databases and test data in a project that I am working with every now and then. It has worked rather well:
internal static void RunScriptFile(SqlConnection conn, string fileName)
{
long fileSize = 0;
using (FileStream stream = File.OpenRead(fileName))
{
fileSize = stream.Length;
using (StreamReader reader = new StreamReader(stream))
{
StringBuilder sb = new StringBuilder();
string line = string.Empty;
while (!reader.EndOfStream)
{
line = reader.ReadLine();
if (string.Compare(line.Trim(), "GO", StringComparison.InvariantCultureIgnoreCase) == 0)
{
RunCommand(conn, sb.ToString());
sb.Length = 0;
}
else
{
sb.AppendLine(line);
}
}
}
}
}
private static void RunCommand(SqlConnection connection, string commandString)
{
using (SqlCommand command = new SqlCommand(commandString, connection))
{
try
{
command.ExecuteNonQuery();
}
catch (Exception ex)
{
Console.WriteLine(string.Format("Exception while executing statement: {0}", commandString));
Console.WriteLine(ex.ToString());
}
}
}
I have used the Database Publishing Wizard to generate SQL scripts (and in some cases edited them to include only the data I want to use in the test), and just pass the script file paths into the RunScriptFile method before the tests. The method parses the script file and executes each part that is separated by a GO line separately (I found that this greatly helped in troubleshooting errors that happened while running the SQL scripts).
I has been a while since I wrote the code, but I think it requires the the script file ends with a GO line in order for the last part of it to be executed.

Have a look at these posts.
Ayende Rahien - nhibernate-unit-testing
Scott Muc - unit-testing-domain-persistence-with-ndbunit-nhibernate-and-sqlite
I have found them to be very usefull and basically they are extending the example by Mike Glenn

I use Proteus (Unit Test Utility), available on Google code here :
http://code.google.com/p/proteusproject/
You create a set of data. Each time, you run a unit test, the current data are saved, the set of data is loaded, then you use all the time the same set of data to make your tests. At the end the original data are restored.
Very powerfull
HTH

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.