How to get around OutOfMemory exception in C#? - c#

I've got a few huge xml files, 1+ gb. I need to do some filtering operations with them. The easiest idea I've come up with is to save them as txt and ReadAllText from them, and start doing some operations like
var a = File.ReadAllText("file path");
a = a.Replace("<", "\r\n<");
The moment I try to do that, however, the program crashes out of memory. I've looked at my task manager while I run it and the RAM usage climbs to 50% and the moment it reaches it the program dies.
Does anyone have any ideas on how I operate with this file avoiding the OutOfMemory exception or allow the program to pull on more of the memory.

If you can do it line by line, instead of saying "Read everything to memory" with File.ReadAllText, you can say "Yield me one line at time" with File.ReadLines.
This will return IEnumerable which uses deferred execution. You can do it like this:
using(StreamWriter sw = new StreamWriter(newFilePath))
foreach(var line in File.ReadLines(path))
{
sw.WriteLine(line.Replace("<", "\r\n<"));
sw.Flush();
}
If you want to learn more about deferred execution, you can check this github page.

Related

System.IO.Compression.ZipArchive keeps file locked after dispose?

I have a class that takes data from several sources and writes them to a ZIP file. I've benchmarked the class to check if using CompressionLevel.Optimal would be much slower than CompressionLevel.Fastest. But the benchmark throws an exception on different iterations and in different CompressionLevel values each time I run the benchmark.
I started removing the methods that add the file-content step by step until I ended up with the code below (inside the for-loop) which does basically nothing besides creating an empty zip-file and deleting it.
Simplified code:
var o = #"e:\test.zip";
var result = new FileInfo(o);
for (var i = 0; i < 1_000_000; i++)
{
// Alternate approach
// using(var archive = ZipFile.Open(o, ZipArchiveMode.Create))
using (var archive = new ZipArchive(result.OpenWrite(), ZipArchiveMode.Create, false, Encoding.UTF8))
{
}
result.Delete();
}
The loop runs about 100 to 15k iterations on my PC and then throws an IOException when trying to delete the file saying that the file (result) is locked.
So... did I miss something about how to use System.IO.Compression.ZipArchive? There is no close method for ZipArchive and using should dispose/close the archive... I've tried different .NET versions 4.6, 4.6.1, 4.7 and 4.7.2.
EDIT 1:
The result.Delete() is not part of the code that is benchmarked
EDIT 2:
Also tried to play around with Thread.Sleep(5/10/20) after the using block (therefore the result.Delete() to check if the lock persists) but up to 20ms the file is still locked at some point. Didnt tried higher values than 20ms.
EDIT 3:
Can't reprodurce the problem at home. Tried a dozen times at work and the loop never hit 20k iterations. Tried once here and it completed.
EDIT 4:
jdweng (see comments) was right. Thanks! Its somehow related to my "e:" partition on a local hdd. The same code runs fine on my "c:" partition on a local ssd and also on a network share.
In my experience files are may not be consistently unlocked when the dispose method for the stream returns. My best guess is that this is due to the file system doing some operation asynchronously. The best solution I have found is to retry the delete operation multiple times. i.e. something like this:
public static void DeleteRetrying(this FileInfo self, int delayMs = 100, int numberOfAttempts = 3)
{
for (int i = 0; i < numberOfAttempts-1; i++)
{
try
{
self.Delete();
}
catch (IOException)
{
// Consider making the method async and
// replace this with Task.Delay
Thread.Sleep(delayMs);
}
}
// Final attempt, let the exception propagate
self.Delete();
}
This is not an ideal solution, and I would love if someone could provide a better solution. But it might be good enough for testing where the impact of a non deleted file would be manageable.

Last batch never uploads to Solr when uploading batches of data from json file stream

This might be a long shot but I might as well try here. There is a block of c# code that is rebuilding a solr core. The steps are as follows:
Delete all the existing documents
Get the core entities
Split the entities into batches of 1000
Spin of threads to preform the next set of processes:
Serialize each batch to json and writing the json to a file on the server
hosting the core
Send a command to the core to upload that file using System.Net.WebClient solrurl/corename/update/json?stream.file=myfile.json&stream.contentType=application/json;charset=utf-8
Delete the file. I've also tried deleting the files after all the batches are done, as well as not deleting the files at all
After all batches are done it commits. I've also tried committing
after each batch is done.
My problem is the last batch will not upload if it's much less than the batch size. It flows through like the command was called but nothing happens. It throws no exceptions and I see no errors in the solr logs. My questions are Why? and How can I ensure the last batch always gets uploaded? We think it's a timing issue, but we've added Thread.Sleep(30000) in many parts of the code to test that theory and it still happens.
The only time it doesn't happen is:
if the batch is full or almost full
we don't run multiple threads it
we put a break point at the File.Delete line on the last batch and wait for 30 seconds or so, then continue
Here is the code for writing the file and calling the update command. This is called for each batch.
private const string
FileUpdateCommand = "{1}/update/json?stream.file={0}&stream.contentType=application/json;charset=utf-8",
SolrFilesDir = #"\\MYSERVER\SolrFiles",
SolrFileNameFormat = SolrFilesDir + #"\{0}-{1}.json",
_solrUrl = "http://MYSERVER:8983/solr/",
CoreName = "MyCore";
public void UpdateCoreByFile(List<CoreModel> items)
{
if (items.Count == 0)
return;
var settings = new JsonSerializerSettings { DateTimeZoneHandling = DateTimeZoneHandling.Utc };
var dir = new DirectoryInfo(SolrFilesDir);
if (!dir.Exists)
dir.Create();
var filename = string.Format(SolrFileNameFormat, Guid.NewGuid(), CoreName);
using (var sw = new StreamWriter(filename))
{
sw.Write(JsonConvert.SerializeObject(items, settings));
}
var file = HttpUtility.UrlEncode(filename);
var command = string.Format(FileUpdateCommand, file, CoreName);
using (var client = _clientFactory.GetClient())//System.Net.WebClient
{
client.DownloadData(new Uri(_solrUrl + command));
}
//Thread.Sleep(30000);//doesn't work if I add this
File.Delete(filename);//works here if add breakpoint and wait 30 sec or so
}
I'm just trying to figure out why this is happening and how to address it. I hope this makes sense, and I have provided enough information and code. Thanks for any help.
Since changing the size of the data set and adding a breakpoint "fixes" it, this is most certainly a race condition. Since you haven't added the code that actually indexes the content, it's impossible to say what the issue really is, but my guess is that the last commit happens before all the threads have finished, and only works when all threads are done (if you sleep the threads, the issue will still be there, since all threads sleep for the same time).
The easy fix - use commitWithin instead, and never issue explicit commits. The commitWithin parmaeter makes sure that the documents become available in the index within the given time frame (given as milliseconds). To make sure that the documents you submit becomes available within ten seconds, append &commitWithin=10000 to your URL.
If there's already documents pending a commit, the documents added will be committed before the ten seconds has ellapsed, but even if there's just one last document being submitted as the last batch, it'll never be more than ten seconds before it becomes visible (.. and there will be no documents left forever in a non-committed limbo).
That way you won't have to keep your threads synchronized or issue a final commit, as long as you wait until all threads have finished before exiting your application (if it's an application that actually terminates).

Sharing a memory map simultaneously between processes

After overcoming some other difficulties, I'm now stuck with this (probably simple) problem.
My goal:
Multiple instances of my application are running and performing
operations (read & write) on the same file at the same time
Solution (what I want to do): use memory-mapped files to speed up the
execution. When the app starts, it attempts to call CreateFromFile( )
(I understand that if the file has already been mapped by someone
else, this will do nothing but return the "handle"), then proceeds to
perform its work on the file.
The problem is that I'm getting an IOException (The file is being used by another process) upon an attempt to CreateFromFile( ) when another process has already done so and not disposed of MemoryMappedFile object. I could, theoretically, use mutexes and constantly dispose of the objects using the using statement, but that dumps the contents to disk, slowing the execution down again and defeating the purpose.
What are the possible solutions to my problem?
I could, theoretically, CreateOrOpen( ) a non-persisted file, then dump it to disk when needed, but I believe there's a better approach?
EDIT : I feel I didn't make myself clear. All instances of the app need to work on the file constantly, they open it and use it for the next 10, 20, 60 minutes without closing the file. Eventually all of them will be killed and that's when the OS steps in, flushing the map file to disk.
Example programs:
Program A)
static void Main(string[] args)
{
var mmf = MemoryMappedFile.CreateFromFile(#"C:\Users\admin\Documents\Visual Studio 2010\Projects\serialize\serialize\bin\Debug\File.dat", FileMode.Open, "name");
var accessor = mmf.CreateViewAccessor();
accessor.Write(5, 'd');
Console.Read();
}
Program B)
static void Main(string[] args)
{
var mmf = MemoryMappedFile.CreateFromFile(#"C:\Users\admin\Documents\Visual Studio 2010\Projects\serialize\serialize\bin\Debug\File.dat", FileMode.Open, "name");
var accessor = mmf.CreateViewAccessor();
accessor.Write(2, 'x');
Console.Read();
}
I run both at the same time, let them hang waiting for the Console.Read(). Ideally, they should be able to access the file at the same time (i.e. both of them could open it after the other one has), but as stated above, I'm getting an IOException.

Check If File Is In Use By Other Instances of Executable Run

Before I go into too detail, my program is written in Visual Studio 2010 using C# .Net 4.0.
I wrote a program that will generate separate log files for each run. The log file is named after the time, and accurate up at millisecond (for example, 20130726103042375.log). The program will also generate a master log file for the day if it has not already exist (for example, *20130726_Master.log*)
At the end of each run, I want to append the log file to a master log file. Is there a way to check if I can append successfully? And retry after Sleep for like a second or something?
Basically, I have 1 executable, and multiple users (let's say there are 5 users).
All 5 users will access and run this executable at the same time. Since it's nearly impossible for all user to start at the exact same time (up to millisecond), there will be no problem generate individual log files.
However, the issue comes in when I attempt to merge those log files to the master log file. Though it is unlikely, I think the program will crash if multiple users are appending to the same master log file.
The method I use is
File.AppendAllText(masterLogFile, File.ReadAllText(individualLogFile));
I have check into the lock object, but I think it doesn't work in my case, as there are multiple instances running instead of multiple threads in one instance.
Another way I look into is try/catch, something like this
try
{
stream = file.Open(FileMode.Open, FileAccess.ReadWrite, FileShare.None);
}
catch {}
But I don't think this solve the problem, because the status of the masterLogFile can change in that brief millisecond.
So my overall question is: Is there a way to append to masterLogFile if it's not in use, and retry after a short timeout if it is? Or if there is an alternative way to create the masterLogFile?
Thank you in advance, and sorry for the long message. I want to make sure I get my message across and explain what I've tried or look into so we are not wasting anyone's time.
Please let me know if there's anymore information I can provide to help you help me.
Your try/catch is the way to do things. If the call to File.Open succeeds, then you can write to to the file. The idea is to keep the file open. I would suggest something like:
bool openSuccessful = false;
while (!openSuccessful)
{
try
{
using (var writer = new StreamWriter(masterlog, true)) // append
{
// successfully opened file
openSuccessful = true;
try
{
foreach (var line in File.ReadLines(individualLogFile))
{
writer.WriteLine(line);
}
}
catch (exceptions that occur while writing)
{
// something unexpected happened.
// handle the error and exit the loop.
break;
}
}
}
catch (exceptions that occur when trying to open the file)
{
// couldn't open the file.
// If the exception is because it's opened in another process,
// then delay and retry.
// Otherwise exit.
Sleep(1000);
}
}
if (!openSuccessful)
{
// notify of error
}
So if you fail to open the file, you sleep and try again.
See my blog post, File.Exists is only a snapshot, for a little more detail.
I would do something along the lines of this as I think in incurs the least overhead. Try/catch is going to generate a stack trace(which could take a whole second) if an exception is thrown. There has to be a better way to do this atomically still. If I find one I'll post it.

Parse output from a process that updates a single console line

Greetings stackoverflow members,
in a BackgroundWorker of a WPF Frontend i run sox (open source console sound processing tool) in a System.Diagnostics.Process. In that same way i use several other command line tools and parse their output to poulate progress bars in my frontend.
This works fine for the other tools but not for Sox since instead of spamming new lines for each progress step, it updates a single line on the console by only using carriage returns (\r) and no line feeds (\n). I tried both asynchronous and synchronous reads on process.StandardError.
Using async process.ErrorDataReceived += (sender, args) => FadeAudioOutputHandler(clip, args); in combination with process.BeginErrorReadLine(); doesn't produce any individual status updates because for some reason the carriage returns do not trigger ReadLine, even though the MSDN docs suggest that it should. The output is spit out in one chunk when the process finishes.
I then tried the following code for synchronous char by char reads on the stream:
char[] c;
var line = new StringBuilder();
while (process.StandardError.Peek() > -1)
{
c = new char[1];
process.StandardError.Read(c, 0, c.Length);
if (c[0] == '\r')
{
var percentage = 0;
var regex = new Regex(#"%\s([^\s]+)");
var match = regex.Match(line.ToString());
if (match.Success)
{
myProgressObject.ProgressType = ProgressType.FadingAudio
//... some calculations omitted for brevity
percentage = (int) Math.Round(result);
}
else
{
myProgressObject.ProgressType = ProgressType.UndefinedStep;
}
_backGroundWorker.ReportProgress(percentage, myProgressObject);
line.Clear();
}
else
{
line.Append(c[0]);
}
}
The above code does not seem to read the stream in realtime but will stall output for a while. Then it spams a small chunk and finally deadlocks half-way through the process.
Any hints towards the right direction would be greatly appreciated!
UPDATE with (sloppy?) solution:
This drove me crazy because nothing i tried on the C# side of things seemed to have any effect on the results. My original implementation, before changing it 15 times and introducing new dependencies, was fine.
The problem is with sox and RedirectStandardError alone. I discovered that after grabbing the sox source code and building my own version. First i removed all output of sox entirely except for the stuff i was really interested in and then changing the output to full lines followed by a newline \n . I assumed that this would fix my issues. Well, it didn't. I do not know enough c++ to actually find out why, but they seem to have tempered with how stdio writes to that stream, how it's buffered or do it in such a special way that the streamreader on the c# side is not flushed until the default 4096 byte buffer is full. I confirmed that by padding each line to at least 4096 byte. So in conclusion all i had to do was to manually flush stderr in sox.c after each fprintf(stderr, ...) call in display_status(...):
fflush(stderr);
Though, I'm not sure this is anywhere close to an elegant solution.
Thanks to Erik Dietrich for his answer which made me look at this from a different angle.
The situation you describe is a known problem - for a solution including source code see http://www.codeproject.com/KB/threads/ReadProcessStdoutStderr.aspx
It solves both problems (deadlock and the problem with \n)...
I've had to deal with a similar issue with a bespoke build tool in visual studio. I found that using a regex and doing the parsing in the same thread as the reading is a problem and the output processing grinds to a halt. I ended up with a standard consumer producer solution where you read lines from the output and stick them onto a Queue. Then have the queue be dequeued and processed on some other thread. I can't offer source code but this site has some fantastic resources: http://www.albahari.com/threading/part2.aspx
It's a little kludgy, but perhaps you could pipe the output of the uncooperative process to a process that does nothing but process input by characters, insert line feeds, and write to to standard out... So, in terms of (very) pseudo-code:
StartProcess("sox | littleguythatIwrote")
ReadStandardOutTheWayYouAleadyAre()
Could be that just moves the goalposts (I'm a lot more familiar with std in/out/err in the NIX world), but it's a different way to look at the problem, anyway.

Categories