Why does calling the Tesseract process cause this service to crash randomly? - c#

I have a .NET Core 2.1 service which runs on an Ubuntu 18.04 VM and calls Tesseract OCR 4.00 via a Process instance. I would like to use an API wrapper, but I could only find one available and it is only in beta for the latest version of Tesseract -- the stable wrapper uses version 3 instead of 4. In the past, this service worked well enough, but I have been changing it so that document/image data is written and read from disk less frequently in an attempt to improve speed. The service used to call many more external processes (such as ImageMagick) which were unnecessary due to the presence of an API, so I have been replacing those with API calls.
Recently I've been testing this with a sample file taken from real data. It's a faxed document PDF that has 133 pages, but is only 5.8 MB in spite of that due to grayscale and resolution. The service takes a document, splits it into individual pages, then assigns multiple threads (one thread per page) to call Tesseract and process them using Parallel.For. The thread limits are configurable. I am aware that Tesseract has its own multithreading environment variable (OMP_THREAD_LIMIT). I found in prior testing that setting it to "1" is ideal for our set up at the moment, but in my recent testing for this issue I have tried leaving it unset (dynamic value) with no improvement.
The issue is that unpredictably, when Tesseract is called, the service will hang for about a minute and then crash, with the only error showing in journalctl being:
dotnet[32328]: Error while reaping child. errno = 10
dotnet[32328]: at System.Environment.FailFast(System.String, System.Exception)
dotnet[32328]: at System.Environment.FailFast(System.String)
dotnet[32328]: at System.Diagnostics.ProcessWaitState.TryReapChild()
dotnet[32328]: at System.Diagnostics.ProcessWaitState.CheckChildren(Boolean)
dotnet[32328]: at System.Diagnostics.Process.OnSigChild(Boolean)
I can't find anything at all online for this particular error. It would seem to me, based on related research I've done on the Process class, that this is occurring when the process is exiting and dotnet is trying to clean up the resources it was using. I'm really at a loss as to how to even approach this problem, although I have tried a number of "guesses" such as changing thread limit values. There is no cross-over between threads. Each thread has its own partition of pages (based on how Parallel.For partitions a collection) and it sets to work on those pages, one at a time.
Here is the process call, called from within multiple threads (8 is the limit we normally set):
private bool ProcessOcrPage(IMagickImage page, int pageNumber, object instanceId)
{
var inputPageImagePath = Path.Combine(_fileOps.GetThreadWorkingDirectory(instanceId), $"ocrIn_{pageNumber}.{page.Format.ToString().ToLower()}");
string outputPageFilePathWithoutExt = Path.Combine(_fileOps.GetThreadOutputDirectory(instanceId),
$"pg_{pageNumber.ToString().PadLeft(3, '0')}");
page.Write(inputPageImagePath);
var cmdArgs = $"-l eng \"{inputPageImagePath}\" \"{outputPageFilePathWithoutExt}\" pdf";
bool success;
_logger.LogStatement($"[Thread {instanceId}] Executing the following command:{Environment.NewLine}tesseract {cmdArgs}", LogLevel.Debug);
var psi = new ProcessStartInfo("tesseract", cmdArgs)
{
RedirectStandardError = true,
RedirectStandardOutput = true,
UseShellExecute = false,
CreateNoWindow = true
};
// 0 is not the default value for this environment variable. It should remain unset if there
// is no config value, as it is determined dynamically by default within OpenMP.
if (_processorConfig.TesseractThreadLimit > 0)
psi.EnvironmentVariables.Add("OMP_THREAD_LIMIT", _processorConfig.TesseractThreadLimit.ToString());
using (var p = new Process() { StartInfo = psi })
{
string standardErr, standardOut;
int exitCode;
p.Start();
standardOut = p.StandardOutput.ReadToEnd();
standardErr = p.StandardError.ReadToEnd();
p.WaitForExit();
exitCode = p.ExitCode;
if (!string.IsNullOrEmpty(standardOut))
_logger.LogStatement($"Tesseract stdOut:\n{standardOut}", LogLevel.Debug, nameof(ProcessOcrPage));
if (!string.IsNullOrEmpty(standardErr))
_logger.LogStatement($"Tesseract stdErr:\n{standardErr}", LogLevel.Debug, nameof(ProcessOcrPage));
success = p.ExitCode == 0;
}
return success;
}
EDIT 4: After much testing and discussion with Clint in chat, here is what we learned. The error is raised from a Process event "OnSigChild," that much is obvious from the stack trace, but there is no way to hook into the same event that raises this error. The process never times out given a timeout of 10 seconds (Tesseract typically only takes a few seconds to process a given page). Curiously, if the process timeout is removed and I wait on the standard output and error streams to close, it will hang for a good 20-30 seconds, but the process does not appear in ps auxf during this hang time. From the best that I can tell, Linux is able to determine that the process is done executing, but .NET is not. Otherwise, the error seems to be raised at the very moment that the process is done executing.
The most baffling thing to me is still that the process handling part of the code really hasn't changed very much compared to the working version of this code we have in production. This suggests that it's an error I made somewhere, but I am simply unable to find it. I think I will have to open up an issue on the dotnet GitHub tracker.

"Error while reaping child"
Processes hold up some resources in the kernel, On Unix, when the parent dies, it is the init process that is responsible for cleaning up the kernel resources both Zombine and Orphan process (aka reaping the child). .NET Core reaps child processes as soon as they terminate.
"I have discovered that removing the stdout and stderr stream ReadToEnd
calls causes the processes to end immediately instead of hang, with
the same error"
The error is due to the fact that you are prematurely calling p.ExitCode even before the process has finished and with the ReadToEnd you are just delaying this activity
Summary of updated code
StartInfo.FileName should point to a filename that you want to start
UseShellExecute to false if the process should be created directly from the executable file and true if you intend that shell should be used when starting the process;
Added asynchrnous read operations to standard ouput and error streams
AutoResetEvents to signal when the output and error when the operations complete
Process.Close() to release the resources
It is easier to set and use ArgumentList over Arguments property
Redhat Blog on NetProcess on Linux
Revised Module
private bool ProcessOcrPage(IMagickImage page, int pageNumber, object instanceId)
{
StringBuilder output = new StringBuilder();
StringBuilder error = new StringBuilder();
int exitCode;
var inputPageImagePath = Path.Combine(_fileOps.GetThreadWorkingDirectory(instanceId), $"ocrIn_{pageNumber}.{page.Format.ToString().ToLower()}");
string outputPageFilePathWithoutExt = Path.Combine(_fileOps.GetThreadOutputDirectory(instanceId),
$"pg_{pageNumber.ToString().PadLeft(3, '0')}");
page.Write(inputPageImagePath);
var cmdArgs = $"-l eng \"{inputPageImagePath}\" \"{outputPageFilePathWithoutExt}\" pdf";
bool success;
_logger.LogStatement($"[Thread {instanceId}] Executing the following command:{Environment.NewLine}tesseract {cmdArgs}", LogLevel.Debug);
using (var outputWaitHandle = new AutoResetEvent(false))
using (var errorWaitHandle = new AutoResetEvent(false))
{
try
{
using (var process = new Process())
{
process.StartInfo = new ProcessStartInfo
{
WindowStyle = ProcessWindowStyle.Hidden,
FileName = "tesseract.exe", // Verify if this is indeed the process that you want to start ?
RedirectStandardOutput = true,
RedirectStandardError = true,
UseShellExecute = false,
CreateNoWindow = true,
Arguments = cmdArgs,
WorkingDirectory = Path.GetDirectoryName(path)
};
if (_processorConfig.TesseractThreadLimit > 0)
process.StartInfo.EnvironmentVariables.Add("OMP_THREAD_LIMIT", _processorConfig.TesseractThreadLimit.ToString());
process.OutputDataReceived += (sender, e) =>
{
if (e.Data == null)
{
outputWaitHandle.Set();
}
else
{
output.AppendLine(e.Data);
}
};
process.ErrorDataReceived += (sender, e) =>
{
if (e.Data == null)
{
errorWaitHandle.Set();
}
else
{
error.AppendLine(e.Data);
}
};
process.Start();
process.BeginOutputReadLine();
process.BeginErrorReadLine();
if (!outputWaitHandle.WaitOne(ProcessTimeOutMiliseconds) && !errorWaitHandle.WaitOne(ProcessTimeOutMiliseconds) && !process.WaitForExit(ProcessTimeOutMiliseconds))
{
//To cancel the read operation if the process is stil reading after the timeout this will prevent ObjectDisposeException
process.CancelOutputRead();
process.CancelErrorRead();
Console.ForegroundColor = ConsoleColor.Red;
Console.WriteLine("Timed Out");
//To release allocated resource for the Process
process.Close();
//Timed out
return false;
}
Console.ForegroundColor = ConsoleColor.Green;
Console.WriteLine("Completed On Time");
exitCode = process.ExitCode;
if (!string.IsNullOrEmpty(standardOut))
_logger.LogStatement($"Tesseract stdOut:\n{standardOut}", LogLevel.Debug, nameof(ProcessOcrPage));
if (!string.IsNullOrEmpty(standardErr))
_logger.LogStatement($"Tesseract stdErr:\n{standardErr}", LogLevel.Debug, nameof(ProcessOcrPage));
process.Close();
return exitCode == 0 ? true : false;
}
}
Catch
{
//Handle Exception
}
}
}

Related

How to make a process crash if it doesn't log anything for 5 minutes

I work at Ubisoft and we use a very old program to manipulate some files. Since it's legacy software, it's really bad and it may happen that the software has crashed and keeps on running. We sadly don't have access to the code, so we're unable to fix that. I was wondering, is it possible to use System.Diagnostics.Process with a "no log timeout"? Here's what I'm trying to achieve
var legacySoftwareProcess = new Process
{
StartInfo =
{
UseShellExecute = false,
RedirectStandardOutput = true,
WorkingDirectory = localPackageFolder,
FileName = CiConfig.DataGeneration.RebuildUnstrippedBatName
}
};
legacySoftwareProcess.IdleLogTimeout = 5 * 60; // 5 minutes
legacySoftwareProcess.Start();
var output = proc.StandardOutput.ReadToEnd();
legacySoftwareProcess.WaitForExit();
if (legacySoftwareProcess.ExitCode != 0)
{
Context.LogMessage(output);
Context.LogError("The process exited with non 0 code");
}
Rather than using:
var output = proc.StandardOutput.ReadToEnd();
You can listen for the event when output data is received from the process:
proc.OutputDataReceived += ResetTimer;
proc.Start();
proc.BeginOutputReadLine(); // not sure that you should use that as it may read output synchronously (I will check that soon)
And in the handler method ResetTimer, as the method name implies, reset a 5-minute timer:
static void ResetTimer(object sender, DataReceivedEventArgs e)
{
if (e.Data != null)
{
// reset the timer
}
}
If timer has elapsed, it means nothing has been outputed for 5 minutes, and you can take action accordingly, ie kill the process.

Sending message from one C# console application to another

First of all, I've read all related topics and they gave general idea but implementation doesn't work for me:
Send strings from one console application to another
How to send input to the console as if the user is typing?
Sending input/getting output from a console application (C#/WinForms)
I have a console application that is doing some actions in background until cancellation is requested. Typical usage scenario is :
1) Execute application
2) Enter input data
3) Issue start command
4) After some time passes, enter stop command
5) Exit application
Child application Program.cs :
static void Main()
{
Console.WriteLine("Enter input parameter : ");
var inputParameter = Console.ReadLine();
Console.WriteLine("Entered : " + inputParameter);
var tokenSource = new CancellationTokenSource();
var token = tokenSource.Token;
Task.Factory.StartNew(() =>
{
while (true)
{
if (token.IsCancellationRequested)
{
Console.WriteLine("Stopping actions");
return;
}
// Simulating some actions
Console.Write("*");
}
}, token);
if (Console.ReadKey().KeyChar == 'c')
{
tokenSource.Cancel();
Console.WriteLine("Stop command");
}
Console.WriteLine("Finished");
Console.ReadLine();
}
What I'm looking for is some sort of host utility to control this application - spawn multiple instances and perform required user actions on each instance.
Host application Program.cs :
static void Main()
{
const string exe = "Child.exe";
var exePath = System.IO.Path.GetFullPath(exe);
var startInfo = new ProcessStartInfo(exePath)
{
RedirectStandardOutput = true,
RedirectStandardInput = true,
WindowStyle = ProcessWindowStyle.Hidden,
WindowStyle = ProcessWindowStyle.Maximized,
CreateNoWindow = true,
UseShellExecute = false
};
var childProcess = new Process { StartInfo = startInfo };
childProcess.OutputDataReceived += readProcess_OutputDataReceived;
childProcess.Start();
childProcess.BeginOutputReadLine();
Console.WriteLine("Waiting 5s for child process to start...");
Thread.Sleep(5000);
Console.WriteLine("Enter input");
var msg = Console.ReadLine();
// Sending input parameter
childProcess.StandardInput.WriteLine(msg);
// Sending start command aka any key
childProcess.StandardInput.Write("s");
// Wait 5s while child application is working
Thread.Sleep(5000);
// Issue stop command
childProcess.StandardInput.Write("c");
// Wait for child application to stop
Thread.Sleep(20000);
childProcess.WaitForExit();
Console.WriteLine("Batch finished");
Console.ReadLine();
}
When I run this tool, after first input it crashes with "has stopped working" error and prompt to send memory dump to Microsoft. Output window in VS shows no exceptions.
Guess this problem occurs somewhere between applications and may be because of output stream buffer overflow (child app is writing a lot of stars each second which mimics real output which may be huge) and I yet have no idea how to fix it. I don't really need to pass child's output to host (only send start-stop commands to child), but commenting RedirectStandardOutput and OutputDataReceived doesn't fix this problem. Any ideas how to make this work?
I would recommend using NamedPipeServerStream and NamedPipeClientStream, which allows you to open a stream which will communicate between processes on a given machine.
First, this will create a pipe server stream and wait for someone to connect to it:
var stream = new NamedPipeServerStream(this.PipeName, PipeDirection.InOut);
stream.WaitForConnection();
return stream;
Then, this will connect to that stream (from your other process), allowing you to read / write in either direction:
var stream = new NamedPipeClientStream(".", this.PipeName, PipeDirection.InOut);
stream.Connect(100);
return stream;
Another alternative is to use MSMQ, you can find a good tutorial here
I would advise to look to the Working with memory mapped files in .NET 4
http://blogs.msdn.com/b/salvapatuel/archive/2009/06/08/working-with-memory-mapped-files-in-net-4.aspx
It's fast and efficient.

Queuing installations via Process.Start

I need to queue approximately 20 installations that are fully unattended (Using a C# winform application). Each installation has its own INI file (that is manually created) that contains the proper information on what arguments each installer requires for this procedure (read in before that program is executed). I'm running into issues with many application that when the setup.exe is executed the process closes immediately and launches its MSI (if applicable), causing my procedure to carry out with the next installation assuming that the first is complete. I have read similar problems snooping around the web, but no real solution on the issue... (some workarounds included using a batch file with the /Wait option which should have kept the setup.exe in memory until its MSI has completed). The setup.exe must be launched due to the fact that they contain bootstrappers.
What options do i have to resolve this dilemma?
Here is some sample code that demonstrates the procedure:
foreach (ListViewItem itm in this.lstSoftwares.Items)
{
try
{
if (itm.Checked)
{
lblStatus.Text = "Status: Installing " + current.ToString() + " of " + count.ToString();
string InstallPath = Path.Combine(Application.StartupPath, "Software",
itm.Text, itm.Tag.ToString());
string CommandLine = itm.SubItems[1].Text;
Process process = new Process();
process.StartInfo.FileName = InstallPath;
process.StartInfo.Arguments = CommandLine;
process.StartInfo.WindowStyle = ProcessWindowStyle.Normal;
process.Start();
process.WaitForExit();
this.lstSoftwares.Items[i].SubItems[2].Text = "Complete";
current++;
}
Update
right after waitforexit() i'm using a loop that checks if the msiexec is running:
private bool MSIRunning()
{
try
{
using (var mutex = Mutex.OpenExisting(#"Global\_MSIExecute"))
{
return true;
}
}
catch (Exception)
{
return false;
}
}
this is a hack in my opionion, but doing the trick so far...
Querying the MSI Mutex after process.start in a loop (check if Mutex is running every 3 seconds, if not return and proceed with next install) seemed to solve the problem (Noted above).
Already answered, but I have a slightly more robust implementation of the MSI mutex check:
public bool IsMsiExecFree(TimeSpan maxWaitTime)
{
_logger.Info(#"Waiting up to {0}s for Global\_MSIExecute mutex to become free...", maxWaitTime.TotalSeconds);
// The _MSIExecute mutex is used by the MSI installer service to serialize installations
// and prevent multiple MSI based installations happening at the same time.
// For more info: http://msdn.microsoft.com/en-us/library/aa372909(VS.85).aspx
const string installerServiceMutexName = "Global\\_MSIExecute";
Mutex msiExecuteMutex = null;
var isMsiExecFree = false;
try
{
msiExecuteMutex = Mutex.OpenExisting(installerServiceMutexName,
MutexRights.Synchronize);
isMsiExecFree = msiExecuteMutex.WaitOne(maxWaitTime, false);
}
catch (WaitHandleCannotBeOpenedException)
{
// Mutex doesn't exist, do nothing
isMsiExecFree = true;
}
catch (ObjectDisposedException)
{
// Mutex was disposed between opening it and attempting to wait on it, do nothing
isMsiExecFree = true;
}
finally
{
if (msiExecuteMutex != null && isMsiExecFree)
msiExecuteMutex.ReleaseMutex();
}
_logger.Info(#"Global\_MSIExecute mutex is free, or {0}s has elapsed.", maxWaitTime.TotalSeconds);
return isMsiExecFree;
}

.Exited event problems

The .Exited is not working for all cases, for example: to C:\foo.png when I close the responsible application that show the image, I don't get the MessageBox.Show("exited!");
here's my code:
public static void TryOpenFile(string filename)
{
Process proc = new Process();
proc.StartInfo = new ProcessStartInfo(filename);
proc.EnableRaisingEvents = true;
proc.Exited += (a,b) => { MessageBox.Show("Exited!"); }
proc.Start();
}
how I call the function TryOpenFile(#"C:\foo.png");. How to fix this?
Is it possible that you already have your image editing program open? When you call proc.Start(), if the process is already running, then the existing process is reused. You should check the return value of proc.Start() to see if this is the case.
From MSDN:
Return Value
true if a process resource is started; false if no new
process resource is started (for example, if an existing process is
reused).
...
Remarks
...
If the process resource specified by the FileName member of the StartInfo property is
already running on the computer, no additional process resource is started. Instead, the
running process resource is reused and false is returned.

Hanging process when run with .NET Process.Start -- what's wrong?

I wrote a quick and dirty wrapper around svn.exe to retrieve some content and do something with it, but for certain inputs it occasionally and reproducibly hangs and won't finish. For example, one call is to svn list:
svn list "http://myserver:84/svn/Documents/Instruments/" --xml --no-auth-cache --username myuser --password mypassword
This command line runs fine when I just do it from a command shell, but it hangs in my app. My c# code to run this is:
string cmd = "svn.exe";
string arguments = "list \"http://myserver:84/svn/Documents/Instruments/\" --xml --no-auth-cache --username myuser --password mypassword";
int ms = 5000;
ProcessStartInfo psi = new ProcessStartInfo(cmd);
psi.Arguments = arguments;
psi.RedirectStandardOutput = true;
psi.WindowStyle = ProcessWindowStyle.Normal;
psi.UseShellExecute = false;
Process proc = Process.Start(psi);
StreamReader output = new StreamReader(proc.StandardOutput.BaseStream, Encoding.UTF8);
proc.WaitForExit(ms);
if (proc.HasExited)
{
return output.ReadToEnd();
}
This takes the full 5000 ms and never finishes. Extending the time doesn't help. In a separate command prompt, it runs instantly, so I'm pretty sure it's unrelated to an insufficient waiting time. For other inputs, however, this seems to work fine.
I also tried running a separate cmd.exe here (where exe is svn.exe and args is the original arg string), but the hang still occurred:
string cmd = "cmd";
string arguments = "/S /C \"" + exe + " " + args + "\"";
What could I be screwing up here, and how can I debug this external process stuff?
EDIT:
I'm just now getting around to addressing this. Mucho thanks to Jon Skeet for his suggestion, which indeed works great. I have another question about my method of handling this, though, since I'm a multi-threaded novice. I'd like suggestions on improving any glaring deficiencies or anything otherwise dumb. I ended up creating a small class that contains the stdout stream, a StringBuilder to hold the output, and a flag to tell when it's finished. Then I used ThreadPool.QueueUserWorkItem and passed in an instance of my class:
ProcessBufferHandler bufferHandler = new ProcessBufferHandler(proc.StandardOutput.BaseStream,
Encoding.UTF8);
ThreadPool.QueueUserWorkItem(ProcessStream, bufferHandler);
proc.WaitForExit(ms);
if (proc.HasExited)
{
bufferHandler.Stop();
return bufferHandler.ReadToEnd();
}
... and ...
private class ProcessBufferHandler
{
public Stream stream;
public StringBuilder sb;
public Encoding encoding;
public State state;
public enum State
{
Running,
Stopped
}
public ProcessBufferHandler(Stream stream, Encoding encoding)
{
this.stream = stream;
this.sb = new StringBuilder();
this.encoding = encoding;
state = State.Running;
}
public void ProcessBuffer()
{
sb.Append(new StreamReader(stream, encoding).ReadToEnd());
}
public string ReadToEnd()
{
return sb.ToString();
}
public void Stop()
{
state = State.Stopped;
}
}
This seems to work, but I'm doubtful that this is the best way. Is this reasonable? And what can I do to improve it?
One standard issue: the process could be waiting for you to read its output. Create a separate thread to read from its standard output while you're waiting for it to exit. It's a bit of a pain, but that may well be the problem.
Jon Skeet is right on the money!
If you don't mind polling after you launch your svn command try this:
Process command = new Process();
command.EnableRaisingEvents = false;
command.StartInfo.FileName = "svn.exe";
command.StartInfo.Arguments = "your svn arguments here";
command.StartInfo.UseShellExecute = false;
command.StartInfo.RedirectStandardOutput = true;
command.Start();
while (!command.StandardOutput.EndOfStream)
{
Console.WriteLine(command.StandardOutput.ReadLine());
}
I had to drop an exe on a client's machine and use Process.Start to launch it.
The calling application would hang - the issue ended up being their machine assuming the exe was dangerous and preventing other applications from starting it.
Right click the exe and go to properties. Hit "Unblock" toward the bottom next to the security warning.
Based on Jon Skeet's answer this is how I do it in modern day (2021) .NET 5
var process = Process.Start(processStartInfo);
var stdErr = process.StandardError;
var stdOut = process.StandardOutput;
var resultAwaiter = stdOut.ReadToEndAsync();
var errResultAwaiter = stdErr.ReadToEndAsync();
await process.WaitForExitAsync();
await Task.WhenAll(resultAwaiter, errResultAwaiter);
var result = resultAwaiter.Result;
var errResult = errResultAwaiter.Result;
Note that you can't await the standard output before the error, because the wait will hang in case the standard error buffer gets full first (same for trying it the other way around).
The only way is to start reading them asynchronously, wait for the process to exit, and then complete the await by using Task.WaitAll
I know this is an old post but maybe this will assist someone. I used this to execute some AWS (Amazon Web Services) CLI commands using .Net TPL tasks.
I did something like this in my command execution which is executed within a .Net TPL Task which is created within my WinForm background worker bgwRun_DoWork method which holding a loop with while(!bgwRun.CancellationPending). This contains the reading of the Standard Output from the Process via a new Thread using the .Net ThreadPool class.
private void bgwRun_DoWork(object sender, DoWorkEventArgs e)
{
while (!bgwRun.CancellationPending)
{
//build TPL Tasks
var tasks = new List<Task>();
//work to add tasks here
tasks.Add(new Task(()=>{
//build .Net ProcessInfo, Process and start Process here
ThreadPool.QueueUserWorkItem(state =>
{
while (!process.StandardOutput.EndOfStream)
{
var output = process.StandardOutput.ReadLine();
if (!string.IsNullOrEmpty(output))
{
bgwRun_ProgressChanged(this, new ProgressChangedEventArgs(0, new ExecutionInfo
{
Type = "ExecutionInfo",
Text = output,
Configuration = s3SyncConfiguration
}));
}
if (cancellationToken.GetValueOrDefault().IsCancellationRequested)
{
break;
}
}
});
});//work Task
//loop through and start tasks here and handle completed tasks
} //end while
}
I know my SVN repos can run slow sometimes, so maybe 5 seconds isn't long enough? Have you copied the string you are passing to the process from a break point so you are positive it's not prompting you for anything?

Categories