Thread safe writing to Lucene index files

Thread safe writing to Lucene index files - c#

I have an application that crawls a site and writes the content as lucene index files into the physical directory.
When I use threads for this purpose, I am getting write errors or errors due to the locks.
I want to use multiple threads and write into the index files without missing the task of any of the threads.
public class WriteDocument
{
private static Analyzer _analyzer;
private static IndexWriter indexWriter;
private static string Host;
public WriteDocument(string _Host)
{
Host = _Host;
Lucene.Net.Store.Directory _directory = FSDirectory.GetDirectory(Host, false);
_analyzer = new StandardAnalyzer();
bool indexExists = IndexReader.IndexExists(_directory);
bool createIndex = !indexExists;
indexWriter = new IndexWriter(_directory, _analyzer, true);
}
public void AddDocument(object obj)
{
DocumentSettings doc = (DocumentSettings)obj;
Field urlField = new Field("Url", doc.downloadedDocument.Uri.ToString(), Field.Store.YES, Field.Index.TOKENIZED);
document.Add(urlField);
indexWriter.AddDocument(document);
document = null;
doc.downloadedDocument = null;
indexWriter.Optimize();
indexWriter.Close();
}
}
To the above class, I am passing the values like this:
DocumentSettings writedoc = new DocumentSettings()
{
Host = Host,
downloadedDocument = downloadDocument
};
Thread t = new Thread(() =>
{
doc.AddDocument(writedoc);
});
t.Start();
If I add t.Join(); after t.Start(); the code works for me without any errors. But this slows down my process and virtually, this is equal to the output I get without using threads.
I am getting error like:
Cannot rename /indexes/Segments.new to /indexes/Segments
the file is used by some other process.
Can anyone help me on this code?

An IndexWriter is not thread safe, so this is not possible.
If you want to use multiple threads for download, you will need to build some sort of "message pump" that is singlethreaded, to which you can feed the Documents you are downloading and creating, and puts them in a queue.
Example, in your AddDocument method, instead of utilizing the index directly, just send them of to a service that will index it eventually.
That service should try to index everything in queue all the time, and if it doesn't have a queue for the time being, sleep for a while.

An approach you can take would be to create a separate index for each of the threads and merge them all back at the end. e.g. index1, index2...indexn (corresponding to threads 1..n) and merge them.

Related

ThreadPool race-condition, closure, locking or something else?

Still pretty new to threads so I'm sure it is one of those little gotchas and a repeat question, but I have been unable to find the answer browsing the threads.
I have a port scanner app in C#.
I'm using threadpools to spin up a new TcpClient for each port and probe if it's open.
After suffering through the concepts of closures and thread synchronization, I am having an issue where when multiple threads try to save their results to different indexes in the Orchestrator.hosts (List).
I have multiple threads trying to update a single List results object. My understanding is this is fine as long as I lock the object on write, however I'm finding that on some updates, multiple entries are getting the same update.
IE, Thread #1 supposed to update Hosts[0].Ports[0].Status to "Open",
What happens:
Thread #1 updates multiple host with the port result despite passing a specific index for Hosts.
Hosts[0].Ports[0].Status to "Open",
Hosts[1].Ports[0].Status to "Open",
Hosts[2].Ports[0].Status to "Open",
Not sure where my problem is. The Static method I'm calling to perform a probe of a given port
public static void ScanTCPPorts()
{
// Create a list of portsToScan objects to send to thread workers
//List<ScanPortRequest> portsToScan = new List<ScanPortRequest>();
using (ManualResetEvent resetEvent = new ManualResetEvent(false))
{
int toProcess = 0;
for (var i = 0; i < hostCount; i++) // Starting at Begining
{
int currentHostId = i;
// To hold our current hosts ID (Assign outside of threaded function to avoid race-condition)
if (hosts[i].IsAlive || scanDefinition.isForced())
{
int portCount = hosts[i].Ports.Count;
for (int p = 0; p < portCount; p++)
{
// Thread-safe Increment our workQueue counter
Interlocked.Increment(ref toProcess);
int currentPortPosition = p;
// We need to send the arrayIndex in to the thread function
PortScanRequestResponse portRequestResponse = new PortScanRequestResponse(hosts[currentHostId], currentHostId, hosts[currentHostId].Ports[currentPortPosition], currentPortPosition);
ThreadPool.QueueUserWorkItem(
new WaitCallback(threadedRequestResponseInstance => {
PortScanRequestResponse portToScan = threadedRequestResponseInstance as PortScanRequestResponse;
PortScanRequestResponse threadResult = PortScanner.scanTCPPort(portToScan);
// Lock so Thread-safe update to result
lock (Orchestrator.hosts[portToScan.hostResultIndex])
{
if (threadResult.port.status == PortStatus.Open)
{
// Update result
Orchestrator.hosts[portToScan.hostResultIndex].Ports[portToScan.portResultIndex].status = PortStatus.Open;
//Logger.Log(hosts[currentHostId].IPAddress + " " + hosts[currentHostId].Ports[currentPortPosition].type + " " + hosts[currentHostId].Ports[currentPortPosition].portNumber + " is open");
}
else
{
Orchestrator.hosts[portToScan.hostResultIndex].Ports[portToScan.portResultIndex].status = PortStatus.Closed;
}
// Check if this was the last scan for the given host
if (Orchestrator.hosts[portToScan.hostResultIndex].PortScanComplete != true)
{
if (Orchestrator.hosts[portToScan.hostResultIndex].isCompleted())
{
Orchestrator.hosts[portToScan.hostResultIndex].PortScanComplete = true;
// Logger.Log(hosts[currentHostId].IPAddress + " has completed a port scan");
Orchestrator.hosts[portToScan.hostResultIndex].PrintPortSummery();
}
}
}
// Safely decrement the counter
if (Interlocked.Decrement(ref toProcess) == 0)
resetEvent.Set();
}), portRequestResponse); // Pass in our Port to scan
}
}
}
resetEvent.WaitOne();
}
}
Here is the worker process in a separate public static class.
public static PortScanRequestResponse scanTCPPort(object portScanRequest) {
PortScanRequestResponse portScanResponse = portScanRequest as PortScanRequestResponse;
HostDefinition host = portScanResponse.host;
ScanPort port = portScanResponse.port;
try
{
using (TcpClient threadedClient = new TcpClient())
{
try
{
IAsyncResult result = threadedClient.BeginConnect(host.IPAddress, port.portNumber, null, null);
Boolean success = result.AsyncWaitHandle.WaitOne(Orchestrator.scanDefinition.GetPortTimeout(), false);
if (threadedClient.Client != null)
{
if (success)
{
threadedClient.EndConnect(result);
threadedClient.Close();
portScanResponse.port.status = PortStatus.Open;
return portScanResponse;
}
}
} catch { }
}
}
catch
{ }
portScanResponse.port.status = PortStatus.Closed;
return portScanResponse;
}
Originally I was pulling the host index from a free variable, thinking this was the problem moved it to inside the delegate.
I tried locking the Hosts object everywhere there was a write.
I have tried different thread sync techniques (CountdownEvent and ManualResetEvent).
I think there is just some fundamental threading principal I have not been introduced to yet, or I have made a very simple logic mistake.

I have multiple threads trying to update a single List results object. My understanding is this is fine as long as I lock the object on write.
I haven't studied your code, but the above statement alone is incorrect. When a List<T>, or any other non-thread-safe object , is used in a multithreaded environment, all interactions with the object must be synchronized. Only one thread at a time should be allowed to interact with the object. Both writes and reads must be enclosed in lock statements, using the same locker object. Even reading the Count must be synchronized. Otherwise the usage is erroneous, and the behavior of the program is undefined.

I was hyper-focused on it being a thread issue because this was my firsts threaded project. Turned out to be that I didn't realize copies of a List<> objects are references to their original object (reference type). I assumed my threads were accessing my save structure in an unpredictable way, but my arrays of ports were all referencing the same object.
This was a "reference type" vs "value type" issue on my List<> of ports.

How to output data from Threadpool to ConcurrentQueue (C#)?

I have written an application for my company that essentially sends records from a text or CSV file in arrays of 100 records to a web service, to which it then returns the response, also in arrays of 100 records and writes them to another file. Currently, it is single threaded (processing sequentially using the Backgroundworker in Windows Pro Forms), but I am looking to multi-thread it utilizing a threadpool and concurrentqueue.
#region NonDataCollectionService
if (!userService.needsAllRecords)
{
ConcurrentQueue<Record[]> outputQueue = new ConcurrentQueue<Record[]>();
while ((!inputFile.checkForEnd()) && (!backgroundWorker1.CancellationPending))
{
//Get array of (typically) 100 records from input file
Record[] serviceInputRecord = inputFile.getRecords(userService.maxRecsPerRequest);
//Queue array to be processed by threadpool, send in output concurrentqueue to be filled by threads
ThreadPool.QueueUserWorkItem(new WaitCallback(sendToService), new object[]{serviceInputRecord, outputQueue});
}
}
#endregion
void sendToService(Object stateInfo)
{
//The following block is in progress, I basically need to create copies of my class that calls the service for each thread
IWS threadService = Activator.CreateInstance(userService.GetType()) as IWS;
threadService.errorStatus = userService.errorStatus;
threadService.inputColumns = userService.inputColumns;
threadService.outputColumns = userService.outputColumns;
threadService.serviceOptions = userService.serviceOptions;
threadService.userLicense = userService.userLicense;
object[] objectArray = stateInfo as object[];
//Send input records to service
threadService.sendToService((Record[])objectArray[0]);
//The line below returns correctly
Record[] test123 = threadService.outputRecords;
ConcurrentQueue<Record[]> threadQueue = objectArray[1] as ConcurrentQueue<Record[]>;
//threadQueue has records here
threadQueue.Enqueue(test123);
}
However, when I check "outputQueue" in the top block, it is empty, even when threadQueue has records queued. Is this because I'm passing by value instead of passing by reference? If so, how would I do that syntactically with Threadpool.QueueUserWorkItem?

Can this code have bottleneck or be resource-intensive?

It's code that will execute 4 threads in 15-min intervals. The last time that I ran it, the first 15-minutes were copied fast (20 files in 6 minutes), but the 2nd 15-minutes are much slower. It's something sporadic and I want to make certain that, if there's any bottleneck, it's in a bandwidth limitation with the remote server.
EDIT: I'm monitoring the last run and the 15:00 and :45 copied in under 8 minutes each. The :15 hasn't finished and neither has :30, and both began at least 10 minutes before :45.
Here's my code:
static void Main(string[] args)
{
Timer t0 = new Timer((s) =>
{
Class myClass0 = new Class();
myClass0.DownloadFilesByPeriod(taskRunDateTime, 0, cts0.Token);
Copy0Done.Set();
}, null, TimeSpan.FromMinutes(20), TimeSpan.FromMilliseconds(-1));
Timer t1 = new Timer((s) =>
{
Class myClass1 = new Class();
myClass1.DownloadFilesByPeriod(taskRunDateTime, 1, cts1.Token);
Copy1Done.Set();
}, null, TimeSpan.FromMinutes(35), TimeSpan.FromMilliseconds(-1));
Timer t2 = new Timer((s) =>
{
Class myClass2 = new Class();
myClass2.DownloadFilesByPeriod(taskRunDateTime, 2, cts2.Token);
Copy2Done.Set();
}, null, TimeSpan.FromMinutes(50), TimeSpan.FromMilliseconds(-1));
Timer t3 = new Timer((s) =>
{
Class myClass3 = new Class();
myClass3.DownloadFilesByPeriod(taskRunDateTime, 3, cts3.Token);
Copy3Done.Set();
}, null, TimeSpan.FromMinutes(65), TimeSpan.FromMilliseconds(-1));
}
public struct FilesStruct
{
public string RemoteFilePath;
public string LocalFilePath;
}
Private void DownloadFilesByPeriod(DateTime TaskRunDateTime, int Period, Object obj)
{
FilesStruct[] Array = GetAllFiles(TaskRunDateTime, Period);
//Array has 20 files for the specific period.
using (Session session = new Session())
{
// Connect
session.Open(sessionOptions);
TransferOperationResult transferResult;
foreach (FilesStruct u in Array)
{
if (session.FileExists(u.RemoteFilePath)) //File exists remotely
{
if (!File.Exists(u.LocalFilePath)) //File does not exist locally
{
transferResult = session.GetFiles(u.RemoteFilePath, u.LocalFilePath);
transferResult.Check();
foreach (TransferEventArgs transfer in transferResult.Transfers)
{
//Log that File has been transferred
}
}
else
{
using (StreamWriter w = File.AppendText(Logger._LogName))
{
//Log that File exists locally
}
}
}
else
{
using (StreamWriter w = File.AppendText(Logger._LogName))
{
//Log that File exists remotely
}
}
if (token.IsCancellationRequested)
{
break;
}
}
}
}

Something is not quite right here. First thing is, you're setting 4 timers to run parallel. If you think about it, there is no need. You don't need 4 threads running parallel all the time. You just need to initiate tasks at specific intervals. So how many timers do you need? ONE.
The second problem is why TimeSpan.FromMilliseconds(-1)? What is the purpose of that? I can't figure out why you put that in there, but I wouldn't.
The third problem, not related to multi-programming, but I should point out anyway, is that you create a new instance of Class each time, which is unnecessary. It would be necessary if, in your class, you need to set constructors and your logic access different methods or fields of the class in some order. In your case, all you want to do is to call the method. So you don't need a new instance of the class every time. You just need to make the method you're calling static.
Here is what I would do:
Store the files you need to download in an array / List<>. Can't you spot out that you're doing the same thing every time? Why write 4 different versions of code for that? This is unnecessary. Store items in an array, then just change the index in the call!
Setup the timer at perhaps 5 seconds interval. When it reaches the 20 min/ 35 min/ etc. mark, spawn a new thread to do the task. That way a new task can start even if the previous one is not finished.
Wait for all threads to complete (terminate). When they do, check if they throw exceptions, and handle them / log them if necessary.
After everything is done, terminate the program.
For step 2, you have the option to use the new async keyword if you're using .NET 4.5. But it won't make a noticeable difference if you use threads manually.
And why is it so slow...why don't you check your system status using task manager? Is the CPU high and running or is the network throughput occupied by something else or what? You can easily tell the answer yourself from there.

The problem was the sftp client.
The purpose of the console application was to loop through a list<> and download the files. I tried with winscp and, even though, it did the job, it was very slow. I also tested sharpSSH and it was even slower than winscp.
I finally ended up using ssh.net which, at least in my particular case, was much faster than both winscp and sharpssh. I think the problem with winscp is that there was no evident way of disconnecting after I was done. With ssh.net I could connect/disconnect after every file download was made, something I couldn't do with winscp.

Writing to file in a thread safe manner

Writing Stringbuilder to file asynchronously. This code takes control of a file, writes a stream to it and releases it. It deals with requests from asynchronous operations, which may come in at any time.
The FilePath is set per class instance (so the lock Object is per instance), but there is potential for conflict since these classes may share FilePaths. That sort of conflict, as well as all other types from outside the class instance, would be dealt with retries.
Is this code suitable for its purpose? Is there a better way to handle this that means less (or no) reliance on the catch and retry mechanic?
Also how do I avoid catching exceptions that have occurred for other reasons.
public string Filepath { get; set; }
private Object locker = new Object();
public async Task WriteToFile(StringBuilder text)
{
int timeOut = 100;
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
while (true)
{
try
{
//Wait for resource to be free
lock (locker)
{
using (FileStream file = new FileStream(Filepath, FileMode.Append, FileAccess.Write, FileShare.Read))
using (StreamWriter writer = new StreamWriter(file, Encoding.Unicode))
{
writer.Write(text.ToString());
}
}
break;
}
catch
{
//File not available, conflict with other class instances or application
}
if (stopwatch.ElapsedMilliseconds > timeOut)
{
//Give up.
break;
}
//Wait and Retry
await Task.Delay(5);
}
stopwatch.Stop();
}

How you approach this is going to depend a lot on how frequently you're writing. If you're writing a relatively small amount of text fairly infrequently, then just use a static lock and be done with it. That might be your best bet in any case because the disk drive can only satisfy one request at a time. Assuming that all of your output files are on the same drive (perhaps not a fair assumption, but bear with me), there's not going to be much difference between locking at the application level and the lock that's done at the OS level.
So if you declare locker as:
static object locker = new object();
You'll be assured that there are no conflicts with other threads in your program.
If you want this thing to be bulletproof (or at least reasonably so), you can't get away from catching exceptions. Bad things can happen. You must handle exceptions in some way. What you do in the face of error is something else entirely. You'll probably want to retry a few times if the file is locked. If you get a bad path or filename error or disk full or any of a number of other errors, you probably want to kill the program. Again, that's up to you. But you can't avoid exception handling unless you're okay with the program crashing on error.
By the way, you can replace all of this code:
using (FileStream file = new FileStream(Filepath, FileMode.Append, FileAccess.Write, FileShare.Read))
using (StreamWriter writer = new StreamWriter(file, Encoding.Unicode))
{
writer.Write(text.ToString());
}
With a single call:
File.AppendAllText(Filepath, text.ToString());
Assuming you're using .NET 4.0 or later. See File.AppendAllText.
One other way you could handle this is to have the threads write their messages to a queue, and have a dedicated thread that services that queue. You'd have a BlockingCollection of messages and associated file paths. For example:
class LogMessage
{
public string Filepath { get; set; }
public string Text { get; set; }
}
BlockingCollection<LogMessage> _logMessages = new BlockingCollection<LogMessage>();
Your threads write data to that queue:
_logMessages.Add(new LogMessage("foo.log", "this is a test"));
You start a long-running background task that does nothing but service that queue:
foreach (var msg in _logMessages.GetConsumingEnumerable())
{
// of course you'll want your exception handling in here
File.AppendAllText(msg.Filepath, msg.Text);
}
Your potential risk here is that threads create messages too fast, causing the queue to grow without bound because the consumer can't keep up. Whether that's a real risk in your application is something only you can say. If you think it might be a risk, you can put a maximum size (number of entries) on the queue so that if the queue size exceeds that value, producers will wait until there is room in the queue before they can add.

You could also use ReaderWriterLock, it is considered to be more 'appropriate' way to control thread safety when dealing with read write operations...
To debug my web apps (when remote debug fails) I use following ('debug.txt' end up in \bin folder on the server):
public static class LoggingExtensions
{
static ReaderWriterLock locker = new ReaderWriterLock();
public static void WriteDebug(string text)
{
try
{
locker.AcquireWriterLock(int.MaxValue);
System.IO.File.AppendAllLines(Path.Combine(Path.GetDirectoryName(System.Reflection.Assembly.GetExecutingAssembly().GetName().CodeBase).Replace("file:\\", ""), "debug.txt"), new[] { text });
}
finally
{
locker.ReleaseWriterLock();
}
}
}
Hope this saves you some time.

Performance issues using task factory in C#

I am working on a logging system for a web application which logs a sequence of events in a dictionary object before sending it to my logging object using Task.Factory.StartNew(() => iLogEventSave()). The logger seemed to work fine, but in some instances some events were not being saved properly so I used the lock() statement to correct the issue. This seemed to do the trick, but the application's performance has dramatically decreased by doing this. How can I have the UI/Page render without having to wait for the Tasks to finish their job?
Below is the code
private static readonly object Locker = new object();
public void iLogEventSave(object state)
{
XmlDocument doc = new XmlDocument();
IDictionary<string, string> EventDetails = (IDictionary<string, string>)state;
string logFile = "";
if(ConfigurationManager.AppSettings["Log_File_Path"].ToString() =="")
{
logFile = HttpRuntime.AppDomainAppPath + "Logs\\" + DateTime.Now.ToString("yyyy_MM_dd") + ".txt";
}
else
{
logFile = ConfigurationManager.AppSettings["Log_File_Path"].ToString() + DateTime.Now.ToString("yyyy_MM_dd") + ".txt";
}
lock (Locker)
{
if (File.Exists(logFile))
{
doc.Load(logFile);
}
else
{
var root = doc.CreateElement("Log");
doc.AppendChild(root);
}
var el = (XmlElement)doc.DocumentElement.AppendChild(doc.CreateElement("Event"));
foreach (KeyValuePair<string, string> item in EventDetails)
{
XmlElement Desc = doc.CreateElement("Details");
Desc.SetAttribute(item.Key.ToString(), item.Value);
el.AppendChild(Desc);
}
doc.Save(logFile);
}
}

If your log did not save several events while being executed asynchronously, you have an unhandled error that you did not address. Considering that you're using a file, I'm going to go out on a limb and say that it failed because two threads were competing for access to the same log file and the first thread to grab it locked the other one out. This is why your lock would now work, it prevents other threads from trying to grab the file.
But logging to a file means that you've effectively restricted yourself to one thread at a time and dealing with the entire file as it grows. You have to load more and more, append more and more, and locking the thread means that the more threads are waiting to log the events, the higher your overhead. All this could certainly add up to a decrease in performance.
May I recommend using a database table to log events? File I/O is very expensive, resource and time-wise. Databases have less overhead and far better throughput by comparison in these very scenarios.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Thread safe writing to Lucene index files - c#

An approach you can take would be to create a separate index for each of the threads and merge them all back at the end. e.g. index1, index2...indexn (corresponding to threads 1..n) and merge them.

Related

ThreadPool race-condition, closure, locking or something else?

How to output data from Threadpool to ConcurrentQueue (C#)?

Can this code have bottleneck or be resource-intensive?

Writing to file in a thread safe manner

Performance issues using task factory in C#

Categories

Resources