Lucene .Net, do i need to close IndexWriter

Lucene .Net, do i need to close IndexWriter - c#

We are having locking issues with Lucene .Net throwing a LockObtainFailedException. It is a multi tenanted site and each customer gets their own physical search index on disc, and a static list of IndexWriters is used, one per index to control changes.
We call the following functions on the IndexWriter
AddDocument();
DeleteDocuments();
DeleteAll();
Optimize();
Commit();
I have noticed that we never call Close() or Dispose() on the IndexWriter, and wanted to know if this was good practice and could be the cause of the issues.
Thanks Dave

The docs say yes, but only when you're killing off the application itself - otherwise, no. Here's the docs for IndexWriter.Dispose in Lucene.Net 4.8:
Commits all changes to an index, waits for pending merges to complete,
and closes all associated files.
This is a "slow graceful shutdown" which may take a long time ...
Note that this may be a costly operation, so, try to re-use a single
writer instead of closing and opening a new one. See Commit() for
caveats about write caching done by some IO devices.
https://github.com/apache/lucenenet/blob/master/src/Lucene.Net/Index/IndexWriter.cs#L996
So, you should call .Dispose(), but, typically only once when you're shutting down the app. It is not however clear whether you need to Dispose() its underlying objects.
You're already calling .Commit(), which they recommend instead. I would guess your problem is actually related to threading. I'm just learning Lucene, but if I were in your position I'd try putting a standard .Net lock around any write calls to Lucene, so that only one thread has access to writes at a time. If it solves your issue, you know it was threading.
Locks are awfully painful, and Lucene writes may take a long time, so if the lock solves this issue it may introduce other problems like 2 threads attempting to write and one hanging or failing depending on how your code is written. If that does arise you'd probably want to implement a Write Queue so threads can quickly hand off what they'd like written to a cheap data structure like ConcurrentQueue, and then have those write ops startup the write operation if none is running, and keep dequeuing until everything's written out - then back to sleep.

To use Close/Dispose when you don't need the object any longer is always a good idea. There is a reason why a developer exposes these methods. Typically, the documentation give additional hints when to use these methods.
I also advise to use every IDisposeable-object in a using-block, which just calls Dispose().
This gives objects the ability to clean up and free resources. In case of framework-objects this isn't really important since the garbage collector will care sooner or later, but in case of system-objects or handles like file-system handles Dispose becomes important. These handles might stay open.
In the case of the Lucene IndexWriter I'm not perfectly sure, but when it uses a file for its index (which is what I assume), then you have a reason why Dispose should be called.
When handles/connections/etc stay open it can lead to such exceptions. So, yes, you should use Close()/Dispose()

Related

How do I ensure that the StringBuilder object is GCed in a multi-threaded environment? (Given that I can't use the using Keyword)?

EDIT :- How do I ensure that the StringBuilder object is GC'ed in a multi-threaded environment? (Given that I can't use the using Keyword)?
I'm using StringBuilder across multiple threads and I create a new instance of StringBuilder for each thread being called.
I'm concerned about the performance, specifically the memory it is taking for so many instances of StringBuilder being created.
Is there an alternative to make sure the GC frees the memory for me? (Like calling Dispose etc.)
EDIT: The StringBuilder does not implement IDisposable, so I cant use the using keyword.
EDIT: I don't want to force the GC.Collect() since there are multiple threads running at the same time, and they're sharing resources between them. Some of these threads are lying dormant and are just listening for events to turn active again.

No, StringBuilder doesn't implement IDisposable, so it can't be used with using.
However, it wouldn't help anyway. Dispose has nothing to do with the GC. There may be cases where it helps, but it cannot really do anything that invalidation of the reference already does - the only exception being unmanaged resources (which must also be released in the finalizer).
If there is some managed internal resource, Dispose could get rid of that reference - however, that isn't really the proper pattern. A Disposed object should be dead for good, so you shouldn't keep a reference to it anyway.
The few cases where Dispose is completely necessary are in things like event handlers and CancellationToken (though that leak has been fixed in .NET 4.5). In any case, though, the whole point is to get rid of the references, so that GC can collect them - it will not cause the GC to collect them.
If you find yourself doing a lof of operations on StringBuilder (and it actually causes a performance problem!), you should probably think about reusing the older StringBuilders rather than finding out how to dispose of them faster. You can reuse StringBuilders as many times as you like, which is very handy when doing high-performance string processing.
The most important thing when considering application performance, especially on platforms like .NET, is profiling. Don't guess - measure. .NET's GC is actually pretty impressive and rarely needs manual "help" from you.
EDIT:
As for your edit, if you're thinking that creating a StringBuilder for each thread is too much work, you're probably creating way more threads than reasonable. If you have a performance problem, look into better work scheduling, rather than better StringBuilder disposal patterns. There is no reason to keep a thread just to wait for some event - that's what asynchronous I/O is for. So in truth, there really isn't much of a reason of having more than about 2 * CPU core threads, give or take.
There's no way to explicitly collect any memory in .NET - and it really is a good thing. The only thing that comes close is GC.Collect, but it is by far more likely to cause a performance problem rather than solving it. The only place I've seen it being useful is in performance tests. It's also used in WPF, and it causes quite interesting performance issues.

You shouldn't have to worry about a object being GCed as long as you do not keep any unnecessary reference to it. Hence, to ensure an object gets GCed, ensure that you do not hold any reference to it. Set any variables referencing that object to null. Rest GC will take care of.

Another approach to consider is to have thread specific cache of string builder object(s).
Assuming you are not manually spawning a new thread every time (i.e. threads are not getting created and destroyed, and are instead coming from a thread pool) following implementation of internal StringBuilderCache class in .Net framework may be a good starting point -
http://referencesource.microsoft.com/#mscorlib/system/text/stringbuildercache.cs
A more sophisticated implementation for reference can be found at -
http://source.roslyn.codeplex.com/#Microsoft.CodeAnalysis.Workspaces/Formatting/StringBuilderPool.cs

You can try GC.Collect();
MSDN
Make sure you know what your are doing before you start manipulating GC manually. It does a pretty good job when it comes to balance time vs space performance. Calling GC.Collect will free memory, but will definitely slow things down. What do you try to optimize for?

Generating cacheable data exactly once when needed and blocking otherwise?

I'm making a cool (imo) T4 template which will make caching a lot easier. One of the options I have in making this template is to allow for a "load once" type functionality, though I'm not sure how safe it is.
Basically, I want to make it so you can do something like this:
var post=MyCache.PostsCache.GetOrLockLoad(id, ()=>LoadPost(id));
and basically make it so that when the cache must be loaded, it will place a blocking lock across PostsCache. This way, other threads would block until the LoadPost() function is done. This makes it so that LoadPost will only be executed once per cache miss. The traditional way of doing this is LoadPost will be executed anytime the cache is empty, possibly multiple times if multiple requests for it come before the cache is loaded the first time.
Is this a reasonable thing to do, or is blocking other threads for something like this dangerous or wasteful? I'm thinking something along the lines that the thread locking overheads are greater than most operations, but maybe not?
Has anyone seen this kind of thing done and is it a good idea or just dangerous?
Also, although it's designed to run on any cache and application type, it's initially being targetted at ASP.Net's built in caching mechanism.

This seems ok, since in theory the requests after the first will only wait about as long as it would have taken for them to load the data themselves anyway.
But it still feels a bit iffy - what if the first loader thread gets held up due to some intermittent issue that may not affect other threads. It feels like it would be safer to let each thread try the load independently.
It's also adding the complexity and overhead of the locking mechanisms. Keep in mind the more locking you do, the more risk you introduce of getting a deadlock condition (in general). Although in your case, as long as there's no funky locking going on in the LoadPost method it shouldn't be an issue.
Given the risks, I think you would be better off going with a non-locking option.
After all, for any given thread the wait time is pretty much the same - either the time taken to load, or the time spent waiting for the first thread to load.
I'm always a little uncomfortable when a non-concurrent option is used over a concurrent one, especially if the gain seems marginal.

Cleanup before termination?

This question has been bugging me for a while: I've read in MSDN's DirectX article the following:
The destructor (of the application) should release any (Direct2D) interfaces stored...
DemoApp::~DemoApp()
{
SafeRelease(&m_pDirect2dFactory);
SafeRelease(&m_pRenderTarget);
SafeRelease(&m_pLightSlateGrayBrush);
SafeRelease(&m_pCornflowerBlueBrush);
}
Now, if all of the application's data is getting released/deallocated at the termination (source) why would I go through the trouble to make a function in-order-to/and release them individually? it makes no sense!
I keep seeing this more and more over the time, and it's obviously really bugging me.
The MSDN article above is the first time I've encountered this, so it made sense to mention it of all other cases.
Well, since so far I didn't actually ask my questions, here they are:
Do I need to release something before termination? (do explain why please)
Why did the author in MSDN haven chosen to do that?
Does the answer differ from native & managed code? I.E. Do I need to make sure everything's disposed at the end of the program while I'm writing a C# program? (I don't know about Java but if disposal exists there I'm sure other members would appreciate an answer for that too).
Thank you!

You don't need to worry about managed content when your application is terminating. When the entire process's memory is torn down all of that goes with it.
What matters is unmanaged resources.
If you have a lock on a file and the managed wrapper for the file handler is taken down when the application closes without you ever releasing the lock, you've now thrown away the only key that would allow access to the file.
If you have an internal buffer (say for logging errors) you may want to flush it before the application terminates. Not doing so would potentially mean the fatal error that caused the application to end isn't logged. That could be...bad.
If you have network connections open you'll want to close them. If you don't then the OS likely won't do it for you (at least not for a while; eventually it might notice the inactivity) and that's rather rude to whoever's on the other end. They may be continuing to listen for a response, or continuing to send you information, not knowing that you're not there anymore.

Now, if all of the application's data is getting released/deallocated
at the termination (source) why would I go through the trouble to make
a function in-order-to/and release them individually?
A number of reasons. One immediate reason is because not all resources are memory. Only memory gets reclaimed at process termination. If some of your resources are things like shared mutexes or file handles, not releasing those resources could mess up other programs or subsequent runs of your program.
I think there's a more important, more fundamental reason though. Not cleaning up after yourself is just lazy, sloppy programming. If you are lazy and sloppy in cleanup at termination, are you lazy and sloppy at other times? If your tendancy is to be lazy and sloppy and only override that tendancy in specific areas where you're cognizant of potential problems, then your tendancy is to be lazy and sloppy. What if there are potential problems you're not cognizant of? How can you rely on your overall philosophy of lazy, sloppy programming to write correct, robust programs?
Don't be that guy. Clean up after yourself.

Why call Dispose() before main() exits?

My .net service cleans up all its unmanaged resources by calling resourceName.Dispose() in a finally block before the Main() loop exits.
Do I really have to do this?
Am I correct in thinking that I can’t leak any resources because the process is ending? Windows will close any handles that are no longer being used, right?

There is no limit to the types of resources that may be encapsulated by an object implementing IDisposable. The vast majority of resources encapsulated by IDisposable objects will be cleaned up by the operating system when a process shuts down, but some programs may use resources the operating system knows nothing about. For example, a database application which requires a locking pattern that isn't supported by the underlying database might use one or more tables to keep track of what things are "checked out" and by whom. A class which "checks out" resources using such tables could ensure in its Dispose method that everything gets checked back in, but if the program shuts down without the class having a chance to clean up the tables, the resources guarded by that table would be left dangling. Since the operating system would have no clue what any of those tables mean, it would have no way of cleaning them up.

It's probably okay to skip this, in that specific case.
The first thing to understand is that while ending the process should by itself be enough to cleanup most things, it's possible for some unmanaged resources to be left in a bad or unclosed state. For example, you might have an app that is licensed per seat, and when the app closes you need to update a database record somewhere to release your license. If a process terminates incorrectly, nothing will make that update happen, and you could end up locking people out of your software. Just because your process terminates isn't an excuse not to do cleanup.
However, in the .Net world with the IDisposable pattern you can get a little more insurance. When the process exits, all remaining finalizers will run. If the Dispose() pattern is implemented properly (and that's a bigger "if" than it should be), the finalizers are still there to take care of any remaining unmanaged resources for their objects...
However, it's good practice to always be in the habit of correctly disposing these things yourself. And FWIW, just calling .Dispose() is not enough to do this correctly. Your .Dispose() call must be included as part of a finally block (including the implicit finally block you get with a using statement).

Multithreaded file writing

I am trying to write to different pieces of a large file using multiple threads, just like a segmented file downloader would do.
My question is, what is the safe way to do this? Do I open the file for writing, create my threads, passing the Stream object to each thread? I don't want an error to occur because multiple threads are accessing the same object at potentially the same time.
This is C# by the way.

I would personally suggest that you fetch the data in multiple threads, but actually write to it from a single thread. It's likely to be considerably simpler that way. You could use a producer/consumer queue (which is really easy in .NET 4) and then each producer would feed pairs of "index, data". The consumer thread could then just sequentially seek, write, seek, write etc.

If this were Linux programming, I would recommend you look into the pwrite() command, which writes a buffer to a file at a given offset. A cursory search of C# documentation doesn't turn up anything like this however. Does anyone know if a similar function exists?

Although one might be able to open multiple streams pointing to the same file, and use a different stream in each thread, I would second the advice of using a single thread for the writing absent some reason to do otherwise. Even if two or more threads can safely write to the same file simultaneously, that doesn't mean it's a good idea. It may be helpful to have the unified thread attempt to sequence writes in a sensible order to avoid lots of random seeking; the performance benefit from that would depend upon how effectively the OS could cache and schedule random writes. Don't go crazy optimizing such things if it turns out the OS does a good job, but be prepared to add some optimization if the OS default behavior turns out to perform poorly.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.