I was wondering if i could safely read from an XmlDocument object using SelectNodes() and SelectSingleNode() from multiple threads with no problems. MSDN says that they are not guaranteed to be thread safe. If SelectNodes() and SelectSingleNode() do present problems running from multiple threads, could i use proper locking to avoid any issues? I have a WCF service set up that needs to grab a chunk of xml from a database and select some info out of this xml. I'd like to cache the xml to avoid hitting the database so often, but i'm concerned about thread safety and performance. Is there a better way to go about doing this? Thanks
Here's the deal. If the documentation says that instance methods are not guarenteed to be threadsafe then you better take note. And if you do decide to use the class in a multithreaded scenario without the proper synchronization mechanisms then you need to be 1) aware the consequences of ignoring the documentation and 2) prepared for all of your assumptions to be invalidated on future versions of the class. This advice is valid even for methods that seem to only be reading internal state.
How do you know that SelectNodes and SelectSingleNodes do not modify an internal variable? Because if they do then they are definitely not threadsafe! Now, I happen to use Reflector to look inside and I can see that they do not modify any internal variables. But, how do you know that would not change in a future version?
Now, since we know in reality that SelectNodes and SelectSingleNodes do not modify the internal state of the class they may be safe for multithreaded operations despite the warning if and only if the following conditions apply.
After the XmlDocument is initialized no other method besides SelectNodes or SelectSingleNode is called...ever. Because I have not examined all methods on the XmlDocument class I cannot say what ones modify the internal state of the class and which ones do not and as a result I would consider all but the 2 methods I just mentioned a possible risk to breaking down your lock free approach to using the class.
An explicit or implicit memory barrier is created after the XmlDocument is initialized on one thread and before SelectNodes or SelectSingleNodes is called on another thread. I should note that a memory barrier will most likely be created implicitly for you as a result of getting the multithreaded environment setup. But, I can think of some subtle scenarios where this breaks down.
My advice...take the warning in the documentation literally and use the appropriate synchronization mechanisms.
As you are going to write/read to/from the XML document you need to synchronize those two operations if you don't want to run into race conditions. And if you care about performance (who doesn't?) a ReaderWriterLockSlim might perform better than locking.
SelectNodes / SelectSingleNode should be safe (they only read data). Of course you need to synchronize those with any method that actually modifies the xml.
you could also use MsXml FreeThreadedDOMDocument model instead of the classical DomDocument when you call createInstance.
Beware that according this article, FreeThreadedDOMDocument is 7x or 10x slower than classical DomDocument.
Related
There are some cases where I really like using Guava's Striped class.
Is there an equivalent in C#?
It doesn't look like there is a direct equivalent, but there are some lockless thread-safe collection options (I'm not sure what you're trying to achieve, so I can't say if they will work for your scenario). Have a look at the System.Collections.Concurrent Namespace.
In particular, ConcurrentBag, ConcurrentQueue, ConcurrentStack, and ConcurrentDictionary all have different locking/lockless thread-safe strategies. Some are explained in this blog post.
You might be able to get what you want via the Partitioner class, although I am unsure of the implementation.
#Behrooz is incorrect in saying that all .net framework types only use a single lock for the entire list. Take a look at the source for ConcurrentDictionary. Line 71 suggests that this class is implemented using multiple locks.
If you really want to, you could write your own version. The source for the Guava Striped is: https://github.com/google/guava/blob/master/guava/src/com/google/common/util/concurrent/Striped.java
I think best you can do is implementing your own because all dotnet framework types offer only one lock for the entire list.
To do that you can use the GetHashCode() function, modulus(%) it with the number of stripes you want. and use it as an index for Tuple<TLock, List<T>>[] where TLock can be any kind of lock defined in System.Threading namespace and T is the type you want to store/access.
With this you can decide how you want your stripes to be stored. There are choices like HashSet(inefficient in your case since you already use some of the bits to calculate the stripe index), SortedSet, List, Array.
btw, Thank you for the question, It's gonna help me solve a problem I'm having.
Have you tried Tamarind from NuGet?
It's C# port of Google's Guava library
I think the ConcurrentDictionary can archive a similar result.
Based on their documentation:
All these operations are atomic and are thread-safe with regards to all other operations on the ConcurrentDictionary class. The only exceptions are the methods that accept a delegate, that is, AddOrUpdate and GetOrAdd. For modifications and write operations to the dictionary, ConcurrentDictionary uses fine-grained locking to ensure thread safety. (Read operations on the dictionary are performed in a lock-free manner.) However, delegates for these methods are called outside the locks to avoid the problems that can arise from executing unknown code under a lock. Therefore, the code executed by these delegates is not subject to the atomicity of the operation.
As you can see, read operations are lock-free. That will allow you to not block the threads from reading while other are inserting for example.
Consider that I have a custom class called Terms and that class contains a number of strings properties. Then I create a fairly large (say 50,000) List<Terms> object. This List<Terms> only needs to be read from but it needs to be read from by multiple instances of Task.Factory.StartNew (the number of instances could vary from 1 to 100s).
How would I best pass that list into the long running task? Memory isn't too much of a concern as this is a custom application for a specific use on a specific server with plenty of memory. Should I reference it or should I just pass it off as a normal argument into the method doing the work?
Since you're passing a reference it doesn't really matter how you pass it, it won't copy the list itself. As Ket Smith said, I would pass it as a parameter to the method you are executing.
The issue is List<T> is not entirely thread-safe. Reads by multiple threads are safe but a write can cause some issues:
It is safe to perform multiple read operations on a List, but issues can occur if the collection is modified while it’s being read. To ensure thread safety, lock the collection during a read or write operation.
From List<T>
You say your list is read-only so that may be a non-issue, but a single unpredictable change could lead to unexpected behavior and so it's bug-prone.
I recommend using ImmutableList<T> which is inherently thread-safe since it's immutable.
So long as you don't try to copy it into each separate task, it shouldn't make much difference: more a matter of coding style than anything else. Each task will still be working with the same list in memory: just a different reference to the same underlying list.
That said, sheerly as a matter of coding style and maintainability, I'd probably try to pass it in as a parameter to whatever method you're executing in your Task.Factory.StartNew() (or better yet, Task.Run() - see here). That way, you've clearly called out your task's dependencies, and if you decide that you need to get the list from some other place, it's more clear what you've got to change. (But you could probably find 20 places in my own code where I haven't followed that rule: sometimes I go with what's easier for me now than with what's likely to be easier for the me six months from now.)
It looks like the mono implementation has no MemoryBarrier calls inside the ReaderWriterLockSlim methods. So when I make any changes inside a write lock, I can receive old cached values in another thread which uses a read lock.
Is it really possible? Should I insert MemoryBarrier before and after the code inside Read and Write lock?
Looking at (what I think is) the mono source, the Mono ReaderWriterLockSlim is implemented using Interlocked calls.
These calls include a memory barrier on x86, so you shouldn't need to add one.
As Peter correctly points out, the implementation does introduce a memory barrier, just not explicitly.
More generally: the C# language specification requires that certain side effects be well ordered with respect to locks. Though that rule only applies to locks entered with the C# lock statement, it would be exceedingly strange for a provider of a custom locking primitive to make a locking object that did not follow the same rules. You are wise to double-check, but in general you can assume that if its a threading primitive then it has been designed to ensure that important side effects are well-ordered around it.
Is it necessary to lock LINQ statements as follows? If omitting the lock, any exceptions will be countered when multiple threads execute it concurrently?
lock (syncKey)
{
return (from keyValue in dictionary
where keyValue.Key > versionNumber
select keyValue.Value).ToList();
}
PS: Writer threads do exist to mutate the dictionary.
Most types are thread-safe to read, but not thread-safe during mutation.
If none of the threads is changing the dictionary, then you don't need to do anything - just read away.
If, however, one of the threads is changing it then you have problems and need to synchronize. The simplest approach is a lock, however this prevents concurrent readers even when there is no writer. If there is a good chance you will have more readers that writers, consider using a ReaderWriterLockSlim to synchronize - this will allow any number of readers (with no writer), or: one writer.
In 4.0 you might also consider a ConcurrentDictionary<,>
So long as the query has no side-effects (such as any of the expressions calling code that make changes) there there is no need to lock a LINQ statement.
Basically, if you don't modify the data (and nothing else is modifying the data you are using) then you don't need locks.
If you are using .NET 4.0 and there is a ConcurrentDictionary that is thread safe. Here is an example of using a concurrent dictionary (admittedly not in a LINQ statement)
UPDATE
If you are modifying data then you need to use locks. If two or more threads attempt to access a locked section of code there will be a small performance loss as one or more of the threads waits for the lock to be released. NOTE: If you over-lock then you may end up with worse performance that you would if you had just built the code using a sequential algorithm from the start.
If you are only ever reading data then you don't need locks as there is no mutable shared state to protect.
If you do not use locks then you may end up with intermittent bugs where the data is not quite right or exceptions are thrown when collisions occur between readers and writers. In my experience, most of the time you may never get an exception, you just get corrupt data (except you don't necessarily know it is corrupt). Here is another example showing how data can be corrupted if you don't use locks or redesign your algorithm to cope.
You often get the best out of a system if you consider the constraints of developing in a parallel system from the outset. Sometimes you can re-write your code so it uses no shared data. Sometime you can split the data up into chunks and have each thread/task work on its own chunk then have some process at the end stitch it all back together again.
If your dictionary is static and a method where you run the query is not (or another concurrent access scenarios), and dictionary can be modified from another thread, then yes, lock is required otherwise - is not.
Yes, you need to lock your shared resources when using LINQ in multi-threaded scenarios (EDIT: of course, if your source collection is being modified as Marc said, if you are only reading it, you don't need to worry about it). If you are using .Net 4 or the parallel extensions for 3.5 you could look at replacing your Dictionary with a ConcurrentDictionary (or use some other custom implementation anyway).
In the current implementation of CPython, there is an object known as the "GIL" or "Global Interpreter Lock". It is essentially a mutex that prevents two Python threads from executing Python code at the same time. This prevents two threads from being able to corrupt the state of the Python interpreter, but also prevents multiple threads from really executing together. Essentially, if I do this:
# Thread A
some_list.append(3)
# Thread B
some_list.append(4)
I can't corrupt the list, because at any given time, only one of those threads are executing, since they must hold the GIL to do so. Now, the items in the list might be added in some indeterminate order, but the point is that the list isn't corrupted, and two things will always get added.
So, now to C#. C# essentially faces the same problem as Python, so, how does C# prevent this? I'd also be interested in hearing Java's story, if anyone knows it.
Clarification: I'm interested in what happens without explicit locking statements, especially to the VM. I am aware that locking primitives exist for both Java & C# - they exist in Python as well: The GIL is not used for multi-threaded code, other than to keep the interpreter sane. I am interested in the direct equivalent of the above, so, in C#, if I can remember enough... :-)
List<String> s;
// Reference to s is shared by two threads, which both execute this:
s.Add("hello");
// State of s?
// State of the VM? (And if sane, how so?)
Here's another example:
class A
{
public String s;
}
// Thread A & B
some_A.s = some_other_value;
// some_A's state must change: how does it change?
// Is the VM still in good shape afterwards?
I'm not looking to write bad C# code, I understand the lock statements. Even in Python, the GIL doesn't give you magic-multi-threaded code: you must still lock shared resources. But the GIL prevents Python's "VM" from being corrupted - it is this behavior that I'm interested in.
Most other languages that support threading don't have an equivalent of the Python GIL; they require you to use mutexes, either implicitly or explicitly.
Using lock, you would do this:
lock(some_list)
{
some_list.Add(3);
}
and in thread 2:
lock(some_list)
{
some_list.Add(4);
}
The lock statement ensures that the object inside the lock statement, some_list in this case, can only be accessed by a single thread at a time. See http://msdn.microsoft.com/en-us/library/c5kehkcz(VS.80).aspx for more information.
C# does not have an equivalent of GIL to Python.
Though they face the same issue, their design goals make them
different.
With GIL, CPython ensures that suche operations as appending a list
from two threads is simple. Which also
means that it would allow only one
thread to run at any time. This
makes lists and dictionaries thread safe. Though this makes the job
simpler and intuitive, it makes it
harder to exploit the multithreading
advantage on multicores.
With no GIL, C# does the opposite. It ensures that the burden of integrity is on the developer of the
program but allows you to take
advantage of running multiple threads
simultaneously.
As per one of the discussion -
The GIL in CPython is purely a design choice of having
a big lock vs a lock per object
and synchronisation to make sure that objects are kept in a coherent state.
This consist of a trade off - Giving up the full power of
multithreading.
It has been that most problems do not suffer from this disadvantage
and there are libraries which help you exclusively solve this issue when
required.
That means for a certain class of problems, the burden to utilize the
multicore is
passed to developer so that rest can enjoy the more simpler, intuitive
approach.
Note: Other implementation like IronPython do not have GIL.
It may be instructive to look at the documentation for the Java equivalent of the class you're discussing:
Note that this implementation is not synchronized. If multiple threads access an ArrayList instance concurrently, and at least one of the threads modifies the list structurally, it must be synchronized externally. (A structural modification is any operation that adds or deletes one or more elements, or explicitly resizes the backing array; merely setting the value of an element is not a structural modification.) This is typically accomplished by synchronizing on some object that naturally encapsulates the list. If no such object exists, the list should be "wrapped" using the Collections.synchronizedList method. This is best done at creation time, to prevent accidental unsynchronized access to the list:
List list = Collections.synchronizedList(new ArrayList(...));
The iterators returned by this class's iterator and listIterator methods are fail-fast: if the list is structurally modified at any time after the iterator is created, in any way except through the iterator's own remove or add methods, the iterator will throw a ConcurrentModificationException. Thus, in the face of concurrent modification, the iterator fails quickly and cleanly, rather than risking arbitrary, non-deterministic behavior at an undetermined time in the future.
Note that the fail-fast behavior of an iterator cannot be guaranteed as it is, generally speaking, impossible to make any hard guarantees in the presence of unsynchronized concurrent modification. Fail-fast iterators throw ConcurrentModificationException on a best-effort basis. Therefore, it would be wrong to write a program that depended on this exception for its correctness: the fail-fast behavior of iterators should be used only to detect bugs.
Most complex datastructures(for example lists) can be corrupted when used without locking in multiple threads.
Since changes of references are atomic, a reference always stays a valid reference.
But there is a problem when interacting with security critical code. So any datastructures used by critical code most be one of the following:
Inaccessible from untrusted code, and locked/used correctly by trusted code
Immutable (String class)
Copied before use (valuetype parameters)
Written in trusted code and uses internal locking to guarantee a safe state
For example critical code cannot trust a list accessible from untrusted code. If it gets passed in a List, it has to create a private copy, do it's precondition checks on the copy, and then operate on the copy.
I'm going to take a wild guess at what the question really means...
In Python data structures in the interpreter get corrupted because Python is using a form of reference counting.
Both C# and Java use garbage collection and in fact they do use a global lock when doing a full heap collection.
Data can be marked and moved between "generations" without a lock. But to actually clean it up everything must come to a stop. Hopefully a very short stop, but a full stop.
Here is an interesting link on CLR garbage collection as of 2007:
http://vineetgupta.spaces.live.com/blog/cns!8DE4BDC896BEE1AD!1104.entry