This question already has answers here:
What are the differences between various threading synchronization options in C#?
(7 answers)
Closed 5 months ago.
There is a system who put files on a folder on a disk.
I'm writing an executable (c#) which take these files and send them into a database.
My executable can be started multiples times in the same time (in parallel) and I have a multithread problem with processing files.
Example:
There are 50 files in folder.
The executable 1 takes 10 files to process.
The executable 2 takes 10 files to process.
My questions are:
How can I be sure that my executable 2 don't take executable 1 files?
How can I lock the 10 files from executable 1?
How to make this process thread safe?
Instead of using multiple processes, a more elegant solution would be to use one process with multiple threads.
You could for example have one thread that lists the files on the disk and adds them to a concurrentqueue, with multiple processing threads taking filenames from the queue and doing the processing. . Or use something like dataflow to essentially do the same thing. The absolutely simplestion option might be to create a list of all files, and use Plinq AsParallel to parallelize the processing.
Note that IO typically do not parallelize very well, so if "processing" mostly involve Reading and writing to disk, your gains might be less than expected.
If you insist on using multiple processes you could follow the same pattern by using an actor model, or otherwise having one management process handing out work to multiple worker processes.
I would probably not recommend relying only on file-locks, since it would be difficult to ensure that files are not processed twice.
It is better to read the list of all the files first, then divide between the threads. Otherwise, exception can be used
You can check if the file is in use or not.
try
{
using (Stream stream = new FileStream("File.txt", FileMode.Open))
{
// File ready for
}
} catch {
//the file is in use.
}
Hope this helps!
Related
Hi I am trying to mimic multi threading with Parallel.ForEach loop. Below is my function:
public void PollOnServiceStart()
{
constants = new ConstantsUtil();
constants.InitializeConfiguration();
HashSet<string> newFiles = new HashSet<string>();
//string serviceName = MetadataDbContext.GetServiceName();
var dequeuedItems = MetadataDbContext
.UpdateOdfsServiceEntriesForProcessingOnStart();
var handlers = Producer.GetParserHandlers(dequeuedItems);
while (handlers.Any())
{
Parallel.ForEach(handlers,
new ParallelOptions { MaxDegreeOfParallelism = 4 },
handler =>
{
Logger.Info($"Started processing a file remaining in Parallel ForEach");
handler.Execute();
Logger.Info($"Enqueing one file for next process");
dequeuedItems = MetadataDbContext
.UpdateOdfsServiceEntriesForProcessingOnPollInterval(1);
handlers = Producer.GetParserHandlers(dequeuedItems);
});
int filesRemovedCount = Producer.RemoveTransferredFiles();
Logger.Info($"{filesRemovedCount} files removed from {Constants.OUTPUT_FOLDER}");
}
}
So to explain what's going on. The function UpdateOdfsServiceEntriesForProcessingOnStart() gets 4 file names (4 because of parallel count) and adds them to a thread safe object called ParserHandler. These objects are then put into a list var handlers.
My idea here is to loop through this handler list and call the handler.Execute().
Handler.Execute() copies files from the network location onto a local drive, parses through the file and creates multiple output files, then sends said files to a network location and updates a DB table.
What I expect in this Parallel For Each loop is that after the Handler.Execute() call, UpdateOdfsServiceEntriesForProcessingOnPollInterval(1) function will add a new file name from the db table it reads to the dequeued items container which will then be passed as one item to the recreated handler list. In this way, after one file is done executing, a new file will take its place for each parallel loop.
However what happens is that while I do get a new file added it doesn't get executed by the next available thread. Instead what happens is that the parallel for each has to finish executing the first 4 files and then it will pick up the very next file. Meaning, after the first 4 are ran in parallel, only 1 file is ran at a time thereby nullifying the whole point of the parallel looping. The initial files added before all 4 files finish the Execute() call are never executed.
IE:
(Start1, Start2, Start3, Start4) all at once. What should happen is something like (End2, Start5), and then (End3, Start6). But what is happening is (End 2, End 3, End 1, End 4), Then Start5. End5. Start6, End6.
Why is this happening?
Because we want to deploy multiple of instances of this service app in a machine, it is not beneficial to have a giant list waiting in queue. This is wasteful as the other app instances wont be able to process things.
I am writing what should be a long comment as an answer, although it's an awful answer because it doesn't answer the question.
Be aware that parallelizing filesystem operations is unlikely to make them faster, especially if the storage is a classic hard disk. The head of the disk cannot be in N places at the same moment, and if you tell it to do so will just waste most of its time traveling instead of reading or writing.
The best way to overcome the bottleneck imposed by accessing the filesystem is to make sure that there is work for the disk to do at all moments. Don't stop the disk's work to make a computation or to fetch/save data from/to the database. To make this happen you must have multiple workflows running concurrently. One workflow will do entirely I/O with the disk, another workflow will talk continuously with the database, a third workflow will utilize the CPU by doing the one calculation after the other etc. This approach is called task parallelism (doing heterogeneous work in parallel), as opposed with data parallelism (doing homogeneous work in parallel, the speciality of Parallel.ForEach). It is also called pipelining, because in order to make all workflows run concurrently you must place intermediate buffers between them, so you create a pipeline with the data flowing from buffer to buffer. Another term used for this kind of operations is producer-consumer pattern, which describes a short pipeline consisting by only two building blocks, with the first being the producer and the second the consumer.
The most powerful tool currently available¹ to create pipelines, is the TPL Dataflow library. It offers a variety of "blocks" (pipeline segments) that can be linked with each other, and can cover most scenarios. What you do is that you instantiate the blocks that will compose your pipeline, you configure them, you tell each one what work it should do, you link them together, you feed the first block with the initial raw data that should be processed, and finally await for the Completion of the last block. You can look at an example of using the TPL Dataflow library here.
¹ Available as built-in library in the .NET platform. Powerful third-party tools also exist, like the Akka.NET for example.
I have many processes which write files (any file can be written once).
They open, write and close files.
Also i have many processes which are read files. File size can various.
need such: when some process tries to read file which is writing at this moment, i need to read full content when file is closed after write. I need to lock on write and wait for unlocked on read.
Important: if file reads file and can't do that it writes file by itself.
1. Try to read file
2. If file does not exists, write file
So for async mode there can be more than 1 process who want to write file because of can't read it. I need to lock file writing and all readers should wait for this
File locking is an operating-system specific thing.
https://en.wikipedia.org/wiki/File_locking
Unix-Like systems
Unix-like systems generally support the flock(), fcntl() and lockf() system calls. However apart from lockf advisory locks, it is not part of the Posix standard, so you need to consult operating-system specific documentation.
Documentation for Linux is here:
http://linux.die.net/man/3/lockf
http://linux.die.net/man/2/fcntl
http://linux.die.net/man/2/flock
Note that fcntl() does many things not just locking.
Note also that in most cases locking on unix-like systems is advisory - i.e. a cooperative effort. Both parties must participate and simply ignoring the lock is a possibility. Mandatory locking is possible, but not used in the typical paradigm.
Windows
In windows mandatory file locks (Share mode with CreateFile) and range locks LockFileEx are normal, and advisory locks are not available, though they can be emulated (typically with a one-byte range lock at 0xffffffff or 0xffffffffffffffff - the portion locked does not have to actually exist so this does not imply that the file is that big).
Alternatives
An alternative for your described scenario, is to simply create the file with a different name, then rename it when done.
E.g. if the file is to be called "data-20130719-112258-99823.csv" instead create one called "tmpdata-20130719-112258-99823.csv.tmp" then when it has been fully written, rename it.
The standard way to handle this issue is to write to a temp file name, then rename the file when writing is complete.
Other processes waiting for the file need to watch for the existence of the real file (using a file system watcher, or similar mechanism). When the file "appears", the writing is already complete.
I have a piece of code that is executed by n threads. The code contains,
for(;;)//repeat about 10,000 times
{
lock(padlock)
{
File.AppendAllText(fileName, text);
}
}
Basically, all threads write to the same set of 10,000 files and hence the files are the shared resource. The issue is that the 10,000 open,write,close performed by each thread is slowing down my program considerably. If I could share the file handlers across threads, I'd be able to keep them open and write from different threads. Can someone throw some light on how I can proceed?
Let all threads write to a syncronised list.
Let one thread 'eat' the items on the list and write them with a single FileWriter to the file.
Presto problem solved in exchange for some extra memory usage.
I suggest you open the file using a FileStream and share that instead of the fileName.
I have 3 processes each of which listens to a data feed. All the 3 processes need to read and update the same file after receiving data. The 3 processes keep running whole day. Obviously, each process needs to have exclusive read/write lock on the file.
I could use "named mutex" to protect the read/write of the file or I can open the file using FileShare.None.
Would these 2 approaches work? Which one is better?
The programe is written in C# and runs on Windows.
Use a named mutex for this. If you open the file with FileShare.None in one process, the other processes will get an exception thrown when they attempt to open the file, which means you have to deal with waiting and retrying etc. in these processes.
I agree with the named mutex. Waiting / retrying on file access is very tedious and exception-prone, whereas the named mutex solution is very clean and straightforward.
"monitor" is one of you choise since it is the solution of synchronization problem.
I am writing a c# windows service which will be polling an SFTP folder for new files (one file = one job) and processing them. Multiple instances of the service may be running at the same time, so it is important that they do not step on each other.
I realize that an SFTP folder does not make an ideal queue, but that's what I have to work with. What do I need to do to either use this SFTP folder as a concurrent message queue, or safely represent it in a way that can be used concurrently?
Seems like your biggest problem would be dealing with multiple instances of the program stepping on each other and processing the same files.
The way I've handled this in the past is to have the program grab the first file and immediately rename it from say 'filename.txt' to 'filename.txt.processing'. The processes would be set up to ignore any file ending in '.processing' so that they don't step on each other. I don't think a file rename is perfectly atomic, but I've never had any problems with it.
Multiple instances of the service may
be running at the same time
On the same machine, or different ones?
Not sure if moving a file in Windows is an atomic operation.
If it is, then when a service chooses to work on a file, it should attempt to move the file to another folder.
If the move operation is successful, then it is safe to work on the file.
You could also leveragea datasbase to keep track of which files are being processed, have been processed or are awaiting processing.
This adds the complicatio of updating the table with new files.