I have many processes which write files (any file can be written once).
They open, write and close files.
Also i have many processes which are read files. File size can various.
need such: when some process tries to read file which is writing at this moment, i need to read full content when file is closed after write. I need to lock on write and wait for unlocked on read.
Important: if file reads file and can't do that it writes file by itself.
1. Try to read file
2. If file does not exists, write file
So for async mode there can be more than 1 process who want to write file because of can't read it. I need to lock file writing and all readers should wait for this
File locking is an operating-system specific thing.
https://en.wikipedia.org/wiki/File_locking
Unix-Like systems
Unix-like systems generally support the flock(), fcntl() and lockf() system calls. However apart from lockf advisory locks, it is not part of the Posix standard, so you need to consult operating-system specific documentation.
Documentation for Linux is here:
http://linux.die.net/man/3/lockf
http://linux.die.net/man/2/fcntl
http://linux.die.net/man/2/flock
Note that fcntl() does many things not just locking.
Note also that in most cases locking on unix-like systems is advisory - i.e. a cooperative effort. Both parties must participate and simply ignoring the lock is a possibility. Mandatory locking is possible, but not used in the typical paradigm.
Windows
In windows mandatory file locks (Share mode with CreateFile) and range locks LockFileEx are normal, and advisory locks are not available, though they can be emulated (typically with a one-byte range lock at 0xffffffff or 0xffffffffffffffff - the portion locked does not have to actually exist so this does not imply that the file is that big).
Alternatives
An alternative for your described scenario, is to simply create the file with a different name, then rename it when done.
E.g. if the file is to be called "data-20130719-112258-99823.csv" instead create one called "tmpdata-20130719-112258-99823.csv.tmp" then when it has been fully written, rename it.
The standard way to handle this issue is to write to a temp file name, then rename the file when writing is complete.
Other processes waiting for the file need to watch for the existence of the real file (using a file system watcher, or similar mechanism). When the file "appears", the writing is already complete.
Related
I have a webservice that is writing files that are being read by a different program.
To keep the reader program from reading them before they're done writing, I'm writing them with a .tmp extension, then using File.Move to rename them to a .xml extension.
My problem is when we are running at volume - thousands of files in just a couple of minutes.
I've successfully written file "12345.tmp", but when I try to rename it, File.Move() throws an IOException:
File.Move("12345.tmp", "12345.xml")
Exception: The process cannot access the file because it is being used
by another process.
For my situation, I don't really care what the filenames are, so I retry:
File.Move("12345.tmp", "12346.xml")
Exception: Exception: Could not find file '12345.tmp'.
Is File.Move() deleting the source file, if it encounters an error in renaming the file?
Why?
Is there someway to ensure that the file either renames successfully or is left unchanged?
The answer is that it depends much on how the file system itself is implemented. Also, if the Move() is between two file systems (possibly even between two machines, if the paths are network shares) - then it also depends much on the O/S implementation of Move(). Therefore, the guarantees depend less on what System.IO.File does, and more about the underlying mechanisms: the O/S code, file-system drivers, file system structure etc.
Generally, in the vast majority of cases Move() will behave the way you expect it to: either the file is moved or it remains as it was. This is because a Move within a single file system is an act of removing the file reference from one directory (an on-disk data structure), and adding it to another. If something bad happens, then the operation is rolled back: the removal from the source directory is undone by an opposite insert operation. Most modern file systems have a built-in journaling mechanism, which ensures that the move operation is either carried out completely or rolled back completely, even in case the machine loses power in the midst of the operation.
All that being said, it still depends, and not all file systems provide these guarantees. See this study
If you are running over Windows, and the file system is local (not a network share), then you can use the Transacitonal File System (TxF) feature of Windows to ensure the atomicity of your move operation.
I have multiple Windows programs (running on Windows 2000, XP and 7), which handle text files of different formats (csv, tsv, ini and xml). It is very important not to corrupt the content of these files during file IO. Every file should be safely accessible by multiple programs concurrently, and should be resistant to system crashes. This SO answer suggests using an in-process database, so I'm considering to use the Microsoft Jet Database Engine, which is able to handle delimited text files (csv, tsv), and supports transactions. I used Jet before, but I don't know whether Jet transactions really tolerate unexpected crashes or shutdowns in the commit phase, and I don't know what to do with non-delimited text files (ini, xml). I don't think it's a good idea to try to implement fully ACIDic file IO by hand.
What is the best way to implement transactional handling of text files on Windows? I have to be able to do this in both Delphi and C#.
Thank you for your help in advance.
EDIT
Let's see an example based on #SirRufo's idea. Forget about concurrency for a second, and let's concentrate on crash tolerance.
I read the contents of a file into a data structure in order to modify some fields. When I'm in the process of writing the modified data back into the file, the system can crash.
File corruption can be avoided if I never write the data back into the original file. This can be easily achieved by creating a new file, with a timestamp in the filename every time a modification is saved. But this is not enough: the original file will stay intact, but the newly written one may be corrupt.
I can solve this by putting a "0" character after the timestamp, which would mean that the file hasn't been validated. I would end the writing process by a validation step: I would read the new file, compare its contents to the in-memory structure I'm trying to save, and if they are the same, then change the flag to "1". Each time the program has to read the file, it chooses the newest version by comparing the timestamps in the filename. Only the latest version must be kept, older versions can be deleted.
Concurrency could be handled by waiting on a named mutex before reading or writing the file. When a program gains access to the file, it must start with checking the list of filenames. If it wants to read the file, it will read the newest version. On the other hand, writing can be started only if there is no version newer than the one read last time.
This is a rough, oversimplified, and inefficient approach, but it shows what I'm thinking about. Writing files is unsafe, but maybe there are simple tricks like the one above which can help to avoid file corruption.
UPDATE
Open-source solutions, written in Java:
Atomic File Transactions: article-1, article-2, source code
Java Atomic File Transaction (JAFT): project home
XADisk: tutorial, source code
AtomicFile: description, source code
How about using NTFS file streams? Write multiple named(numbered/timestamped) streams to the same Filename. Every version could be stored in a different stream but is actually stored in the same "file" or bunch of files, preserving the data and providing a roll-back mechanism...
when you reach a point of certainty delete some of the previous streams.
Introduced in NT 4? It covers all versions. Should be crash proof you will always have the previous version/stream plus the original to recover / roll-back to.
Just a late night thought.
http://msdn.microsoft.com/en-gb/library/windows/desktop/aa364404%28v=vs.85%29.aspx
What you are asking for is transactionality, which is not possible without developing yourself the mechanism of a RDBMS database according to your requirements:
"It is very important not to corrupt the content of these files during file IO"
Pickup a DBMS.
See a related post Accessing a single file with multiple threads
However my opinion is to use a database like Raven DB for these kind of transactions, Raven DB supports concurrent access to same file as well as supporting batching on multiple operations into a single request. However everything is persisted as JSON documents, not text files. It does support .NET/C# very well, including Javascript and HTML but not Delphi.
First of all this question has nothing to do with C# or Delphi. You have to simulate your file structure as if it is a database.
Assumptions;
Moving of files is a cheap process and Op System guarantees that the files are not corrupted during move.
You have a single directory of files that need to be processed. (d:\filesDB*.*)
A Controller application is a must.
Simplified Worker Process;
-initialization
Gets a processID from the Operating system.
Creates directories in d:\filesDB
d:\filesDB\<processID>
d:\filesDB\<processID>\inBox
d:\filesDB\<processID>\outBox
-process for each file
Select file to process.
Move it to the "inBox" Directory (ensures single access to file)
Open file
Create new file in "outBox" and close it properly
Delete file in "inBox" Directory.
Move newly created file located in "OutBox" back to d:\filesDB
-finallization
remove the created directories.
Controller Application
Runs only on startup of the system, and initializes applications that will do the work.
Scan d:\filesDB directory for subdirectories,
For each subDirectory
2.1 if File exists in "inBox", move it to d:\filesDB and skip "outBox".
2.2 if File exists in "outBox", move it to d:\filesDB
2.3 delete the whole subDirectory.
Start each worker process that need to be started.
I hope that this will solve your problem.
You are creating a nightmare for yourself trying to handle these transactions and states in your own code across multiple systems. This is why Larry Ellison (Oracle CEO) is a billionaire and most of us are not. If you absolutely must use files, then setup an Oracle or other database that supports LOB and CLOB objects. I store very large SVG files in such a table for my company so that we can add and render large maps to our systems without any code changes. The files can be pulled from the table and passed to your users in a buffer then returned to the database when they are done. Setup the appropriate security and record locking and your problem is solved.
Ok, you are dead - unless you can drop XP. Simple like that.
Since POST-XP Windows supports Transactional NTFS - though it is not exposed to .NET (natively - you can still use it). This allows one to roll back or commit changes on a NTFS file system, with a DTC even in coordination with a database. Pretty nice. XP, though - no way, not there.
Start at Any real-world, enterprise-grade experience with Transactional NTFS (TxF)? as a starter. The question there lists a lot of ressources to get you started on how to do it.
Note that this DOES have a performance overhead - obviously. It is not that bad, though, unless you need a SECOND transactional resource, as there is a very thin kernel level transaction coordinator there, transactions only get promoted to full DTC when a second ressource is added.
For a direct link - http://msdn.microsoft.com/en-us/magazine/cc163388.aspx has some nice information.
I have 3 processes each of which listens to a data feed. All the 3 processes need to read and update the same file after receiving data. The 3 processes keep running whole day. Obviously, each process needs to have exclusive read/write lock on the file.
I could use "named mutex" to protect the read/write of the file or I can open the file using FileShare.None.
Would these 2 approaches work? Which one is better?
The programe is written in C# and runs on Windows.
Use a named mutex for this. If you open the file with FileShare.None in one process, the other processes will get an exception thrown when they attempt to open the file, which means you have to deal with waiting and retrying etc. in these processes.
I agree with the named mutex. Waiting / retrying on file access is very tedious and exception-prone, whereas the named mutex solution is very clean and straightforward.
"monitor" is one of you choise since it is the solution of synchronization problem.
I have a text file and multiple threads/processes will write to it (it's a log file).
The file gets corrupted sometimes because of concurrent writings.
I want to use a file writing mode from all of threads which is sequential at file-system level itself.
I know it's possible to use locks (mutex for multiple processes) and synchronize writing to this file but I prefer to open the file in the correct mode and leave the task to System.IO.
Is it possible ? what's the best practice for this scenario ?
Your best bet is just to use locks/mutexex. It's a simple approach, it works and you can easily understand it and reason about it.
When it comes to synchronization it often pays to start with the simplest solution that could work and only try to refine if you hit problems.
To my knowledge, Windows doesn't have what you're looking for. There is no file handle object that does automatic synchronization by blocking all other users while one is writing to the file.
If your logging involves the three steps, open file, write, close file, then you can have your threads try to open the file in exclusive mode (FileShare.None), catch the exception if unable to open, and then try again until success. I've found that tedious at best.
In my programs that log from multiple threads, I created a TextWriter descendant that is essentially a queue. Threads call the Write or WriteLine methods on that object, which formats the output and places it into a queue (using a BlockingCollection). A separate logging thread services that queue--pulling things from it and writing them to the log file. This has a few benefits:
Threads don't have to wait on each other in order to log
Only one thread is writing to the file
It's trivial to rotate logs (i.e. start a new log file every hour, etc.)
There's zero chance of an error because I forgot to do the locking on some thread
Doing this across processes would be a lot more difficult. I've never even considered trying to share a log file across processes. Were I to need that, I would create a separate application (a logging service). That application would do the actual writes, with the other applications passing the strings to be written. Again, that ensures that I can't screw things up, and my code remains simple (i.e. no explicit locking code in the clients).
you might be able to use File.Open() with a FileShare value set to None, and make each thread wait if it can't get access to the file.
Say I want to be informed whenever a file copy is launched on my system and get the file name, the destination where it is being copied or moved and the time of copy.
Is this possible? How would you go about it? Should you hook CopyFile API function?
Is there any software that already accomplishes this?
Windows has the concept of I/O filters which allow you to intercept all I/O operations and choose to perform additional actions as a result. They are primarily used for A/V type scenarios but can be programmed for a wide variety of tasks. The SysInternals Process Monitor for example uses a I/O filter to see the file level access.
You can view your current filters using MS Filter Manager, (fltmc.exe from a command prompt)
There is a kit to help you write filters, you can get the drivers and develop your own.
http://www.microsoft.com/whdc/driver/filterdrv/default.mspx is a starting place to get in depth info
As there is a .NET tag on this question, I would simply use System.IO.FileSystemWatcher that's in the .NET Framework. I'm guessing it is implemented using the I/O Filters that Andrew mentions in his answer, but I really do not know (nor care, exactly). Would that fit your needs?
As Andrew says a filter driver is the way to go.
There is no foolproof way of detecting a file copy as different programs copy files in different ways (some may use the CopyFile API, others may just read one file and write out the contents to another themselves). You could try calculating a hash in your filter driver of any file opened for reading, and then do the same after a program finishes writing to a file. If the hashes match you know you have a file copy. However this technique may be slow. If you just hook the CopyFile API you will miss file copies made without that API. Java programs (to name but one) have no access to the CopyFile API.
This is likely impossible as there is no guaranteed central method for performing a copy/move. You could hook into a core API (like CopyFile) but of course that means that you will still miss any copy/move that any application does without using this API.
Maybe you could watch the entire filesystem with IO filters for open files and then just draw conclusions yourself if two files with same names and same filesizes are open at the same time. But that no 100% solution either.
As previously mentioned, a file copy operation can be implemented in various ways and may involve several disk and memory transfers, therefore is not possible to simply get notified by the system when such operation occurs.
Even for the user, there are multiple ways to duplicate content and entire files. Copy commands, "save as", "send to", move, using various tools. Under the hood the copy operation is a succession of read / write, correlated by certain parameters. That is the only way to guarantee successful auditing. Hooking on CopyFile will not give you the copy operations of Total Commander, for example. Nor will it give you "Save as" operations which are in fact file create -> file content moved -> closing of original file -> opening of the new file. Then, things are different when dealing with copy over network, impersonated copy operations where the file handle security context is different than the process security context, and so on. I do not think that there is a straightforward way to achieve all of the above.
However, there is a software that can notify you for most of the common copy operations (i.e. when they are performed through windows explorer, total commander, command prompt and other applications). It also gives you the source and destination file name, the timestamp and other relevant details. It can be found here: http://temasoft.com/products/filemonitor.
Note: I work for the company which develops this product.