I have a webservice that is writing files that are being read by a different program.
To keep the reader program from reading them before they're done writing, I'm writing them with a .tmp extension, then using File.Move to rename them to a .xml extension.
My problem is when we are running at volume - thousands of files in just a couple of minutes.
I've successfully written file "12345.tmp", but when I try to rename it, File.Move() throws an IOException:
File.Move("12345.tmp", "12345.xml")
Exception: The process cannot access the file because it is being used
by another process.
For my situation, I don't really care what the filenames are, so I retry:
File.Move("12345.tmp", "12346.xml")
Exception: Exception: Could not find file '12345.tmp'.
Is File.Move() deleting the source file, if it encounters an error in renaming the file?
Why?
Is there someway to ensure that the file either renames successfully or is left unchanged?
The answer is that it depends much on how the file system itself is implemented. Also, if the Move() is between two file systems (possibly even between two machines, if the paths are network shares) - then it also depends much on the O/S implementation of Move(). Therefore, the guarantees depend less on what System.IO.File does, and more about the underlying mechanisms: the O/S code, file-system drivers, file system structure etc.
Generally, in the vast majority of cases Move() will behave the way you expect it to: either the file is moved or it remains as it was. This is because a Move within a single file system is an act of removing the file reference from one directory (an on-disk data structure), and adding it to another. If something bad happens, then the operation is rolled back: the removal from the source directory is undone by an opposite insert operation. Most modern file systems have a built-in journaling mechanism, which ensures that the move operation is either carried out completely or rolled back completely, even in case the machine loses power in the midst of the operation.
All that being said, it still depends, and not all file systems provide these guarantees. See this study
If you are running over Windows, and the file system is local (not a network share), then you can use the Transacitonal File System (TxF) feature of Windows to ensure the atomicity of your move operation.
Related
After having read questions dealing with how to tell whether two files are on the same physical volume or not, and seeing that it's (almost) impossible (e.g. here), I'm wondering how the OS knows whether a file move operation should update a master file table (or its equivalent) or whether to copy and delete.
Does Windows delegate that to the drives somehow? (Or perhaps the OS does have information about every file, and it's just not accessible by programs? Unlikely.)
Or - Does Windows know only about certain types of drives (and copies and deletes in other cases)? In which case we could also assume the same. Which means allowing a file move without using a background thread, for example. (Because it will be near instantaneous.)
I'm trying to better understand this subject. If I'm making some basic incorrect assumption - please, correcting that in itself would be an answer.
If needed to limit the scope, let's concentrate on Windows 7 and up, and NTFS and FAT drives.
Of course the operating system knows which drive (and which partition on that drive) contains any particular local file; otherwise, how could it read the data? (For remote files, the operating system doesn't know about the drives, but it does know which server to contact. Moves between different servers are implemented as copy-and-delete; moves on the same server are either copy-and-delete or are delegated to that server, depending on the protocol in use.)
This information is also available to applications. You can use the GetFileInformationByHandle() function to obtain the serial number of the volume containing a particular file.
The OS does have information about every file, and it's just not as easily accessible to your program. Not in any portable way, that is.
See it this way: Those files are owned by the system. The system allocates the space, manages the volume and indexes. It's not going to copy and delete the file if it ends up in the same physical volume, as it is more efficient to move the file. It will only copy and delete if it needs to.
In C or C++ for Windows I first try to MoveFileEx without MOVEFILE_COPY_ALLOWED set. It will fail if the file can not be moved by renaming. If rename fails I know that it may take some time and show some progress bar or the like.
There are no such rename AFAIK in .NET and that System::IO::File::Move of .NET does not fail if you move between different volumes.
First, regarding Does Windows delegate that to the drives somehow. No. The OS is more like a central nervous system. It keeps track of whats going on centrally, and for its distributed assets (or devices) such as a drive. (internal or external)
It follows that the OS, has information about every file residing on a drive for which it has successfully enumerated. The most relevant part of the OS with respect to file access is the File System. There are several types. Knowledge of the following topics will help to understand issues surrounding file access:
1) File attribute settings
2) User Access Controls
3) File location (pdf) (related to User Access Controls)
4) Current state of file (i.e. is the file in use currently)
5) Access Control Lists
Regarding will be near instantaneous. This obviously is only a perception. No matter how fast, or seemingly simultaneous, file handling via standard programming libraries can be done in such a way as to be aware of file related errors, such as:
ENOMEM - insufficient memory.
EMFILE - FOPEN_MAX files open already.
EINVAL - filename is NULL or contains only whitespace.
EINVAL - invalid mode.
(these in relation to fopen) can be used to mitigate OS/file run-time issues. This being said, applications should always be written to comply with good programming methods to avoid bumping into OS related file access issues, thread safety included.
I want to use some mutex on files, so any process won't touch certain files before other stop using them. How can I do it in .NET 3.5? Here are some details:
I have some service, which checks every period of time if there are any files/directories in certain folder and if there are, service's doing something with it.
My other process is responsible for moving files (and directories) into certain folder and everything works just fine.
But I'm worrying because there can be situation, when my copying process will copy the files to certain folder and in the same time (in the same milisecond) my service will check if there are some files, and will do something with them (but not with all of them, because it checked during the copying).
So my idea is to put some mutex in there (maybe one extra file can be used as a mutex?), so service won't check anything until copying is done.
How can I achieve something like that in possibly easy way?
Thanks for any help.
The canonical way to achieve this is the filename:
Process A copies the files to e.g. "somefile.ext.noprocess" (this is non-atomic)
Process B ignores all files with the ".noprocess" suffix
After Process B has finished copying, it renames the file to "somefile.ext"
Next time Process B checks, it sees the file and starts processing.
If you have more than one file, that have to be processd together (or none), you need to adapt this scheme to an additional transaction file containing the file names for the transaction: Only if this file exists and has the correct name, must process B read it and process the files mentioned in it.
Your problem really is not of mutual exclusion, but of atomicity. Copying multiple files is not an atomic operation, and so it is possible to observe the files in a half-copied state which you'd like to prevent.
To solve your problem, you could hinge your entire operation on a single atomic file system operation, for example renaming (or moving) of a folder. That way no one can observe an intermediate state. You can do it as follows:
Copy the files to a folder outside the monitored folder, but on the same drive.
When the copying operation is complete, move the folder inside the monitored folder. To any outside process, all the files would appear at once, and it would have no chance to see only part of the files.
I have multiple Windows programs (running on Windows 2000, XP and 7), which handle text files of different formats (csv, tsv, ini and xml). It is very important not to corrupt the content of these files during file IO. Every file should be safely accessible by multiple programs concurrently, and should be resistant to system crashes. This SO answer suggests using an in-process database, so I'm considering to use the Microsoft Jet Database Engine, which is able to handle delimited text files (csv, tsv), and supports transactions. I used Jet before, but I don't know whether Jet transactions really tolerate unexpected crashes or shutdowns in the commit phase, and I don't know what to do with non-delimited text files (ini, xml). I don't think it's a good idea to try to implement fully ACIDic file IO by hand.
What is the best way to implement transactional handling of text files on Windows? I have to be able to do this in both Delphi and C#.
Thank you for your help in advance.
EDIT
Let's see an example based on #SirRufo's idea. Forget about concurrency for a second, and let's concentrate on crash tolerance.
I read the contents of a file into a data structure in order to modify some fields. When I'm in the process of writing the modified data back into the file, the system can crash.
File corruption can be avoided if I never write the data back into the original file. This can be easily achieved by creating a new file, with a timestamp in the filename every time a modification is saved. But this is not enough: the original file will stay intact, but the newly written one may be corrupt.
I can solve this by putting a "0" character after the timestamp, which would mean that the file hasn't been validated. I would end the writing process by a validation step: I would read the new file, compare its contents to the in-memory structure I'm trying to save, and if they are the same, then change the flag to "1". Each time the program has to read the file, it chooses the newest version by comparing the timestamps in the filename. Only the latest version must be kept, older versions can be deleted.
Concurrency could be handled by waiting on a named mutex before reading or writing the file. When a program gains access to the file, it must start with checking the list of filenames. If it wants to read the file, it will read the newest version. On the other hand, writing can be started only if there is no version newer than the one read last time.
This is a rough, oversimplified, and inefficient approach, but it shows what I'm thinking about. Writing files is unsafe, but maybe there are simple tricks like the one above which can help to avoid file corruption.
UPDATE
Open-source solutions, written in Java:
Atomic File Transactions: article-1, article-2, source code
Java Atomic File Transaction (JAFT): project home
XADisk: tutorial, source code
AtomicFile: description, source code
How about using NTFS file streams? Write multiple named(numbered/timestamped) streams to the same Filename. Every version could be stored in a different stream but is actually stored in the same "file" or bunch of files, preserving the data and providing a roll-back mechanism...
when you reach a point of certainty delete some of the previous streams.
Introduced in NT 4? It covers all versions. Should be crash proof you will always have the previous version/stream plus the original to recover / roll-back to.
Just a late night thought.
http://msdn.microsoft.com/en-gb/library/windows/desktop/aa364404%28v=vs.85%29.aspx
What you are asking for is transactionality, which is not possible without developing yourself the mechanism of a RDBMS database according to your requirements:
"It is very important not to corrupt the content of these files during file IO"
Pickup a DBMS.
See a related post Accessing a single file with multiple threads
However my opinion is to use a database like Raven DB for these kind of transactions, Raven DB supports concurrent access to same file as well as supporting batching on multiple operations into a single request. However everything is persisted as JSON documents, not text files. It does support .NET/C# very well, including Javascript and HTML but not Delphi.
First of all this question has nothing to do with C# or Delphi. You have to simulate your file structure as if it is a database.
Assumptions;
Moving of files is a cheap process and Op System guarantees that the files are not corrupted during move.
You have a single directory of files that need to be processed. (d:\filesDB*.*)
A Controller application is a must.
Simplified Worker Process;
-initialization
Gets a processID from the Operating system.
Creates directories in d:\filesDB
d:\filesDB\<processID>
d:\filesDB\<processID>\inBox
d:\filesDB\<processID>\outBox
-process for each file
Select file to process.
Move it to the "inBox" Directory (ensures single access to file)
Open file
Create new file in "outBox" and close it properly
Delete file in "inBox" Directory.
Move newly created file located in "OutBox" back to d:\filesDB
-finallization
remove the created directories.
Controller Application
Runs only on startup of the system, and initializes applications that will do the work.
Scan d:\filesDB directory for subdirectories,
For each subDirectory
2.1 if File exists in "inBox", move it to d:\filesDB and skip "outBox".
2.2 if File exists in "outBox", move it to d:\filesDB
2.3 delete the whole subDirectory.
Start each worker process that need to be started.
I hope that this will solve your problem.
You are creating a nightmare for yourself trying to handle these transactions and states in your own code across multiple systems. This is why Larry Ellison (Oracle CEO) is a billionaire and most of us are not. If you absolutely must use files, then setup an Oracle or other database that supports LOB and CLOB objects. I store very large SVG files in such a table for my company so that we can add and render large maps to our systems without any code changes. The files can be pulled from the table and passed to your users in a buffer then returned to the database when they are done. Setup the appropriate security and record locking and your problem is solved.
Ok, you are dead - unless you can drop XP. Simple like that.
Since POST-XP Windows supports Transactional NTFS - though it is not exposed to .NET (natively - you can still use it). This allows one to roll back or commit changes on a NTFS file system, with a DTC even in coordination with a database. Pretty nice. XP, though - no way, not there.
Start at Any real-world, enterprise-grade experience with Transactional NTFS (TxF)? as a starter. The question there lists a lot of ressources to get you started on how to do it.
Note that this DOES have a performance overhead - obviously. It is not that bad, though, unless you need a SECOND transactional resource, as there is a very thin kernel level transaction coordinator there, transactions only get promoted to full DTC when a second ressource is added.
For a direct link - http://msdn.microsoft.com/en-us/magazine/cc163388.aspx has some nice information.
I need to uniquely identify a file on Windows so I can always have a reference for that file even if it's moved or renamed. I did some research and found the question Unique file identifier in windows with a way that uses the method GetFileInformationByHandle with C++, but apparently that only works for NTFS partitions, but not for the FAT ones.
I need to program a behavior like the one on DropBox: if you close it on your computer, rename a file and open it again it detects that change and syncs correctly. I wonder whats the technique and maybe how DropBox does if you guys know.
FileSystemWatcher for example would work, but If the program using it is closed, no changes can be detected.
I will be using C#.
Thanks,
The next best method (but one that involves reading every file completely, which I'd avoid when it can be helped) would be to compare file size and a hash (e.g. SHA-256) of the file contents. The probability that both collide is fairly slim, especially under normal circumstances.
I'd use the GetFileInformationByHandle way on NTFS and fall back to hashing on FAT volumes.
In Dropbox' case I think though, that there is a service or process running in background observing file system changes. It's the most reliable way, even if it ceases to work if you stop said service/process.
What the user was looking for was most likely Windows Change Journals. Those track changes like renames of files persistently, no need to have a watcher observing file system events running all the time. Instead, one simply needs to maintain when last looked at the log and continue looking again beginning at that point. At some point a file with an already known ID would have an event of type RENAME and whoever is interested in that event could do the same for its own version of that file. The important thing is to keep track of the used IDs for files of course.
An automatic backup application is one example of a program that must check for changes to the state of a volume to perform its task. The brute force method of checking for changes in directories or files is to scan the entire volume. However, this is often not an acceptable approach because of the decrease in system performance it would cause. Another method is for the application to register a directory notification (by calling the FindFirstChangeNotification or ReadDirectoryChangesW functions) for the directories to be backed up. This is more efficient than the first method, however, it requires that an application be running at all times. Also, if a large number of directories and files must be backed up, the amount of processing and memory overhead for such an application might also cause the operating system's performance to decrease.
To avoid these disadvantages, the NTFS file system maintains an update sequence number (USN) change journal. When any change is made to a file or directory in a volume, the USN change journal for that volume is updated with a description of the change and the name of the file or directory.
https://learn.microsoft.com/en-us/windows/win32/fileio/change-journals
I've a program (deployed a copy to each users computer) for user to store files on a centralized file server with compression (CAB file).
When adding a file, user need to extract the file onto his own disk, add the file, and compress it back onto the server. So if two users process the same compressed file at the same time, the later uploaded one will replace the one earlier and cause data loss.
My strategy to prevent this is before user extract the compressed file, the program will check if there is a specified temp file exist on the server. If not, the program will create such temp file to prevent other user's interfere, and will delete the temp file after uploading; If yes, the program will wait until the temp file is deleted.
Is there better way of doing this? And will frequently creating and deleting empty files damage the disk?
And will frequently creating and
deleting empty files damage the disk?
No. If you're using a solid-state disk, there's a theoretical limit on the number of writes that can be performed (which is an inherit limitation of FLASH). However, you're incredibly unlikely to ever reach that limit.
Is there better way of doing this
Well, I would go about this differently:
Write a Windows Service that handles all disk access, and have your client apps talk to the service. So, when a client needs to retrieve a file, it would open a socket connection to your service and request the file and either keep it in memory or save it to their local disk. Perform any modifications on the client's local copy of the file (decompress, add/remove/update files, recompress, etc), and, when the operation is complete and you're ready to save (or commit in source-control lingo) your changes, open another socket connection to your service app (running on the server), and send it the new file contents as a binary stream.
The service app would then handle loading and saving the files to disk. This gives you a lot of additional capabilities, as well - the server can keep track of past versions (perhaps even committing each version to svn or another source control system), provide metadata such as what the latest version is, etc.
Now that I'm thinking about it, you may be better off just integrating an svn interface into your app. SharpSVN is a good library for this.
Creating temporary files to flag the lock is a viable and widely used option (and no, this won't damage the disk). Another option is to open the compressed file exclusively (or let other processes only read the file but not write it) and keep the file opened while the user works with the contents of the file.
Is there better way of doing this?
Yes. From what you've written here, it sounds like you are well on your way towards re-inventing revision control.
Perhaps you could use some off-the-shelf version control system?
Or perhaps at least re-use some code from such systems?
Or perhaps you could at least learn a little about the problems those systems faced, how fixing the obvious problems led to non-obvious problems, and attempt to make a system that works at least as well?
My understanding is that version control systems went through several stages (see
"Edit Conflict Resolution" on the original wiki, the Portland Pattern Repository).
In roughly chronological order:
The master version is stored on the server. Last-to-save wins, leading to mysterious data loss with no warning.
The master version is stored on the server. When I pull a copy to my machine, the system creates a lock file on the server. When I push my changes to the server (or cancel), the system deletes that lock file. No one can change those files on the server, so we've fixed the "mysterious data loss" problem, but we have endless frustration when I need to edit some file that someone else checked out just before leaving on a long vacation.
The master version is stored on the server. First-to-save wins ("optimistic locking"). When I pull the latest version from the server, it includes some kind of version-number. When I later push my edits to the server, if the version-number I pulled doesn't match the current version on the server, someone else has cut in first and changed things ahead of me, and the system gives some sort of polite message telling me about it. Ideally I pull the latest version from the server and carefully merge it with my version, and then push the merged version to the server, and everything is wonderful. Alas, all too often, an impatient person pulls the latest version, overwrites it with "his" version, and pushes "his" version, leading to data loss.
Every version is stored on the server, in an unbroken chain. (Centralized version control like TortoiseSVN is like this).
Every version is stored in every local working directory; sometimes the chain forks into 2 chains; sometimes two chains merge back into one chain. (Distributed version control tools like TortoiseHg are like this).
So it sounds like you're doing what everyone else did when they moved from stage 1 to stage 2. I suppose you could slowly work your way through every stage.
Or maybe you could jump to stage 4 or 5 and save everyone time?
Take a look at the FileStream.Lock method. Quoting from MSDN:
Prevents other processes from reading from or writing to the FileStream.
...
Locking a range of a file stream gives the threads of the locking process exclusive access to that range of the file stream.