How to lock file in a multi-user file management system

How to lock file in a multi-user file management system - c#

I've a program (deployed a copy to each users computer) for user to store files on a centralized file server with compression (CAB file).
When adding a file, user need to extract the file onto his own disk, add the file, and compress it back onto the server. So if two users process the same compressed file at the same time, the later uploaded one will replace the one earlier and cause data loss.
My strategy to prevent this is before user extract the compressed file, the program will check if there is a specified temp file exist on the server. If not, the program will create such temp file to prevent other user's interfere, and will delete the temp file after uploading; If yes, the program will wait until the temp file is deleted.
Is there better way of doing this? And will frequently creating and deleting empty files damage the disk?

And will frequently creating and
deleting empty files damage the disk?
No. If you're using a solid-state disk, there's a theoretical limit on the number of writes that can be performed (which is an inherit limitation of FLASH). However, you're incredibly unlikely to ever reach that limit.
Is there better way of doing this
Well, I would go about this differently:
Write a Windows Service that handles all disk access, and have your client apps talk to the service. So, when a client needs to retrieve a file, it would open a socket connection to your service and request the file and either keep it in memory or save it to their local disk. Perform any modifications on the client's local copy of the file (decompress, add/remove/update files, recompress, etc), and, when the operation is complete and you're ready to save (or commit in source-control lingo) your changes, open another socket connection to your service app (running on the server), and send it the new file contents as a binary stream.
The service app would then handle loading and saving the files to disk. This gives you a lot of additional capabilities, as well - the server can keep track of past versions (perhaps even committing each version to svn or another source control system), provide metadata such as what the latest version is, etc.
Now that I'm thinking about it, you may be better off just integrating an svn interface into your app. SharpSVN is a good library for this.

Creating temporary files to flag the lock is a viable and widely used option (and no, this won't damage the disk). Another option is to open the compressed file exclusively (or let other processes only read the file but not write it) and keep the file opened while the user works with the contents of the file.

Is there better way of doing this?
Yes. From what you've written here, it sounds like you are well on your way towards re-inventing revision control.
Perhaps you could use some off-the-shelf version control system?
Or perhaps at least re-use some code from such systems?
Or perhaps you could at least learn a little about the problems those systems faced, how fixing the obvious problems led to non-obvious problems, and attempt to make a system that works at least as well?
My understanding is that version control systems went through several stages (see
"Edit Conflict Resolution" on the original wiki, the Portland Pattern Repository).
In roughly chronological order:
The master version is stored on the server. Last-to-save wins, leading to mysterious data loss with no warning.
The master version is stored on the server. When I pull a copy to my machine, the system creates a lock file on the server. When I push my changes to the server (or cancel), the system deletes that lock file. No one can change those files on the server, so we've fixed the "mysterious data loss" problem, but we have endless frustration when I need to edit some file that someone else checked out just before leaving on a long vacation.
The master version is stored on the server. First-to-save wins ("optimistic locking"). When I pull the latest version from the server, it includes some kind of version-number. When I later push my edits to the server, if the version-number I pulled doesn't match the current version on the server, someone else has cut in first and changed things ahead of me, and the system gives some sort of polite message telling me about it. Ideally I pull the latest version from the server and carefully merge it with my version, and then push the merged version to the server, and everything is wonderful. Alas, all too often, an impatient person pulls the latest version, overwrites it with "his" version, and pushes "his" version, leading to data loss.
Every version is stored on the server, in an unbroken chain. (Centralized version control like TortoiseSVN is like this).
Every version is stored in every local working directory; sometimes the chain forks into 2 chains; sometimes two chains merge back into one chain. (Distributed version control tools like TortoiseHg are like this).
So it sounds like you're doing what everyone else did when they moved from stage 1 to stage 2. I suppose you could slowly work your way through every stage.
Or maybe you could jump to stage 4 or 5 and save everyone time?

Take a look at the FileStream.Lock method. Quoting from MSDN:
Prevents other processes from reading from or writing to the FileStream.
...
Locking a range of a file stream gives the threads of the locking process exclusive access to that range of the file stream.

Related

File move - How does the OS know whether to update a master file table or copy and delete?

After having read questions dealing with how to tell whether two files are on the same physical volume or not, and seeing that it's (almost) impossible (e.g. here), I'm wondering how the OS knows whether a file move operation should update a master file table (or its equivalent) or whether to copy and delete.
Does Windows delegate that to the drives somehow? (Or perhaps the OS does have information about every file, and it's just not accessible by programs? Unlikely.)
Or - Does Windows know only about certain types of drives (and copies and deletes in other cases)? In which case we could also assume the same. Which means allowing a file move without using a background thread, for example. (Because it will be near instantaneous.)
I'm trying to better understand this subject. If I'm making some basic incorrect assumption - please, correcting that in itself would be an answer.
If needed to limit the scope, let's concentrate on Windows 7 and up, and NTFS and FAT drives.

Of course the operating system knows which drive (and which partition on that drive) contains any particular local file; otherwise, how could it read the data? (For remote files, the operating system doesn't know about the drives, but it does know which server to contact. Moves between different servers are implemented as copy-and-delete; moves on the same server are either copy-and-delete or are delegated to that server, depending on the protocol in use.)
This information is also available to applications. You can use the GetFileInformationByHandle() function to obtain the serial number of the volume containing a particular file.

The OS does have information about every file, and it's just not as easily accessible to your program. Not in any portable way, that is.
See it this way: Those files are owned by the system. The system allocates the space, manages the volume and indexes. It's not going to copy and delete the file if it ends up in the same physical volume, as it is more efficient to move the file. It will only copy and delete if it needs to.

In C or C++ for Windows I first try to MoveFileEx without MOVEFILE_COPY_ALLOWED set. It will fail if the file can not be moved by renaming. If rename fails I know that it may take some time and show some progress bar or the like.
There are no such rename AFAIK in .NET and that System::IO::File::Move of .NET does not fail if you move between different volumes.

First, regarding Does Windows delegate that to the drives somehow. No. The OS is more like a central nervous system. It keeps track of whats going on centrally, and for its distributed assets (or devices) such as a drive. (internal or external)
It follows that the OS, has information about every file residing on a drive for which it has successfully enumerated. The most relevant part of the OS with respect to file access is the File System. There are several types. Knowledge of the following topics will help to understand issues surrounding file access:
1) File attribute settings
2) User Access Controls
3) File location (pdf) (related to User Access Controls)
4) Current state of file (i.e. is the file in use currently)
5) Access Control Lists
Regarding will be near instantaneous. This obviously is only a perception. No matter how fast, or seemingly simultaneous, file handling via standard programming libraries can be done in such a way as to be aware of file related errors, such as:
ENOMEM - insufficient memory.
EMFILE - FOPEN_MAX files open already.
EINVAL - filename is NULL or contains only whitespace.
EINVAL - invalid mode.
(these in relation to fopen) can be used to mitigate OS/file run-time issues. This being said, applications should always be written to comply with good programming methods to avoid bumping into OS related file access issues, thread safety included.

Application architecture with data on a shared network, without a database on the server

I'm currently working on a C# project of an application we'd like to develop. We're brainstorming over the question of sharing the data between users. We'd like to be able to specify a folder where all the files of the application are going to be saved and we'd like to be able to save them on a shared folder (server, different PC or Mac, Nas, etc.).
The deployment would be like so :
Installation on the first PC, we choose a network drive, share, whatever and create all the files for the application in this location.
On the second PC we install the application and we choose the same location (on the network), the application doesn't create anything, it sees that it's already existing and it uses these files as the application's data
Same thing on the other clients
The application's files are going to be documents (most likely XML formatted documents) and when opening the application we want to show all the existing documents. The thing is, we don't only want to have the list of documents and be able to edit their content, we also would like to be able to edit the document's property, so in a way we'd like a file (Sqlite, XML, whatever) representing the list of all the documents and their attributes. Same thing for a list of addresses.
I know all that looks exactly like a client / server with database solution, but this solution is out of the question. I was first looking at SQLite for my data files, but I know concurrency can be a real problem and file lock doesn't work well. The thing is, I would have the same problem with simple XML files (refreshing the content when several users are working, accessing locked files).
So I guess my final question is : Is it feasable? Is there an alternative I didn't see which would allow us to do that more easily?
EDIT :
OK I'm not responding to every post or comment, because I'm currently testing concurrency with SQLite. What I did, and please correct me if the way I test this is wrong, is launch X BackgroundWorker which are all going to insert record in a sample database (which is recreated everytime I start the application). I tried launching 100 iterations of INSERT in the database via these backgroundWorkers.
Of course concurrency is working with one application running, it's simply waiting for the last BackgroundWorker to do it's job and then writing the next record. I also tried inserting at (almost) the same time, meaning I put a loop in every BackgroundWorker waiting for a modulo 5 timestamp (every 5 seconds, every BackgroundWorker runs). Again, it's waiting for the previous insert query to end before doing the next and everything's working fine. I even tried it with 500 BackgroundWorkers and it worked fine.
I then tried launching my app several times and running them simultaneously. When doing this I did have some issue. With two instances of my app it was still working fine, but when trying this with 4-5 instances, it got really buggy and I got two types of error : 1. database is locked 2. disk I/O failure. But mostyle locked databases.
What I did was pretty intensive, in the scenario of my application, it will never ever come to 5 processes trying to simultaneously insert 500 hunded rows at the same time (maybe I'll get a concurrency of two or three connections). But what really bugged me and what makes me think my testing method is not really a good one, is that I got these errors trying to work on a database on a shared network, on a NAS AND on my own HDD. Everytime it worked for maybe 30-40 queries then throwing me "database is locked" error.
Am I testing it wrong? Maybe I shouldn't be trying so hard to make this work, but I'm still not convinced that SQLite is not a good alternative to what I'm trying to do, since the concurrency is going to be really small.

With your optimistic/pessimistic locking, you are ultimately trying to build a database. Also, you WILL have issues with consistency while trying to keep multiple files in sync with each other. Think about if you update the "metadata" file, and the write fails half-way through because of a network blip. File corruption will ensue, and you will be left trying to reconstruct things from backups.
I would suggest a couple of likely solutions:
1) Host the content yourselves, and let them be pure clients (cloud based deployments are ideal for this). Most network/firewall issues can be circumvented by using HTTP as your transport (web services).
2) Have one of the workstations be the "server", which keeps it data files on the NFS. This will give you transactional integrity, incremental backups, etc. There are lots of good embedded database managements systems to help you manage this complexity. MS SQL Server even has some great options for this.

You right, Sqlite uses file locks on database file, so storing all data files in database would bring write-starvation problem for editing your documents.
May be it's better choice to implement simple optimistic/pessimistic locking by yourself on particular-file level? For example, in case of using pessimistic lock you just don't allow anyone to edit particular file, if somebody already in process of editing it. In this case you will hold lock just on one file, but not on the entire database. If possibility of conflict(editing particular file at the same time) is pretty low, it is better to go with optimistic locking.
Simple optimistic locking implementation:
When user get file for reading - it's OK, no problem here. If user get file for editing, you could calculate hash for this file(or get timestamp of last updated time of the file), and then, when user tries to save edited file, compare current(at the moment of saving) hash/timestamp to make sure that file has not been changed by somebody else. If file has not been changed then it's ok to save it. IF file has been changed, then current user is out of luck, you need to inform him about it. This optimistic scenario is nice when possibility of this "out of luck" is pretty low. Otherwise it's better to stick with pessimistic locking, when you do not allow user even to start file editing if somebody else is doing it.

Transactional handling of text files on Windows

I have multiple Windows programs (running on Windows 2000, XP and 7), which handle text files of different formats (csv, tsv, ini and xml). It is very important not to corrupt the content of these files during file IO. Every file should be safely accessible by multiple programs concurrently, and should be resistant to system crashes. This SO answer suggests using an in-process database, so I'm considering to use the Microsoft Jet Database Engine, which is able to handle delimited text files (csv, tsv), and supports transactions. I used Jet before, but I don't know whether Jet transactions really tolerate unexpected crashes or shutdowns in the commit phase, and I don't know what to do with non-delimited text files (ini, xml). I don't think it's a good idea to try to implement fully ACIDic file IO by hand.
What is the best way to implement transactional handling of text files on Windows? I have to be able to do this in both Delphi and C#.
Thank you for your help in advance.
EDIT
Let's see an example based on #SirRufo's idea. Forget about concurrency for a second, and let's concentrate on crash tolerance.
I read the contents of a file into a data structure in order to modify some fields. When I'm in the process of writing the modified data back into the file, the system can crash.
File corruption can be avoided if I never write the data back into the original file. This can be easily achieved by creating a new file, with a timestamp in the filename every time a modification is saved. But this is not enough: the original file will stay intact, but the newly written one may be corrupt.
I can solve this by putting a "0" character after the timestamp, which would mean that the file hasn't been validated. I would end the writing process by a validation step: I would read the new file, compare its contents to the in-memory structure I'm trying to save, and if they are the same, then change the flag to "1". Each time the program has to read the file, it chooses the newest version by comparing the timestamps in the filename. Only the latest version must be kept, older versions can be deleted.
Concurrency could be handled by waiting on a named mutex before reading or writing the file. When a program gains access to the file, it must start with checking the list of filenames. If it wants to read the file, it will read the newest version. On the other hand, writing can be started only if there is no version newer than the one read last time.
This is a rough, oversimplified, and inefficient approach, but it shows what I'm thinking about. Writing files is unsafe, but maybe there are simple tricks like the one above which can help to avoid file corruption.
UPDATE
Open-source solutions, written in Java:
Atomic File Transactions: article-1, article-2, source code
Java Atomic File Transaction (JAFT): project home
XADisk: tutorial, source code
AtomicFile: description, source code

How about using NTFS file streams? Write multiple named(numbered/timestamped) streams to the same Filename. Every version could be stored in a different stream but is actually stored in the same "file" or bunch of files, preserving the data and providing a roll-back mechanism...
when you reach a point of certainty delete some of the previous streams.
Introduced in NT 4? It covers all versions. Should be crash proof you will always have the previous version/stream plus the original to recover / roll-back to.
Just a late night thought.
http://msdn.microsoft.com/en-gb/library/windows/desktop/aa364404%28v=vs.85%29.aspx

What you are asking for is transactionality, which is not possible without developing yourself the mechanism of a RDBMS database according to your requirements:
"It is very important not to corrupt the content of these files during file IO"
Pickup a DBMS.

See a related post Accessing a single file with multiple threads
However my opinion is to use a database like Raven DB for these kind of transactions, Raven DB supports concurrent access to same file as well as supporting batching on multiple operations into a single request. However everything is persisted as JSON documents, not text files. It does support .NET/C# very well, including Javascript and HTML but not Delphi.

First of all this question has nothing to do with C# or Delphi. You have to simulate your file structure as if it is a database.
Assumptions;
Moving of files is a cheap process and Op System guarantees that the files are not corrupted during move.
You have a single directory of files that need to be processed. (d:\filesDB*.*)
A Controller application is a must.
Simplified Worker Process;
-initialization
Gets a processID from the Operating system.
Creates directories in d:\filesDB
d:\filesDB\<processID>
d:\filesDB\<processID>\inBox
d:\filesDB\<processID>\outBox
-process for each file
Select file to process.
Move it to the "inBox" Directory (ensures single access to file)
Open file
Create new file in "outBox" and close it properly
Delete file in "inBox" Directory.
Move newly created file located in "OutBox" back to d:\filesDB
-finallization
remove the created directories.
Controller Application
Runs only on startup of the system, and initializes applications that will do the work.
Scan d:\filesDB directory for subdirectories,
For each subDirectory
2.1 if File exists in "inBox", move it to d:\filesDB and skip "outBox".
2.2 if File exists in "outBox", move it to d:\filesDB
2.3 delete the whole subDirectory.
Start each worker process that need to be started.
I hope that this will solve your problem.

You are creating a nightmare for yourself trying to handle these transactions and states in your own code across multiple systems. This is why Larry Ellison (Oracle CEO) is a billionaire and most of us are not. If you absolutely must use files, then setup an Oracle or other database that supports LOB and CLOB objects. I store very large SVG files in such a table for my company so that we can add and render large maps to our systems without any code changes. The files can be pulled from the table and passed to your users in a buffer then returned to the database when they are done. Setup the appropriate security and record locking and your problem is solved.

Ok, you are dead - unless you can drop XP. Simple like that.
Since POST-XP Windows supports Transactional NTFS - though it is not exposed to .NET (natively - you can still use it). This allows one to roll back or commit changes on a NTFS file system, with a DTC even in coordination with a database. Pretty nice. XP, though - no way, not there.
Start at Any real-world, enterprise-grade experience with Transactional NTFS (TxF)? as a starter. The question there lists a lot of ressources to get you started on how to do it.
Note that this DOES have a performance overhead - obviously. It is not that bad, though, unless you need a SECOND transactional resource, as there is a very thin kernel level transaction coordinator there, transactions only get promoted to full DTC when a second ressource is added.
For a direct link - http://msdn.microsoft.com/en-us/magazine/cc163388.aspx has some nice information.

Using the bittorrent protocol to distribute nightly and CI builds

This questions continues from what I learnt from my question yesterday titled using git to distribute nightly builds.
In the answers to the above questions it was clear that git would not suit my needs and was encouraged to re-examine using BitTorrent.
Short Version
Need to distribute nightly builds to 70+ people each morning, would like to use git BitTorrent to load balance the transfer.
Long Version
NB. You can skip the below paragraph if you have read my previous question.
Each morning we need to distribute our nightly build to the studio of 70+ people (artists, testers, programmers, production etc). Up until now we have copied the build to a server and have written a sync program that fetches it (using Robocopy underneath); even with setting up mirrors the transfer speed is unacceptably slow with it taking up-to an hour or longer to sync at peak times (off-peak times are roughly 15 minutes) which points to being hardware I/O bottleneck and possibly network bandwidth.
What I know so far
What I have found so far:
I have found the excellent entry on Wikipedia about the BitTorrent protocol which was an interesting read (I had only previously known the basics of how torrents worked). Also found this StackOverflow answer on the BITFIELD exchange that happens after the client-server handshake.
I have also found the MonoTorrent C# Library (GitHub Source) that I can use to write our own tracker and client. We cannot use off the shelf trackers or clients (e.g. uTorrent).
Questions
In my initial design, I have our build system creating a .torrent file and adding it to the tracker. I would super-seed the torrent using our existing mirrors of the build.
Using this design, would I need to create a new .torrent file for each new build? In other words, would it be possible to create a "rolling" .torrent where if the content of the build has only change 20% that is all that needs to be downloaded to get latest?
... Actually. In writing the above question, I think that I would need to create new file however I would be able download to the same location on the users machine and the hash will automatically determine what I already have. Is this correct?
In response to comments
For completely fresh sync the entire build (including: the game, source code, localized data, and disc images for PS3 and X360) ~37,000 files and coming in just under 50GB. This is going to increase as production continues. This sync took 29 minutes to complete at time when there is was only 2 other syncs happening, which low-peak if you consider that at 9am we would have 50+ people wanting to get latest.
We have investigated the disk I/O and network bandwidth with the IT dept; the conclusion was that the network storage was being saturated. We are also recording statistics to a database of syncs, these records show even with handful of users we are getting unacceptable transfer rates.
In regard not using off-the-shelf clients, it is a legal concern with having an application like uTorrent installed on users machines given that other items can be easily downloaded using that program. We also want to have a custom workflow for determining which build you want to get (e.g. only PS3 or X360 depending on what DEVKIT you have on your desk) and have notifications of new builds available etc. Creating a client using MonoTorrent is not the part that I'm concerned about.

To the question whether or not you need to create a new .torrent, the answer is: yes.
However, depending a bit on the layout of your data, you may be able to do some simple semi-delta-updates.
If the data you distribute is a large collection of individual files, with each build some files may have changed you can simply create a new .torrent file and have all clients download it to the same location as the old one (just like you suggest). The clients would first check the files that already existed on disk, update the ones that had changed and download new files. The main drawback is that removed files would not actually be deleted at the clients.
If you're writing your own client anyway, deleting files on the filesystem that aren't in the .torrent file is a fairly simple step that can be done separately.
This does not work if you distribute an image file, since the bits that stayed the same across the versions may have moved, and thus yielding different piece hashes.
I would not necessarily recommend using super-seeding. Depending on how strict the super seeding implementation you use is, it may actually harm transfer rates. Keep in mind that the purpose of super seeding is to minimize the number of bytes sent from the seed, not to maximize the transfer rate. If all your clients are behaving properly (i.e. using rarest first), the piece distribution shouldn't be a problem anyway.
Also, to create a torrent and to hash-check a 50 GiB torrent puts a lot of load on the drive, you may want to benchmark the bittorrent implementation you use for this, to make sure it's performant enough. At 50 GiB, the difference between different implementations may be significant.

Just wanted to add a few non-BitTorrent suggestions for your perusal:
If the delta between nightly builds is not significant, you may be able to use rsync to reduce your network traffic and decrease the time it takes to copy the build. At a previous company we used rsync to submit builds to our publisher, as we found our disc images didn't change much build-to-build.
Have you considered simply staggering the copy operations so that clients aren't slowing down the transfer for each other? We've been using a simple Python script internally when we do milestone branches: the script goes to sleep until a random time in a specified range, wakes up, downloads and checks-out the required repositories and runs a build. The user runs the script when leaving work for the day, when they return they have a fresh copy of everything ready to go.

You could use BitTorrent sync Which is somehow an alternative to dropbox but without a server in the cloud. It allows you to synchronize any number of folders and files of any size. with several people and it uses the same algorithms from the bit Torrent protocol. You can create a read-only folder and share the key with others. This method removes the need to create a new torrent file for each build.

Just to throw another option into the mix, have you considered BITS? Not used it myself but from reading the documentation it supports a distributed peer caching model which sounds like it will achieve what you want.
The downside is that it is a background service so it will give up network bandwidth in favour of user initiated activity - nice for your users but possibly not what you want if you need data on a machine in a hurry.
Still, it's another option.

Transactional file writing in C# and Windows?

I have a data file and from time to time I need to write a change to the file. The change consists of changing information in more than one place. For example, changing some data near the end of the file and also changing some information near the start. I want the two separate writes to either both succeed or both fail, otherwise it is left in uncertain state and effectively corrupted. Is there any builtin support for this scenario in .NET or in general?
If not then how to others solve this issue? How does a database on Windows solve this issue?
UPDATE: I do not want to use the Transactional NTFS capability because it is not available on older version of Windows such as XP and it is slow in the file overwrite scenario as described above.

DB basically uses a Journal concept (at least those one I'm aware of). An idea is, that a write operation is written in journal until Writer doesn't commit a transaction. (Sure it's just basic description, it's so easy)
In your case, it could be a copy of your file, where you're going to write a data, and if everything finished with success, substitute original file with it's copy.
Substitution is: rename original file like a old, rename backup file like a original.
If substitution fails: this is a critical error, that application should handle via fault tolerance strategies. Could be that it informed a user about a failed save operation, and tries to recover. By the way in any moment you have both copies of your file. That one when write operation just started, and that one when write operation finished.
This techniques we used on past projects on VS IDE like systems for industrial control with pretty good success.

If you are using Windows 6 or later (Vista/7/2008/2008R2) the NTFS filesystem supports transactions (including within a distributed transaction): but you will need to use P/Invoke to call Win32 APIs (see this question).
If you need to run on older versions of Windows, or non-NTFS partitions you would need to perform the transactions yourself. This is decidedly non-trivial: getting full ACID functionality while handling multiple processes (including remote access via shares) across process and system crashes even with the assumption that only your access methods will be used (some other process using normal Win32 APIs would of course break things).
In this case a database will almost certainly be easier: there are a number of in-process databases (SQL Compact Edition, SQL Lite, ...) so a database doesn't require a server process.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.