I'm writing a back up solution (of sorts). Simply it copies a file from location C:\ and pastes it to location Z:\
To ensure the speed is fast, before copying and pasting it checks to see if the original file exists. If it does, it performs a few 'calculations' to work out if the copy should continue or if the backup file is up to date. It is these calculations I'm finding difficult.
Originally, I compared the file size but this is not good enough because it would be very possible to change a file and it to be the same size (for example saving the character C in notepad is the same size as if I saved the Character T).
So, I need to find out if the modified date differs. At the moment, I get the file info using the FileInfo class but after reviewing all the fields there is nothing which appears to be suitable.
How can I check to ensure that I'm copying files which have been modified?
EDIT
I have seen suggestions on SO to use MD5 checksums, but I'm concerned this may be a problem as some of the files I'm comparing will be up to 10GB
Going by modified date will be unreliable - the computer clock can go backwards when it synchronizes, or when manually adjusted. Some programs might not behave well when modifying or copying files in terms of managing the modified date.
Going by the archive bit might work in a controlled environment but what happens if another piece of software is running that uses the archive bit as well?
The Windows archive bit is evil and must be stopped
If you want (almost) complete reliability then what you should do is store a hash value of the last backed up version using a good hashing function like SHA1, and if the hash value changes then you upload the new copy.
Here is the SHA1 class along with a code sample on the bottom:
http://msdn.microsoft.com/en-us/library/system.security.cryptography.sha1.aspx
Just run the file bytes through it and store the hash value. Pass a FileStream to it instead of loading your file into memory with a byte array to reduce memory usage, especially for large files.
You can combine this with modified date in various ways to tweak your program as needed for speed and reliability. For example, you can check modified dates for most backups and periodically run a hash checker that runs while the system is idle to make sure nothing got missed. Sometimes the modified date will change but the file contents are still the same (i.e. got overwritten with the same data), in which case you can avoid resending the whole file after you recompute the hash and realize it is still the same.
Most version control systems use some kind of combined approach with hashes and modified dates.
Your approach will generally involve some kind of risk management with a compromise between performance and reliability if you don't want to do a full backup and send all the data over each time. It's important to do "full backups" once in a while for this reason.
You can compare files by their hashes:
private byte[] GetFileHash(string fileName)
{
HashAlgorithm sha1 = HashAlgorithm.Create();
using(FileStream stream = new FileStream(fileName,FileMode.Open,FileAccess.Read))
return sha1.ComputeHash(stream);
}
If content was changed, hashes will be different.
You may like to check out the FileSystemWatcher class.
"This class lets you monitor a directory for changes and will fire an
event when something is modified."
Your code can then handle the event and process the file.
Code source - MSDN:
// Create a new FileSystemWatcher and set its properties.
FileSystemWatcher watcher = new FileSystemWatcher();
watcher.Path = args[1];
/* Watch for changes in LastAccess and LastWrite times, and
the renaming of files or directories. */
watcher.NotifyFilter = NotifyFilters.LastAccess | NotifyFilters.LastWrite
| NotifyFilters.FileName | NotifyFilters.DirectoryName;
// Only watch text files.
watcher.Filter = "*.txt";
// Add event handlers.
watcher.Changed += new FileSystemEventHandler(OnChanged);
watcher.Created += new FileSystemEventHandler(OnChanged);
watcher.Deleted += new FileSystemEventHandler(OnChanged);
watcher.Renamed += new RenamedEventHandler(OnRenamed);
Generally speaking, you'd let the OS take care of tracking whether a file has changed or not.
If you use:
File.GetAttributes
And check for the archive flag, this will tell you if the file has changed since it was last archived. I believe XCOPY and similar reset this flag once it has done the copy, but you may need to take care of this yourself.
You can easily test the flag in DOS using:
dir /aa yourfilename
Or just add the attributes column in windows explorer.
The file archive flag is normally used by backup programs to check whether a file needs backing up. When Windows modifies or creates a file, it sets the archive flag (see here). Check whether the archive flag is set to decide whether the file needs backing up:
if ((File.GetAttributes(fileName) & FileAttributes.Archive) == FileAttributes.Archive)
{
// Archive file.
}
After backing up the file, clear the archive flag:
File.SetAttributes(fileName, File.GetAttributes(fileName) & ~FileAttributes.Archive);
This assumes no other programs (e.g., system backup software) are clearing the archive flag.
From this article get the Crc32 class
Calculating CRC-32 in C# and .NET
Pass your file path to this function...
It returns a CRC value... compare it to your file that already exists... if the CRC's are different then the file is changed.
internal Int32 GetCRC(string filepath)
{
Int32 ret = 0;
StringBuilder hash = new StringBuilder();
try
{
Crc32 crc32 = new Crc32();
using (System.IO.FileStream fs = File.Open(filepath, FileMode.Open, FileAccess.Read, FileShare.None))
foreach (byte b in crc32.ComputeHash(fs)) hash.Append(b.ToString("x2").ToLower());
ret = Int32.Parse(hash.ToString(), System.Globalization.NumberStyles.HexNumber);
}
catch (Exception ex)
{
string msg = (ex.InnerException == null) ? ex.Message : ex.InnerException.Message;
Console.WriteLine($"FILE ERROR: {msg}");
ret = 0;
}
finally
{
hash.Clear();
hash = null;
}
return ret;
}
Related
Within a tool copying big files between disks, I replaced the
System.IO.FileInfo.CopyTo method by System.IO.Stream.CopyToAsync.
This allow a faster copy and a better control during the copy, e.g. I can stop the copy.
But this create even more fragmentation of the copied files. It is especially annoying when I copy file of many hundreds megabytes.
How can I avoid disk fragmentation during copy?
With the xcopy command, the /j switch copies files without buffering. And it is recommended for very large file in TechNet
It seems indeed to avoid file fragmentation (while a simple file copy within windows 10 explorer DOES fragment my file!)
A copy without buffering seems to be the opposite way than this async copy. Or it there any way to do async copy without buffering?
Here it my current code for aync copy. I let the default buffersize of 81920 bytes, i.e. 10*1024*size(int64).
I am working with NTFS file systems, thus 4096 bytes clusters.
EDIT: I updated the code with SetLength as suggested, added the FileOptions Async while creating the destinationStream and fix setting the attributes AFTER setting the time (otherwise, exception is thrown for ReadOnly files)
int bufferSize = 81920;
try
{
using (FileStream sourceStream = source.OpenRead())
{
// Remove existing file first
if (File.Exists(destinationFullPath))
File.Delete(destinationFullPath);
using (FileStream destinationStream = File.Create(destinationFullPath, bufferSize, FileOptions.Asynchronous))
{
try
{
destinationStream.SetLength(sourceStream.Length); // avoid file fragmentation!
await sourceStream.CopyToAsync(destinationStream, bufferSize, cancellationToken);
}
catch (OperationCanceledException)
{
operationCanceled = true;
}
} // properly disposed after the catch
}
}
catch (IOException e)
{
actionOnException(e, "error copying " + source.FullName);
}
if (operationCanceled)
{
// Remove the partially written file
if (File.Exists(destinationFullPath))
File.Delete(destinationFullPath);
}
else
{
// Copy meta data (attributes and time) from source once the copy is finished
File.SetCreationTimeUtc(destinationFullPath, source.CreationTimeUtc);
File.SetLastWriteTimeUtc(destinationFullPath, source.LastWriteTimeUtc);
File.SetAttributes(destinationFullPath, source.Attributes); // after set time if ReadOnly!
}
I fear also that the File.SetAttributes and Time at the end on my code could increase file fragmentation.
Is there a proper way to create a 1:1 asynchronous file copy without any file fragmentation, i.e. asking the HDD that the file steam get only contiguous sectors?
Other topics regarding file fragmentation like How can I limit file fragmentation while working with .NET suggests incrementing the file size in larger chunks, but it does not seem to be a direct answer to my question.
but the SetLength method does the job
It does not do the job. It only updates the file size in the directory entry, it does not allocate any clusters. The easiest way to see this for yourself is by doing this on a very large file, say 100 gigabytes. Note how the call completes instantly. Only way it can be instant is when the file system does not also do the job of allocating and writing the clusters. Reading from the file is actually possible, even though the file contains no actual data, the file system simply returns binary zeros.
This will also mislead any utility that reports fragmentation. Since the file has no clusters, there can be no fragmentation. So it only looks like you solved your problem.
The only thing you can do to force the clusters to be allocated is to actually write to the file. It is in fact possible to allocate 100 gigabytes worth of clusters with a single write. You must use Seek() to position to Length-1, then write a single byte with Write(). This will take a while on a very large file, it is in effect no longer async.
The odds that it will reduce fragmentation are not great. You merely reduced the risk somewhat that the writes will be interleaved by writes from other processes. Somewhat, actual writing is done lazily by the file system cache. Core issue is that the volume was fragmented before you began writing, it will never be less fragmented after you're done.
Best thing to do is to just not fret about it. Defragging is automatic on Windows these days, has been since Vista. Maybe you want to play with the scheduling, maybe you want to ask more about it at superuser.com
I think, FileStream.SetLength is what you need.
Considering Hans Passant answer,
in my code above, an alternative to
destinationStream.SetLength(sourceStream.Length);
would be, if I understood it properly:
byte[] writeOneZero = {0};
destinationStream.Seek(sourceStream.Length - 1, SeekOrigin.Begin);
destinationStream.Write(writeOneZero, 0, 1);
destinationStream.Seek(0, SeekOrigin.Begin);
It seems indeed to consolidate the copy.
But a look at the source code of FileStream.SetLengthCore seems it does almost the same, seeking at the end but without writing one byte:
private void SetLengthCore(long value)
{
Contract.Assert(value >= 0, "value >= 0");
long origPos = _pos;
if (_exposedHandle)
VerifyOSHandlePosition();
if (_pos != value)
SeekCore(value, SeekOrigin.Begin);
if (!Win32Native.SetEndOfFile(_handle)) {
int hr = Marshal.GetLastWin32Error();
if (hr==__Error.ERROR_INVALID_PARAMETER)
throw new ArgumentOutOfRangeException("value", Environment.GetResourceString("ArgumentOutOfRange_FileLengthTooBig"));
__Error.WinIOError(hr, String.Empty);
}
// Return file pointer to where it was before setting length
if (origPos != value) {
if (origPos < value)
SeekCore(origPos, SeekOrigin.Begin);
else
SeekCore(0, SeekOrigin.End);
}
}
Anyway, not sure that theses method guarantee no fragmentation, but at least avoid it for most of the cases. Thus the auto defragment tool will finish the job at a low performance expense.
My initial code without this Seek calls created hundred of thousands of fragments for 1 GB file, slowing down my machine when the defragment tool went active.
An external Windows service I work with maintains a single text-based log file that it continuously appends to. This log file grows unbounded over time. I'd like to prune this log file periodically to maintain, say the most recent 5mb of log entries. How can I efficiently implement the file I/O code in C# .NET 4.0 to prune the file to say 5mb?
Updated:
The way service dependencies are set up, my service always starts before the external service. This means I get exclusive access to the log file to truncate it, if required. Once the external service starts up, I will not access the log file. I can gain exclusive access to the file on desktop startup. The problem is - the log file may a few gigabytes in size and I'm looking for an efficient way to truncate it.
It's going to take the amount of memory that you want to store to process the "new" log file but if you only want 5Mb then it should be fine. If you are talking about Gb+ then you probably have other problems; however, it could still be accomplished using a temp file and some locking.
As noted before, you may experience a race condition but that's not the case if this is the only thread writing to this file. This would replace your current writing to the file.
const int MAX_FILE_SIZE_IN_BYTES = 5 * 1024 * 1024; //5Mb;
const string LOG_FILE_PATH = #"ThisFolder\log.txt";
string newLogMessage = "Hey this happened";
#region Use one or the other, I mean you could use both below if you really want to.
//Use this one to save an extra character
if (!newLogMessage.StartsWith(Environment.NewLine))
newLogMessage = Environment.NewLine + newLogMessage;
//Use this one to imitate a write line
if (!newLogMessage.EndsWith(Environment.NewLine))
newLogMessage = newLogMessage + Environment.NewLine;
#endregion
int newMessageSize = newLogMessage.Length*sizeof (char);
byte[] logMessage = new byte[MAX_FILE_SIZE_IN_BYTES];
//Append new log to end of "file"
System.Buffer.BlockCopy(newLogMessage.ToCharArray(), 0, logMessage, MAX_FILE_SIZE_IN_BYTES - newMessageSize, logMessage.Length);
FileStream logFile = File.Open(LOG_FILE_PATH, FileMode.Open, FileAccess.ReadWrite);
int sizeOfRetainedLog = (int)Math.Min(MAX_FILE_SIZE_IN_BYTES - newMessageSize, logFile.Length);
//Set start position/offset of the file
logFile.Position = logFile.Length - sizeOfRetainedLog;
//Read remaining portion of file to beginning of buffer
logFile.Read(logMessage, logMessage.Length, sizeOfRetainedLog);
//Clear the file
logFile.SetLength(0);
logFile.Flush();
//Write the file
logFile.Write(logMessage, 0, logMessage.Length);
I wrote this really quick, I apologize if I'm off by 1 somewhere.
depending on how often it is written to I'd say you might be facing a race condition to modify the file without damaging the log. You could always try writing a service to monitor the file size, and once it reaches a certain point lock the file, dupe and clear the whole thing and close it. Then store the data in another file that the service controls the size of easily. Alternatively you could see if the external service has an option for logging to a database, which would make it pretty simple to roll out the oldest data.
You could use a file observer to monitor the file:
FileSystemWatcher logWatcher = new FileSystemWatcher();
logWatcher.Path = #"c:\example.log"
logWatcher.Changed += logWatcher_Changed;
Then when the event is raised you can use a StreamReader to read the file
private void logWatcher_Changed(object sender, FileSystemEventArgs e)
{
using (StreamReader readFile = new StreamReader(path))
{
string line;
string[] row;
while ((line = readFile.ReadLine()) != null)
{
// Here you delete the lines you want or move it to another file, so that your log keeps small. Then save the file.
}
}
}
It´s an option.
I'm developing a small C# application that scans a log file for lines containing certain keywords and alerts the user when one of the keywords is found. This log is potentially extremely large (several gigabytes, in worst case scenario) but the only lines on the log that are relevant to me, are the ones added to the log while my application is running.
Is there a way I can capture each text line being appended to the file, without having to worry about the file content that was already present?
I already found out about the FileSystemWatcher class while searching for a solution, and while that seems great for notifying when I have new content to fetch from the log, it doesn't seem to help for telling me what was added to it.
If you keep a FileStream open in Read mode (allowing writers, of course), you should be able to initially scan through the whole file and wait at the end until the FSW notifies you that the file has been modified.
Just be careful to reset your reading thread somehow if the file is deleted, for example if the log file that you are tailing gets rolled.
Here, I knocked together an example- run this, and while it is running, edit C:\Temp\Temp.txt in notepad and save it:
public static void Main()
{
var lockMe = new object();
using (var latch = new ManualResetEvent(true))
using (var fs = new FileStream(#"C:\Temp\Temp.txt", FileMode.OpenOrCreate, FileAccess.Read, FileShare.ReadWrite))
using (var fsw = new FileSystemWatcher(#"C:\Temp\"))
{
fsw.Changed += (s, e) =>
{
lock (lockMe)
{
if (e.FullPath != #"C:\Temp\Temp.txt") return;
latch.Set();
}
};
using (var sr = new StreamReader(fs))
while (true)
{
latch.WaitOne();
lock (lockMe)
{
String line;
while ((line = sr.ReadLine()) != null)
Console.Out.WriteLine(line);
latch.Set();
}
}
}
}
The most efficient solution (if your application needs it), is to write a file hook driver to capture all write access to to the file. That driver might tell you what bytes were changed. If you don't want to write the driver in C/C++, perhaps you can use EasyHook. EasyHook is great because, if you know the exact application that's writing to the log file, you can write a very simple user-mode hook (check his examples on CodePlex). If you don't know the name of the applications, you might have to write a kernel-hook (which is still easier with EasyHook).
Instead of reading the text from the file (what I assume you are doing), read the bytes of the file. If you can assume that writes to the file will always be appended, and you know the text encoding of the file, then you can just read in the bytes starting at the file size of the original file. Then convert the bytes to text using the proper encoding.
In a similar way to this question, but you'll need to have the old file size recorded. Then instead of seeking back 10 newlines, just seek back the size difference. You'll have to be careful about encodings though.
I have a program that runs as a Windows Service which is processing files in a specific folder.
Since it's a service, it constantly monitors a folder for new files that have been added. Part of the program's job is to perform comparisons of files in the target folder and flag non-matching files.
What I would like to do is to detect a running copy operation and when it is completed, so that a file is not getting prematurely flagged if it's matching file has not been copied over to the target folder yet.
What I was thinking of doing was using the FileSystemWatcher to watch the target folder and see if a copy operation is occurring. If there is, I put my program's main thread to sleep until the copy operation has completed, then proceed to perform the operation on the folder like normal.
I just wanted to get some insight on this approach and see if it is valid. If anyone else has any other unique approaches to this problem, it would be greatly appreciated.
UPDATE:
I apologize for the confusion, when I say target directory, I mean the source folder containing all the files I want to process. A part of the function of my program is to copy the directory structure of the source directory to a destination directory and copy all valid files to that destination directory, preserving the directory structure of the original source directory, i.e. a user may copy folders containing files to the source directory. I want to prevent errors by ensuring that if a new set of folders containing more subfolders and files is copied to the source directory for processing, my program will not start operating on the target directory until the copy process has completed.
Yup, use a FileSystemWatcher but instead of watching for the created event, watch for the changed event. After every trigger, try to open the file. Something like this:
var watcher = new FileSystemWatcher(path, filter);
watcher.Changed += (sender, e) => {
FileStream file = null;
try {
Thread.Sleep(100); // hack for timing issues
file = File.Open(
e.FullPath,
FileMode.Open,
FileAccess.Read,
FileShare.Read
);
}
catch(IOException) {
// we couldn't open the file
// this is probably because the copy operation is not done
// just swallow the exception
return;
}
// now we have a handle to the file
};
This is about the best that you can do, unfortunately. There is no clean way to know that the file is ready for you to use.
What you are looking for is a typical producer/consumer scenario. What you need to do is outlined in 'Producer/consumer queue' section on this page. This will allow you to use multi threading (maybe span a backgroundworker) to copy files so you don't block the main service thread from listening to system events & you can perform more meaningful tasks there - like checking for new files & updating the queue. So on main thread do check for new files on background threads perform the actual coping task. From personal experience (have implemented this tasks) there is not too much performance gain from this approach unless you are running on multiple CPU machine but the process is very clean & smooth + the code is logically separated nicely.
In short, what you have to do is have an object like the following:
public class File
{
public string FullPath {get; internal set;}
public bool CopyInProgress {get; set;} // property to make sure
// .. other properties if desired
}
Then following the tutorial posted above issue a lock on the File object & the queue to update it & copy it. Using this approach you can use this type approaches instead of constantly monitoring for file copy completion.
The important point to realize here is that your service has only one instance of File object per actual physical file - just make sure you (1)lock your queue when adding & removing & (2) lock the actual File object when initializing an update.
EDIT: Above where I say "there is not too much performance gain from this approach unless" I refere to if you do this approach in a single thread compare to #Jason's suggesting this approach must be noticeably faster due to #Jason's solution performing very expensive IO operations which will fail on most cases. This I haven't tested but I'm quite sure as my approach does not require IO operations open(once only), stream(once only) & close file(once only). #Jason approach suggests multiple open,open,open,open operations which will all fail except the last one.
One approach is to attempt to open the file and see if you get an error. The file will be locked if it is being copied. This will open the file in shared mode so it will conflict with an already open write lock on the file:
using(System.IO.File.Open("file", FileMode.Open,FileAccess.Read, FileShare.Read)) {}
Another is to check the file size. It would change over time if the file is being copied to.
It is also possible to get a list of all applications that has opened a certain file, but I don't know the API for this.
I know this is an old question, but here's an answer I spun up after searching for an answer to just this problem. This had to be tweaked a lot to remove some of the proprietary-ness from what I was working on, so this may not compile directly, but it'll give you an idea. This is working great for me:
void BlockingFileCopySync(FileInfo original, FileInfo copyPath)
{
bool ready = false;
FileSystemWatcher watcher = new FileSystemWatcher();
watcher.NotifyFilter = NotifyFilters.LastWrite;
watcher.Path = copyPath.Directory.FullName;
watcher.Filter = "*" + copyPath.Extension;
watcher.EnableRaisingEvents = true;
bool fileReady = false;
bool firsttime = true;
DateTime previousLastWriteTime = new DateTime();
// modify this as you think you need to...
int waitTimeMs = 100;
watcher.Changed += (sender, e) =>
{
// Get the time the file was modified
// Check it again in 100 ms
// When it has gone a while without modification, it's done.
while (!fileReady)
{
// We need to initialize for the "first time",
// ie. when the file was just created.
// (Really, this could probably be initialized off the
// time of the copy now that I'm thinking of it.)
if (firsttime)
{
previousLastWriteTime = System.IO.File.GetLastWriteTime(copyPath.FullName);
firsttime = false;
System.Threading.Thread.Sleep(waitTimeMs);
continue;
}
DateTime currentLastWriteTime = System.IO.File.GetLastWriteTime(copyPath.FullName);
bool fileModified = (currentLastWriteTime != previousLastWriteTime);
if (fileModified)
{
previousLastWriteTime = currentLastWriteTime;
System.Threading.Thread.Sleep(waitTimeMs);
continue;
}
else
{
fileReady = true;
break;
}
}
};
System.IO.File.Copy(original.FullName, copyPath.FullName, true);
// This guy here chills out until the filesystemwatcher
// tells him the file isn't being writen to anymore.
while (!fileReady)
{
System.Threading.Thread.Sleep(waitTimeMs);
}
}
Ok, so to explain; I am developing for a system that can suffer a power failure at any point in time, one point that I am testing is directly after I have written a file out using a StreamWriter. The code below:
// Write the updated file back out to the Shell directory.
using (StreamWriter shellConfigWriter =
new StreamWriter(#"D:\xxx\Shell\Config\Game.cfg.bak"))
{
for (int i = 0; i < configContents.Count; i++)
{
shellConfigWriter.WriteLine(configContents[i]);
}
shellConfigWriter.Close();
}
FileInfo gameCfgBackup = new FileInfo(#"D:\xxx\Shell\Config\Game.cfg.bak");
gameCfgBackup.CopyTo(#"D:\xxx\Shell\Config\Game.cfg", true);
Writes the contents of shellConfigWriter (a List of strings) out to a file used as a temporary store, then it is copied over the original. Now after this code has finished executing the power is lost, upon starting back up again the file Game.cfg exists and is the correct size, but is completely blank. At first I thought that this was due to Write-Caching being enabled on the hard drive, but even with it off it still occurs (albeit less often).
Any ideas would be very welcome!
Update: Ok, so after removing the .Close() statements and calling .Flush() after every write operation the files still end up blank. I could go one step further and create a backup of the original file first, before creating the new one, and then I have enough backups to do a integrity check, but I don't think it'll help to solve the underlying issue (that when I tell it to write to, flush and close a file... It doesn't!).
Keep the OS from buffering the output using the FileOptions parameter of the FileStream object's constructor:
using (Stream fs = new FileStream(#"D:\xxx\Shell\Config\Game.cfg.bak", FileMode.Create, FileAccess.Write, FileShare.None, 0x1000, FileOptions.WriteThrough))
using (StreamWriter shellConfigWriter = new StreamWriter(fs))
{
for (int i = 0; i < configContents.Count; i++)
{
shellConfigWriter.WriteLine(configContents[i]);
}
shellConfigWriter.Flush();
shellConfigWriter.BaseStream.Flush();
}
First of all, you don't have to call shellConfigWriter.Close() there. The using statement will take care of it. What you might want to do instead to guard against power failure is call shellConfigWriter.Flush().
Update
Something else you might want to consider is that if a power failure can really happen at any time, it could happen in the middle of a write, such that only some of the bytes make it to a file. There's really no way to stop that.
To protect against these scenarios, a common procedure is to use state/condition flag files. You use the existence or non-existence on the file system of a zero-byte file with a particular name to tell your program where to pick up again when it resumes. Then you don't create or destroy the files that trigger a particular state until you are sure you've reached that state and completed the previous.
The downside here is that it might mean throwing a lot of work away now and then. But the benefit is that it means the functional part of your code looks like normal: there's very little extra work to do to make the system sufficiently robust.
You want to set AutoFlush = true;