What is better way to delete file with condition - c#

I want to build a win service(no UI) on c# that all what it done is: run on list of directories and delete files that over then X kb.
I want the better performance,
what is the better way to do this?
there is no pure async function for delete file so if i want to use async await
I can wrap this function like:
public static class FileExtensions {
public static Task DeleteAsync(this FileInfo fi) {
return Task.Factory.StartNew(() => fi.Delete() );
}
}
and call to this function like:
FileInfo fi = new FileInfo(fileName);
await fi.DeleteAsync();
i think to run like
foreach file on ListOfDirectories
{
if(file.Length>1000)
await file.DeleteAsync
}
but on this option the files will delete 1 by 1 (and every DeleteAsync will use on thread from the threadPool).
so i not earn from the async, i can do it 1 by 1.
maybe i think to collect X files on list and then delete them AsParallel
please help me to find the better way

You can use Directory.GetFiles("DirectoryPath").Where(x=> new FileInfo(x).Length < 1000); to get a list of files that are under 1 KB of size.
Then use Parallel.ForEach to iterate over that collection like this:
var collectionOfFiles = Directory.GetFiles("DirectoryPath")
.Where(x=> new FileInfo(x).Length < 1000);
Parallel.ForEach(collectionOfFiles, File.Delete);
It could be argued that you should use:
Parallel.ForEach(collectionOfFiles, currentFile =>
{
File.Delete(currentFile);
});
to improve the readability of the code.
MSDN has a simple example on how to use Parallel.ForEach()
If you are wondering about the FileInfo object, here is the documentation

this is may be can help you.
public static class FileExtensions
{
public static Task<int> DeleteAsync(this IEnumerable<FileInfo> files)
{
var count = files.Count();
Parallel.ForEach(files, (f) =>
{
f.Delete();
});
return Task.FromResult(count);
}
public static async Task<int> DeleteAsync(this DirectoryInfo directory, Func<FileInfo, bool> predicate)
{
return await directory.EnumerateFiles().Where(predicate).DeleteAsync();
}
public static async Task<int> DeleteAsync(this IEnumerable<FileInfo> files, Func<FileInfo, bool> predicate)
{
return await files.Where(predicate).DeleteAsync();
}
}
var _byte = 1;
var _kb = _byte * 1000;
var _mb = _kb * 1000;
var _gb = _mb * 1000;
DirectoryInfo d = new DirectoryInfo(#"C:\testDirectory");
var deletedFileCount = await d.DeleteAsync(f => f.Length > _mb * 1);
Debug.WriteLine("{0} Files larger than 1 megabyte deleted", deletedFileCount);
// => 7 Files larger than 1 megabyte deleted
deletedFileCount = await d.GetFiles("*.*",SearchOption.AllDirectories)
.Where(f => f.Length > _kb * 10).DeleteAsync();
Debug.WriteLine("{0} Files larger than 10 kilobytes deleted", deletedFileCount);
// => 11 Files larger than 10 kilobytes deleted

Related

How to process directory files in Task parallel library?

I have a scenario in which i have to process the multiple files(e.g. 30) parallel based on the processor cores. I have to assign these files to separate tasks based on no of processor cores. I don't know how to make a start and end limit of each task to process. For example each and every task knows how many files it has to process.
private void ProcessFiles(object e)
{
try
{
var diectoryPath = _Configurations.Descendants().SingleOrDefault(Pr => Pr.Name == "DirectoryPath").Value;
var FilePaths = Directory.EnumerateFiles(diectoryPath);
int numCores = System.Environment.ProcessorCount;
int NoOfTasks = FilePaths.Count() > numCores ? (FilePaths.Count()/ numCores) : FilePaths.Count();
for (int i = 0; i < NoOfTasks; i++)
{
Task.Factory.StartNew(
() =>
{
int startIndex = 0, endIndex = 0;
for (int Count = startIndex; Count < endIndex; Count++)
{
this.ProcessFile(FilePaths);
}
});
}
}
catch (Exception ex)
{
throw;
}
}
For problems such as yours, there are concurrent data structures available in C#. You want to use BlockingCollection and store all the file names in it.
Your idea of calculating the number of tasks by using the number of cores available on the machine is not very good. Why? Because ProcessFile() may not take the same time for each file. So, it would be better to start the number of tasks as the number of cores you have. Then, let each task read file name one by one from the BlockingCollection and then process the file, until the BlockingCollection is empty.
try
{
var directoryPath = _Configurations.Descendants().SingleOrDefault(Pr => Pr.Name == "DirectoryPath").Value;
var filePaths = CreateBlockingCollection(directoryPath);
//Start the same #tasks as the #cores (Assuming that #files > #cores)
int taskCount = System.Environment.ProcessorCount;
for (int i = 0; i < taskCount; i++)
{
Task.Factory.StartNew(
() =>
{
string fileName;
while (!filePaths.IsCompleted)
{
if (!filePaths.TryTake(out fileName)) continue;
this.ProcessFile(fileName);
}
});
}
}
And the CreateBlockingCollection() would be as follows:
private BlockingCollection<string> CreateBlockingCollection(string path)
{
var allFiles = Directory.EnumerateFiles(path);
var filePaths = new BlockingCollection<string>(allFiles.Count);
foreach(var fileName in allFiles)
{
filePaths.Add(fileName);
}
filePaths.CompleteAdding();
return filePaths;
}
You will have to modify your ProcessFile() to receive a file name now instead of taking all the file paths and processing its chunk.
The advantage of this approach is that now your CPU won't be over or under subscribed and the load will be evenly balanced too.
I haven't run the code myself, so there might be some syntax error in my code. Feel free to correct the error, if you come across any.
Based on my admittedly limited understanding of the TPL, I think your code could be rewritten as such:
private void ProcessFiles(object e)
{
try
{
var diectoryPath = _Configurations.Descendants().SingleOrDefault(Pr => Pr.Name == "DirectoryPath").Value;
var FilePaths = Directory.EnumerateFiles(diectoryPath);
Parallel.ForEach(FilePaths, path => this.ProcessFile(path));
}
catch (Exception ex)
{
throw;
}
}
regards

How to pass different range on parallel.for?

I need to process the single file in parallel by sending skip-take count like 1-1000, 1001-2000,2001-3000 etc
Code for parallel process
var line = File.ReadAllLines("D:\\OUTPUT.CSV").Length;
Parallel.For(1, line, new ParallelOptions { MaxDegreeOfParallelism = 10 }, x
=> {
DoSomething(skip,take);
});
Function
public static void DoSomething(int skip, int take)
{
//code here
}
How can send the skip and take count in parallel process as per my requirement ?
You can do these rather easily with PLINQ. If you want batches of 1000, you can do:
const int BatchSize = 1000;
var pageAmount = (int) Math.Ceiling(((float)lines / BatchSize));
var results = Enumerable.Range(0, pageAmount)
.AsParallel()
.Select(page => DoSomething(page));
public void DoSomething(int page)
{
var currentLines = source.Skip(page * BatchSize).Take(BatchSize);
// do something with the selected lines
}

C# WPF Speedup (Thread) Total FileInfo.Length from Multiple Files

I'm trying to Speedup the Sum-calculation of all Files in all Folders recursive given by one Path.
Let's say i choose "E:\" as Folder.
I will now get the entrie recursive Fileslist via "SafeFileEnumerator" into IEnumerable in Milliseconds (works like a charm)
Now i would like to gather the sum of all bytes from all files in this Enumerable.
Right now i loop them via foreach and get the FileInfo(oFileInfo.FullName).Length; - for each file.
This is working, but it is slow - it takes about 30 seconds. If i lookup the space consumption via Windows rightclick - properties of all selected folders in the windows explorer i get them in about 6 seconds (~ 1600 files in 26 gigabytes of data on ssd)
so my first thougth was to speedup gathering by the usage of threads, but i don't get any speedup here..
the code without the threads is below:
public static long fetchFolderSize(string Folder, CancellationTokenSource oCancelToken)
{
long FolderSize = 0;
IEnumerable<FileSystemInfo> aFiles = new SafeFileEnumerator(Folder, "*", SearchOption.AllDirectories);
foreach (FileSystemInfo oFileInfo in aFiles)
{
// check if we will cancel now
if (oCancelToken.Token.IsCancellationRequested)
{
throw new OperationCanceledException();
}
try
{
FolderSize += new FileInfo(oFileInfo.FullName).Length;
}
catch (Exception oException)
{
Debug.WriteLine(oException.Message);
}
}
return FolderSize;
}
the multithreading code is below:
public static long fetchFolderSize(string Folder, CancellationTokenSource oCancelToken)
{
long FolderSize = 0;
int iCountTasks = 0;
IEnumerable<FileSystemInfo> aFiles = new SafeFileEnumerator(Folder, "*", SearchOption.AllDirectories);
foreach (FileSystemInfo oFileInfo in aFiles)
{
// check if we will cancel now
if (oCancelToken.Token.IsCancellationRequested)
{
throw new OperationCanceledException();
}
if (iCountTasks < 10)
{
iCountTasks++;
Thread oThread = new Thread(delegate()
{
try
{
FolderSize += new FileInfo(oFileInfo.FullName).Length;
}
catch (Exception oException)
{
Debug.WriteLine(oException.Message);
}
iCountTasks--;
});
oThread.Start();
continue;
}
try
{
FolderSize += new FileInfo(oFileInfo.FullName).Length;
}
catch (Exception oException)
{
Debug.WriteLine(oException.Message);
}
}
return FolderSize;
}
could someone please give me an advice how i could speedup the foldersize calculation process?
kindly regards
Edit 1 (Parallel.Foreach suggestion - see comments)
public static long fetchFolderSize(string Folder, CancellationTokenSource oCancelToken)
{
long FolderSize = 0;
ParallelOptions oParallelOptions = new ParallelOptions();
oParallelOptions.CancellationToken = oCancelToken.Token;
oParallelOptions.MaxDegreeOfParallelism = System.Environment.ProcessorCount;
IEnumerable<FileSystemInfo> aFiles = new SafeFileEnumerator(Folder, "*", SearchOption.AllDirectories).ToArray();
Parallel.ForEach(aFiles, oParallelOptions, oFileInfo =>
{
try
{
FolderSize += new FileInfo(oFileInfo.FullName).Length;
}
catch (Exception oException)
{
Debug.WriteLine(oException.Message);
}
});
return FolderSize;
}
Side-note about SafeFileEnumerator performance:
Once you get IEnumerable, it doesn't mean you got entire collection because it is lazy proxy. Try this snippet below - I'm sure you'll see the performance difference (sorry if it's not compiling - just to illustrate the idea):
var tmp = new SafeFileEnumerator(Folder, "*", SearchOption.AllDirectories).ToArray(); // fetch all records explicitly to populate the array
IEnumerable<FileSystemInfo> aFiles = tmp;
Now out the actual result you want to achieve.
If you need just file sizes - it's better to request OS functions about filesystem, not querying files one-by-one. I'd start with DirectoryInfo class (see for instance http://www.tutorialspoint.com/csharp/csharp_windows_file_system.htm).
If you need to calculate the checksum for each, it would be definitely slow task because you have to load each of the files first (a lot of memory transfers). Threads are not a booster here because they'll be limited by OS filesystem throughput, not your CPU power.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading;
using System.Threading.Tasks;
using System.IO;
namespace ConsoleApplication3
{
class Program
{
static void Main(string[] args)
{
long size = fetchFolderSize(#"C:\Test", new CancellationTokenSource());
}
public static long fetchFolderSize(string Folder, CancellationTokenSource oCancelToken)
{
ParallelOptions po = new ParallelOptions();
po.CancellationToken = oCancelToken.Token;
po.MaxDegreeOfParallelism = System.Environment.ProcessorCount;
long folderSize = 0;
string[] files = Directory.GetFiles(Folder);
Parallel.ForEach<string,long>(files,
po,
() => 0,
(fileName, loop, fileSize) =>
{
fileSize = new FileInfo(fileName).Length;
po.CancellationToken.ThrowIfCancellationRequested();
return fileSize;
},
(finalResult) => Interlocked.Add(ref folderSize, finalResult)
);
string[] subdirEntries = Directory.GetDirectories(Folder);
Parallel.For<long>(0, subdirEntries.Length, () => 0, (i, loop, subtotal) =>
{
if ((File.GetAttributes(subdirEntries[i]) & FileAttributes.ReparsePoint) !=
FileAttributes.ReparsePoint)
{
subtotal += fetchFolderSize(subdirEntries[i], oCancelToken);
return subtotal;
}
return 0;
},
(finalResult) => Interlocked.Add(ref folderSize, finalResult)
);
return folderSize ;
}
}
}

Limited number of concurent threads C# [duplicate]

Let's say I have 100 tasks that do something that takes 10 seconds.
Now I want to only run 10 at a time like when 1 of those 10 finishes another task gets executed till all are finished.
Now I always used ThreadPool.QueueUserWorkItem() for such task but I've read that it is bad practice to do so and that I should use Tasks instead.
My problem is that I nowhere found a good example for my scenario so could you get me started on how to achieve this goal with Tasks?
SemaphoreSlim maxThread = new SemaphoreSlim(10);
for (int i = 0; i < 115; i++)
{
maxThread.Wait();
Task.Factory.StartNew(() =>
{
//Your Works
}
, TaskCreationOptions.LongRunning)
.ContinueWith( (task) => maxThread.Release() );
}
TPL Dataflow is great for doing things like this. You can create a 100% async version of Parallel.Invoke pretty easily:
async Task ProcessTenAtOnce<T>(IEnumerable<T> items, Func<T, Task> func)
{
ExecutionDataflowBlockOptions edfbo = new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = 10
};
ActionBlock<T> ab = new ActionBlock<T>(func, edfbo);
foreach (T item in items)
{
await ab.SendAsync(item);
}
ab.Complete();
await ab.Completion;
}
You have several options. You can use Parallel.Invoke for starters:
public void DoWork(IEnumerable<Action> actions)
{
Parallel.Invoke(new ParallelOptions() { MaxDegreeOfParallelism = 10 }
, actions.ToArray());
}
Here is an alternate option that will work much harder to have exactly 10 tasks running (although the number of threads in the thread pool processing those tasks may be different) and that returns a Task indicating when it finishes, rather than blocking until done.
public Task DoWork(IList<Action> actions)
{
List<Task> tasks = new List<Task>();
int numWorkers = 10;
int batchSize = (int)Math.Ceiling(actions.Count / (double)numWorkers);
foreach (var batch in actions.Batch(actions.Count / 10))
{
tasks.Add(Task.Factory.StartNew(() =>
{
foreach (var action in batch)
{
action();
}
}));
}
return Task.WhenAll(tasks);
}
If you don't have MoreLinq, for the Batch function, here's my simpler implementation:
public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> source, int batchSize)
{
List<T> buffer = new List<T>(batchSize);
foreach (T item in source)
{
buffer.Add(item);
if (buffer.Count >= batchSize)
{
yield return buffer;
buffer = new List<T>();
}
}
if (buffer.Count >= 0)
{
yield return buffer;
}
}
You can create a method like this:
public static async Task RunLimitedNumberAtATime<T>(int numberOfTasksConcurrent,
IEnumerable<T> inputList, Func<T, Task> asyncFunc)
{
Queue<T> inputQueue = new Queue<T>(inputList);
List<Task> runningTasks = new List<Task>(numberOfTasksConcurrent);
for (int i = 0; i < numberOfTasksConcurrent && inputQueue.Count > 0; i++)
runningTasks.Add(asyncFunc(inputQueue.Dequeue()));
while (inputQueue.Count > 0)
{
Task task = await Task.WhenAny(runningTasks);
runningTasks.Remove(task);
runningTasks.Add(asyncFunc(inputQueue.Dequeue()));
}
await Task.WhenAll(runningTasks);
}
And then you can call any async method n times with a limit like this:
Task task = RunLimitedNumberAtATime(10,
Enumerable.Range(1, 100),
async x =>
{
Console.WriteLine($"Starting task {x}");
await Task.Delay(100);
Console.WriteLine($"Finishing task {x}");
});
Or if you want to run long running non async methods, you can do it that way:
Task task = RunLimitedNumberAtATime(10,
Enumerable.Range(1, 100),
x => Task.Factory.StartNew(() => {
Console.WriteLine($"Starting task {x}");
System.Threading.Thread.Sleep(100);
Console.WriteLine($"Finishing task {x}");
}, TaskCreationOptions.LongRunning));
Maybe there is a similar method somewhere in the framework, but I didn't find it yet.
I would love to use the simplest solution I can think of which as I think using the TPL:
string[] urls={};
Parallel.ForEach(urls, new ParallelOptions() { MaxDegreeOfParallelism = 2}, url =>
{
//Download the content or do whatever you want with each URL
});

Selecting entries according to running total

I would like to select from a list of files only so many files that their total size does not exceed a threshold (i.e. the amount of free space on the target drive).
I understand that I could do this by adding up file sizes in a loop until I hit the threshold and then use that number to select files from the list. However, is it possible to do that with a LINQ-query instead?
This could work (files is a List<FileInfo>):
var availableSpace = DriveInfo.GetDrives()
.First(d => d.Name == #"C:\").AvailableFreeSpace;
long usedSpace = 0;
var availableFiles = files
.TakeWhile(f => (usedSpace += f.Length) < availableSpace);
foreach (FileInfo file in availableFiles)
{
Console.WriteLine(file.Name);
}
You can achieve that by using a closure:
var directory = new DirectoryInfo(#"c:\temp");
var files = directory .GetFiles();
long maxTotalSize = 2000000;
long aggregatedSize = 0;
var result = files.TakeWhile(fileInfo =>
{
aggregatedSize += fileInfo.Length;
return aggregatedSize <= maxTotalSize;
});
Theres a caveat though, because the variable aggregatedSize may get modified after you have left the scope where it has been defined.
You could wrap that in an extension method though - that would eliminate the closure:
public static IEnumerable<FileInfo> GetWithMaxAggregatedSize(this IEnumerable<FileInfo> files, long maxTotalSize)
{
long aggregatedSize = 0;
return files.TakeWhile(fileInfo =>
{
aggregatedSize += fileInfo.Length;
return aggregatedSize <= maxTotalSize;
});
}
You finally use the method like this:
var directory = new DirectoryInfo(#"c:\temp");
var files = directory.GetFiles().GetWithMaxAggregatedSize(2000000);
EDIT: I replaced the Where-method with the TakeWhile-method. The TakeWhile-extension will stop once the threshold has been reached, while the Where-extension will continue. Credits for bringing up the TakeWhile-extension go to Tim Schmelter.

Categories