I'm trying to Speedup the Sum-calculation of all Files in all Folders recursive given by one Path.
Let's say i choose "E:\" as Folder.
I will now get the entrie recursive Fileslist via "SafeFileEnumerator" into IEnumerable in Milliseconds (works like a charm)
Now i would like to gather the sum of all bytes from all files in this Enumerable.
Right now i loop them via foreach and get the FileInfo(oFileInfo.FullName).Length; - for each file.
This is working, but it is slow - it takes about 30 seconds. If i lookup the space consumption via Windows rightclick - properties of all selected folders in the windows explorer i get them in about 6 seconds (~ 1600 files in 26 gigabytes of data on ssd)
so my first thougth was to speedup gathering by the usage of threads, but i don't get any speedup here..
the code without the threads is below:
public static long fetchFolderSize(string Folder, CancellationTokenSource oCancelToken)
{
long FolderSize = 0;
IEnumerable<FileSystemInfo> aFiles = new SafeFileEnumerator(Folder, "*", SearchOption.AllDirectories);
foreach (FileSystemInfo oFileInfo in aFiles)
{
// check if we will cancel now
if (oCancelToken.Token.IsCancellationRequested)
{
throw new OperationCanceledException();
}
try
{
FolderSize += new FileInfo(oFileInfo.FullName).Length;
}
catch (Exception oException)
{
Debug.WriteLine(oException.Message);
}
}
return FolderSize;
}
the multithreading code is below:
public static long fetchFolderSize(string Folder, CancellationTokenSource oCancelToken)
{
long FolderSize = 0;
int iCountTasks = 0;
IEnumerable<FileSystemInfo> aFiles = new SafeFileEnumerator(Folder, "*", SearchOption.AllDirectories);
foreach (FileSystemInfo oFileInfo in aFiles)
{
// check if we will cancel now
if (oCancelToken.Token.IsCancellationRequested)
{
throw new OperationCanceledException();
}
if (iCountTasks < 10)
{
iCountTasks++;
Thread oThread = new Thread(delegate()
{
try
{
FolderSize += new FileInfo(oFileInfo.FullName).Length;
}
catch (Exception oException)
{
Debug.WriteLine(oException.Message);
}
iCountTasks--;
});
oThread.Start();
continue;
}
try
{
FolderSize += new FileInfo(oFileInfo.FullName).Length;
}
catch (Exception oException)
{
Debug.WriteLine(oException.Message);
}
}
return FolderSize;
}
could someone please give me an advice how i could speedup the foldersize calculation process?
kindly regards
Edit 1 (Parallel.Foreach suggestion - see comments)
public static long fetchFolderSize(string Folder, CancellationTokenSource oCancelToken)
{
long FolderSize = 0;
ParallelOptions oParallelOptions = new ParallelOptions();
oParallelOptions.CancellationToken = oCancelToken.Token;
oParallelOptions.MaxDegreeOfParallelism = System.Environment.ProcessorCount;
IEnumerable<FileSystemInfo> aFiles = new SafeFileEnumerator(Folder, "*", SearchOption.AllDirectories).ToArray();
Parallel.ForEach(aFiles, oParallelOptions, oFileInfo =>
{
try
{
FolderSize += new FileInfo(oFileInfo.FullName).Length;
}
catch (Exception oException)
{
Debug.WriteLine(oException.Message);
}
});
return FolderSize;
}
Side-note about SafeFileEnumerator performance:
Once you get IEnumerable, it doesn't mean you got entire collection because it is lazy proxy. Try this snippet below - I'm sure you'll see the performance difference (sorry if it's not compiling - just to illustrate the idea):
var tmp = new SafeFileEnumerator(Folder, "*", SearchOption.AllDirectories).ToArray(); // fetch all records explicitly to populate the array
IEnumerable<FileSystemInfo> aFiles = tmp;
Now out the actual result you want to achieve.
If you need just file sizes - it's better to request OS functions about filesystem, not querying files one-by-one. I'd start with DirectoryInfo class (see for instance http://www.tutorialspoint.com/csharp/csharp_windows_file_system.htm).
If you need to calculate the checksum for each, it would be definitely slow task because you have to load each of the files first (a lot of memory transfers). Threads are not a booster here because they'll be limited by OS filesystem throughput, not your CPU power.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading;
using System.Threading.Tasks;
using System.IO;
namespace ConsoleApplication3
{
class Program
{
static void Main(string[] args)
{
long size = fetchFolderSize(#"C:\Test", new CancellationTokenSource());
}
public static long fetchFolderSize(string Folder, CancellationTokenSource oCancelToken)
{
ParallelOptions po = new ParallelOptions();
po.CancellationToken = oCancelToken.Token;
po.MaxDegreeOfParallelism = System.Environment.ProcessorCount;
long folderSize = 0;
string[] files = Directory.GetFiles(Folder);
Parallel.ForEach<string,long>(files,
po,
() => 0,
(fileName, loop, fileSize) =>
{
fileSize = new FileInfo(fileName).Length;
po.CancellationToken.ThrowIfCancellationRequested();
return fileSize;
},
(finalResult) => Interlocked.Add(ref folderSize, finalResult)
);
string[] subdirEntries = Directory.GetDirectories(Folder);
Parallel.For<long>(0, subdirEntries.Length, () => 0, (i, loop, subtotal) =>
{
if ((File.GetAttributes(subdirEntries[i]) & FileAttributes.ReparsePoint) !=
FileAttributes.ReparsePoint)
{
subtotal += fetchFolderSize(subdirEntries[i], oCancelToken);
return subtotal;
}
return 0;
},
(finalResult) => Interlocked.Add(ref folderSize, finalResult)
);
return folderSize ;
}
}
}
Related
I want to build a win service(no UI) on c# that all what it done is: run on list of directories and delete files that over then X kb.
I want the better performance,
what is the better way to do this?
there is no pure async function for delete file so if i want to use async await
I can wrap this function like:
public static class FileExtensions {
public static Task DeleteAsync(this FileInfo fi) {
return Task.Factory.StartNew(() => fi.Delete() );
}
}
and call to this function like:
FileInfo fi = new FileInfo(fileName);
await fi.DeleteAsync();
i think to run like
foreach file on ListOfDirectories
{
if(file.Length>1000)
await file.DeleteAsync
}
but on this option the files will delete 1 by 1 (and every DeleteAsync will use on thread from the threadPool).
so i not earn from the async, i can do it 1 by 1.
maybe i think to collect X files on list and then delete them AsParallel
please help me to find the better way
You can use Directory.GetFiles("DirectoryPath").Where(x=> new FileInfo(x).Length < 1000); to get a list of files that are under 1 KB of size.
Then use Parallel.ForEach to iterate over that collection like this:
var collectionOfFiles = Directory.GetFiles("DirectoryPath")
.Where(x=> new FileInfo(x).Length < 1000);
Parallel.ForEach(collectionOfFiles, File.Delete);
It could be argued that you should use:
Parallel.ForEach(collectionOfFiles, currentFile =>
{
File.Delete(currentFile);
});
to improve the readability of the code.
MSDN has a simple example on how to use Parallel.ForEach()
If you are wondering about the FileInfo object, here is the documentation
this is may be can help you.
public static class FileExtensions
{
public static Task<int> DeleteAsync(this IEnumerable<FileInfo> files)
{
var count = files.Count();
Parallel.ForEach(files, (f) =>
{
f.Delete();
});
return Task.FromResult(count);
}
public static async Task<int> DeleteAsync(this DirectoryInfo directory, Func<FileInfo, bool> predicate)
{
return await directory.EnumerateFiles().Where(predicate).DeleteAsync();
}
public static async Task<int> DeleteAsync(this IEnumerable<FileInfo> files, Func<FileInfo, bool> predicate)
{
return await files.Where(predicate).DeleteAsync();
}
}
var _byte = 1;
var _kb = _byte * 1000;
var _mb = _kb * 1000;
var _gb = _mb * 1000;
DirectoryInfo d = new DirectoryInfo(#"C:\testDirectory");
var deletedFileCount = await d.DeleteAsync(f => f.Length > _mb * 1);
Debug.WriteLine("{0} Files larger than 1 megabyte deleted", deletedFileCount);
// => 7 Files larger than 1 megabyte deleted
deletedFileCount = await d.GetFiles("*.*",SearchOption.AllDirectories)
.Where(f => f.Length > _kb * 10).DeleteAsync();
Debug.WriteLine("{0} Files larger than 10 kilobytes deleted", deletedFileCount);
// => 11 Files larger than 10 kilobytes deleted
I'm processing a list of items (200k - 300k), each item processing time is between 2 to 8 seconds. To gain time, I can process this list in parallel. As I'm in an async context, I use something like this :
public async Task<List<Keyword>> DoWord(List<string> keyword)
{
ConcurrentBag<Keyword> keywordResults = new ConcurrentBag<Keyword>();
if (keyword.Count > 0)
{
try
{
var tasks = keyword.Select(async kw =>
{
return await Work(kw).ConfigureAwait(false);
});
keywordResults = new ConcurrentBag<Keyword>(await Task.WhenAll(tasks).ConfigureAwait(false));
}
catch (AggregateException ae)
{
foreach (Exception innerEx in ae.InnerExceptions)
{
log.ErrorFormat("Core threads exception: {0}", innerEx);
}
}
}
return keywordResults.ToList();
}
The keyword list contains always 8 elements (comming from above) thus I process my list 8 by 8 but, in this case, I guess that if 7 keywords are processed in 3 secs and the 8th is processed in 10 secs, the total time for the 8 keywords will be 10 (correct me if i'm wrong).
How Can I approach from the Parallel.Foreach then? I mean : launch 8 keywords if 1 of them is done, launch 1 more. In this case I'll have 8 working processes permanently. Any idea ?
Another more easier way to do this is to use the AsyncEnumerator NuGet Package:
using System.Collections.Async;
public async Task<List<Keyword>> DoWord(List<string> keywords)
{
var keywordResults = new ConcurrentBag<Keyword>();
await keywords.ParallelForEachAsync(async keyword =>
{
try
{
var result = await Work(keyword);
keywordResults.Add(result);
}
catch (AggregateException ae)
{
foreach (Exception innerEx in ae.InnerExceptions)
{
log.ErrorFormat("Core threads exception: {0}", innerEx);
}
}
}, maxDegreeOfParallelism: 8);
return keywordResults.ToList();
}
Here's some sample code showing how you could approach this using TPL Dataflow.
Note that in order to compile this, you will need to add TPL Dataflow to your project via NuGet.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Threading.Tasks;
using System.Threading.Tasks.Dataflow;
namespace Demo
{
class Keyword // Dummy test class.
{
public string Name;
}
class Program
{
static void Main()
{
// Dummy test data.
var keywords = Enumerable.Range(1, 100).Select(n => n.ToString()).ToList();
var result = DoWork(keywords).Result;
Console.WriteLine("---------------------------------");
foreach (var item in result)
Console.WriteLine(item.Name);
}
public static async Task<List<Keyword>> DoWork(List<string> keywords)
{
var input = new TransformBlock<string, Keyword>
(
async s => await Work(s),
// This is where you specify the max number of threads to use.
new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 8 }
);
var result = new List<Keyword>();
var output = new ActionBlock<Keyword>
(
item => result.Add(item), // Output only 1 item at a time, because 'result.Add()' is not threadsafe.
new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 1 }
);
input.LinkTo(output, new DataflowLinkOptions { PropagateCompletion = true });
foreach (string s in keywords)
await input.SendAsync(s);
input.Complete();
await output.Completion;
return result;
}
public static async Task<Keyword> Work(string s) // Stubbed test method.
{
Console.WriteLine("Processing " + s);
int delay;
lock (rng) { delay = rng.Next(10, 1000); }
await Task.Delay(delay); // Simulate load.
Console.WriteLine("Completed " + s);
return await Task.Run( () => new Keyword { Name = s });
}
static Random rng = new Random();
}
}
I have a scenario in which i have to process the multiple files(e.g. 30) parallel based on the processor cores. I have to assign these files to separate tasks based on no of processor cores. I don't know how to make a start and end limit of each task to process. For example each and every task knows how many files it has to process.
private void ProcessFiles(object e)
{
try
{
var diectoryPath = _Configurations.Descendants().SingleOrDefault(Pr => Pr.Name == "DirectoryPath").Value;
var FilePaths = Directory.EnumerateFiles(diectoryPath);
int numCores = System.Environment.ProcessorCount;
int NoOfTasks = FilePaths.Count() > numCores ? (FilePaths.Count()/ numCores) : FilePaths.Count();
for (int i = 0; i < NoOfTasks; i++)
{
Task.Factory.StartNew(
() =>
{
int startIndex = 0, endIndex = 0;
for (int Count = startIndex; Count < endIndex; Count++)
{
this.ProcessFile(FilePaths);
}
});
}
}
catch (Exception ex)
{
throw;
}
}
For problems such as yours, there are concurrent data structures available in C#. You want to use BlockingCollection and store all the file names in it.
Your idea of calculating the number of tasks by using the number of cores available on the machine is not very good. Why? Because ProcessFile() may not take the same time for each file. So, it would be better to start the number of tasks as the number of cores you have. Then, let each task read file name one by one from the BlockingCollection and then process the file, until the BlockingCollection is empty.
try
{
var directoryPath = _Configurations.Descendants().SingleOrDefault(Pr => Pr.Name == "DirectoryPath").Value;
var filePaths = CreateBlockingCollection(directoryPath);
//Start the same #tasks as the #cores (Assuming that #files > #cores)
int taskCount = System.Environment.ProcessorCount;
for (int i = 0; i < taskCount; i++)
{
Task.Factory.StartNew(
() =>
{
string fileName;
while (!filePaths.IsCompleted)
{
if (!filePaths.TryTake(out fileName)) continue;
this.ProcessFile(fileName);
}
});
}
}
And the CreateBlockingCollection() would be as follows:
private BlockingCollection<string> CreateBlockingCollection(string path)
{
var allFiles = Directory.EnumerateFiles(path);
var filePaths = new BlockingCollection<string>(allFiles.Count);
foreach(var fileName in allFiles)
{
filePaths.Add(fileName);
}
filePaths.CompleteAdding();
return filePaths;
}
You will have to modify your ProcessFile() to receive a file name now instead of taking all the file paths and processing its chunk.
The advantage of this approach is that now your CPU won't be over or under subscribed and the load will be evenly balanced too.
I haven't run the code myself, so there might be some syntax error in my code. Feel free to correct the error, if you come across any.
Based on my admittedly limited understanding of the TPL, I think your code could be rewritten as such:
private void ProcessFiles(object e)
{
try
{
var diectoryPath = _Configurations.Descendants().SingleOrDefault(Pr => Pr.Name == "DirectoryPath").Value;
var FilePaths = Directory.EnumerateFiles(diectoryPath);
Parallel.ForEach(FilePaths, path => this.ProcessFile(path));
}
catch (Exception ex)
{
throw;
}
}
regards
Suppose there are an arbitrary number of threads in my C# program. Each thread needs to look up the changeset ids for a particular path by looking up it's history. The method looks like so:
public List<int> GetIdsFromHistory(string path, VersionControlServer tfsClient)
{
IEnumerable submissions = tfsClient.QueryHistory(
path,
VersionSpec.Latest,
0,
RecursionType.None, // Assume that the path is to a file, not a directory
null,
null,
null,
Int32.MaxValue,
false,
false);
List<int> ids = new List<int>();
foreach(Changeset cs in submissions)
{
ids.Add(cs.ChangesetId);
}
return ids;
}
My question is, does each thread need it's own VersionControlServer instance or will one suffice? My intuition tells me that each thread needs its own instance since the TFS SDK uses webservices and I should probably have more than one connection open if I'm really going to get the parallel behavior. If I only use one connection, my intuition tells me that I'll get serial behavior even though I've got multiple threads.
If I need as many instances as there are threads, I think of using an Object-Pool pattern, but will the connections time out and close over a long period if not being used? The docs seem sparse in this regard.
It would appear that threads using the SAME client is the fastest option.
Here's the output from a test program that runs 4 tests 5 times each and returns the average result in milliseconds. Clearly using the same client across multiple threads is the fastest execution:
Parallel Pre-Alloc: Execution Time Average (ms): 1921.26044
Parallel AllocOnDemand: Execution Time Average (ms): 1391.665
Parallel-SameClient: Execution Time Average (ms): 400.5484
Serial: Execution Time Average (ms): 1472.76138
For reference, here's the test program itself (also on GitHub):
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Microsoft.TeamFoundation;
using Microsoft.TeamFoundation.Client;
using Microsoft.TeamFoundation.VersionControl.Client;
using System.Collections;
using System.Threading.Tasks;
using System.Diagnostics;
namespace QueryHistoryPerformanceTesting
{
class Program
{
static string TFS_COLLECTION = /* TFS COLLECTION URL */
static VersionControlServer GetTfsClient()
{
var projectCollectionUri = new Uri(TFS_COLLECTION);
var projectCollection = TfsTeamProjectCollectionFactory.GetTeamProjectCollection(projectCollectionUri, new UICredentialsProvider());
projectCollection.EnsureAuthenticated();
return projectCollection.GetService<VersionControlServer>();
}
struct ThrArg
{
public VersionControlServer tfc { get; set; }
public string path { get; set; }
}
static List<string> PATHS = new List<string> {
// ASSUME 21 FILE PATHS
};
static int NUM_RUNS = 5;
static void Main(string[] args)
{
var results = new List<TimeSpan>();
for (int i = NUM_RUNS; i > 0; i--)
{
results.Add(RunTestParallelPreAlloc());
}
Console.WriteLine("Parallel Pre-Alloc: Execution Time Average (ms): " + results.Select(t => t.TotalMilliseconds).Average());
results.Clear();
for (int i = NUM_RUNS; i > 0; i--)
{
results.Add(RunTestParallelAllocOnDemand());
}
Console.WriteLine("Parallel AllocOnDemand: Execution Time Average (ms): " + results.Select(t => t.TotalMilliseconds).Average());
results.Clear();
for (int i = NUM_RUNS; i > 0; i--)
{
results.Add(RunTestParallelSameClient());
}
Console.WriteLine("Parallel-SameClient: Execution Time Average (ms): " + results.Select(t => t.TotalMilliseconds).Average());
results.Clear();
for (int i = NUM_RUNS; i > 0; i--)
{
results.Add(RunTestSerial());
}
Console.WriteLine("Serial: Execution Time Average (ms): " + results.Select(t => t.TotalMilliseconds).Average());
}
static TimeSpan RunTestParallelPreAlloc()
{
var paths = new List<ThrArg>();
paths.AddRange( PATHS.Select( x => new ThrArg { path = x, tfc = GetTfsClient() } ) );
return RunTestParallel(paths);
}
static TimeSpan RunTestParallelAllocOnDemand()
{
var paths = new List<ThrArg>();
paths.AddRange(PATHS.Select(x => new ThrArg { path = x, tfc = null }));
return RunTestParallel(paths);
}
static TimeSpan RunTestParallelSameClient()
{
var paths = new List<ThrArg>();
var _tfc = GetTfsClient();
paths.AddRange(PATHS.Select(x => new ThrArg { path = x, tfc = _tfc }));
return RunTestParallel(paths);
}
static TimeSpan RunTestParallel(List<ThrArg> args)
{
var allIds = new List<int>();
var stopWatch = new Stopwatch();
stopWatch.Start();
Parallel.ForEach(args, s =>
{
allIds.AddRange(GetIdsFromHistory(s.path, s.tfc));
}
);
stopWatch.Stop();
return stopWatch.Elapsed;
}
static TimeSpan RunTestSerial()
{
var allIds = new List<int>();
VersionControlServer tfsc = GetTfsClient();
var stopWatch = new Stopwatch();
stopWatch.Start();
foreach (string s in PATHS)
{
allIds.AddRange(GetIdsFromHistory(s, tfsc));
}
stopWatch.Stop();
return stopWatch.Elapsed;
}
static List<int> GetIdsFromHistory(string path, VersionControlServer tfsClient)
{
if (tfsClient == null)
{
tfsClient = GetTfsClient();
}
IEnumerable submissions = tfsClient.QueryHistory(
path,
VersionSpec.Latest,
0,
RecursionType.None, // Assume that the path is to a file, not a directory
null,
null,
null,
Int32.MaxValue,
false,
false);
List<int> ids = new List<int>();
foreach(Changeset cs in submissions)
{
ids.Add(cs.ChangesetId);
}
return ids;
}
I have a fairly large XML file(around 1-2GB).
The requirement is to persist the xml data in to database.
Currently this is achieved in 3 steps.
Read the large file with less memory foot print as much as possible
Create entities from the xml-data
Store the data from the created entities in to the database using SqlBulkCopy.
To achieve better performance I want to create a Producer-consumer model where the producer creates a set of entities say a batch of 10K and adds it to a Queue. And the consumer should take the batch of entities from the queue and persist to the database using sqlbulkcopy.
Thanks,
Gokul
void Main()
{
int iCount = 0;
string fileName = #"C:\Data\CatalogIndex.xml";
DateTime startTime = DateTime.Now;
Console.WriteLine("Start Time: {0}", startTime);
FileInfo fi = new FileInfo(fileName);
Console.WriteLine("File Size:{0} MB", fi.Length / 1048576.0);
/* I want to change this loop to create a producer consumer pattern here to process the data parallel-ly
*/
foreach (var element in StreamElements(fileName,"title"))
{
iCount++;
}
Console.WriteLine("Count: {0}", iCount);
Console.WriteLine("End Time: {0}, Time Taken:{1}", DateTime.Now, DateTime.Now - startTime);
}
private static IEnumerable<XElement> StreamElements(string fileName, string elementName)
{
using (var rdr = XmlReader.Create(fileName))
{
rdr.MoveToContent();
while (!rdr.EOF)
{
if ((rdr.NodeType == XmlNodeType.Element) && (rdr.Name == elementName))
{
var e = XElement.ReadFrom(rdr) as XElement;
yield return e;
}
else
{
rdr.Read();
}
}
rdr.Close();
}
}
Is this what you are trying to do?
void Main()
{
const int inputCollectionBufferSize = 1024;
const int bulkInsertBufferCapacity = 100;
const int bulkInsertConcurrency = 4;
BlockingCollection<object> inputCollection = new BlockingCollection<object>(inputCollectionBufferSize);
Task loadTask = Task.Factory.StartNew(() =>
{
foreach (object nextItem in ReadAllElements(...))
{
// this will potentially block if there are already enough items
inputCollection.Add(nextItem);
}
// mark this collection as done
inputCollection.CompleteAdding();
});
Action parseAction = () =>
{
List<object> bulkInsertBuffer = new List<object>(bulkInsertBufferCapacity);
foreach (object nextItem in inputCollection.GetConsumingEnumerable())
{
if (bulkInsertBuffer.Length == bulkInsertBufferCapacity)
{
CommitBuffer(bulkInsertBuffer);
bulkInsertBuffer.Clear();
}
bulkInsertBuffer.Add(nextItem);
}
};
List<Task> parseTasks = new List<Task>(bulkInsertConcurrency);
for (int i = 0; i < bulkInsertConcurrency; i++)
{
parseTasks.Add(Task.Factory.StartNew(parseAction));
}
// wait before exiting
loadTask.Wait();
Task.WaitAll(parseTasks.ToArray());
}