I'm trying to upload multiple files to S3 using the TransferUtility class from Amazon SDK for .NET
My thinking was, that since the SDK doesn't allow to upload multiple files form different folders at once, I'd create multiple threads and upload there, but it looks like Amazon SDK has some kind of checking against this, since I don't notice any parallel execution of my uploading method.
Here is the pseudo-code I'm using:
int totalUploaded = 0;
foreach (var dItem in Detection.Items.AsParallel())
{
UploadFile(dItem);
totalUploaded++;
Action a = () => { lblStatus.Text = $"Uploaded {totalUploaded} from {Detection.ItemsCount}"; };
BeginInvoke(a);
}
I'm using .AsParallel to spawn multiple threads. My CPU (i7-5930K) has 6 cores and supports multi-threading, so AsParallel must spawn more threads as needed.
And here is the upload method
private void UploadFile(Detection.Item item)
{
Debug.WriteLine("Enter " + item.FileInfo);
Interlocked.Increment(ref _threadsCount);
...
using (var client = AmazonS3Client)
{
....
// if we are here we need to upload
TransferUtilityUploadRequest request = new TransferUtilityUploadRequest
{
Key = s3Key,
BucketName = settings.BucketName,
CannedACL = S3CannedACL.PublicRead,
FilePath = item.FileInfo.FullName,
ContentType = "image/png",
};
TransferUtility utility = new TransferUtility(client);
utility.Upload(request);
}
}
Сan't see what can be wrong here? Any idea highly appreciated.
Thx
The problem in your code is that you are only constructing the ParallelEnumerable, but treat it like a simple IEnumerable:
foreach (var dItem in Detection.Items.AsParallel())
This part of code simply iterates over collection. In case you want the parallel executing, you have to use the extension method ForAll():
Detection.Items.AsParallel().ForAll(dItem =>
{
//Do parallel stuff here
});
Also you can simply use the Parallel class:
Parallel.ForEach(Detection.Items, dItem =>
{
//Do parallel stuff here
});
Related
I am wondering if I should be using Parallel.ForEach() for my case. For a bit of context: I am developing a small music player using the NAudio library. I want to use Parallel.ForEach() in a factory method to quickly access .mp3 files and create TrackModel objects to represent them (about 400). The code looks like this:
public static List<TrackModel> CreateTracks(string[] files)
{
// Guard Clause
if (files == null || files.Length == 0) throw new ArgumentException();
var output = new List<TrackModel>();
TrackModel track;
Parallel.ForEach(files, file =>
{
using (MusicPlayer musicPlayer = new MusicPlayer(file, 0f))
{
track = new TrackModel()
{
FilePath = file,
Title = File.Create(file).Tag.Title,
Artist = File.Create(file).Tag.FirstPerformer,
TrackLength = musicPlayer.GetLengthInSeconds(),
};
}
lock (output)
{
output.Add(track);
}
});
return output;
}
Note: I use lock to prevent multiple Threads from adding elements to the list at the same time.
My question is the following: Should I be using Parallel.ForEach() in this situation or am I better off writing a normal foreach loop? Is this the right approach to achieve better performance and should I be using multithreading in combination with file access in the first place?
You're better off avoiding both a foreach and Parallel.ForEach. In this case AsParallel() is your friend.
Try this:
public static List<TrackModel> CreateTracks(string[] files)
{
if (files == null || files.Length == 0) throw new ArgumentException();
return
files
.AsParallel()
.AsOrdered()
.WithDegreeOfParallelism(2)
.Select(file =>
{
using (MusicPlayer musicPlayer = new MusicPlayer(file, 0f))
{
return new TrackModel()
{
FilePath = file,
Title = File.Create(file).Tag.Title,
Artist = File.Create(file).Tag.FirstPerformer,
TrackLength = musicPlayer.GetLengthInSeconds(),
};
}
})
.ToList();
}
This handles all the parallel logic and the locking behind the scenes for you.
Combining the suggestion from comments and answers and adapting them to my code I was able to solve my issue with the following code:
public List<TrackModel> CreateTracks(string[] files)
{
var output = files
.AsParallel()
.Select(file =>
{
using MusicPlayer musicPlayer = new MusicPlayer(file, 0f);
using File musicFile = File.Create(file);
return new TrackModel()
{
FilePath = file,
Title = musicFile.Tag.Title,
Artist = musicFile.Tag.FirstPerformer,
Length = musicPlayer.GetLengthInSeconds(),
};
})
.ToList();
return output;
}
Using AsParallel() helped significantly decrease the loading time which is what I was looking for. I will mark Enigmativity's answer as correct because of the clever idea. Initially, it threw a weird AggregateException, but I was able to solve it by saving the output in a variable and then returning it.
Credit to marsze as well, whose suggestion helped me fix a memory leak in the application and shave off 16MB of memory (!).
I wanted to ask about the best approach for having a console application which can also be used as a windows-service in a net core environment. The problem is not for having such an application, but rather the executed code.
I try to explain what exactly my problem is.
If the windows-service is started, a for-loop is being initiated which does several things
accessing amazons AWS SQS
accesing via HTTP-Request a csv-file => those data is being used and partially stored in a db
accessing tables of an oracle db via EF (insert,update and delete)
So far so good. Everything is working out as I want to. Using dependency injection (Scoped) for accessing in my loop those methods I have programmed for getting all the action done.
The tricky part is that the memory usage of this application is rather ... not optimal. It does what it does, but while downloading and using the data of the csv files, the application uses too much memory and doesn't free up properly. Do you have any suggestions or knowlegde base articles how to handle such scenarios (loop in windows-service)?
I tried to free up everything I can, like clearing lists and setting them to null, disabled any tracking in EF while querying data (also disabled extra insert / update changetracker) and using "using statements" ( / disposing elements).
Also, I would like to add that I am using the latest SDK of Amazon AWS (Core and SQS) and EF Core 2.2.6 with Oracle Provider.
Any chance you might have a hint?
If you need more information, just tell me. I will then provide more data as needed.
Kind regards
Regarding the comment of reading csv file
Reading the file from the URL.
using (var response = await request.GetResponseAsync())
{
await using (var receiveStream = response.GetResponseStream())
{
using (var readStream = new StreamReader(receiveStream, Encoding.UTF8))
{
var content = readStream.ReadToEnd();
result.Content = content.Split('\n').ToList();
result.IsSuccess = true;
}
}
}
and after reading it, I convert it to my target class
public static async Task<List<Curve>> ReturnCurveData(List<string> content)
{
var checkVar = -1;
var list = new List<Curve>();
foreach (var entry in content)
{
if (string.IsNullOrEmpty(entry)) continue;
var entrySplitted = entry.Split('|');
if (entrySplitted.Length < 3) continue;
else if(!int.TryParse(entrySplitted[0], out checkVar)) continue;
var item = new Curve();
item.Property1 = Convert.ToInt32(entrySplitted[0]);
item.Property2 = (entrySplitted.Length ==3) ? DateTime.Now : Convert.ToDateTime(entrySplitted[1]);
item.Property3 = (entrySplitted.Length ==3) ? Convert.ToDateTime(entrySplitted[1]) : Convert.ToDateTime(entrySplitted[2]);
item.Value = (entrySplitted.Length ==3) ? Convert.ToDecimal(entrySplitted[2], new CultureInfo("en-US")): Convert.ToDecimal(entrySplitted[3], new CultureInfo("en-US"));
list.Add(item);
}
return await Task.FromResult(list);
}
Regarding the definition of scope
var hostBuilder = new HostBuilder()
.UseContentRoot(Directory.GetCurrentDirectory())
.ConfigureAppConfiguration((hostingContext, config) =>
{
...
})
.ConfigureServices((hostContext, services) =>
{
services.AddScoped<Data.Queries.Database.Db>();
services.AddScoped<Data.Queries.External.Aws>();
services.AddScoped<Data.Queries.External.Web>();
services.RegisterEfCoreOracle<DbContext>(AppDomain.CurrentDomain.BaseDirectory,
"cfg_db.json");
services.AddScoped<IExecute, Execute>();
services.AddHostedService<ExecuteHost>();
})
.ConfigureLogging((hostingContext, logging) =>
{
...
});
public static void RegisterEfCoreOracle<T>(this IServiceCollection services, string configurationDirectory, string configurationFile, ServiceLifetime lifetime = ServiceLifetime.Scoped) where T : DbContext
{
//Adding configuration file
IConfiguration configuration = new ConfigurationBuilder()
.SetBasePath(configurationDirectory)
.AddJsonFile(configurationFile, optional: false)
.Build();
services.Configure<OracleConfiguration<T>>(configuration);
var oraConfig = services.ReturnServiceProvider().GetService<IOptions<OracleConfiguration<T>>>();
if (oraConfig.Value.Compatibility != null)
{
services.AddDbContext<T>(o => o
.UseOracle(oraConfig.Value.ConnectionString(), b => b
.UseOracleSQLCompatibility(oraConfig.Value.Compatibility)), lifetime);
}
else
{
services.AddDbContext<T>(o => o
.UseOracle(oraConfig.Value.ConnectionString()), lifetime);
}
}
Well since you posted your code, we can analyze the problem pretty easily:
var content = readStream.ReadToEnd();
As I said, never read the file in memory, process it line by line, for example using a StreamReader or StringReader, or any of the million csv parsers in Nuget.
result.Content = content.Split('\n').ToList();
Not only are you reading the entire file in memory, you're then splitting it into values (so in addition to the entire file contents, you're allocating an array of length equal to the number of lines, and for each line allocating strings for each element that's separated by a comma, for a total of lines*elements strings), and in addition to all that allocating yet another list and copying the contents of the array into it.
Edit: You're splitting it into lines here, and into values in the second part. My analysis is correct, but the problem is split over multiple lines.
Needless to say this is ludicrous at best. Stop using Split, don't ToList needlessly, and don't read all of it at once. You're writing a web server, all of this could theoretically be done once for each request processed, which can easily go in the dozens depending on your CPU.
I won't go over the second part, but it's even worse. At a glance I see even more lists allocated and even more Splits. And the line return await Task.FromResult(list); shows you don't understand async funtions at all. Not only what you have is not async at all, but if you insist on making it async for the fun of it, return the list directly, not as an awaited task.
I have a folder with many CSV files in it, which are around 3MB each in size.
example of content of one CSV:
afkla890sdfa9f8sadfkljsdfjas98sdf098,-1dskjdl4kjff;
afkla890sdfa9f8sadfkljsdfjas98sdf099,-1kskjd11kjsj;
afkla890sdfa9f8sadfkljsdfjas98sdf100,-1asfjdl1kjgf;
etc...
Now I have a Console app written in C#, that searches each CSV file for a certain string.
And those strings to search for are in a txt file.
example of search txt file:
-1gnmjdl5dghs
-17kn3mskjfj4
-1plo3nds3ddd
then I call the method to search each search string in all files in given folder:
private static object _lockObject = new object();
public static IEnumerable<string> SearchContentListInFiles(string searchFolder, List<string> searchList)
{
var result = new List<string>();
var files = Directory.EnumerateFiles(searchFolder);
Parallel.ForEach(files, (file) =>
{
var fileContent = File.ReadLines(file);
if (fileContent.Any(x => searchList.Any(y => x.ToLower().Contains(y))))
{
lock (_lockObject)
{
foreach (string searchFound in fileContent.Where(x => searchList.Any(y => x.ToLower().Contains(y))))
{
result.Add(searchFound);
}
}
}
});
return result;
}
Question now is, can I anyhow improve performance of this operation?
I have around 100GB of files to search trough.
It takes aproximatly 1 hour to search all ~30.000 files with around 25 search strings, on a SSD disk and a good i7 CPU.
Would it make a difference to have larger CSV files or smaller CSV? I just want this search to be as fast as possible.
UPDATE
I have tried every suggestion that you wrote, and this is now what best performed for me (Removing ToLower from the LINQ yielded best performance boost. Search time from 1hour is now 16minutes!):
public static IEnumerable<string> SearchContentListInFiles(string searchFolder, HashSet<string> searchList)
{
var result = new BlockingCollection<string>();
var files = Directory.EnumerateFiles(searchFolder);
Parallel.ForEach(files, (file) =>
{
var fileContent = File.ReadLines(file); //.Select(x => x.ToLower());
if (fileContent.Any(x => searchList.Any(y => x.Contains(y))))
{
foreach (string searchFound in fileContent.Where(x => searchList.Any(y => x.Contains(y))))
{
result.Add(searchFound);
}
}
});
return result;
}
Probably something like Lucene could be a performance boost: why don't you index your data so you can search it easily?
Take a look at Lucene .NET
You'll avoid searching data sequentially. In addition, you can model many indexes based on the same data to be able to get to certain results at the light speed.
Try to:
Do .ToLower one time for a line instead of do .ToLower for each element in searchList.
Do one scan of file instead of two pass any and where. Get the list and then add with lock if any found. In your sample you waste time for two pass and block all threads when search and add.
If you know position where to look for (in your sample you know) you can scan from position, not in all string
Use producer consumer pattern for example use: BlockingCollection<T>, so no need to use lock
If you need to strictly search in field, build HashSet of searchList and do searchHash.Contains(fieldValue) this will increase process dramatically
So here a sample (not tested):
using(var searcher = new FilesSearcher(
searchFolder: "path",
searchList: toLookFor))
{
searcher.SearchContentListInFiles();
}
here is the searcher:
public class FilesSearcher : IDisposable
{
private readonly BlockingCollection<string[]> filesInMemory;
private readonly string searchFolder;
private readonly string[] searchList;
public FilesSearcher(string searchFolder, string[] searchList)
{
// reader thread stores lines here
this.filesInMemory = new BlockingCollection<string[]>(
// limit count of files stored in memory, so if processing threads not so fast, reader will take a break and wait
boundedCapacity: 100);
this.searchFolder = searchFolder;
this.searchList = searchList;
}
public IEnumerable<string> SearchContentListInFiles()
{
// start read,
// we not need many threads here, probably 1 thread by 1 storage device is the optimum
var filesReaderTask = Task.Factory.StartNew(ReadFiles, TaskCreationOptions.LongRunning);
// at least one proccessing thread, because reader thread is IO bound
var taskCount = Math.Max(1, Environment.ProcessorCount - 1);
// start search threads
var tasks = Enumerable
.Range(0, taskCount)
.Select(x => Task<string[]>.Factory.StartNew(Search, TaskCreationOptions.LongRunning))
.ToArray();
// await for results
Task.WaitAll(tasks);
// combine results
return tasks
.SelectMany(t => t.Result)
.ToArray();
}
private string[] Search()
{
// if you always get unique results use list
var results = new List<string>();
//var results = new HashSet<string>();
foreach (var content in this.filesInMemory.GetConsumingEnumerable())
{
// one pass by a file
var currentFileMatches = content
.Where(sourceLine =>
{
// to lower one time for a line, and we don't need to make lowerd copy of file
var lower = sourceLine.ToLower();
return this.searchList.Any(sourceLine.Contains);
});
// store current file matches
foreach (var currentMatch in currentFileMatches)
{
results.Add(currentMatch);
}
}
return results.ToArray();
}
private void ReadFiles()
{
var files = Directory.EnumerateFiles(this.searchFolder);
try
{
foreach (var file in files)
{
var fileContent = File.ReadLines(file);
// add file, or wait if filesInMemory are full
this.filesInMemory.Add(fileContent.ToArray());
}
}
finally
{
this.filesInMemory.CompleteAdding();
}
}
public void Dispose()
{
if (filesInMemory != null)
filesInMemory.Dispose();
}
}
This operation is first and foremost disk bound. Disk bound operations do not benefit from Multithreading. Indeed all you will do is swamp the Disk controler with a ton of conflictign requests at the same time, that a feature like NCQ has to striahgten out again.
If you had loaded all the files into memory first, your operation would be Memory Bound. And memory bound operations do not benefit from Multithreading either (usually; it goes into details of CPU and memory architecture here).
While a certain amount of Multitasking is mandatory in Programming, true Multithreading only helps with CPU bound operations. Nothing in there looks remotely CPU bound. So multithreading taht search (one thread per file) will not make it faster. And indeed likely make it slower due to all the Thread switching and synchronization overhead.
I have a code for tesseract to run in 1 instance how can i parallelize the code so that it can run in quad core processor or 8 core processor systems.here is my code block.thanks in advance.
using (TesseractEngine engine = new TesseractEngine(#"./tessdata", "tel+tel1", EngineMode.Default))
{
foreach (string ab in files)
{
using (var pages = Pix.LoadFromFile(ab))
{
using (Tesseract.Page page = engine.Process(pages,Tesseract.PageSegMode.SingleBlock))
{
string text = page.GetText();
OCRedText.Append(text);
}
}
}
This has worked for me:
static IEnumerable<string> Ocr(string directory, string sep)
=> Directory.GetFiles(directory, sep)
.AsParallel()
.Select(x =>
{
using var engine = new TesseractEngine(tessdata, "eng", EngineMode.Default);
using var img = Pix.LoadFromFile(x);
using var page = engine.Process(img);
return page.GetText();
}).ToList();
I am no expert on the matter of parallelization, but this function ocr's 8 Tiff's in 12 seconds.
However, it creates an Engine for every Tiff. I have not been able to call engine.Process concurrently.
The most simple way to run this code in parallel is using PLINQ. Calling AsParallel() on enumeration will automatically run query that follows it (.Select(...)) simultaneously on all available CPU cores.
It is crucial to run in parallel only thread-safe code. Assuming TesseractEngine is thread-safe (as you suggest in comment, I didn't verify it myself) as well as Pix.LoadFromFile(), then the only problematic part could be OCRedText.Append(). It is not clear from code, what OCRedText is, so I assume it is StringBuilder or List and therefore it is not thread-safe. So I removed this part from code that will run in parallel and I process it later in single-thread - since method .Append() is likely to run fast, this shouldn't have significant adverse effect on overall performance.
using (TesseractEngine engine = new TesseractEngine(#"./tessdata", "tel+tel1", EngineMode.Default))
{
var texts = files.AsParallel().Select(ab =>
{
using (var pages = Pix.LoadFromFile(ab))
{
using (Tesseract.Page page = engine.Process(pages, eract.PageSegMode.SingleBlock))
{
return page.GetText();
}
}
});
foreach (string text in texts)
{
OCRedText.Append(text);
}
}
I have a class "Image" with three properties: Url, Id, Content.
I have a list of 10 such images.
This is a silverlight app.
I want to create a method:
IObservable<Image> DownloadImages(List<Image> imagesToDownload)
{
//start downloading all images in imagesToDownload
//OnImageDownloaded:
image.Content = webResponse.Content
yield image
}
This method starts downloading all 10 images in parallel.
Then, when each downloads completes, it sets the Image.Content to the WebResponse.Content of that download.
The result should be an IObservable stream with each downloaded image.
I'm a beginner in RX, and I think what I want can be achieved with ForkJoin, but that's in an experimental release of reactive extensions dll which I don't want to use.
Also I really don't like download counting on callbacks to detect that all images have been downloaded and then call onCompleted().
Doesn't seem to be in the Rx spirit to me.
Also I post what solution I've coded so far, though I don't like my solution because its long/ugly and uses counters.
return Observable.Create((IObserver<Attachment> observer) =>
{
int downloadCount = attachmentsToBeDownloaded.Count;
foreach (var attachment in attachmentsToBeDownloaded)
{
Action<Attachment> action = attachmentDDD =>
this.BeginDownloadAttachment2(attachment).Subscribe(imageDownloadWebResponse =>
{
try
{
using (Stream stream = imageDownloadWebResponse.GetResponseStream())
{
attachment.FileContent = stream.ReadToEnd();
}
observer.OnNext(attachmentDDD);
lock (downloadCountLocker)
{
downloadCount--;
if (downloadCount == 0)
{
observer.OnCompleted();
}
}
} catch (Exception ex)
{
observer.OnError(ex);
}
});
action.Invoke(attachment);
}
return () => { }; //do nothing when subscriber disposes subscription
});
}
Ok, I did manage it to make it work in the end based on Jim's answer.
var obs = from image in attachmentsToBeDownloaded.ToObservable()
from webResponse in this.BeginDownloadAttachment2(image).ObserveOn(Scheduler.ThreadPool)
from responseStream in Observable.Using(webResponse.GetResponseStream, Observable.Return)
let newImage = setAttachmentValue(image, responseStream.ReadToEnd())
select newImage;
where setAttachmentValue just takes does `image.Content = bytes; return image;
BeginDownloadAttachment2 code:
private IObservable<WebResponse> BeginDownloadAttachment2(Attachment attachment)
{
Uri requestUri = new Uri(this.DownloadLinkBaseUrl + attachment.Id.ToString();
WebRequest imageDownloadWebRequest = HttpWebRequest.Create(requestUri);
IObservable<WebResponse> imageDownloadObservable = Observable.FromAsyncPattern<WebResponse>(imageDownloadWebRequest.BeginGetResponse, imageDownloadWebRequest.EndGetResponse)();
return imageDownloadObservable;
}
How about we simplify this a bit. Take your image list and convert it to an observable. Next, consider using the Observable.FromAsyncPattern to manage the service requests. Finally use SelectMany to coordinate the request with the response. I'm making some assumptions on how you are getting the file streams here. Essentially if you can pass in the BeginInvoke/EndInvoke delegates into FromAsyncPattern for your service request you are good.
var svcObs = Observable.FromAsyncPattern<Stream>(this.BeginDownloadAttachment2, This.EndDownloadAttchment2);
var obs = from image in imagesToDownload.ToObservable()
from responseStream in svcObs(image)
.ObserveOnDispatcher()
.Do(response => image.FileContent = response.ReadToEnd())
select image;
return obs;