Reading files from S3 concurrently

Reading files from S3 concurrently - c#

I am calling the below method to get files from S3. I have a list of file paths in a collection. I call the below method inside Parallel.Foreach loop. The method addRecords() of container is thread safe and ensures that only one thread enters the critical section. The code skips some of the files and the returned result has less records than expected.
Parallel.ForEach(fileListAll, new ParallelOptions { MaxDegreeOfParallelism = 20 }, (currentFile) =>
{
string RecordsRaw = _AmazonS3.GetFileFromS3(currentFile);
Container.AddRecords(RecordsRaw);
});
public string GetFileFromS3(string filePath)
{
try
{
string ResponseData = null;
GetObjectRequest Request = new GetObjectRequest
{
BucketName = _AWSConfiguration.S3BucketName,
Key = filePath
};
using (GetObjectResponse response = _AmazonS3Client.GetObject(Request))
using (Stream responseStream = response.ResponseStream)
using (StreamReader reader = new StreamReader(responseStream))
{
ResponseData = reader.ReadToEnd();
}
if (!string.IsNullOrEmpty(ResponseData))
{
return ResponseData;
}
}
catch (Exception ex)
{
//Removed the code for brevity
}
return null;
}
I am not sure how can I change GetFileFromS3() method so that it can be called by multiple threads while still maintaining concurrency.
Please bear with as this might be a basic question but I am struggling from 2 days to get this working correctly.

Related

Working with SemaphoreSlim and the ThreadPool for sending HttpWebRequests

I need to understand how SemaphoreSlim works with ThreadPool.
I need to read a file from the local disk of a computer, read each line, and then send that line as a HttpWebRequest to 2 different servers, and correspondingly get 2 HttpWebRequest back.
So lets say that file has 100 requests (this number can be in thousands or even more in a real-time scenario), and when all these requests will be sent I should get back 200 responses (like I mentioned each request should go to 2 different servers and fetch 2 responses from them). Here is my code:
static void Main(string[] args)
{
try
{
SendEntriesInFile(someFileOnTheLocaldisk);
Console.WriteLine();
}
catch (Exception e)
{
Console.WriteLine("Regression Tool Error: Major Unspecified Error:\n" + e);
}
}
}
public class MyParameters
{
}
private void SendEntriesInFile(FileInfo file)
{
static SemaphoreSlim threadSemaphore = new SemaphoreSlim(5, 10);
using (StreamReader reader = file.OpenText())
{
string entry = reader.ReadLine();
while (!String.IsNullOrEmpty(entry))
{
MyParameters myParams = new MyParameters(entry, totalNumberOfEntries, serverAddresses, requestType, fileName);
threadSemaphore.Wait();
ThreadPool.QueueUserWorkItem(new WaitCallback(Send), requestParams);
entry = reader.ReadLine();
}
}
}
private void Send(object MyParameters)
{
MyParameters myParams = (MyParameters)MyParameters;
for(int i=0; i < myParams.ServerAddresses.Count; i++)
{
byte[] bytesArray = null;
bytesArray = Encoding.UTF8.GetBytes(myParams.Request);
HttpWebRequest webRequest = null;
if (reqParams.TypeOfRequest == "tlc")
{
webRequest = (HttpWebRequest)WebRequest.Create(string.Format("http://{0}:{1}{2}", myParams .ServerAddresses[i], port, "/SomeMethod1"));
}
else
{
webRequest = (HttpWebRequest)WebRequest.Create(string.Format("http://{0}:{1}{2}", myParams .ServerAddresses[i], port, "/SomeMethod2"));
}
if (webRequest != null)
{
webRequest.Method = "POST";
webRequest.ContentType = "application/x-www-form-urlencoded";
webRequest.ContentLength = bytesArray.Length;
webRequest.Timeout = responseTimeout;
webRequest.ReadWriteTimeout = transmissionTimeout;
webRequest.ServicePoint.ConnectionLimit = maxConnections;
webRequest.ServicePoint.ConnectionLeaseTimeout = connectionLifeDuration;
webRequest.ServicePoint.MaxIdleTime = maxConnectionIdleTimeout;
webRequest.ServicePoint.UseNagleAlgorithm = nagleAlgorithm;
webRequest.ServicePoint.Expect100Continue = oneHundredContinue;
using (Stream requestStream = webRequest.GetRequestStream())
{
//Write the request through the request stream
requestStream.Write(bytesArray, 0, bytesArray.Length);
requestStream.Flush();
}
string response = "";
using (HttpWebResponse httpWebResponse = (HttpWebResponse)webRequest.GetResponse())
{
if (httpWebResponse != null)
{
using (Stream responseStream = httpWebResponse.GetResponseStream())
{
using (StreamReader stmReader = new StreamReader(responseStream))
{
response = stmReader.ReadToEnd();
string fileName = "";
if(i ==0)
{
fileName = Name is generated through some logic here;
}
else
{
fileName = Name is generated through some logic here;
}
using (StreamWriter writer = new StreamWriter(fileName))
{
writer.WriteLine(response);
}
}
}
}
}
}
}
Console.WriteLine(" Release semaphore: ---- " + threadSemaphore.Release());
}
The only thing is that I'm confused with that when I do something like above my Semaphore allow 5 threads to concurrently execute the Send() method and 5 other threads they wait in a queue for their turn. Since my file contains 100 requests so I should get back 200 responses. Every time I ended up getting only 109 or 107 or 108 responses back. Why I don't get all the 200 responses back. Though using a different code (not discussed here) when I send lets say 10 requests in parallel on 10 different created threads (not using the ThreadPool rather creating threads on-demand) I get back all the 200 responses.

Lots of very helpful comments. To add to them, I would recommend using just pure async io for this.
I would recommend applying the Semaphore as a DelegatingHandler.
You want to register this class.
public class LimitedConcurrentHttpHandler
: DelegatingHandler
{
private readonly SemaphoreSlim _concurrencyLimit = new(8);
public LimitedConcurrentHttpHandler(){ }
public LimitedConcurrentHttpHandler(HttpMessageHandler inner) : base(inner) {}
protected override async Task<HttpResponseMessage> SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
{
await _concurrencyLimit.Wait(cancellationToken);
try
{
return await base.SendAsync(request, cancellationToken);
}
finally
{
_concurrencyLimit.Release();
}
}
}
This way, you can concentrate on your actual business logic within your actual code.

Web API MultipartAsync Task no result

The goal is to upload a single file to my webserver and then store it to mssql database by using multipartcontent. While the file is uploading, a progress bar should be displayed in the client (WPF application). The following code sample shows only the upload to memorystream (no database connection).
The connection from client to server works, the upload to the MemoryStream at server-side works and receiving receiving the percentage to client side works (section ContinueWith in my sample code). The problem is, the client doesn't receive the final CreateResponse request - like a timeout or lost connection, I am not sure because i doesn't get an error/exception. The client never receives the task final result.
WebApi:
public class AttachmentsController : ApiController
{
[HttpPost]
[Route("api/Attachments/Upload")]
public async Task<HttpResponseMessage> Upload()
{
if (!Request.Content.IsMimeMultipartContent())
throw new HttpResponseException(HttpStatusCode.UnsupportedMediaType);
try
{
AttachmentUpload _resultAttachmentUpload = null;
var _provider = new InMemoryMultipartFormDataStreamProvider();
await Request.Content.ReadAsMultipartAsync(_provider)
.ContinueWith(t =>
{
if (t.IsFaulted || t.IsCanceled)
{
throw new HttpResponseException(
HttpStatusCode.InternalServerError);
}
return new AttachmentUpload()
{
FileName = _provider.Files[0].Headers.ContentDisposition
.FileName.Trim('\"'),
Size = _provider.Files[0].ReadAsStringAsync().Result
.Length / 1024
};
});
return Request.CreateResponse<AttachmentUpload>(HttpStatusCode.Accepted,
_resultAttachmentUpload);
}
catch (Exception e)
{
return Request.CreateErrorResponse(HttpStatusCode.InternalServerError,
e.ToString());
}
}
}
WPF Client UploadService.cs:
private async Task<Attachment> UploadAttachment(AttachmentUpload uploadData,
string filePath)
{
try
{
var _encoding = Encoding.UTF8;
MultipartFormDataContent _multipartFormDataContent =
new MultipartFormDataContent();
_multipartFormDataContent.Add(new StreamContent(new MemoryStream(
File.ReadAllBytes(filePath))), uploadData.FileName, uploadData.FileName);
_multipartFormDataContent.Add(new StringContent(uploadData.Id.ToString()),
"AttachmentId", "AttachmentId");
_multipartFormDataContent.Add(new StringContent(
uploadData.Upload.ToString(CultureInfo.CurrentCulture)), "AttachmentUpload",
"AttachmentUpload");
_multipartFormDataContent.Add(new StringContent(
uploadData.DocumentId.ToString()), "DocumentId", "DocumentId");
_multipartFormDataContent.Add(new StringContent(
uploadData.User, _encoding), "User", "User");
//ProgressMessageHandler is instantiate in ctor to show progressbar
var _client = new HttpClient(ProgressMessageHandler);
_client.DefaultRequestHeaders.Accept.Add(
new MediaTypeWithQualityHeaderValue("multipart/form-data"));
_client.Timeout = TimeSpan.FromMinutes(5);
var _requestUri = new Uri(BaseAddress + "api/Attachments/Upload");
var _httpRequestTask = await _client.PostAsync(_requestUri,
_multipartFormDataContent)
.ContinueWith<AttachmentUpload>(request =>
{
var _httpResponse = request.Result;
if (!_httpResponse.IsSuccessStatusCode)
{
throw new Exception();
}
var _response = _httpResponse.Content.ReadAsStringAsync();
AttachmentUpload _upload =
JsonConvert.DeserializeObject<AttachmentUpload>(_response.Result);
return _upload;
});
var _resultAttachment = _httpRequestTask;
return new Attachment()
{
Id = _resultAttachment.Id,
FileName = _resultAttachment.FileName,
Comment = _resultAttachment.Comment,
Upload = _resultAttachment.Upload,
};
}
catch (Exception e)
{
//Handle exceptions
//file not found, access denied, no internet connection etc etc
var tmp = e;
throw e;
}
}
The program lacks at var _httpRequestTask = await _client.PostAsync(...).
The debugger never reaches the line var _resultAttachment = _httpRequestTask;.
Thank you very much for your help.

First of all, don't mix await and ContinueWith, the introduction of async / await effectively renders ContinueWith obsolete.
The reason that var _resultAttachment = _httpRequestTask; is never hit is because you have created a deadlock.
WPF has a synchronisation context, that ensures continuations resume on the UI thread.
In the line AttachmentUpload _upload = JsonConvert.DeserializeObject<AttachmentUpload>(_response.Result);, _response.Result is a blocking call; it blocks the current thread, until the Task referenced by _response has completed.
The ReadAsStringAsync method, which generated the Task, will attempt to resume once the asynchronous work has completed, and the WPF synchronisation context will force it to use the UI thread, which has been blocked by _response.Result, hence the deadlock.
To remedy this, use the await keyword for every asynchronous call:
var _httpResponse = await _client.PostAsync(_requestUri, _multipartFormDataContent);
if (!_httpResponse.IsSuccessStatusCode)
{
throw new Exception();
}
var _response = await _httpResponse.Content.ReadAsStringAsync();
var _resultAttachment = JsonConvert.DeserializeObject<AttachmentUpload>(_response);
You should also notice that the code is much more readable without the ContinueWith.

No of bytes copied from a file to a stream is null

I am trying to copy a file from amazon s3 to a memory stream so that I can pass it as a file to be downloaded.
I am getting the file from amazon s3, however when I am trying to copy the stream I am getting a null value.
I have checked to see if the stream is getting closed before copying and it is not.
var ms = new MemoryStream();
try
{
GetObjectRequest getObjectRequest = new GetObjectRequest();
getObjectRequest.BucketName = Bucketname;
getObjectRequest.Key = Keyname;
var getObjectResponse = client.GetObjectAsync(getObjectRequest);
getObjectResponse.Wait();
getObjectResponse.Result.ResponseStream.CopyToAsync(ms);
var len = ms.Length;
return File(ms.ToArray(), System.Net.Mime.MediaTypeNames.Application.Pdf, "filename");
}
Also
var len = ms.Length;
is giving a value of 18.
So Why is no content being read.
Please do point me in the right direction.

Your problem is right here:
getObjectResponse.Result.ResponseStream.CopyToAsync(ms);
Who is waiting for that Task to finish? No one, so you return in the middle of the operation.
Overall, the code should be:
try
{
GetObjectRequest getObjectRequest = new GetObjectRequest();
getObjectRequest.BucketName = Bucketname;
getObjectRequest.Key = Keyname;
var getObjectResponse = await client.GetObjectAsync(getObjectRequest);
await getObjectResponse.ResponseStream.CopyToAsync(ms);
var len = ms.Length;
return File(ms.ToArray(), System.Net.Mime.MediaTypeNames.Application.Pdf, "filename");
}
Note the proper use of await for asynchronous operations, which you should be using, specially in a web framework like ASP.NET Core.

Use await for your GetObjectAsync method like below and also for CopyToAsync as #Camilo answer :
try
{
GetObjectRequest getObjectRequest = new GetObjectRequest();
getObjectRequest.BucketName = Bucketname;
getObjectRequest.Key = Keyname;
var getObjectResponse = await client.GetObjectAsync(getObjectRequest);
getObjectResponse.Wait();
await getObjectResponse.Result.ResponseStream.CopyToAsync(ms);
var len = ms.Length;
return File(ms.ToArray(), System.Net.Mime.MediaTypeNames.Application.Pdf, "filename");
}
or you can do something like:
static async Task ReadStream()
{
try
{
GetObjectRequest request = new GetObjectRequest
{
BucketName = bucketName,
Key = keyName
};
using (GetObjectResponse response = await client.GetObjectAsync(getObjectRequest))
using (Stream responseStream = response.ResponseStream)
using (MemoryStream reader = new MemoryStream(responseStream))
{
//your codes
}
}
catch (AmazonS3Exception e)
{
//Handle it
}
catch (Exception e)
{
//Handle it
}
}
the responseStream is your stream content
and consuming is something like :
ReadStream().Wait();

You need to wait for your async to return. If your method is not marked as async and you don't want to mark it as such you can do the following:
var ms = Task.Run<byte>(async () => await getObjectResponse.Result.ResponseStream.CopyToAsync(ms)).Result;
This will execute the async method and wait for the return. You also won't need to mark your function as async.

GetFileFromApplicationUriAsync seem to hang forever

I am trying to read data from a file, and the GetFileFromApplicationUriAsync, most of the times just hangs for ever, any ideas why that happens. I am debugging the app on the simulator.
A couple notes :
BaseLocalFolder = "ms-appdata:///local/";
Uri baseUri = new Uri(BaseLocalFolder);
Uri cardsUri = new Uri(baseUri, Path.Combine(set,"cards.ccc"));
StorageFile file = await StorageFile.GetFileFromApplicationUriAsync(cardsUri);
using (IInputStream stream = await file.OpenSequentialReadAsync())
{
using (StreamReader rdr = new StreamReader(stream.AsStreamForRead()))
{
var cardArray = Windows.Data.Json.JsonArray.Parse(rdr.ReadToEnd());
foreach (Tables.Card card in cardArray.ToCards())
{
_cards.Add(card.ToCardViewModel());
}
}
}
UPDATES:
After I changed the code according to suggestions, I started having the same issue with ReadTextAsync(), once it hits that line it stays there forever, I never get the string.
StorageFile file = await Windows.Storage.ApplicationData.Current.LocalFolder.GetFileAsync(Path.Combine(set, "cards.ccc"));
try
{
string allCards = await Windows.Storage.FileIO.ReadTextAsync(file);
var cardArray = Windows.Data.Json.JsonArray.Parse(allCards);
foreach (Tables.Card card in cardArray.ToCards())
{
_cards.Add(card.ToCardViewModel());
}
}
catch (FileNotFoundException err)
{
string s = "";
}

a faster way to download multiple files

i need to download about 2 million files from the SEC website. each file has a unique url and is on average 10kB. this is my current implementation:
List<string> urls = new List<string>();
// ... initialize urls ...
WebBrowser browser = new WebBrowser();
foreach (string url in urls)
{
browser.Navigate(url);
while (browser.ReadyState != WebBrowserReadyState.Complete) Application.DoEvents();
StreamReader sr = new StreamReader(browser.DocumentStream);
StreamWriter sw = new StreamWriter(), url.Substring(url.LastIndexOf('/')));
sw.Write(sr.ReadToEnd());
sr.Close();
sw.Close();
}
the projected time is about 12 days... is there a faster way?
Edit: btw, the local file handling takes only 7% of the time
Edit: this is my final implementation:
void Main(void)
{
ServicePointManager.DefaultConnectionLimit = 10000;
List<string> urls = new List<string>();
// ... initialize urls ...
int retries = urls.AsParallel().WithDegreeOfParallelism(8).Sum(arg => downloadFile(arg));
}
public int downloadFile(string url)
{
int retries = 0;
retry:
try
{
HttpWebRequest webrequest = (HttpWebRequest)WebRequest.Create(url);
webrequest.Timeout = 10000;
webrequest.ReadWriteTimeout = 10000;
webrequest.Proxy = null;
webrequest.KeepAlive = false;
webresponse = (HttpWebResponse)webrequest.GetResponse();
using (Stream sr = webrequest.GetResponse().GetResponseStream())
using (FileStream sw = File.Create(url.Substring(url.LastIndexOf('/'))))
{
sr.CopyTo(sw);
}
}
catch (Exception ee)
{
if (ee.Message != "The remote server returned an error: (404) Not Found." && ee.Message != "The remote server returned an error: (403) Forbidden.")
{
if (ee.Message.StartsWith("The operation has timed out") || ee.Message == "Unable to connect to the remote server" || ee.Message.StartsWith("The request was aborted: ") || ee.Message.StartsWith("Unable to read data from the transport connection: ") || ee.Message == "The remote server returned an error: (408) Request Timeout.") retries++;
else MessageBox.Show(ee.Message, "Error", MessageBoxButtons.OK, MessageBoxIcon.Error);
goto retry;
}
}
return retries;
}

Execute the downloads concurrently instead of sequentially, and set a sensible MaxDegreeOfParallelism otherwise you will try to make too many simultaneous request which will look like a DOS attack:
public static void Main(string[] args)
{
var urls = new List<string>();
Parallel.ForEach(
urls,
new ParallelOptions{MaxDegreeOfParallelism = 10},
DownloadFile);
}
public static void DownloadFile(string url)
{
using(var sr = new StreamReader(HttpWebRequest.Create(url)
.GetResponse().GetResponseStream()))
using(var sw = new StreamWriter(url.Substring(url.LastIndexOf('/'))))
{
sw.Write(sr.ReadToEnd());
}
}

Download files in several threads. Number of threads depends on your throughput. Also, look at WebClient and HttpWebRequest classes. Simple sample:
var list = new[]
{
"http://google.com",
"http://yahoo.com",
"http://stackoverflow.com"
};
var tasks = Parallel.ForEach(list,
s =>
{
using (var client = new WebClient())
{
Console.WriteLine($"starting to download {s}");
string result = client.DownloadString((string)s);
Console.WriteLine($"finished downloading {s}");
}
});

I'd use several threads in parallel, with a WebClient. I recommend setting the max degree of parallelism to the number of threads you want, since unspecified degree of parallelism doesn't work well for long running tasks. I've used 50 parallel downloads in one of my projects without a problem, but depending on the speed of an individual download a much lower might be sufficient.
If you download multiple files in parallel from the same server, you're by default limited to a small number (2 or 4) of parallel downloads. While the http standard specifies such a low limit, many servers don't enforce it. Use ServicePointManager.DefaultConnectionLimit = 10000; to increase the limit.

I think the code from o17t H1H' S'k seems right and all but to perform I/O bound tasks an async method should be used.
Like this:
public static async Task DownloadFileAsync(HttpClient httpClient, string url, string fileToWriteTo)
{
using HttpResponseMessage response = await httpClient.GetAsync(url, HttpCompletionOption.ResponseHeadersRead);
using Stream streamToReadFrom = await response.Content.ReadAsStreamAsync();
using Stream streamToWriteTo = File.Open(fileToWriteTo, FileMode.Create);
await streamToReadFrom.CopyToAsync(streamToWriteTo);
}
Parallel.Foreach is also available with Parallel.ForEachAsync. Parallel.Foreach has a lot of features that the async does't has, but most of them are also depracticed. You can implement an Producer Consumer system with Channel or BlockingCollection to handle the amount of 2 million files. But only if you don't know all URLs at the start.
private static async void StartDownload()
{
(string, string)[] urls = new ValueTuple<string, string>[]{
new ("https://dotnet.microsoft.com", "C:/YoureFile.html"),
new ( "https://www.microsoft.com", "C:/YoureFile1.html"),
new ( "https://stackoverflow.com", "C:/YoureFile2.html")};
var client = new HttpClient();
ParallelOptions options = new() { MaxDegreeOfParallelism = 2 };
await Parallel.ForEachAsync(urls, options, async (url, token) =>
{
await DownloadFileAsync(httpClient, url.Item1, url.Item2);
});
}
Also look into this NuGet Package. The Github Wiki gives examples how to use it. To download 2 million files this is a good library and has also a retry function. To download a file you only have to create an instance of LoadRequest and it downloads it with the name of the file into the Downloads directory.
private static void StartDownload()
{
string[] urls = new string[]{
"https://dotnet.microsoft.com",
"https://www.microsoft.com",
" https://stackoverflow.com"};
foreach (string url in urls)
new LoadRequest(url).Start();
}
I hope this helps to improve the code.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Reading files from S3 concurrently - c#

Related

Working with SemaphoreSlim and the ThreadPool for sending HttpWebRequests

Web API MultipartAsync Task no result

No of bytes copied from a file to a stream is null

GetFileFromApplicationUriAsync seem to hang forever

a faster way to download multiple files

Categories

Resources