Can this async code complete out of order? - c#

I have the following C# code in an AspNet WebApi controller:
private static async Task<string> SaveDocumentAsync(HttpContent content) {
var path = "something";
using (var file = File.OpenWrite(path)) {
await content.CopyToAsync(file);
}
return path;
}
public async Task<IHttpActionResult> Put() {
var path = await SaveDocumentAsync(Request.Content);
await SaveDbRecordAsync(path); // writes something to the database using System.Data and awaiting Async methods
return OK();
}
I am sometimes seeing the database record visible before the document has finished being written. Is this a possible execution sequence? (It is also possible my file system isn't giving me the semantics I want).
To clarify how I'm observing this. It is an application that is reading the path out of the database and then trying to read the file and finding it isn't there. The file does appear shortly afterwards.
This doesn't happen every time, normally the file comes first. Maybe 1 in 1000 it happens the wrong way.
This turned out to be down to file system semantics. I thought I'd excluded my replicated file system, but I'd done it wrong. The code is behaving as expected.

Since you're awaiting SaveDocumentAsync function before you call SaveDbRecordAsync, it executes after SaveDocumentAsync completes.
If you were to fire the tasks in parallel then await them:
var saveTask = SaveDocumentAsync(Request.Content);
var dbTask = SaveDbRecordAsync("a/path.ext");
await saveTask;
await dbTask;
then you wouldn't be able to guarantee the completion order.
#Neiston touches a good point: it might be that the app you're using to view the results might be updating with a delay and causing you to think the order is switched.

As you are writing to 2 different files (one file, one database), then the OS is perfectly within it's remit to perform the writes in whatever order is 'best' for the storage medium.
In the old days of spinning storage, the 2 requests would be in the write queue, and if the r/w heads were currently nearer the to the tracks for the database, than the file, then the OS (or maybe the HDD controller) would write the database data first, followed by the file data.
This assumes that both your file and your database server are running on the same physical machine. If you are writing to a shared folder, and/or the DB server is also on a different machine, then who knows what order they will finish in.

Related

Last batch never uploads to Solr when uploading batches of data from json file stream

This might be a long shot but I might as well try here. There is a block of c# code that is rebuilding a solr core. The steps are as follows:
Delete all the existing documents
Get the core entities
Split the entities into batches of 1000
Spin of threads to preform the next set of processes:
Serialize each batch to json and writing the json to a file on the server
hosting the core
Send a command to the core to upload that file using System.Net.WebClient solrurl/corename/update/json?stream.file=myfile.json&stream.contentType=application/json;charset=utf-8
Delete the file. I've also tried deleting the files after all the batches are done, as well as not deleting the files at all
After all batches are done it commits. I've also tried committing
after each batch is done.
My problem is the last batch will not upload if it's much less than the batch size. It flows through like the command was called but nothing happens. It throws no exceptions and I see no errors in the solr logs. My questions are Why? and How can I ensure the last batch always gets uploaded? We think it's a timing issue, but we've added Thread.Sleep(30000) in many parts of the code to test that theory and it still happens.
The only time it doesn't happen is:
if the batch is full or almost full
we don't run multiple threads it
we put a break point at the File.Delete line on the last batch and wait for 30 seconds or so, then continue
Here is the code for writing the file and calling the update command. This is called for each batch.
private const string
FileUpdateCommand = "{1}/update/json?stream.file={0}&stream.contentType=application/json;charset=utf-8",
SolrFilesDir = #"\\MYSERVER\SolrFiles",
SolrFileNameFormat = SolrFilesDir + #"\{0}-{1}.json",
_solrUrl = "http://MYSERVER:8983/solr/",
CoreName = "MyCore";
public void UpdateCoreByFile(List<CoreModel> items)
{
if (items.Count == 0)
return;
var settings = new JsonSerializerSettings { DateTimeZoneHandling = DateTimeZoneHandling.Utc };
var dir = new DirectoryInfo(SolrFilesDir);
if (!dir.Exists)
dir.Create();
var filename = string.Format(SolrFileNameFormat, Guid.NewGuid(), CoreName);
using (var sw = new StreamWriter(filename))
{
sw.Write(JsonConvert.SerializeObject(items, settings));
}
var file = HttpUtility.UrlEncode(filename);
var command = string.Format(FileUpdateCommand, file, CoreName);
using (var client = _clientFactory.GetClient())//System.Net.WebClient
{
client.DownloadData(new Uri(_solrUrl + command));
}
//Thread.Sleep(30000);//doesn't work if I add this
File.Delete(filename);//works here if add breakpoint and wait 30 sec or so
}
I'm just trying to figure out why this is happening and how to address it. I hope this makes sense, and I have provided enough information and code. Thanks for any help.
Since changing the size of the data set and adding a breakpoint "fixes" it, this is most certainly a race condition. Since you haven't added the code that actually indexes the content, it's impossible to say what the issue really is, but my guess is that the last commit happens before all the threads have finished, and only works when all threads are done (if you sleep the threads, the issue will still be there, since all threads sleep for the same time).
The easy fix - use commitWithin instead, and never issue explicit commits. The commitWithin parmaeter makes sure that the documents become available in the index within the given time frame (given as milliseconds). To make sure that the documents you submit becomes available within ten seconds, append &commitWithin=10000 to your URL.
If there's already documents pending a commit, the documents added will be committed before the ten seconds has ellapsed, but even if there's just one last document being submitted as the last batch, it'll never be more than ten seconds before it becomes visible (.. and there will be no documents left forever in a non-committed limbo).
That way you won't have to keep your threads synchronized or issue a final commit, as long as you wait until all threads have finished before exiting your application (if it's an application that actually terminates).

Bulk upload via REST api

I have the goal of uploading a Products CSV of ~3000 records to my e-commerce site. I want to utilise the REST API that my e-comm platform provides so I have something I can re-use and build upon for future sites that I may create.
My main issue that I am having trouble working through is:
- System.Threading.ThreadAbortException
Which I can only attribute to how long it takes to process through all 3K records via a POST request. My code:
public ActionResult WriteProductsFromFile()
{
string fileNameIN = "19107.txt";
string fileNameOUT = "19107_output.txt";
string jsonUrl = $"/api/products";
List<string> ls = new List<string>();
var engine = new FileHelperAsyncEngine<Prod1>();
using (engine.BeginReadFile(fileNameIN))
{
foreach (Prod1 prod in engine)
{
outputProduct output = new outputProduct();
if (!string.IsNullOrEmpty(prod.name))
{
output.product.name = prod.name;
string productJson = JsonConvert.SerializeObject(output);
ls.Add(productJson);
}
}
}
foreach (String s in ls)
nopApiClient.Post(jsonUrl, s);
return RedirectToAction("GetProducts");
}
}
Since I'm new to web-coding, am I going about this the wrong way? Is there a preferred way to bulk-upload that I haven't come across?
I've attempted to use the TaskCreationOptions.LongRunning flag, which helps the cause slightly but doesn't get me anywhere near my goal.
Web and api controller actions are not meant to do long running tasks - besides locking up the UI/thread, you will be introducing a series of opportunities for failure that you will have little recourse in recovering from.
But it's not all bad you have a lot of options here, there is a lot of literature on async/cloud architecture - which explains how to deal with files and these sorts of scenarios.
What you want to do is disconnect the processing of your file from the API request (in your application not the 3rd party)
It will take a little more work but will ultimately create a more reliable application.
Step 1:
Drop the file immediately to disk - I see you have the file on DISK already not sure how it gets there but either way it will work out the same.
Step 2:
Use a process running as
- a console app (easiest)
- a service (requires some sort of install/uninstall of the service)
- or even a thread in your web app (but you will struggle to know when it fails)
Which ever way you choose, the process will watch a directory for file changes, when there is a change it will kick off your method to happily process the file as you like.
Check out the FileSystemWatchers here is a basic example: https://www.dotnetperls.com/filesystemwatcher
Additionally:
If you are interested in running a thread in your Api/Web app, take a look at https://www.hanselman.com/blog/HowToRunBackgroundTasksInASPNET.aspx for some options.
You don't have to use a FileSystemWatcher of course, you could trigger via a flag in a DB - that is being checked periodically, or a system event.

Using Entity framework in conjunction with Task Parallel Library

I have an application that we are developing using .NET 4.0 and EF 6.0. Premise of the program is quite simple. Watch a particular folder on the file system. As a new file gets dropped into this folder, look up information about this file in the SQL Server database (using EF), and then based on what is found, move the file to another folder on the file system. Once the file move is complete, go back to the DB and update the information about this file (Register File move).
These are large media files so it might take a while for each of them to move to the target location. Also, we might start this service with hundreds of these media files sitting in the source folder already that will need to be dispatched to the target location(s).
So to speed things up, I started out with using Task parallel library (async/await not available as this is .NET 4.0). For each file in the source folder, I look up info about it in the DB, determine which target folder it needs to move to, and then start a new task that begins to move the file…
LookupFileinfoinDB(filename)
{
// use EF DB Context to look up file in DB
}
// start a new task to begin the file move
var moveFileTask = Task<bool>.Factory.StartNew(
                () =>
                    {
                        var success = false;
                        try
{
// the code to actually moves the file goes here…
.......
}
}
Now, once this task completes, I have to go back to the DB and update the info about the file. And that is where I am running into problems. (keep in mind that I might have several of these 'move file tasks'running in parallel and they will finish at different times. Currently, I am using task continuations to register the file move in the DB:
filemoveTask.ContinueWith(
                       t =>
                       {
                           if (t.IsCompleted && t.Result)
{
RegisterFileMoveinDB();
}
}
Problem is that I am using the same DB context for looking up the file info in the main task as well as inside the RegistetrFilemoveinDB() method later, that executes on the nested task. I was getting all kinds of weird exceptions thrown at me (mostly about SQL server Data reader etc.) when moving several files together. Online search for the answer revealed that the sharing of DB context among several tasks like I am doing here is a big no no as EF is not thread safe.
I would rather not create a new DB context for each file move as there could be dozens or even hundreds of them going at the same time. What would be a good alternative approach? Is there a way to 'signal' the main task when a nested task completes and finish the File move registration in the main task? Or am I approaching this problem in a wrong way all together and there is a better way to go about this?
Your best bet is to scope your DbContext for each thread. Parallel.ForEach has overloads that are useful for this (the overloads with Func<TLocal> initLocal:
Parallel.ForEach(
fileNames, // the filenames IEnumerable<string> to be processed
() => new YourDbContext(), // Func<TLocal> localInit
( fileName, parallelLoopState, dbContext ) => // body
{
// your logic goes here
// LookUpFileInfoInDB( dbContext, fileName )
// MoveFile( ... )
// RegisterFileMoveInDB( dbContext, ... )
// pass dbContext along to the next iteration
return dbContext;
}
( dbContext ) => // Action<TLocal> localFinally
{
dbContext.SaveChanges(); // single SaveChanges call for each thread
dbContext.Dispose();
} );
You can call SaveChanges() within the body expression/RegisterFileMoveInDB if you prefer to have the DB updated ASAP. I would suggest tying the file system operations in with the DB transaction so that if the DB update fails, the file system operations are rolled back.
You could also pass the ExclusiveScheduler of a ConcurrentExclusiveSchedulerPair instance as a parameter of ContinueWith. This way the continuations will run sequentially instead of concurrently regarding to each other.
TaskScheduler exclusiveScheduler
= new ConcurrentExclusiveSchedulerPair().ExclusiveScheduler;
//...
filemoveTask.ContinueWith(t =>
{
if (t.Result)
{
RegisterFileMoveinDB();
}
}, exclusiveScheduler);
According to #Moho question:
Threads in i.e. built-in IO async operations are taken from
threadpool of .NET runtime CLR so it's very efficient mechanism. If
you create threads by your self you do it in old manner which is
inefficient especially for IO operations.
When you call async you don't have to wait immediately. Postpone waiting until it's necessary.
Best Regards.

Task.Run occasionally fails silently when launched from MVC Controller

I am attempting to generate PDF copies of specific forms within my MVC application. As this is time consuming, and the client does not need to wait for this generation to happen, I'm trying to trigger this as a series of Fire and Forget Tasks.
One hang-up of note is that I need to have the HttpContext established, or some underlying pieces of the code that I can't alter won't work. I believe I have dealt with this problem, but I wanted to call it out in case it matters.
Here is the function I am calling...
private void AsyncPDFFormGeneration(string htmlOutput, string serverRelativePath, string serverURL, string signature, ScannedDocument document, HttpContext httpContext)
{
try
{
System.Web.HttpContext.Current = httpContext;
using (StreamWriter stw = new StreamWriter(Server.MapPath(serverRelativePath), false, System.Text.Encoding.Default))
{
stw.Write(htmlOutput);
}
Doc ABCDoc = new Doc();
ABCDoc.HtmlOptions.Engine = EngineType.Gecko;
int DocID = 0;
DocID = ABCDoc.AddImageUrl(serverURL + serverRelativePath + "?dumb=" + DateTime.Now.Hour.ToString() + DateTime.Now.Minute.ToString() + DateTime.Now.Second + DateTime.Now.Millisecond);
while (true)
{
ABCDoc.FrameRect();
if (!ABCDoc.Chainable(DocID))
break;
ABCDoc.TextStyle.LeftMargin = 100;
ABCDoc.Page = ABCDoc.AddPage();
DocID = ABCDoc.AddImageToChain(DocID);
}//End while (true...
for (int i = 1; i <= ABCDoc.PageCount; i++)
{
ABCDoc.PageNumber = i;
ABCDoc.Flatten();
}
ScannedDocuments.AddScannedDocument(document, ABCDoc.GetData());
System.IO.File.Delete(Server.MapPath(serverRelativePath));
}
catch (Exception e)
{
//Exception is logged to the database, and if that fails, to the Event Log
}
}
Within, I am writing the String output of the HTML contents of the MVC Form in question to an html file, handing the path to that file to the PDF writer, generating the PDF, and then deleting the html file.
I'm calling it inside of a Controller POST method, like so:
Task.Run(() => AsyncPDFFormGeneration(htmlOutput, serverRelativePath,
serverURL, signature, document, HttpContext.ApplicationInstance.Context));
This command is called as part of a foreach loop that constructs the forms, loads them into string format, and then passes them into a task. I've also tried this with
Task.Factory.StartNew
just in case something weird was going on with Task.Run, but that didn't produce a different result.
The problem I am having is that not all of the Tasks execute every time. If I run in Visual Studio and step my way through debugging, it works properly every time. However, when attempting to generate 11 forms sequentially, sometimes it generates all of them, sometimes it generated 3 or 4, sometimes it generates all but 1.
I have error logging set up to be as extensive as possible, but no exceptions are being thrown that I can find, and no generated html files are left lying around in my file structure on account of a thread aborted part-way.
There seems to be a slight correlation between how quickly the page comes back from the post, and how many of the forms are generated. A longer load time generally correlates to more of the forms being generated...but I was under the impression that shouldn't matter. I'm spinning these off to separate threads with their own copy of the HttpContext to take with them and carry around. Once launched, I did not think that the original thread should impact them.
Any ideas on why I'm only getting 3 successful Tasks on some attempts, all 11 on another attempt, and no exceptions?
Task.Run(() => AsyncPDFFormGeneration(htmlOutput, serverRelativePath,
serverURL, signature, document, HttpContext.ApplicationInstance.Context));
You have a subtle race condition on this line. The problem is with the HttpContext.ApplicationInstance.Context property. It will be evaluated when the task starts. If it happens before the end of the request, this is fine. But if for some reason the task takes a bit of time to start, then the request will complete first, and the HttpContext will be null. Therefore, you will have a null-reference exception, giving you the impression that the task didn't start (when, in fact, it did but crashed immediately outside of your try/catch).
To avoid that, just store the context in a local variable, and use it for Task.Run:
var context = HttpContext; // Or HttpContext.ApplicationInstance.Context, but I don't really see the point
Task.Run(() => AsyncPDFFormGeneration(htmlOutput, serverRelativePath, serverURL, signature, document, context));
That said, I don't know what API you are using that requres System.Web.HttpContext.Current to be set, but it seems a very bad choice for a fire-and-forget task. Even if you locally save the HttpContext, it'll still have been cleaned up, so I'm not sure it'll behave as expected.
Also, as was mentioned in the comments, launching fire-and-forget tasks on ASP.NET is dangerous. You should use HostingEnvironment.QueueBackgroundWorkItem instead.
I would try using await Task.WhenAll(task1, task2, task3, etc) as your application may be closing before all tasks have completed.

That async-ing feeling - httpclient and mvc thread blocking

Dilemma, dilemma...
I've been working up a solution to a problem that uses async calls to the HttpClient library (GetAsync=>ConfigureAwait(false) etc). IIn a console app, my dll is very responsive and the mixture of using the async await calls and the Parallel.ForEach(=>) really makes me glow.
Now for the issue. After moving from this test harness to the target app, things have become problematic. I'm using asp.net mvc 4 and have hit a few issues. The main issue really is that calling my process on a controller action actually blocks the main thread until the async actions are complete. I've tried using an async controller pattern, I've tried using Task.Factory, I've tried using new Threads. You name it, I've tried all the flavours - and then some!.
Now, I appreciate that the nature of http is not designed to facilitate long processes like this and there are a number of articles here on SO that say don't do it. However, there are mitigating reasons why i NEED to use this approach. The main reason that I need to run this in mvc is due to the fact that I actually update the live data cache (on the mvc app) in realtime via raising an event in my dll's code. This means that fragments of the 50-60 data feeds can be pushed out live before the entire async action is complete. Therefore, client apps can receive partial updates within seconds of the async action being instigated. If I were to delegate the process out to a console app that ran the entire process in the background, I'd no longer be able to harness those fragment partial updates and this is the raison d'etre behind the entire choice of this architecture.
Can anyone shed light on a solution that would allow me to mitigate the blocking of the thread, whilst at the same time, allow each async fragment to be consumed by my object model and fed out to the client apps (I'm using signalr to make these client updates). A kind of nirvanna would be a scenario where an out-of-process cache object could be shared between numerous processes - the cache update could then be triggered and consumed by my mvc process (aka - http://devproconnections.com/aspnet-mvc/out-process-caching-aspnet). And so back to reality...
I have also considered using a secondary webservice to achieve this, but would welcome other options before once again over engineering my solution (there are already many moving parts and a multitude of async Actions going on).
Sorry not to have added any code, I'm hoping for practical philosophy/insights, rather than code help on this, tho would of course welcome coded examples that illustrate a solution to my problem.
I'll update the question as we move in time, as my thinking process is still maturing on this.
[edit] - for the sake of clarity, the snippet below is my brothers grimm code collision (extracted from a larger body of work):
Parallel.ForEach(scrapeDataBases, new ParallelOptions()
{
MaxDegreeOfParallelism = Environment.ProcessorCount * 15
},
async dataBase =>
{
await dataBase.ScrapeUrlAsync().ConfigureAwait(false);
await UpdateData(dataType, (DataCheckerScrape)dataBase);
});
async and Parallel.ForEach do not mix naturally, so I'm not sure what your console solution looks like. Furthermore, Parallel should almost never be used on ASP.NET at all.
It sounds like what you would want is to just use Task.WhenAll.
On a side note, I think your reasoning around background processing on ASP.NET is incorrect. It is perfectly possible to have a separate process that updates the clients via SignalR.
Being that your question is pretty high level without a lot of code. You could try Reactive Extensions.
Something like
private IEnumerable<Task<Scraper>> ScrappedUrls()
{
// Return the 50 to 60 task for each website here.
// I assume they all return the same type.
// return .ScrapeUrlAsync().ConfigureAwait(false);
throw new NotImplementedException();
}
public async Task<IEnumerable<ScrapeOdds>> GetOdds()
{
var results = new Collection<ScrapeOdds>();
var urlRequest = ScrappedUrls();
var observerableUrls = urlRequest.Select(u => u.ToObservable()).Merge();
var publisher = observerableUrls.Publish();
var hubContext = GlobalHost.ConnectionManager.GetHubContext<OddsHub>();
publisher.Subscribe(scraper =>
{
// Whatever you do do convert to the result set
var scrapedOdds = scraper.GetOdds();
results.Add(scrapedOdds);
// update anything else you want when it arrives.
// Update SingalR here
hubContext.Clients.All.UpdatedOdds(scrapedOdds);
});
// Will fire off subscriptions and not continue until they are done.
await publisher;
return results;
}
The merge option will process the results as they come in. You can then update the signalR hubs plus whatever else you need to update as they come in. The controller action will have to wait for them all to come in. That's why there is an await on the publisher.
I don't really know if httpClient is going to like to have 50 - 60 web calls all at once or not. If it doesn't you can just take the IEnumerable to an array and break it down into a smaller chunks. And also there should be some error checking in there. With Rx you can also tell it to SubscribeOn and ObserverOn different threads but I think with everything being pretty much async that wouldn't be necessary.

Categories