Last batch never uploads to Solr when uploading batches of data from json file stream - c#

This might be a long shot but I might as well try here. There is a block of c# code that is rebuilding a solr core. The steps are as follows:
Delete all the existing documents
Get the core entities
Split the entities into batches of 1000
Spin of threads to preform the next set of processes:
Serialize each batch to json and writing the json to a file on the server
hosting the core
Send a command to the core to upload that file using System.Net.WebClient solrurl/corename/update/json?stream.file=myfile.json&stream.contentType=application/json;charset=utf-8
Delete the file. I've also tried deleting the files after all the batches are done, as well as not deleting the files at all
After all batches are done it commits. I've also tried committing
after each batch is done.
My problem is the last batch will not upload if it's much less than the batch size. It flows through like the command was called but nothing happens. It throws no exceptions and I see no errors in the solr logs. My questions are Why? and How can I ensure the last batch always gets uploaded? We think it's a timing issue, but we've added Thread.Sleep(30000) in many parts of the code to test that theory and it still happens.
The only time it doesn't happen is:
if the batch is full or almost full
we don't run multiple threads it
we put a break point at the File.Delete line on the last batch and wait for 30 seconds or so, then continue
Here is the code for writing the file and calling the update command. This is called for each batch.
private const string
FileUpdateCommand = "{1}/update/json?stream.file={0}&stream.contentType=application/json;charset=utf-8",
SolrFilesDir = #"\\MYSERVER\SolrFiles",
SolrFileNameFormat = SolrFilesDir + #"\{0}-{1}.json",
_solrUrl = "http://MYSERVER:8983/solr/",
CoreName = "MyCore";
public void UpdateCoreByFile(List<CoreModel> items)
{
if (items.Count == 0)
return;
var settings = new JsonSerializerSettings { DateTimeZoneHandling = DateTimeZoneHandling.Utc };
var dir = new DirectoryInfo(SolrFilesDir);
if (!dir.Exists)
dir.Create();
var filename = string.Format(SolrFileNameFormat, Guid.NewGuid(), CoreName);
using (var sw = new StreamWriter(filename))
{
sw.Write(JsonConvert.SerializeObject(items, settings));
}
var file = HttpUtility.UrlEncode(filename);
var command = string.Format(FileUpdateCommand, file, CoreName);
using (var client = _clientFactory.GetClient())//System.Net.WebClient
{
client.DownloadData(new Uri(_solrUrl + command));
}
//Thread.Sleep(30000);//doesn't work if I add this
File.Delete(filename);//works here if add breakpoint and wait 30 sec or so
}
I'm just trying to figure out why this is happening and how to address it. I hope this makes sense, and I have provided enough information and code. Thanks for any help.

Since changing the size of the data set and adding a breakpoint "fixes" it, this is most certainly a race condition. Since you haven't added the code that actually indexes the content, it's impossible to say what the issue really is, but my guess is that the last commit happens before all the threads have finished, and only works when all threads are done (if you sleep the threads, the issue will still be there, since all threads sleep for the same time).
The easy fix - use commitWithin instead, and never issue explicit commits. The commitWithin parmaeter makes sure that the documents become available in the index within the given time frame (given as milliseconds). To make sure that the documents you submit becomes available within ten seconds, append &commitWithin=10000 to your URL.
If there's already documents pending a commit, the documents added will be committed before the ten seconds has ellapsed, but even if there's just one last document being submitted as the last batch, it'll never be more than ten seconds before it becomes visible (.. and there will be no documents left forever in a non-committed limbo).
That way you won't have to keep your threads synchronized or issue a final commit, as long as you wait until all threads have finished before exiting your application (if it's an application that actually terminates).

Related

Can this async code complete out of order?

I have the following C# code in an AspNet WebApi controller:
private static async Task<string> SaveDocumentAsync(HttpContent content) {
var path = "something";
using (var file = File.OpenWrite(path)) {
await content.CopyToAsync(file);
}
return path;
}
public async Task<IHttpActionResult> Put() {
var path = await SaveDocumentAsync(Request.Content);
await SaveDbRecordAsync(path); // writes something to the database using System.Data and awaiting Async methods
return OK();
}
I am sometimes seeing the database record visible before the document has finished being written. Is this a possible execution sequence? (It is also possible my file system isn't giving me the semantics I want).
To clarify how I'm observing this. It is an application that is reading the path out of the database and then trying to read the file and finding it isn't there. The file does appear shortly afterwards.
This doesn't happen every time, normally the file comes first. Maybe 1 in 1000 it happens the wrong way.
This turned out to be down to file system semantics. I thought I'd excluded my replicated file system, but I'd done it wrong. The code is behaving as expected.
Since you're awaiting SaveDocumentAsync function before you call SaveDbRecordAsync, it executes after SaveDocumentAsync completes.
If you were to fire the tasks in parallel then await them:
var saveTask = SaveDocumentAsync(Request.Content);
var dbTask = SaveDbRecordAsync("a/path.ext");
await saveTask;
await dbTask;
then you wouldn't be able to guarantee the completion order.
#Neiston touches a good point: it might be that the app you're using to view the results might be updating with a delay and causing you to think the order is switched.
As you are writing to 2 different files (one file, one database), then the OS is perfectly within it's remit to perform the writes in whatever order is 'best' for the storage medium.
In the old days of spinning storage, the 2 requests would be in the write queue, and if the r/w heads were currently nearer the to the tracks for the database, than the file, then the OS (or maybe the HDD controller) would write the database data first, followed by the file data.
This assumes that both your file and your database server are running on the same physical machine. If you are writing to a shared folder, and/or the DB server is also on a different machine, then who knows what order they will finish in.

System.IO.Compression.ZipArchive keeps file locked after dispose?

I have a class that takes data from several sources and writes them to a ZIP file. I've benchmarked the class to check if using CompressionLevel.Optimal would be much slower than CompressionLevel.Fastest. But the benchmark throws an exception on different iterations and in different CompressionLevel values each time I run the benchmark.
I started removing the methods that add the file-content step by step until I ended up with the code below (inside the for-loop) which does basically nothing besides creating an empty zip-file and deleting it.
Simplified code:
var o = #"e:\test.zip";
var result = new FileInfo(o);
for (var i = 0; i < 1_000_000; i++)
{
// Alternate approach
// using(var archive = ZipFile.Open(o, ZipArchiveMode.Create))
using (var archive = new ZipArchive(result.OpenWrite(), ZipArchiveMode.Create, false, Encoding.UTF8))
{
}
result.Delete();
}
The loop runs about 100 to 15k iterations on my PC and then throws an IOException when trying to delete the file saying that the file (result) is locked.
So... did I miss something about how to use System.IO.Compression.ZipArchive? There is no close method for ZipArchive and using should dispose/close the archive... I've tried different .NET versions 4.6, 4.6.1, 4.7 and 4.7.2.
EDIT 1:
The result.Delete() is not part of the code that is benchmarked
EDIT 2:
Also tried to play around with Thread.Sleep(5/10/20) after the using block (therefore the result.Delete() to check if the lock persists) but up to 20ms the file is still locked at some point. Didnt tried higher values than 20ms.
EDIT 3:
Can't reprodurce the problem at home. Tried a dozen times at work and the loop never hit 20k iterations. Tried once here and it completed.
EDIT 4:
jdweng (see comments) was right. Thanks! Its somehow related to my "e:" partition on a local hdd. The same code runs fine on my "c:" partition on a local ssd and also on a network share.
In my experience files are may not be consistently unlocked when the dispose method for the stream returns. My best guess is that this is due to the file system doing some operation asynchronously. The best solution I have found is to retry the delete operation multiple times. i.e. something like this:
public static void DeleteRetrying(this FileInfo self, int delayMs = 100, int numberOfAttempts = 3)
{
for (int i = 0; i < numberOfAttempts-1; i++)
{
try
{
self.Delete();
}
catch (IOException)
{
// Consider making the method async and
// replace this with Task.Delay
Thread.Sleep(delayMs);
}
}
// Final attempt, let the exception propagate
self.Delete();
}
This is not an ideal solution, and I would love if someone could provide a better solution. But it might be good enough for testing where the impact of a non deleted file would be manageable.

C# Multithreading and pooling

Hello fellow developers,
I have a question about implementing multi-threading on my .NET (Framework 4.0) Windows Service.
Basically, what the service should be doing is the following:
Scans the filesystem (a specific directory) to see if there are files to process
If there are files that need to be processed, it should be using a thread pooling mechanism to issue threads up to a predetermined amount.
Each thread will perform an upload operation of a single file
As soon as one thread completes, the filesystem is scanned again to see if there are other files to process (I want to avoid having two threads perform the operation on the same file)
I am struggling to find a way that will allow me to do just that last step.
Right now, I have a function that retrieves the number of maximum number of concurrent threads that runs in the main thread:
int maximumNumberOfConcurrentThreads = getMaxThreads(databaseConnection);
Then, still in the main thread, I have a function that scans the directory and returns a list with the files to process
List<FileToUploadInfo> filesToUpload = getFilesToUploadFromFS(directory);
After this, I call the following function:
generateThreads(maximumNumberOfConcurrentThreads, filesToUpload);
Each thread should be calling the below function (returns void):
uploadFile(fileToUpload, databaseConnection, currentThread);
Right now, the way the program is structured, if maximum number of threads is set, say, to 5, I am grabbing 5 elements from the list and uploading them.
As soon as all 5 are done, I grab 5 more and do the same until I don't have any left, as per code below.
for (int index = 0; index < filesToUpload.Count; index = index + maximumNumberOfConcurrentThreads) {
try {
Parallel.For(0, maximumNumberOfConcurrentThreads, iteration => { if (index + iteration < filesToUpload .Count) { uploadFile(filesToUpload [index + iteration], databaseConnection, iteration); } });
}
catch (System.ArgumentOutOfRangeException outOfRange) {
debug("Exception in Parallel.For [" + outOfRange.Message + "]");
}
However, if 4 files are small and the upload of each one takes 5 seconds, while the remaining one is big and takes 30 minutes, I will have, after the 4 files have been uploaded, only one file uploading, and I need to wait for it to finish before starting to upload others in the list.
After finishing uploading all the files in the list, my service goes to sleep, and then, when it wakes up again, it scans the file system again.
What is the strategy that best fits my needs? Is it advisable to go this route or will it create concurrency nightmares? I need to avoid uploading any file twice.

Check If File Is In Use By Other Instances of Executable Run

Before I go into too detail, my program is written in Visual Studio 2010 using C# .Net 4.0.
I wrote a program that will generate separate log files for each run. The log file is named after the time, and accurate up at millisecond (for example, 20130726103042375.log). The program will also generate a master log file for the day if it has not already exist (for example, *20130726_Master.log*)
At the end of each run, I want to append the log file to a master log file. Is there a way to check if I can append successfully? And retry after Sleep for like a second or something?
Basically, I have 1 executable, and multiple users (let's say there are 5 users).
All 5 users will access and run this executable at the same time. Since it's nearly impossible for all user to start at the exact same time (up to millisecond), there will be no problem generate individual log files.
However, the issue comes in when I attempt to merge those log files to the master log file. Though it is unlikely, I think the program will crash if multiple users are appending to the same master log file.
The method I use is
File.AppendAllText(masterLogFile, File.ReadAllText(individualLogFile));
I have check into the lock object, but I think it doesn't work in my case, as there are multiple instances running instead of multiple threads in one instance.
Another way I look into is try/catch, something like this
try
{
stream = file.Open(FileMode.Open, FileAccess.ReadWrite, FileShare.None);
}
catch {}
But I don't think this solve the problem, because the status of the masterLogFile can change in that brief millisecond.
So my overall question is: Is there a way to append to masterLogFile if it's not in use, and retry after a short timeout if it is? Or if there is an alternative way to create the masterLogFile?
Thank you in advance, and sorry for the long message. I want to make sure I get my message across and explain what I've tried or look into so we are not wasting anyone's time.
Please let me know if there's anymore information I can provide to help you help me.
Your try/catch is the way to do things. If the call to File.Open succeeds, then you can write to to the file. The idea is to keep the file open. I would suggest something like:
bool openSuccessful = false;
while (!openSuccessful)
{
try
{
using (var writer = new StreamWriter(masterlog, true)) // append
{
// successfully opened file
openSuccessful = true;
try
{
foreach (var line in File.ReadLines(individualLogFile))
{
writer.WriteLine(line);
}
}
catch (exceptions that occur while writing)
{
// something unexpected happened.
// handle the error and exit the loop.
break;
}
}
}
catch (exceptions that occur when trying to open the file)
{
// couldn't open the file.
// If the exception is because it's opened in another process,
// then delay and retry.
// Otherwise exit.
Sleep(1000);
}
}
if (!openSuccessful)
{
// notify of error
}
So if you fail to open the file, you sleep and try again.
See my blog post, File.Exists is only a snapshot, for a little more detail.
I would do something along the lines of this as I think in incurs the least overhead. Try/catch is going to generate a stack trace(which could take a whole second) if an exception is thrown. There has to be a better way to do this atomically still. If I find one I'll post it.

ASP.NET: Firing batch jobs

My application could have up to roughly 100 requests for a batch job within a few milliseconds but in actuality, these job requests are being masked as one job request.
To fix this issue so that only one job request is just not feasible at the moment.
A workaround that I have thought is to program my application to fulfill only 1 batch job every x milliseconds, in this case I was thinking of 200 milliseconds, and ignore any other batch job that may come in within those 200 milliseconds or when my batch job have completed. After those 200 milliseconds are up or when the batch job is completed, my application will wait and accept 1 job request from that time on and it will not process any requests that may have been ignored before. Once my application accepts another job requests, it will repeat the cycle above.
What's the best way of doing this using .Net 4.0? Are there any boiler plate code that I can simply follow as a guide?
Update
Sorry for being unclear. I have added more details about my scenario. Also I just realized that my proposed workaround above will not work. Sorry guys, lol. Here's some background information.
I have an application that builds an index using files in a specified directory. When a file is added, deleted or modified in this directory, my application listens for these events using a FileSystemWatcher and re-indexes these files. The problem is that around 100 files can be added, deleted or modified by an external process and they occur very quickly, ie: within a few milliseconds. My end goal is to re-index these files after the last file change have occurred by the external process. The best solution is to modify the external process to signal my application when it has finished modifying the files I'm listening to but that's not feasible at the moment. Thus, I have to create a workaround.
A workaround that may solve my problem is to wait for the first file change. When the first file change have occurred, wait 200 milliseconds for any other subsequent file changes. Why 200 milliseconds? Because I'm hoping and confident that the external process can perform its file changes within 200 milliseconds. Once my application have waited for 200 milliseconds, I would like it to start a task that will re-index the files and go through another cycle of listening to a file change.
What's the best way of doing this?
Again, sorry for the confusion.
This question is a bit too high level to guess at.
My guess is your application is run as a service, you have your requests come into your application and arrive in a queue to be processed. And every 200 ms, you wake the queue and pop and item off for processing.
I'm confused about the "masked as one job request". Since you mentioned you will "ignore any other batch job", my guess is you haven't arranged your code to accept the incoming requests in a queue.
Regardless, you will generally always have one application process running (your service) and if you choose you could spawn a new thread for each item you process in the queue. You can monitor how much cpu/memory utilization this required and adjust the firing time (200ms) accordingly.
I may not be accurately understanding the problem, but my recommendation is to use the singleton pattern to work around this issue.
With the singleton approach, you can implement a lock on an object (the access method could potentially be something along the lines of BatchProcessor::GetBatchResults) that would then lock all requests to the batch job results object. Once the batch has finished, the lock will be released, and the underlying object, will have the results of the batch job available.
Please keep in mind that this is a "work around". There may be a better solution that involves looking into and changing the underlying business logic that causes multiple requests to come in for a job that's processing on demand.
Update:
Here is a link for information regarding Singleton (includes code examples): http://msdn.microsoft.com/en-us/library/ff650316.aspx
It is my understanding that the poster has some sort of an application that sits and waits for incoming requests to perform a batch job. The problem that he is receiving multiple requests within a short period of time that should actually have come in as just a single request. And, unfortunately, he is not able to solve this problem.
So, his solution is to assume that all requests received within a 200 ms timespan are the same, and to only process these once. My concern with this would be whether this assumption is correct or not? This entirely depends on the sending systems and the environment in which this is being used. The general idea to be able to do this would be to update a lastReceived date/time when a request is processed. Then when a new request comes in, compare the current date/time to the lastReceived date/time and only process it if the difference is greater than 200 ms.
Other possible solutions:
You said you could not modify the sending application so only one job request was sent, but could you add additional information to it, for instance a unique identifier?
Could you store the parameters from the last job request and compare it with the next job request and only process them if they are different?
Based on your Update
Here is an example how you could wait 200ms using a Timer:
static Timer timer;
static int waitTime = 200; //in ms
static void Main(string[] args)
{
FileSystemWatcher fsw = new FileSystemWatcher();
fsw.Path = #"C:\temp\";
fsw.Created += new FileSystemEventHandler(fsw_Created);
fsw.EnableRaisingEvents = true;
Console.ReadLine();
}
static void fsw_Created(object sender, FileSystemEventArgs e)
{
DateTime currTime = DateTime.Now;
if (timer == null)
{
Console.WriteLine("Started # " + currTime);
timer = new Timer();
timer.Interval = waitTime;
timer.Elapsed += new ElapsedEventHandler(timer_Elapsed);
timer.Start();
}
else
{
Console.WriteLine("Ignored # " + currTime);
}
}
static void timer_Elapsed(object sender, ElapsedEventArgs e)
{
//Start task here
Console.WriteLine("Elapsed # " + DateTime.Now);
timer = null;
}

Categories