c# parallel writes to Azure Data Lake File - c#

In our Azure Data Lake, we have daily files recording events and coordinates for those events. We need to take these coordinates and lookup what State, County, Township, and Section these coordinates fall into. I've attempted several versions of this code.
I attempted to do this in U-SQL. I even uploaded a custom assembly that implemented Microsoft.SqlServer.Types.SqlGeography methods, only to find ADLA isn't set up to perform row-by-row operations like geocoding.
I pulled all the rows into SQL Server, converted the coordinates into a SQLGeography and built T-SQL code that would perform the State, County, etc. lookups. After much optimization, I got this process down to ~700ms / row. (with 133M rows in the backlog and ~16k rows added daily we're looking at nearly 3 years to catch up. So I parallelized the T-SQL, things got better, but not enough.
I took the T-SQL code, and built the process as a console application, since the SqlGeography library is actually a .Net library, not a native SQL Server product. I was able to get single threaded processing down t0 ~ 500ms. Adding in .Net's parallelism (parallel.ForEach) and throwing 10/20 of the cores of my machine at it does a lot, but still isn't enough.
I attempted to rewrite this code as an Azure Function and processing files in the data lake file-by-file. Most of the files timed out, since they took longer than 10 minutes to process. So I've updated the code to read in the files, and shread the rows into Azure Queue storage. Then I have a second Azure function that fires for each row in the queue. The idea is, Azure Functions can scale out far greater than any single machine can.
And this is where I'm stuck. I can't reliably write rows to files in ADLS. Here is the code as I have it now.
public static void WriteGeocodedOutput(string Contents, String outputFileName, ILogger log) {
AdlsClient client = AdlsClient.CreateClient(ADlSAccountName, adlCreds);
//if the file doesn't exist write the header first
try {
if (!client.CheckExists(outputFileName)) {
using (var stream = client.CreateFile(outputFileName, IfExists.Fail)) {
byte[] headerByteArray = Encoding.UTF8.GetBytes("EventDate, Longitude, Latitude, RadarSiteID, CellID, RangeNauticalMiles, Azimuth, SevereProbability, Probability, MaxSizeinInchesInUS, StateCode, CountyCode, TownshipCode, RangeCode\r\n");
//stream.Write(headerByteArray, 0, headerByteArray.Length);
client.ConcurrentAppend(outputFileName, true, headerByteArray, 0, headerByteArray.Length);
} catch (Exception e) {
log.LogInformation("multiple attempts to create the file. Ignoring this error, since the file was created.");
//the write the data
byte[] textByteArray = Encoding.UTF8.GetBytes(Contents);
for (int attempt = 0; attempt < 5; attempt++) {
try {
log.LogInformation("prior to write, the outputfile size is: " + client.GetDirectoryEntry(outputFileName).Length);
var offset = client.GetDirectoryEntry(outputFileName).Length;
client.ConcurrentAppend(outputFileName, false, textByteArray, 0, textByteArray.Length);
log.LogInformation("AFTER write, the outputfile size is: " + client.GetDirectoryEntry(outputFileName).Length);
//if successful, stop trying to write this row
attempt = 6;
catch (Exception e){
log.LogInformation($"exception on adls write: {e}");
Random rnd = new Random();
Thread.Sleep(rnd.Next(attempt * 60));
The file will be created when it needs to be, but I do get several messages in my log that several threads tried to create it. I'm not always getting the header row written.
I also no longer get any data rows only:
"BadRequest ( IllegalArgumentException concurrentappend failed with error 0xffffffff83090a6f
(Bad request. The target file does not support this particular type of append operation.
If the concurrent append operation has been used with this file in the past, you need to append to this file using the concurrent append operation.
If the append operation with offset has been used in the past, you need to append to this file using the append operation with offset.
On the same file, it is not possible to use both of these operations.). []
I feel like I'm missing some fundamental design idea here. The code should try to write a row into a file. If the file doesn't yet exist, create it and put the header row in. Then, put in the row.
What's the best-practice way to accomplish this kind of write scenario?
Any other suggestions of how to handle this kind of parallel-write workload in ADLS?

I am a bit late to this but I guess one of the problems could be due to the use of "Create" and "ConcurrentAppend" on the same file stream?
ADLS documentation mentions that they can't be used on the same file. Maybe, try changing the "Create" command to "ConcurrentAppend" as the latter can be used to create a file if it doesn't exist.
Also, if you found a better way to do it, please do post your solution here.


Acquiring waveform of LeCroy oscilloscope from C#/.NET

I am trying to load a waveform from a Teledyne Lecroy Wavesurfer 3054 scope using NI-VISA / IVI library. I can connect to the scope and read and set control variables but I can't figure out how to get the trace data back from the scope into my code. I am using USBTMC and can run the sample code in the Lecroy Automation manual but it does not give an example for getting the waveform array data, just control variables. They do not have a driver for IVI.NET. Here is a distilled version of the code:
// Open session to scope
var session = (IMessageBasedSession)GlobalResourceManager.Open
session.TimeoutMilliseconds = 5000;
// Don't return command header with query result
session.FormattedIO.WriteLine("COMM_HEADER OFF");
// { other setup stuff that works OK }
// ...
// ...
// Attempt to query the Channel 1 waveform data
session.FormattedIO.WriteLine("vbs? 'return = app.Acquisition.C1.Out.Result.DataArray'");
So the last line above (which seems to be what the manual suggests) causes a beep and there is no data that can be read. I've tried all the read functions and they all time out with no data returned. If I query the number of data points I get 100002 which seems correct and I know the data must be there. Is there a better VBS query to use? Is there a read function that I can use to read the data into a byte array that I have overlooked? Do I need to read the data in blocks due to a buffer size limitation, etc.? I am hoping that someone has solved this problem before. Thanks so much!
Here is the first effort I got at making it work:
var session = (IMessageBasedSession)GlobalResourceManager.Open("USB0::0x05FF::0x1023::LCRY3702N14729::INSTR");
session.TimeoutMilliseconds = 5000;
// Don't return command header with query result
session.FormattedIO.WriteLine("COMM_HEADER OFF");
// .. a bunch of setup code...
session.FormattedIO.WriteLine("C1:WF?"); // Query waveform data for Channel 1
buff = session.RawIO.Read(MAX_BUFF_SIZE); // buff has .TRC-like contents of waveform data
The buff[] byte buffer will end up with the same file formatted data as the .TRC files that the scope saves to disk, so it has to be parsed. But at least the waveform data is there! If there is a better way, I may find it and post, or someone else feel free to post it.
The way I achieved this is by saving the screenshot to a local drive. Map the local drive to the current system & simply use File.Copy() to copy image file from the mapped drive to the local computer. It saves time to parse data & re-plot it if one uses TRC-like contents.

C# I/O async (copyAsync): how to avoid file fragmentation?

Within a tool copying big files between disks, I replaced the
System.IO.FileInfo.CopyTo method by System.IO.Stream.CopyToAsync.
This allow a faster copy and a better control during the copy, e.g. I can stop the copy.
But this create even more fragmentation of the copied files. It is especially annoying when I copy file of many hundreds megabytes.
How can I avoid disk fragmentation during copy?
With the xcopy command, the /j switch copies files without buffering. And it is recommended for very large file in TechNet
It seems indeed to avoid file fragmentation (while a simple file copy within windows 10 explorer DOES fragment my file!)
A copy without buffering seems to be the opposite way than this async copy. Or it there any way to do async copy without buffering?
Here it my current code for aync copy. I let the default buffersize of 81920 bytes, i.e. 10*1024*size(int64).
I am working with NTFS file systems, thus 4096 bytes clusters.
EDIT: I updated the code with SetLength as suggested, added the FileOptions Async while creating the destinationStream and fix setting the attributes AFTER setting the time (otherwise, exception is thrown for ReadOnly files)
int bufferSize = 81920;
using (FileStream sourceStream = source.OpenRead())
// Remove existing file first
if (File.Exists(destinationFullPath))
using (FileStream destinationStream = File.Create(destinationFullPath, bufferSize, FileOptions.Asynchronous))
destinationStream.SetLength(sourceStream.Length); // avoid file fragmentation!
await sourceStream.CopyToAsync(destinationStream, bufferSize, cancellationToken);
catch (OperationCanceledException)
operationCanceled = true;
} // properly disposed after the catch
catch (IOException e)
actionOnException(e, "error copying " + source.FullName);
if (operationCanceled)
// Remove the partially written file
if (File.Exists(destinationFullPath))
// Copy meta data (attributes and time) from source once the copy is finished
File.SetCreationTimeUtc(destinationFullPath, source.CreationTimeUtc);
File.SetLastWriteTimeUtc(destinationFullPath, source.LastWriteTimeUtc);
File.SetAttributes(destinationFullPath, source.Attributes); // after set time if ReadOnly!
I fear also that the File.SetAttributes and Time at the end on my code could increase file fragmentation.
Is there a proper way to create a 1:1 asynchronous file copy without any file fragmentation, i.e. asking the HDD that the file steam get only contiguous sectors?
Other topics regarding file fragmentation like How can I limit file fragmentation while working with .NET suggests incrementing the file size in larger chunks, but it does not seem to be a direct answer to my question.
but the SetLength method does the job
It does not do the job. It only updates the file size in the directory entry, it does not allocate any clusters. The easiest way to see this for yourself is by doing this on a very large file, say 100 gigabytes. Note how the call completes instantly. Only way it can be instant is when the file system does not also do the job of allocating and writing the clusters. Reading from the file is actually possible, even though the file contains no actual data, the file system simply returns binary zeros.
This will also mislead any utility that reports fragmentation. Since the file has no clusters, there can be no fragmentation. So it only looks like you solved your problem.
The only thing you can do to force the clusters to be allocated is to actually write to the file. It is in fact possible to allocate 100 gigabytes worth of clusters with a single write. You must use Seek() to position to Length-1, then write a single byte with Write(). This will take a while on a very large file, it is in effect no longer async.
The odds that it will reduce fragmentation are not great. You merely reduced the risk somewhat that the writes will be interleaved by writes from other processes. Somewhat, actual writing is done lazily by the file system cache. Core issue is that the volume was fragmented before you began writing, it will never be less fragmented after you're done.
Best thing to do is to just not fret about it. Defragging is automatic on Windows these days, has been since Vista. Maybe you want to play with the scheduling, maybe you want to ask more about it at superuser.com
I think, FileStream.SetLength is what you need.
Considering Hans Passant answer,
in my code above, an alternative to
would be, if I understood it properly:
byte[] writeOneZero = {0};
destinationStream.Seek(sourceStream.Length - 1, SeekOrigin.Begin);
destinationStream.Write(writeOneZero, 0, 1);
destinationStream.Seek(0, SeekOrigin.Begin);
It seems indeed to consolidate the copy.
But a look at the source code of FileStream.SetLengthCore seems it does almost the same, seeking at the end but without writing one byte:
private void SetLengthCore(long value)
Contract.Assert(value >= 0, "value >= 0");
long origPos = _pos;
if (_exposedHandle)
if (_pos != value)
SeekCore(value, SeekOrigin.Begin);
if (!Win32Native.SetEndOfFile(_handle)) {
int hr = Marshal.GetLastWin32Error();
throw new ArgumentOutOfRangeException("value", Environment.GetResourceString("ArgumentOutOfRange_FileLengthTooBig"));
__Error.WinIOError(hr, String.Empty);
// Return file pointer to where it was before setting length
if (origPos != value) {
if (origPos < value)
SeekCore(origPos, SeekOrigin.Begin);
SeekCore(0, SeekOrigin.End);
Anyway, not sure that theses method guarantee no fragmentation, but at least avoid it for most of the cases. Thus the auto defragment tool will finish the job at a low performance expense.
My initial code without this Seek calls created hundred of thousands of fragments for 1 GB file, slowing down my machine when the defragment tool went active.

Writing to file, memory used steadily increasing

I have an application where I need to write binary to a file constantly. The bits of data are small, about 1K each. The computers this is running on aren't great and are running XP. I've run into the problem that when I turn on the logging the computers just get totally hosed and I watch the Task Manager and just see the memory usage going up and up until it crashes.
A coworker suggested that I just keep the packets in memory until a certain amount of time has passed and then write it all at once instead of writing each one separately - tried that, same issue.
This is the code (loggingBuffer is the List<byte[]> I'm storing the packets in while the interval passes):
if ((DateTime.Now - lastStoreTime).TotalSeconds > 10)
string fileName = #"C:\Storage\file";
FileMode fm = File.Exists(fileName) ? FileMode.Append : FileMode.Create;
using (BinaryWriter w = new BinaryWriter(File.Open(fileName, fm), Encoding.ASCII))
foreach (byte[] packetData in loggingBuffer)
lastStoreTime= DateTime.Now;
Is there anything different I should be doing to accomplish this?
Seems to me that, while you're writing each 10 seconds, you could close the file in between. And cleanup all related file-writing things. Perhaps that would solved your problem.
Secondly, I'd suggest creating the BinaryWriter outside the function where you actually write the data. It'll keep things clearer. In your current code you're checking each time wether to append data or to create a new file and the write to it. If you'll do this outside the function and call it just once perhaps this will save memory too. All untested by me, that is :)

Add Files Into Existing Zip - performance issue

I have a WCF webservice that saves files to a folder(about 200,000 small files).
After that, I need to move them to another server.
The solution I've found was to zip them then move them.
When I adopted this solution, I've made the test with (20,000 files), zipping 20,000 files took only about 2 minutes and moving the zip is really fast.
But in production, zipping 200,000 files takes more than 2 hours.
Here is my code to zip the folder :
using (ZipFile zipFile = new ZipFile())
zipFile.UseZip64WhenSaving = Zip64Option.Always;
zipFile.CompressionLevel = CompressionLevel.None;
zipFile.AddDirectory(this.SourceDirectory.FullName, string.Empty);
I want to modify the WCF webservice, so that instead of saving to a folder, it saves to the zip.
I use the following code to test:
var listAes = Directory.EnumerateFiles(myFolder, "*.*", SearchOption.AllDirectories).Where(s => s.EndsWith(".aes")).Select(f => new FileInfo(f));
foreach (var additionFile in listAes)
using (var zip = ZipFile.Read(nameOfExistingZip))
zip.CompressionLevel = Ionic.Zlib.CompressionLevel.None;
file.WriteLine("Delay for adding a file : " + sw.Elapsed.TotalMilliseconds);
The first file to add to the zip takes only 5 ms, but the 10,000 th file to add takes 800 ms.
Is there a way to optimize this ? Or if you have other suggestions ?
The example shown above is only for test, in the WCF webservice, i'll have different request sending files that I need to Add to the Zip file.
As WCF is statless, I will have a new instance of my class with each call, so how can I keep the Zip file open to add more files ?
I've looked at your code and immediately spot problems. The problem with a lot of software developers nowadays is that they nowadays don't understand how stuff works, which makes it impossible to reason about it. In this particular case you don't seem to know how ZIP files work; therefore I would suggest you first read up on how they work and attempted to break down what happens under the hood.
Now that we're all on the same page on how they work, let's start the reasoning by breaking down how this works using your source code; we'll continue from there on forward:
var listAes = Directory.EnumerateFiles(myFolder, "*.*", SearchOption.AllDirectories).Where(s => s.EndsWith(".aes")).Select(f => new FileInfo(f));
foreach (var additionFile in listAes)
// (1)
using (var zip = ZipFile.Read(nameOfExistingZip))
zip.CompressionLevel = Ionic.Zlib.CompressionLevel.None;
// (2)
// (3)
file.WriteLine("Delay for adding a file : " + sw.Elapsed.TotalMilliseconds);
(1) opens a ZIP file. You're doing this for every file you attempt to add
(2) Adds a single file to the ZIP file
(3) Saves the complete ZIP file
On my computer this takes about an hour.
Now, not all of the file format details are relevant. We're looking for stuff that will get increasingly worse in your program.
Skimming over the file format specification, you'll notice that compression is based on Deflate which doesn't require information on the other files that are compressed. Moving on, we'll notice how the 'file table' is stored in the ZIP file:
You'll notice here that there's a 'central directory' which stores the files in the ZIP file. It's basically stored as a 'list'. So, using this information we can reason on what the trivial way is to update that when implementing steps (1-3) in this order:
Open the zip file, read the central directory
Append data for the (new) compressed file, store the pointer along with the filename in the new central directory.
Re-write the central directory.
Think about it for a moment, for file #1 you need 1 write operation; for file #2, you need to read (1 item), append (in memory) and write (2 items); for file #3, you need to read (2 item), append (in memory) and write (3 items). And so on. This basically means that you're performance will go down the drain if you add more files. You've already observed this, now you know why.
A possible solution
In the previous solution I have added all files at once. That might not work in your use case. Another solution is to implement a merge that basically merges 2 files together every time. This is more convenient if you don't have all files available when you start the compression process.
Basically the algorithm then becomes:
Add a few (say, 16, files). You can toy with this number. Store this in -say- 'file16.zip'.
Add more files. When you hit 16 files, you have to merge the two files of 16 items into a single file of 32 items.
Merge files until you cannot merge anymore. Basically every time you have two files of N items, you create a new file of 2*N items.
Goto (2).
Again, we can reason about it. The first 16 files aren't a problem, we've already established that.
We can also reason what will happen in our program. Because we're merging 2 files into 1 file, we don't have to do as many read and writes. In fact, if you reason about it, you'll see that you have a file of 32 entries in 2 merges, 64 in 4 merges, 128 in 8 merges, 256 in 16 merges... hey, wait we know this sequence, it's 2^N. Again, reasoning about it we'll find that we need approximately 500 merges -- which is much better than the 200.000 operations that we started with.
Hacking in the ZIP file
Yet another solution that might come to mind is to overallocate the central directory, creating slack space for future entries to add. However, this probably requires you to hack into the ZIP code and create your own ZIP file writer. The idea is that you basically overallocate the central directory to a 200K entries before you get started, so that you can simply append in-place.
Again, we can reason about it: adding file now means: adding a file and updating some headers. It won't be as fast as the original solution because you'll need random disk IO, but it'll probably work fast enough.
I haven't worked this out, but it doesn't seem overly complicated to me.
The easiest solution is the most practical
What we haven't discussed so far is the easiest possible solution: one approach that comes to mind is to simply add all files at once, which we can again reason about.
Implementation is quite easy, because now we don't have to do any fancy things; we can simply use the ZIP handler (I use ionic) as-is:
static void Main()
try { File.Delete(#"c:\tmp\test.zip"); }
catch { }
var sw = Stopwatch.StartNew();
using (var zip = new ZipFile(#"c:\tmp\test.zip"))
zip.UseZip64WhenSaving = Zip64Option.Always;
for (int i = 0; i < 200000; ++i)
string filename = "foo" + i.ToString() + ".txt";
byte[] contents = Encoding.UTF8.GetBytes("Hello world!");
zip.CompressionLevel = Ionic.Zlib.CompressionLevel.None;
zip.AddEntry(filename, contents);
Console.WriteLine("Elapsed: {0:0.00}s", sw.Elapsed.TotalSeconds);
Whop; that finishes in 4,5 seconds. Much better.
I can see that you just want to group the 200,000 files into one big single file, without compression (like a tar archive).
Two ideas to explore:
Experiment with other file formats than Zip, as it may not be the fastest. Tar (tape archive) comes to mind (with natural speed advantages due to its simplicity), it even has an append mode which is exactly what you are after to ensure O(1) operations. SharpCompress is a library that will allow you to work with this format (and others).
If you have control over your remote server, you could implement your own file format, the simplest I can think of would be to zip each new file separately (to store the file metadata such as name, date, etc. in the file content itself), and then to append each such zipped file to a single raw bytes file. You would just need to store the byte offsets (separated by columns in another txt file) to allow the remote server to split the huge file into the 200,000 zipped files, and then unzip each of them to get the meta data. I guess this is also roughly what tar does behind the scene :).
Have you tried zipping to a MemoryStream rather than to a file, only flushing to a file when you are done for the day? Of course for back-up purposes your WCF service would have to keep a copy of the received individual files until you are sure they have been "committed" to the remote server.
If you do need compression, 7-Zip (and fiddling with the options) is well worth a try.
You are opening the file repeatedly, why not add loop through and add them all to one zip, then save it?
var listAes = Directory.EnumerateFiles(myFolder, "*.*", SearchOption.AllDirectories)
.Where(s => s.EndsWith(".aes"))
.Select(f => new FileInfo(f));
using (var zip = ZipFile.Read(nameOfExistingZip))
foreach (var additionFile in listAes)
zip.CompressionLevel = Ionic.Zlib.CompressionLevel.None;
If the files aren't all available right away, you could at least batch them together. So if you're expecting 200k files, but you only have received 10 so far, don't open the zip, add one, then close it. Wait for a few more to come in and add them in batches.
If you are OK with performance of 100 * 20,000 files, can't you simply partition your large ZIP into a 100 "small" ZIP files? For simplicity, create a new ZIP file every minute and put a time-stamp in the name.
You can zip all the files using .Net TPL (Task Parallel Library) like this:
while(0 != (read = sourceStream.Read(bufferRead, 0, sliceBytes)))
tasks[taskCounter] = Task.Factory.StartNew(() =>
CompressStreamP(bufferRead, read, taskCounter, ref listOfMemStream, eventSignal)); // Line 1
eventSignal.WaitOne(-1); // Line 2
taskCounter++; // Line 3
bufferRead = new byte[sliceBytes]; // Line 4
Task.WaitAll(tasks); // Line 6
There is a compiled library and source code here:

Prune simple text log file using C# .NET 4.0

An external Windows service I work with maintains a single text-based log file that it continuously appends to. This log file grows unbounded over time. I'd like to prune this log file periodically to maintain, say the most recent 5mb of log entries. How can I efficiently implement the file I/O code in C# .NET 4.0 to prune the file to say 5mb?
The way service dependencies are set up, my service always starts before the external service. This means I get exclusive access to the log file to truncate it, if required. Once the external service starts up, I will not access the log file. I can gain exclusive access to the file on desktop startup. The problem is - the log file may a few gigabytes in size and I'm looking for an efficient way to truncate it.
It's going to take the amount of memory that you want to store to process the "new" log file but if you only want 5Mb then it should be fine. If you are talking about Gb+ then you probably have other problems; however, it could still be accomplished using a temp file and some locking.
As noted before, you may experience a race condition but that's not the case if this is the only thread writing to this file. This would replace your current writing to the file.
const int MAX_FILE_SIZE_IN_BYTES = 5 * 1024 * 1024; //5Mb;
const string LOG_FILE_PATH = #"ThisFolder\log.txt";
string newLogMessage = "Hey this happened";
#region Use one or the other, I mean you could use both below if you really want to.
//Use this one to save an extra character
if (!newLogMessage.StartsWith(Environment.NewLine))
newLogMessage = Environment.NewLine + newLogMessage;
//Use this one to imitate a write line
if (!newLogMessage.EndsWith(Environment.NewLine))
newLogMessage = newLogMessage + Environment.NewLine;
int newMessageSize = newLogMessage.Length*sizeof (char);
byte[] logMessage = new byte[MAX_FILE_SIZE_IN_BYTES];
//Append new log to end of "file"
System.Buffer.BlockCopy(newLogMessage.ToCharArray(), 0, logMessage, MAX_FILE_SIZE_IN_BYTES - newMessageSize, logMessage.Length);
FileStream logFile = File.Open(LOG_FILE_PATH, FileMode.Open, FileAccess.ReadWrite);
int sizeOfRetainedLog = (int)Math.Min(MAX_FILE_SIZE_IN_BYTES - newMessageSize, logFile.Length);
//Set start position/offset of the file
logFile.Position = logFile.Length - sizeOfRetainedLog;
//Read remaining portion of file to beginning of buffer
logFile.Read(logMessage, logMessage.Length, sizeOfRetainedLog);
//Clear the file
//Write the file
logFile.Write(logMessage, 0, logMessage.Length);
I wrote this really quick, I apologize if I'm off by 1 somewhere.
depending on how often it is written to I'd say you might be facing a race condition to modify the file without damaging the log. You could always try writing a service to monitor the file size, and once it reaches a certain point lock the file, dupe and clear the whole thing and close it. Then store the data in another file that the service controls the size of easily. Alternatively you could see if the external service has an option for logging to a database, which would make it pretty simple to roll out the oldest data.
You could use a file observer to monitor the file:
FileSystemWatcher logWatcher = new FileSystemWatcher();
logWatcher.Path = #"c:\example.log"
logWatcher.Changed += logWatcher_Changed;
Then when the event is raised you can use a StreamReader to read the file
private void logWatcher_Changed(object sender, FileSystemEventArgs e)
using (StreamReader readFile = new StreamReader(path))
string line;
string[] row;
while ((line = readFile.ReadLine()) != null)
// Here you delete the lines you want or move it to another file, so that your log keeps small. Then save the file.
It´s an option.
