I have a class that needs to keep an instance of a BinaryWriter open over several write function calls (data is packet based). It also has to create a new file once it has written a certain amount of data/packets.
Normally I would just close the Binary Writer and reinstantiate it with a new file path, but the overhead associated with that operation is too great for my application. I tried closing the writer in a seperate thread, but that interferes with the new instance I create later.
My last ditch attempt was not to close the Writer (and stream) at all, and simply create a new instance of it everytime I'd written the required packets. This seems to work, and doesn't cause any memory leaks, but I'd really like to know what goes on if you do this.
Here is my (simplified) code to illustrate:
class Writer
{
BinaryWriter binWriter;
int bytesWritten;
int filesWritten;
const int maxFilesize = 10E9;
Writer(string filepath)
{
binWriter = new BinaryWriter(File.Open(filepath, FileMode.Create));
bytesWritten = 0;
filesWritten = 0;
}
WritePacket(byte[] packet)
{
if(bytesWritten<maxFileSize)
{
binWriter.Write(packet);
bytesWritten += packet.Length;
}
else
{
// this is where I'd normally call Dispose(), but the overhead
// is too high, and disposing the stream in a seperate thread
// interferes with the new one
// what actually happens here? it's the only thing I've found to
//work...
filesWritten++;
binWriter = new BinaryWriter(File.Open(filepath + filesWritten, FileMode.Create));
}
}
It feels bad, but this is the only solution that works so far. Any insight would be great!
Related
I am working with two C# stream APIs, one of which is a data source and the other of which is a data sink.
Neither API actually exposes a stream object; both expect you to pass a stream into them and they handle writing/reading from the stream.
Is there a way to link these APIs together such that the output of the source is streamed into the sink without having to buffer the entire source in a MemoryStream? This is a very RAM-sensitive application.
Here's an example that uses the MemoryStream approach that I'm trying to avoid, since it buffers the entire stream in RAM before writing it out to S3:
using (var buffer = new MemoryStream())
using (var transferUtil = new TransferUtility(s3client))
{
// This destructor finishes the file and transferUtil closes
// the stream, so we need this weird using nesting to keep everyone happy.
using (var parquetWriter = new ParquetWriter(schema, buffer))
using (var rowGroupWriter = parquetWriter.CreateRowGroup())
{
rowGroupWriter.WriteColumn(...);
...
}
transferUtil.Upload(buffer, _bucketName, _key.Replace(".gz", "") + ".parquet");
}
You are looking for a stream that can be passed to both the data source and sink and that can 'transfer' the data between the two asynchronously. There are a number of possible solutions and I might have considered a producer-consumer pattern around a BlockingCollection.
Recently, the addition of the System.IO.Pipelines, Span and Memory types have really focused on high performance IO and I think it would be a good fit here. The Pipe class with it's associated Reader and Writer, can automatically handle the flow control, back pressure and IO between themselves whilst utilising all the new Span and Memory related types.
I have uploaded a Gist at PipeStream that will give you a custom stream with an internal Pipe implementation that you can pass to both your API classes. Whatever is written to the WriteAsync (or Write) method will be made available to the ReadAsync (or Read) method without requiring any further byte[] or MemoryStream allocations
In your case you would simply substite the MemoryStream for this new class and it should work out of the box. I haven't got a full S3 test working but reading directly from the Parquet stream and dumping it to the console window shows that it works asynchronously.
// Create some very badly 'mocked' data
var idColumn = new DataColumn(
new DataField<int>("id"),
Enumerable.Range(0, 10000).Select(i => i).ToArray());
var cityColumn = new DataColumn(
new DataField<string>("city"),
Enumerable.Range(0, 10000).Select(i => i % 2 == 0 ? "London" : "Grimsby").ToArray());
var schema = new Schema(idColumn.Field, cityColumn.Field);
using (var pipeStream = new PipeStream())
{
var buffer = new byte[4096];
int read = 0;
var readTask = Task.Run(async () =>
{
//transferUtil.Upload(readStream, "bucketName", "key"); // Execute this in a Task / Thread
while ((read = await pipeStream.ReadAsync(buffer, 0, buffer.Length)) > 0)
{
var incoming = Encoding.ASCII.GetString(buffer, 0, read);
Console.WriteLine(incoming);
// await Task.Delay(5000); uncomment this to simulate very slow consumer
}
});
using (var parquetWriter = new ParquetWriter(schema, pipeStream)) // This destructor finishes the file and transferUtil closes the stream, so we need this weird using nesting to keep everyone happy.
using (var rowGroupWriter = parquetWriter.CreateRowGroup())
{
rowGroupWriter.WriteColumn(idColumn); // Step through both these statements to see data read before the parquetWriter completes
rowGroupWriter.WriteColumn(cityColumn);
}
}
The implementation is not completely finished but I think it shows a nice approach. In the console 'readTask' you can un-comment the Task.Delay to simulate a slow read (transferUtil) and you should see the pipe automatically throttles the write task.
You need to be using C# 7.2 or later (VS 2017 -> Project Properties -> Build -> Advanced -> Language Version) for one of the Span extension methods but it should be compatible with any .Net Framework. You may need the Nuget Package
The stream is readable and writable (obviously!) but not seekable which should work for you in this scenario but wouldn't work reading from the Parquet SDK which requires seekable streams.
Hope it helps
Using System.IO.Pipelines it would look something like this:
var pipe = new System.IO.Pipelines.Pipe();
using (var buffer = pipe.Writer.AsStream())
using (var transferUtil = new TransferUtility(s3client))
{
// we can start the consumer first because it will just block
// on the stream until data is available
Task consumer = transferUtil.UploadAsync(pipe.Reader.AsStream(), _bucketName, _key.Replace(".gz", "") + ".parquet");
// start a task to produce data
Task producer = WriteParquetAsync(buffer, ..);
// start pumping data; we can wait here because the producer will
// necessarily finish before the consumer does
await producer;
// this is key; disposing of the buffer early here causes the consumer stream
// to terminate, else it will just hang waiting on the stream to finish.
// see the documentation for Writer.AsStream(bool leaveOpen = false)
buffer.Dispose();
// wait the upload to finish
await consumer;
}
I am serializing a class with a binaryformatter and compressing the data with deflatestream. The save function is as follows and is called from a backgroundworker:
public static void save(System system, String filePath)
{
//Make filestream
FileStream fs = new FileStream(filePath, FileMode.Create);
try
{
//Serialize offerte
BinaryFormatter bf = new BinaryFormatter();
DeflateStream cs = new DeflateStream(fs, CompressionMode.Compress);
bf.Serialize(cs, system);
//Push through
fs.Flush();
cs.Flush();
cs.Close();
}
catch (Exception e)
{
var mess = e.Message;
}
finally
{
//Close
fs.Close();
}
}
The class has a number of 'users'. With 100 users it takes 10 seconds and the file is 2MB. With 1000 users it gives an out-of-memory exception (the estimated size is 16MB). Can anyone see a problem here, or give suggestions how to solve this?
(I was first thinking the time on a background thread was causing this, it takes to long. But I have other background threads that can run longer.)
You aren't disposing of your streams, which may be part of the problem, suggest:
public static void save(System system, String filePath)
{
//Make filestream
using(FileStream fs = new FileStream(filePath, FileMode.Create))
{
//Serialize offerte
BinaryFormatter bf = new BinaryFormatter();
using (DeflateStream cs = new DeflateStream(fs, CompressionMode.Compress)) {
bf.Serialize(cs, system);
//Push through
fs.Flush();
cs.Flush();
cs.Close();
}
}
}
This also removes your exception swallowing, which would probably be a good thing.
You use several objects of classes that implement System.IDisposable
If a designer implemented IDisposable he informs you that he might use scarce resources. You might get out of resources before the garbage collector collects the garbage.
In other words: whenever you use a class that implements System.IDisposable you should call Dispose() as soon as you don't need the class anymore. This is especially needed if you need the resources of the class for something else.
You use two Stream classes: FileStream and DeflateStream. They both implement IDisposable. If you don't call Dispose(), the garbage collector eventually will, but in the mean time the resources that these Streams use are not available for anyone else.
The most easy method to make sure that a Disposable object is disposed is by using the using statement:
using (var myStream = new FileStream(...))
{
... // use myStream
}
When the closing bracket is reached, myStream.Dispose() is called, effectively releasing all scarce resources it uses.
This works on every method that is used to leave the {...} block, including break, return, and even Exceptions.
Therefore using is a very safe method: Dispose() will always be called.
By the way: Dispose() will also take care that the Streams are Flushed and Closed, so at the end of the using statement you don't have to Flush() and Close().
Writing Stringbuilder to file asynchronously. This code takes control of a file, writes a stream to it and releases it. It deals with requests from asynchronous operations, which may come in at any time.
The FilePath is set per class instance (so the lock Object is per instance), but there is potential for conflict since these classes may share FilePaths. That sort of conflict, as well as all other types from outside the class instance, would be dealt with retries.
Is this code suitable for its purpose? Is there a better way to handle this that means less (or no) reliance on the catch and retry mechanic?
Also how do I avoid catching exceptions that have occurred for other reasons.
public string Filepath { get; set; }
private Object locker = new Object();
public async Task WriteToFile(StringBuilder text)
{
int timeOut = 100;
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
while (true)
{
try
{
//Wait for resource to be free
lock (locker)
{
using (FileStream file = new FileStream(Filepath, FileMode.Append, FileAccess.Write, FileShare.Read))
using (StreamWriter writer = new StreamWriter(file, Encoding.Unicode))
{
writer.Write(text.ToString());
}
}
break;
}
catch
{
//File not available, conflict with other class instances or application
}
if (stopwatch.ElapsedMilliseconds > timeOut)
{
//Give up.
break;
}
//Wait and Retry
await Task.Delay(5);
}
stopwatch.Stop();
}
How you approach this is going to depend a lot on how frequently you're writing. If you're writing a relatively small amount of text fairly infrequently, then just use a static lock and be done with it. That might be your best bet in any case because the disk drive can only satisfy one request at a time. Assuming that all of your output files are on the same drive (perhaps not a fair assumption, but bear with me), there's not going to be much difference between locking at the application level and the lock that's done at the OS level.
So if you declare locker as:
static object locker = new object();
You'll be assured that there are no conflicts with other threads in your program.
If you want this thing to be bulletproof (or at least reasonably so), you can't get away from catching exceptions. Bad things can happen. You must handle exceptions in some way. What you do in the face of error is something else entirely. You'll probably want to retry a few times if the file is locked. If you get a bad path or filename error or disk full or any of a number of other errors, you probably want to kill the program. Again, that's up to you. But you can't avoid exception handling unless you're okay with the program crashing on error.
By the way, you can replace all of this code:
using (FileStream file = new FileStream(Filepath, FileMode.Append, FileAccess.Write, FileShare.Read))
using (StreamWriter writer = new StreamWriter(file, Encoding.Unicode))
{
writer.Write(text.ToString());
}
With a single call:
File.AppendAllText(Filepath, text.ToString());
Assuming you're using .NET 4.0 or later. See File.AppendAllText.
One other way you could handle this is to have the threads write their messages to a queue, and have a dedicated thread that services that queue. You'd have a BlockingCollection of messages and associated file paths. For example:
class LogMessage
{
public string Filepath { get; set; }
public string Text { get; set; }
}
BlockingCollection<LogMessage> _logMessages = new BlockingCollection<LogMessage>();
Your threads write data to that queue:
_logMessages.Add(new LogMessage("foo.log", "this is a test"));
You start a long-running background task that does nothing but service that queue:
foreach (var msg in _logMessages.GetConsumingEnumerable())
{
// of course you'll want your exception handling in here
File.AppendAllText(msg.Filepath, msg.Text);
}
Your potential risk here is that threads create messages too fast, causing the queue to grow without bound because the consumer can't keep up. Whether that's a real risk in your application is something only you can say. If you think it might be a risk, you can put a maximum size (number of entries) on the queue so that if the queue size exceeds that value, producers will wait until there is room in the queue before they can add.
You could also use ReaderWriterLock, it is considered to be more 'appropriate' way to control thread safety when dealing with read write operations...
To debug my web apps (when remote debug fails) I use following ('debug.txt' end up in \bin folder on the server):
public static class LoggingExtensions
{
static ReaderWriterLock locker = new ReaderWriterLock();
public static void WriteDebug(string text)
{
try
{
locker.AcquireWriterLock(int.MaxValue);
System.IO.File.AppendAllLines(Path.Combine(Path.GetDirectoryName(System.Reflection.Assembly.GetExecutingAssembly().GetName().CodeBase).Replace("file:\\", ""), "debug.txt"), new[] { text });
}
finally
{
locker.ReleaseWriterLock();
}
}
}
Hope this saves you some time.
this is a continuation of part 3
Write file need to optimised for heavy traffic part 3
as my code changed somewhat i think it is better to open a new thread.
public class memoryStreamClass
{
static MemoryStream ms1 = new MemoryStream();
static MemoryStream ms2 = new MemoryStream();
static int c = 1;
public void fillBuffer(string outputString)
{
byte[] outputByte = Encoding.ASCII.GetBytes(outputString);
if (c == 1)
{
ms1.Write(outputByte, 0, outputByte.Length);
if (ms1.Length > 8100)
{
c = 2;
Thread thread1 = new Thread(() => emptyBuffer(ref ms1));
thread1.Start();
}
}
else
{
ms2.Write(outputByte, 0, outputByte.Length);
if (ms2.Length > 8100)
{
c = 1;
Thread thread2 = new Thread(() => emptyBuffer(ref ms2));
thread2.Start();
}
}
}
void emptyBuffer(ref MemoryStream ms)
{
FileStream outStream = new FileStream(string.Format("c:\\output.txt", FileMode.Append);
ms.WriteTo(outStream);
outStream.Flush();
outStream.Close();
ms.SetLength(0);
ms.Position = 0;
Console.WriteLine(ms.Position);
}
there are 2 things i have changed changed from the code in part 3.
the class and method is changed to non-static, the variables are still static tho.
i have move the memorystream reset length into the emptyBuffer method, and i use a ref parameter to pass the reference instead of a copy to the method.
this code compiled fine and runs ok. However, i run it side by side with my single thread program, using 2 computers, one computer runs the single thread, and one computer runs the multithread version, on the same network. i run it for around 5 mins. and the single threaded version collects 8333KB of data while the multithread version collects only 8222KB of data. (98.6% of the single thread version)
its first time i have do any performance comparison between the 2 version. Maybe a should run more test to confirm it. but base on looking the code, any masters out there will point out any problem?
i haven't putting any code on lock or threadpooling at the moment, maybe i should, but if the code runs fine, i dont want to change it and break it. the only thing i will change is the buffer size, so i will eliminate any chance of the buffer fill up before the other is emptied.
any comments on my code?
The problem is still static state. You're clearing buffers that could have data that wasn't written to disk.
I imagine this scenario is happening 1.4% of the time.
ms1 fills up, empty buffer1 thread started, switch to ms2
empty buffer1 is writing to disk
ms2 fills up, empty buffer2 thread started, switch to ms1
empty buffer1 to disk finishes
ms1 is cleared while it is the active stream
When doing multi-threaded programming, static classes are fine but static state is not. Ideally you have no shared memory between threads and your code is entirely dependent on it.
Think of it this way -- if you're expecting a value to consistently change, it's not exactly static is it?
I have an issue from time to time, I have a few StreamReaders and StreamWriters in my program that read info and write it. They go right about 99% of the time, but once in a while I end up with a StreamWriter that won't close, on a piece of code I've run multiple times.
This tends to happen if I spam a function, but I am trying to find a safe way to guarantee a steam disposed. Anyone know how?
try a using statement MSDN
using (StreamWriter stream = new StreamWriter(Initialization)){
//your code
}
this can be useful:
Closing Stream Read and Stream Writer when the form exits
Also you could use a Try Block
try
{
//Declare your streamwriter
StreamWriter sw = new StreamWriter(Initialization);
}
catch
{
//Handle the errors
}
finally
{
sw.Dispose();
}
If the stream's scope is local, always use the following construct:
using (var stream = new Stream())
{
...do stream work here...
}
If on the other hand you are using the stream as a class field then implement the IDisposable pattern and dispose your stream objects when disposing your class: IDisposable
Wrapping the StreamWriter in a using statement is how I usually ensure it is disposed of.
using (var writer = new StreamWriter(#"C:\AFile.txt"))
{
//do some stuff with writer
}
An alternative would be to use a finally block.