I'm creating files that have a certain structure to them. They begin with a Header, then contain a block of DataElements. (The exact details don't matter to this question.)
I have a DataFileWriter connected to a FileStream for output. The problem is, the service that's consuming the files I'm building will reject any data file whose size is larger than the arbitrary value TOOBIG.
Given these constraints:
Every file must start with a Header
Every file must contain one or more DataElements, which must be written out completely; a file ending with an incomplete DataElement is invalid
It is perfectly valid to stop at the end of the current element and begin writing a new file with a new header, as long as each DataElement is written exactly once
The DataFileWriter doesn't and should not know that it's writing to a FileStream as opposed to some other type of stream; all it knows is that it has a Stream and in other cases that could be a completely different setup.
DataElement does not have a fixed size, but it's reasonable to assume any given element won't exceed 4 KB in size.
What's the best way to set up a system that will ensure, assuming that no massive DataElements come through, that no file exceeding a size of TOOBIG will be created? Basic architecture is given below; how would I need to modify it?
public class DataFileWriter : IDisposable
{
private readonly Stream _output;
private readonly IEnumerable<DataElement> _input;
private const int TOOBIG = 4 * 1024 * 1024 * 1024; // 4 GB
public DataFileWriter(IEnumerable<DataElement> input, Stream output)
{
_input = input;
_output = output;
}
public void Write()
{
WriteHeader(); // writes the header to _output
foreach (var element in input)
{
WriteElement(element); // serializes the record to _output
}
}
public void Dispose()
{
_output.Dispose();
}
}
I have an issue with the following code. I create a memory stream in the GetDB function and the return value is used in a using block. For some unknown reason if I dump my objects I see that the MemoryStream is still around at the end of the Main method. This cause me a massive leak. Any idea how I can clean this buffer ?
I have actually checked that the Dispose method has been called on the MemoryStream but the object seems to stay around, I have used the diagnostic tools of Visual Studio 2017 for this task.
class Program
{
static void Main(string[] args)
{
List<CsvProduct> products;
using (var s = GetDb())
{
products = Utf8Json.JsonSerializer.Deserialize<List<CsvProduct>>(s).ToList();
}
}
public static Stream GetDb()
{
var filepath = Path.Combine("c:/users/tom/Downloads", "productdb.zip");
using (var archive = ZipFile.OpenRead(filepath))
{
var data = archive.Entries.Single(e => e.FullName == "productdb.json");
using (var s = data.Open())
{
var ms = new MemoryStream();
s.CopyTo(ms);
ms.Seek(0, SeekOrigin.Begin);
return (Stream)ms;
}
}
}
}
For some unknown reason if I dump my objects I see that the MemoryStream is still around at the end of the Main method.
That isn't particuarly abnormal; GC happens separately.
This cause me a massive leak.
That isn't a leak, it is just memory usage.
Any idea how I can clean this buffer ?
I would probably just not use a MemoryStream, instead returning something that wraps the live uncompressing stream (from s = data.Open()). The problem here, though, is that you can't just return s - as archive would still be disposed upon leaving the method. So if I needed to solve this, I would create a custom Stream that wraps an inner stream and which disposes a second object when disposed, i.e.
class MyStream : Stream {
private readonly Stream _source;
private readonly IDisposable _parent;
public MyStream(Stream, IDisposable) {...assign...}
// not shown: Implement all Stream methods via `_source` proxy
public override void Dispose()
{
_source.Dispose();
_parent.Dispose();
}
}
then have:
public static Stream GetDb()
{
var filepath = Path.Combine("c:/users/tom/Downloads", "productdb.zip");
var archive = ZipFile.OpenRead(filepath);
var data = archive.Entries.Single(e => e.FullName == "productdb.json");
var s = data.Open();
return new MyStream(s, archive);
}
(could be improved slightly to make sure that archive is disposed if an exception happens before we return with success)
I'm building a class library for various document types. One such type is an image which contains our custom business logic for dealing with images, including converting to PDF. I'm running into the problem described in many posts -- e.g. here and here -- where the System.Drawing.Image.Save constructor is throwing a System.Runtime.InteropServices.ExternalException exception with "A generic error occurred in GDI+".
The answers I've seen say that the input stream needs to be kept open throughout the lifetime of the Image. I get that. The issue I have is that my class library doesn't control the input stream or even whether an input stream is used since I have two constructors. Here is some code:
public sealed class MyImage
{
private System.Drawing.Image _wrappedImage;
public MyImage(System.IO.Stream input)
{
_wrappedImage = System.Drawing.Image.FromStream(input);
}
public MyImage(System.Drawing.Image input)
{
_wrappedImage = input;
}
public MyPdf ConvertToPdf()
{
//no 'using' block because ms needs to be kept open due
// to third-party PDF conversion technology.
var ms = new System.IO.MemoryStream();
//System.Runtime.InteropServices.ExternalException occurs here:
//"A generic error occurred in GDI+"
_wrappedImage.Save(ms, System.Drawing.Imaging.ImageFormat.Bmp);
return MyPdf.CreateFromImage(ms);
}
}
public sealed class MyPdf
{
internal static MyPdf CreateFromImage(System.IO.Stream input)
{
//implementation details not important.
return null;
}
}
My question is this: should I keep a copy of the input stream just to avoid the possibility that the client closes the stream before my image is saved? I.e., I could add this to my class:
private System.IO.Stream _streamCopy = new System.IO.MemoryStream();
and change the constructor to this:
public MyImage(System.IO.Stream input)
{
input.CopyTo(_streamCopy);
_wrappedImage = System.Drawing.Image.FromStream(_streamCopy);
}
This would of course add the overhead of copying the stream which is not ideal. Is there a better way to do it?
You could create another Bitmap instance:
public MyImage(System.IO.Stream input)
{
var image = System.Drawing.Image.FromStream(input);
_wrappedImage = new System.Drawing.Bitmap(image);
// input stream may now be closed
}
I am using a create method for a constructor of a converter.
public void loadData()
{
byte [] data = new byte [] {......}; // some byte data in here
var converter = GetDataConverter(data);
}
Now inside the GetDataConverter I need to create a memorystream from the binary data (the converter is 3rd party and takes a stream)
If I write the GetDataConverter like this I get an error telling me I didnt' dispose which I understand. I created a MemoryStream and I need to dispose.
public MyDataConverter GetDataConverter(byte [] data)
{
return new MyDataConverter(new MemoryStream(data));
}
So my solution would be this:
public MyDataConverter GetDataConverter(byte [] data)
{
using(var ms = new MemoryStream(data))
{
return new MyDataConverter(ms);
}
}
The question is, is my solution correct? Should I be using a 'using' here? isn't the 'using' going to destroy my memory stream once it's out of scope so the converter will have nothing to work on?
I need an answer AND an explanation please, I'm a bit vague on the whole 'using' thing here.
Thanks
If you have no access to the code of ´MyDataConverter´ and the type doesn't implements ´IDisposable´ you can do:
public void loadData()
{
byte[] data = new byte[] { 0 }; // some byte data in here
using (var stream = new MemoryStream(data))
{
var converter = new MyDataConverter(stream);
// do work here...
}
}
If you have to use this many times and want to reuse your code you can do something like this:
public void loadData()
{
byte[] data = new byte[] { 0 }; // some byte data in here
UsingConverter(data, x =>
{
// do work here...
});
}
void UsingConverter(byte[] data, Action<MyDataConverter> action)
{
using (var stream = new MemoryStream(data))
{
var converter = new MyDataConverter(stream);
action(converter);
}
}
It really depends on the implementation of MyDataConverter. If the MemoryStream is only used inside the constructor to retrieve some data from it, then your solution with using is OK.
If, OTOH, MyDataConverter keeps a reference to the MemoryStream to access it later, you must not dispose it here.
I have an extremely large 2D bytearray in memory,
byte MyBA = new byte[int.MaxValue][10];
Is there any way (probably unsafe) that I can fool C# into thinking this is one huge continuous byte array? I want to do this such that I can pass it to a MemoryStream and then a BinaryReader.
MyReader = new BinaryReader(MemoryStream(*MyBA)) //Syntax obviously made-up here
I do not believe .NET provides this, but it should be fairly easy to implement your own implementation of System.IO.Stream, that seamlessly switches backing array. Here are the (untested) basics:
public class MultiArrayMemoryStream: System.IO.Stream
{
byte[][] _arrays;
long _position;
int _arrayNumber;
int _posInArray;
public MultiArrayMemoryStream(byte[][] arrays){
_arrays = arrays;
_position = 0;
_arrayNumber = 0;
_posInArray = 0;
}
public override int Read(byte[] buffer, int offset, int count){
int read = 0;
while(read<count){
if(_arrayNumber>=_arrays.Length){
return read;
}
if(count-read <= _arrays[_arrayNumber].Length - _posInArray){
Buffer.BlockCopy(_arrays[_arrayNumber], _posInArray, buffer, offset+read, count-read);
_posInArray+=count-read;
_position+=count-read;
read=count;
}else{
Buffer.BlockCopy(_arrays[_arrayNumber], _posInArray, buffer, offset+read, _arrays[_arrayNumber].Length - _posInArray);
read+=_arrays[_arrayNumber].Length - _posInArray;
_position+=_arrays[_arrayNumber].Length - _posInArray;
_arrayNumber++;
_posInArray=0;
}
}
return count;
}
public override long Length{
get {
long res = 0;
for(int i=0;i<_arrays.Length;i++){
res+=_arrays[i].Length;
}
return res;
}
}
public override long Position{
get { return _position; }
set { throw new NotSupportedException(); }
}
public override bool CanRead{
get { return true; }
}
public override bool CanSeek{
get { return false; }
}
public override bool CanWrite{
get { return false; }
}
public override void Flush(){
}
public override void Seek(long offset, SeekOrigin origin){
throw new NotSupportedException();
}
public override void SetLength(long value){
throw new NotSupportedException();
}
public override void Write(byte[] buffer, int offset, int count){
throw new NotSupportedException();
}
}
Another way to workaround the size-limitation of 2^31 bytes is UnmanagedMemoryStream which implements System.IO.Stream on top of an unmanaged memory buffer (which might be as large as the OS supports). Something like this might work (untested):
var fileStream = new FileStream("data",
FileMode.Open,
FileAccess.Read,
FileShare.Read,
16 * 1024,
FileOptions.SequentialScan);
long length = fileStream.Length;
IntPtr buffer = Marshal.AllocHGlobal(new IntPtr(length));
var memoryStream = new UnmanagedMemoryStream((byte*) buffer.ToPointer(), length, length, FileAccess.ReadWrite);
fileStream.CopyTo(memoryStream);
memoryStream.Seek(0, SeekOrigin.Begin);
// work with the UnmanagedMemoryStream
Marshal.FreeHGlobal(buffer);
Agree. Anyway you have limit of array size itself.
If you really need to operate huge arrays in a stream, write your custom memory stream class.
I think you can use a linear structure instead of a 2D structure using the following approach.
Instead of having byte[int.MaxValue][10] you can have byte[int.MaxValue*10]. You would address the item at [4,5] as int.MaxValue*(4-1)+(5-1). (a general formula would be (i-1)*number of columns+(j-1).
Of course you could use the other convention.
If I understand your question correctly, you've got a massive file that you want to read into memory and then process. But you can't do this because the amount of data in the file exceeds that of any single-dimensional array.
You mentioned that speed is important, and that you have multiple threads running in parallel to process the data as quickly as possible. If you're going to have to partition the data for each thread anyway, why not base the number of threads on the number of byte[int.MaxValue] buffers required to cover everything?
You can create a memoryStream and then pass the array in line by line using the method Write
EDIT:
The limit of a MemoryStream is certainly the amount of memory present for your application. Maybe there is a limit beneath that but if you need more memory, then you should consider to modify your overall architecture. E.g. you could process your data in chunks, or you could do a swapping mechanism to a file.
If you are using Framework 4.0, you have the option of working with a MemoryMappedFile. Memory mapped files can be backed by a physical file, or by the Windows swap file. Memory mapped files act like an in-memory stream, transparently swapping data to/from the backing storage if and when required.
If you are not using Framework 4.0, you can still use this option, but you will need to either write your own or find an exsiting wrapper. I expect there are plenty on The Code Project.