Reading parts of large files from drive

Reading parts of large files from drive - c#

I'm working with large files in C# (can be up to 20%-40% of available memory) and I will only need small parts of the files to be loaded into memory at a time (like 1-2% of the file). I was thinking that using a FileStream would be the best option, but idk. I will need to give a starting point (in bytes) and a length (in bytes) and copy that region into a byte[]. Access to the file might need to be shared between threads and will be at random spots in the file (non-linear access). I also need it to be fast.
The project already has unsafe methods, so feel free to suggest things from the more dangerous side of C#

A FileStream will allow you to seek to the portion of the file you want, no problem. It's the recommended way to do it in C#, and it's fast.
Sharing between threads: You will need to create a lock to prevent other threads from changing the FileStream position while you're trying to read from it. The simplest way to do this:
// This really needs to be a member-level variable;
private static readonly object fsLock = new object();
// Instantiate this in a static constructor or initialize() method
private static FileStream fs = new FileStream("myFile.txt", FileMode.Open);
public string ReadFile(int fileOffset) {
byte[] buffer = new byte[bufferSize];
int arrayOffset = 0;
lock (fsLock) {
fs.Seek(fileOffset, SeekOrigin.Begin);
int numBytesRead = fs.Read(bytes, arrayOffset , bufferSize);
// Typically used if you're in a loop, reading blocks at a time
arrayOffset += numBytesRead;
}
// Do what you want to the byte array and return it
}
Add try..catch statements and other code as necessary. Everywhere you access this FileStream, put a lock on the member-level variable fsLock... this will keep other methods from reading/manipulating the file pointer while you're trying to read.
Speed-wise, I think you'll find you're limited by disk access speeds, not code.
You'll have to think through all the issues about multi-threaded file access... who intializes/opens the file, who closes it, etc. There's a lot of ground to cover.

I know nothing about the structure of these files, but reading a portion of a file with FileStream or similar sounds like the best and fastest way to do it.
You will not need to copy the byte[] since FileStream can read directly into a byte array.
It sounds like you might know more about the structure of the file, which could bring up additional techniques as well. But if you need to read only a portion of the file, then this would probably be the way to do it.

If you are using .Net 4 look into using memory mapped files in the System.IO.MemoryMappedFiles namespace.
They are perfect for reading small chunks out of large files. There are samples in the MSDN documentation.
You can also do this in earlier versions of .Net, but then you need to wrap the Win32 API (or use http://winterdom.com/dev/net),

Related

Why do most serializers use a stream instead of a byte array?

I am currently working on a socket server and I was wondering
Why do serializers like
XmlSerializer
BinaryFormatter
Protobuf-net
DataContractSerializer
all require a Stream instead of a byte array?

It means you can stream to arbitrary destinations rather than just to memory.
If you want to write something to a file, why would you want to create a complete copy in memory first? In some cases that could cause you to use a lot of extra memory, possibly causing a failure.
If you want to create a byte array, just use a MemoryStream:
var memoryStream = new MemoryStream();
serializer.Write(foo, memoryStream); // Or whatever you're using
var bytes = memoryStream.ToArray();
So with an abstraction of "you use streams" you can easily work with memory - but if the abstraction is "you use a byte array" you are forced to work with memory even if you don't want to.

You can easily make a stream over a byte array...but a byte array is inherently size-constrained, where a stream is open-ended...big as you need. Some serialization can be pretty enormous.
Edit: Also, if I need to implement some kind of serialization, I want to do it for the most basic abstraction, and avoid having to do it over multiple abstractions. Stream would be my choice, as there are stream implementations over lots of things: memory, disk, network and so forth. As an implementer, I get those for "free".

if you use a byte array/ buffer you are working temporarily in memory and you are limited in size
While a stream is something that lets you store things on disk, send across to other computers such as the internet, serial port, etc. streams often use buffers to optimize transmission speed.
So streaming will be useful if you are dealing with a large file

#JonSkeet's answer is the correct one, but as an addendum, if the issue you're having with making a temporary stream is "I don't like it because it's effort" then consider writing an extension method:
namespace Project.Extensions
{
public static class XmlSerialiserExtensions
{
public static void Serialise(this XmlSerializer serialiser, byte[] bytes, object obj)
{
using(var temp = new MemoryStream(bytes))
serialiser.Serialize(temp, obj);
}
public static object Deserialise(this XmlSerializer serialiser, byte[] bytes)
{
using(var temp = new MemoryStream(bytes))
return serialiser.Deserialize(temp);
}
}
}
So you can go ahead and do
serialiser.Serialise(buffer, obj);
socket.Write(buffer);
Or
socket.Read(buffer);
var obj = serialiser.Deserialise(buffer);

Byte arrays were used more often when manipulating ASCII (i.e. 1-byte) strings of characters often in machine dependent applications, such as buffers. They lend themselves more to low-level applications, whereas "streams" is a more generalized way of dealing with data, which enables a wider range of applications. Also, streams are a more abstract way of looking at data, which allows considerations such as character type (UTF-8, UTF-16, ASCII, etc.) to be handled by code that is invisible to the user of the data stream.

.Net streams: Returning vs Providing

I have always wondered what the best practice for using a Stream class in C# .Net is. Is it better to provide a stream that has been written to, or be provided one?
i.e:
public Stream DoStuff(...)
{
var retStream = new MemoryStream();
//Write to retStream
return retStream;
}
as opposed to;
public void DoStuff(Stream myStream, ...)
{
//write to myStream directly
}
I have always used the former example for sake of lifecycle control, but I have this feeling that it a poor way of "streaming" with Stream's for lack of a better word.

I would prefer "the second way" (operate on a provided stream) since it has a few distinct advantages:
You can have polymorphism (assuming as evidenced by your signature you can do your operations on any type of Stream provided).
It's easily abstracted into a Stream extension method now or later.
You clearly divide responsibilities. This method should not care on how to construct a stream, only on how to apply a certain operation to it.
Also, if you're returning a new stream (option 1), it would feel a bit strange that you would have to Seek again first in order to be able to read from it (unless you do that in the method itself, which is suboptimal once more since it might not always be required - the stream might not be read from afterwards in all cases). Having to Seek after passing an already existing stream to a method that clearly writes to the stream does not seem so awkward.

I see the benefit of Streams is that you don't need to know what you're streaming to.
In the second example, your code could be writing to memory, it could be writing directly to file, or to some network buffer. From the function's perspective, the actual output destination can be decided by the caller.
For this reason, I would prefer the second option.
The first function is just writing to memory. In my opinion, it would be clearer if it did not return a stream, but the actual memory buffer. The caller can then attach a Memory Stream if he/she wishes.
public byte[] DoStuff(...)
{
var retStream = new MemoryStream();
//Write to retStream
return retStream.ToArray();
}

100% the second one. You don't want to make assumptions about what kind of stream they want. Do they want to stream to the network or to disk? Do they want it to be buffered? Leave these up to them.
They may also want to reuse the stream to avoid creating new buffers over and over. Or they may want to stream multiple things end-to-end on the same stream.
If they provide the stream, they have control over its type as well as its lifetime. Otherwise, you might as well just return something like a string or array. The stream isn't really giving you any benefit over these.

c# MemoryStream vs Byte Array

I have a function, which generates and returns a MemoryStream. After generation the size of the MemoryStream is fixed, I dont need to write to it anymore only output is required. Write to MailAttachment or write to database for example.
What is the best way to hand the object around? MemoryStream or Byte Array? If I use MemoryStream I have to reset the position after read.

If you have to hold all the data in memory, then in many ways the choice is arbitrary. If you have existing code that operates on Stream, then MemoryStream may be more convenient, but if you return a byte[] you can always just wrap that in a new MemoryStream(blob) anyway.
It might also depend on how big it is and how long you are holding it for; MemoryStream can be oversized, which has advantages and disadvantages. Forcing it to a byte[] may be useful if you are holding the data for a while, since it will trim off any excess; however, if you are only keeping it briefly, it may be counter-productive, since it will force you to duplicate most (at an absolute minimum: half) of the data while you create the new copy.
So; it depends a lot on context, usage and intent. In most scenarios, "whichever works, and is clear and simple" may suffice. If the data is particularly large or held for a prolonged period, you may want to deliberately tweak it a bit.
One additional advantage of the byte[] approach: if needed, multiple threads can access it safely at once (as long as they are reading) - this is not true of MemoryStream. However, that may be a false advantage: most code won't need to access the byte[] from multiple threads.

The MemoryStream class is used to add elements to a stream.
There is a file pointer; It simulates random access, it depends on how it is implemented. Therefore, a MemoryStream is not designed to access any item at any time.
The byte array allows random access of any element at any time until it is unassigned.
Next to the byte [], MemoryStream lives in memory (depending on the name of the class). Then the maximum allocation size is 4 GB.
Finally, use a byte [] if you need to access the data at any index number. Otherwise, MemoryStream is designed to work with something else that requires a stream as input while you just have a string.

Use a byte[] because it's a fixed sized object making it easier for memory allocation and cleanup and holds relatively no overhead - especially since you don't need to use the functions of the MemoryStream. Further you want to get that stream disposed of ASAP so it can release the possible unmanaged resources it may be using.

Performance: use a BinaryReader on a MemoryStream to read a byte array, or read directly?

I would like to know whether using a BinaryReader on a MemoryStream created from a byte array (byte[]) would reduce performance significantly.
There is binary data I want to read, and I get that data as an array of bytes. I am currently deciding between two approaches to read the data, and have to implement many reading methods accordingly. After each reading action, I need the position right after the read data, and therefore I am considering using a BinaryReader. The first, non-BinaryReader approach:
object Read(byte[] data, ref int offset);
The second approach:
object Read(BinaryReader reader);
Such Read() methods will be called very often, in succession on the same data until all data has been read.
So, using a BinaryReader feels more natural, but has it much impact on the performance?

You'll create a fair amount of garbage for each call to Read(byte[]). There will be 40 bytes for the MemoryStream, I stopped counting at 64 bytes for the BinaryReader. Dispose is also normally used, although it doesn't do anything. Whether that overhead matters is impossible to tell from your question.
I'd personally prefer the Read(BinaryReader) overload, not just because it is more efficient. That also gives the flexibility of changing the source of the data. It doesn't have to be in a byte[] anymore, you could feed it from, say, a FileStream or NetworkStream.

If using a BinaryReader feels more natural, do that. I highly doubt there's any noticable performance hit vs reading from an array.

Understanding Memory

In C#, does the following save any memory?
private List<byte[]> _stream;
public object Stream
{
get
{
if (_stream == null)
{
_stream = new List<byte[]>();
}
return _stream;
}
}
Edit: sorry, I guess I should have been more specific.
Specifically using "object" instead of List... I thought that would kinda clue itself in because it's a weird thing to do.

It saves a very small amount of memory. The amount of memory an empty List<byte[]> is going to take up is byte size.
The reason why is that your reference variable _stream only needs to allocate enough memory to hold a reference to an object. Once an object is allocated, it will take up a certain amount of memory which may grow or shrink over time, such as when new byte[]s are added to the List. However the memory taken up by the reference to that object will remain the same size.
This is simpler and less prone to corner cases that cause you headaches:
private List<byte[]> _stream = new List<byte[]>();
public object Stream
{
get
{
return _stream;
}
}
Although, in most cases it's not really optimal to be returning references to private members when they are collections/arrays, etc. Better to return _stream.AsReadOnlyCollection().

Save memory compared to what?
byte[][] _stream;
maybe? Then no, a List<T> will take up more memory since it is an array at its heart (which isn't necessarily exactly the size of its contents, but usually larger) and some statekeeping needs to be done too.

That is a lazy loading. You will create the stream only when someone requests it. It will not create the stream (in your case a list) unless is required.
One might say that it saves some memory because it will not use any unless required. So before using the stream there is no memory allocated for it.

If your edit indicates that you are asking whether the use of the object keyword instead of List<byte[]> as the type of the property saves memory, no, it doesn't. And your if block only saves a negligible amount of memory (and cpu at instantiation) until the first time the property is called. And it does make the first call to that property slightly slower. Consider returning a null instead if it makes sense for the property. And, like another answerer suggested, it may be better to keep the property read-only unless you'd like other classes to be altering it. In general, I'd say attempts at optimization like this are mostly misguided and make your code less maintainable.

Are you sure a Stream wouldn't be just a byte[] or a List of byte? Or even better, a MemoryStream? :) I think you are somewhat confused, so a bigger example and some scenario details will help a lot.

What are objects really
I'd suggest thinking in objects as structs... and object references as pointers to that structure.
If you instantiate an object you are reserving memory for an "struct" with all its fields (and a reference to the class it's implementing), plus all memory reserved by the constructor (other objects, arrays, etc...).
In List you are reserving memory for state keeping (I don't know how it's implemented in C#) and the initial internal array, maybe of ten references. So... if you count its something like (assuming 32 bits runtime, I'm not a .net specialist):
pointer to class: 4 bytes
pointer to array: 4 bytes
array of initialCapacity references: 40 bytes
So in my estimation it's about 48 bytes. But it depends on the implementation.
As SoloBold says: most of times it's not worthy.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Reading parts of large files from drive - c#

Related

Why do most serializers use a stream instead of a byte array?

.Net streams: Returning vs Providing

c# MemoryStream vs Byte Array

Performance: use a BinaryReader on a MemoryStream to read a byte array, or read directly?

Understanding Memory

Categories

Resources