Fast writing bytearray to file - c#

I use _FileStream.Write(_ByteArray, 0, _ByteArray.Length); to write a bytearray to a file. I noticed that's very slow.
I read a line from a text file, convert it to a bytearray and then need to write it to a new (large > 500 Mb) file. Please some advice to speed up the write process.

FileStream.Write is basically what there is. It's possible that using a BufferedStream would help, but unlikely.
If you're really reading a single line of text which, when encoded, is 500MB then I wouldn't be surprised to find that most of the time is being spent performing encoding. You should be able to test that by doing the encoding and then throwing away the result.
Assuming the "encoding" you're performing is just Encoding.GetBytes(string), you might want to try using a StreamWriter to wrap the FileStream - it may work better by tricks like repeatedly encoding into the same array before writing to the file.
If you're actually reading a line at a time and appending that to the file, then:
Obviously it's best if you keep both the input stream and output stream open throughout the operation. Don't repeatedly read and then write.
You may get better performance using multiple threads or possibly asynchronous IO. This will partly depend on whether you're reading from and writing to the same drive.
Using a StreamWriter is probably still a good idea.
Additionally when creating the file, you may want to look at using a constructor which accepts a FileOptions. Experiment with the available options, but I suspect you'll want SequentialScan and possibly WriteThrough.

If your writing nothing but Byte arrays, have you tried using BinaryWriter's Write method? Writing in bulk would probably also help with the speed. Perhaps you can read each line, convert the string to its bytes, store those bytes for a future write operation (i.e in a List or something), and every so often (after reading x lines) write a chunk to the disk.
BinaryWriter: http://msdn.microsoft.com/en-us/library/ms143302.aspx

Related

Writing continuously to file vs saving data in an array and writing to file all at once

I am working in c# and in my program i am currently opening a file stream at launch and writing data to it in csv format about every second. I am wondering, would it be more efficient to store this data in an arraylist and write it all at once at the end, or continue to keep an open filestream and just write the data every second?
If the amount of data is "reasonably manageable" in memory, then write the data a the end.
If this is continuous, I wonder id an option could be to use something like NLog to write your csv (create a specific log format) as that manages writes pretty efficiently. You would also need to set it to raise exceptions if there was an error.
You should consider using a BufferedStream instead. Write to the stream and allow the framework to flush to file as necessary. Just make sure to flush the stream before closing it.
From what I learned in Operating Systems writing to a file is a lot more expensive than writing to memory. However, your stream is most likely going to be cached. Which means that under the hood, all that file writing you are doing is happening in memory. The operating system handles all the actual writing to file asynchronously when it's the right time. Depending on your applications there is no need to worry about such micro-optimizations.
You can read more about why most languages take this approach under the hood here https://unix.stackexchange.com/questions/224415/whats-the-philosophy-behind-delaying-writing-data-to-disk
This kind of depends on your specific case. If you're writing data about once per second it seems likely that you're not going to see much of an impact from writing directly.
In general writing to a FileStream in small pieces is quite performant because the .NET Framework and the OS handle buffering for you. You won't see the file itself being updated until the buffer fills up or you explicitly flush the stream.
Buffering in memory isn't a terrible idea for smallish data and shortish periods. Of course if your program throws an exception or someone kills it before it writes to the disk then you lose all that information, which is probably not your favourite thing.
If you're worried about performance then use a logging thread. Post objects to it through a ConcurrentQueue<> or similar and have it do all the writes on a separate thread. Obviously threaded logging is more complex. It's not something I'd advise unless you really, really need the extra performance.
For quick-and-dirty logging I generally just use File.AppendAllText() or File.AppendAllLines() to push the data out. It takes a bit longer, but it's pretty reliable. And I can read the output while the program is still running, which is often useful.

Read file faster by multiple readers

So i have a large file which has ~2 million lines. The file reading is a bottleneck in my code. Any suggested ways and expert opinion to read the file faster is welcome. Order of reading lines from that text file is unimportant. All lines are pipe '|' separated fixed length records.
What i tried? I started parallel StreamReaders and made sure that resource is locked properly but this approach failed as i now had multiple threads fighting to get hold of the single StreamReader and wasting more time in locking etc thereby making the code slow down further.
One intuitive approach is to break the file and then read it, but i wish to leave the file intact and still be somehow able to read it faster.
I would try maximizing my buffer size. The default size is 1024, increasing this should increase performance. I would suggest trying other buffer size options.
StreamReader(Stream, Encoding, Boolean, Int32) Initializes a new
instance of the StreamReader class for the specified stream, with the
specified character encoding, byte order mark detection option, and
buffer size.
I understand that my problem is not related to software. It is a 'mechanical' problem. Unless there is a possibility to perform changes in hardware, there is no way to improve the reading performance. Why is that? There is one head only to read from the disk and therefore even if i try to read file from both ends for example, it is that same reader which will now have to move even more to read file from both ends for the two threads. Hence it is wiser to let the reader read sequentially and that is the maximum performance achievable.
Thank you all for the explanations. That helped me understand this concept. It may be a very basic and straightforward point for most people here on stackoverflow, but I really learned something about file reading and hardware performance and understood the things taught to me in college, from this question.

Writing constantly Data into a File

My Program ist getting alot and very frequently Data, up to 2-4 Times per second. My Goal is to take this Data and write it into a File.
My Question now, is it smart to having a File-Pointer constantly open? May it be better to just cache the Data first and then write it into a File?
How is the perfomance?
Are there Design-Patterns which are good for this? Any Tips are welcome.
Actually buffering is already implemented in standard System.IO.FileStream http://msdn.microsoft.com/en-us/library/system.io.filestream.aspx
Instead of constant writing all changes are accumulated in buffer and flush to disk as buffer getting full. Just remember specify buffer in constructor and call flush as you finish.

difference between memory stream and filestream

During the serialization we can use either memory stream or file stream.
What is the basic difference between these two? What does memory stream mean?
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Runtime.Serialization.Formatters.Binary;
namespace Serilization
{
class Program
{
static void Main(string[] args)
{
MemoryStream aStream = new MemoryStream();
BinaryFormatter aBinaryFormat = new BinaryFormatter();
aBinaryFormat.Serialize(aStream, person);
aStream.Close();
}
}
}
Stream is a representation of bytes. Both these classes derive from the Stream class which is abstract by definition.
As the name suggests, a FileStream reads and writes to a file whereas a MemoryStream reads and writes to the memory. So it relates to where the stream is stored.
Now it depends how you plan to use both of these. For eg: Let us assume you want to read binary data from the database, you would go in for a MemoryStream. However if you want to read a file on your system, you would go in for a FileStream.
One quick advantage of a MemoryStream is that there is not need to create temporary buffers and files in an application.
The other answers here are great, but I thought one that takes a really high level look at what purpose steams serve might be useful. There's a bit of simplification going on in the explanation below, but hopefully this gets the idea across:
What is a stream?
A stream is effectively the flow of data between two places, it's the pipe rather than the contents of that pipe.
A bad analogy to start
Imagine a water desalination plant (something that takes seawater, removes the salt and outputs clean drinking water to the water network):
The desalination plant can't remove the salt from all of the sea at one time (and nor would we want it to… where would the saltwater fish live?), so instead we have:
A SeaStream that sucks a set amount of water at a time into the plant.
That SeaStream is connected to the DesalinationStream to remove the salt
And the output of the DesalinationStream is connected to the DrinkingWaterNetworkStream to output the now saltless water to the drinking water supply.
OK, so what's that got to do with computers?
Moving big files all at once can be problematic
Frequently in computing we want to move data between two locations, e.g. from an external hard drive to a binary field in a database (to use the example given in another answer). We can do that by copying all of the data from the file from location A into the computer's memory and from there to to Location B, but if the file is large or the source or destination are potentially unreliable then moving the whole file at once may be either unfeasible or unwise.
For example, say we want to move a large file on a USB stick to a field in a database. We could use a 'System.IO.File' object to retrieve that whole file into the computer's memory and then use a database connection to pass that file onto the database.
But, that's potentially problematic, what if the file is larger than the computer's available RAM? Now the file will potentially be cached to the hard drive, which is slow, and it might even slow the computer down too.
Likewise, what if the data source is unreliable, e.g. copying a file from a network drive with a slow and flaky WiFi connection? Trying to copy a large file in one go can be infuriating because you get half the file and then the connection drops out and you have to start all over again, only for it to potentially fail again.
It can be better to split the file and move it a piece at a time
So, rather than getting whole file at once, it would be better to retrieve the file a piece at a time and pass each piece on to the destination one at a time. This is what a Stream does and that's where the two different types of stream you mentioned come in:
We can use a FileStream to retrieve data from a file a piece at a time
and the database API may make available a MemoryStream endpoint we can write to a piece at a time.
We connect those two 'pipes' together to flow the file pieces from file to database.
Even if the file wasn't too big to be held in RAM, without streams we were still doing a number or read/write operations that we didn't need to. The stages we we're carrying out were:
Retrieving the data from the disk (slow)
Writing to a File object in the computer's memory (a bit faster)
Reading from that File object in the computer's memory (faster again)
Writing to the database (probably slow as there's probably a spinning disk hard-drive at the end of that pipe)
Streams allow us to conceptually do away with the middle two stages, instead of dragging the whole file into computer memory at once, we take the output of the operation to retrieve the data and pipe that straight to the operation to pass the data onto the database.
Other benefits of streams
Separating the retrieval of the data from the writing of the data like this also allows us to perform actions between retrieving the data and passing it on. For example, we could add an encryption stage, or we could write the incoming data to more than one type of output stream (e.g. to a FileStream and a NetworkStream).
Streams also allow us to write code where we can resume the operation should the transfer fail part way through. By keeping track of the number of pieces we've moved, if the transfer fails (e.g. if the network connection drops out) we can restart the Stream from the point at which we received the last piece (this is the offset in the BeginRead method).
In simplest form, a MemoryStream writes data to memory, while a FileStream writes data to a file.
Typically, I use a MemoryStream if I need a stream, but I don't want anything to hit the disk, and I use a FileStream when writing a file to disk.
While a file stream reads from a file, a memory stream can be used to read data mapped in the computer's internal memory (RAM). You are basically reading/writing streams of bytes from memory.
Having a bitter experience on the subject, here's what I've found out. if performance is required, you should copy the contents of a filestream to a memorystream. I had to process the contents of 144 files, 528kbytes each and present the outcome to the user. It took 250 seconds aprox. (!!!!). When I just copied the contents of each filestream to a memorystream, (CopyTo method) without changing anything at all, the time dropped to approximately 32 seconds. Note that each time you copy one stream to another, the stream is appended at the end of the destination stream, so you may need to 'rewind' it prior to copying to it. Hope it helps.
In regards to stream itself, in general, it means that when you put a content into the stream (memory), it will not put all the content of whatever data source (file, db...) you are working with, to the memory. As opposed to for example Arrays or Buffers, where you feed everything to the memory. In stream, you get a chunk of eg. file to the memory. When you reach the end of chunk, stream gets the next chunk from file to the memory. It all happens in low-level background while you are just iterating the stream. That's why it's called stream.
A memory stream handles data via an in memory buffer. A filestream deals with files on disk.
Serializing objects in memory is hardly any useful, in my opinion. You need to serialize an object when you want to save it on disk. Typically, serialization is done from the object(which is in memory) to the disk while deserialization is done from the saved serialized object(on the disk) to the object(in memory).
So, most of the times, you want to serialize to disk, thus you use a Filestream for serialization.

Larger File Streams using C#

There are some text files(Records) which i need to access using C#.Net. But the matter is those files are larger than 1GB. (minimum size is 1 GB)
what should I need to do?
What are the factors which I need to be concentrate on?
Can some one give me an idea to over come from this situation.
EDIT:
Thanks for the fast responses. yes they are fixed length records. These text files coming from a local company. (There last month transaction records)
Is it possible to access these files like normal text files (using normal file stream).
and
How about the memory management????
Expanding on CasperOne's answer
Simply put there is no way to reliably put a 100GB file into memory at one time. On a 32 bit machine there is simply not enough addressing space. In a 64 bit machine there is enough addressing space but during the time in which it would take to actually get the file in memory, your user will have killed your process out of frustration.
The trick is to process the file incrementally. The base System.IO.Stream() class is designed to process a variable (and possibly infinite) stream in distinct quantities. It has several Read methods that will only progress down a stream a specific number of bytes. You will need to use these methods in order to divide up the stream.
I can't give more information because your scenario is not specific enough. Can you give us more details or your record delimeters or some sample lines from the file?
Update
If they are fixed length records then System.IO.Stream will work just fine. You can even use File.Open() to get access to the underlying Stream object. Stream.Read has an overload that requests the number of bytes to be read from the file. Since they are fixed length records this should work well for your scenario.
As long as you don't call ReadAllText() and instead use the Stream.Read() methods which take explicit byte arrays, memory won't be an issue. The underlying Stream class will take care not to put the entire file into memory (that is of course, unless you ask it to :) ).
You aren't specifically listing the problems you need to overcome. A file can be 100GB and you can have no problems processing it.
If you have to process the file as a whole then that is going to require some creative coding, but if you can simply process sections of the file at a time, then it is relatively easy to move to the location in the file you need to start from, process the data you need to process in chunks, and then close the file.
More information here would certainly be helpful.
What are the main problems you are having at the moment? The big thing to remember is to think in terms of streams - i.e. keep the minimum amount of data in memory that you can. LINQ is excellent at working with sequences (although there are some buffering operations you need to avoid, such as OrderBy).
For example, here's a way of handling simple records from a large file efficiently (note the iterator block).
For performing multiple aggregates/analysis over large data from files, consider Push LINQ in MiscUtil.
Can you add more context to the problems you are thinking of?
Expanding on JaredPar's answer.
If the file is a binary file (i.e. ints stored as 4 bytes, fixed length strings etc) you can use the BinaryReader class. Easier than pulling out n bytes and then trying to interrogate that.
Also note, the read method on System.IO.Stream is a non blocking operation. If you ask for 100 bytes it may return less than that, but still not have reached end of file.
The BinaryReader.ReadBytes method will block until it reads the requested number of bytes, or End of file - which ever comes first.
Nice collaboration lads :)
Hey Guys, I realize that this post hasn't been touched in a while, but I just wanted to post a site that has the solution to your problem.
http://thedeveloperpage.wordpress.com/c-articles/using-file-streams-to-write-any-size-file-introduction/
Hope it helps!
-CJ

Categories