Reading large file in chunks c# - c#

I want to read very large file (4GBish) chunk by chunk.
I am currently trying to use a StreamReader and the Read() read method. The syntax is:
sr.Read(char[] buffer, int index, int count)
Because the index is an int it will overflow in my case. What should i use instead?

The index is the starting index of buffer not the index of file pointer, usually it would be zero. On each Read call you will read characters equal to the count parameter of Read method. You would not read all the file at once rather read in chunks and use that chunk.
The index of buffer at which to begin writing, reference.
char[] c = null;
while (sr.Peek() >= 0)
{
c = new char[1024];
sr.Read(c, 0, c.Length);
//The output will look odd, because
//only five characters are read at a time.
Console.WriteLine(c);
}
The above example will ready 1024 bytes and will write to console. You can use these bytes, for instance sending these bytes to other application using TCP connection.
When using the Read method, it is more efficient to use a buffer that
is the same size as the internal buffer of the stream, where the
internal buffer is set to your desired block size, and to always read
less than the block size. If the size of the internal buffer was
unspecified when the stream was constructed, its default size is 4
kilobytes (4096 bytes), MSDN.

You could try the simpler version of Read which doesn't chunk the stream but instead reads it character by character. You would have to implement the chunking your self, but it would give you more control allowing you to use a Long instead.
http://msdn.microsoft.com/en-us/library/ath1fht8(v=vs.110).aspx

Related

Offset and length were out of bounds for the array

My code
private static int readinput(byte[] buff, BinaryReader reader)
{
int size = reader.ReadInt32();
reader.Read(buff, 0, size);
return size;
}
Exception in reader.Read(buff, 0, size);
The exception is offset and length were out of bounds for the array or count is greater than the number of elements from index to the end of source collection
Take a step back and think about your code
You've written a method that takes an array of bytes. We don't know how big this array is, but it's controlled by the code calling the method. Let's assume it is 1000 bytes long
Then you read an int from somewhere else, let's assume 2000 is read
Then you attempt to read 2000 bytes into an array that can only hold 1000 bytes, you perform no checking to make sure your array is big enough, nor do you attempt to read in chunks and concatenate if it's not big enough
That's why you get the error you're getting, but as to what you should be coding, I think you need to think about that some more- maybe make the size to the buffer in response to the size int you read, or read in chunks..
The buffer buff that you passed into your function to read the data is too small. buff.Length should be bigger than or equal to your variable called size.
Set a breakpoint on "reader.Read(buff, 0, size);" and hover over buff and size and you'll see what I mean.
Make sure when you call your function, the buff you pass in is sufficient size. If you don't know what size to create a buffer for ahead of time, then change your function to look something like this:
private static byte[] ReadInput(BinaryReader reader)
{
int size = reader.ReadInt32();
return reader.ReadBytes(size);
}
Especially since you're just reading it into the beginning of a provided buffer anyways.
Summary to frame what you're currently doing:
You provided us a function which takes a binary reader (whatever position it's already at, if it's new, then position 0), it reads a 32-bit integer (4 bytes) to figure out the size of some data after it. Then you read that data of that size into a buffer. You do all of this with a buffer provided called buff. You need to be sure that whatever size data you're going to read in, the buffer provided to the function is of sufficient size. If you make the size of the buffer too large, then "reader.Read(buff, 0, size)" is only reading it into the beginning of the buffer. So if your intention was just to read the data the way you coded that function into a perfectly sized buffer, I suggest using the code above.
Just thought I'd explain it a bit more in case that helps you understand what's going on.

c# socket receive byte array length

i'm trying to learn to use sockets in c# and i've a doubt, i'm using a code something like this:
byte[] data = new byte[64];
int length = 0;
length = sock.Receive(data);
//more code...
So, the byte[] data is filled with the recived data and the left space in the array is filled with 0s, Is the byte[] allocated into memory completely (all 64bytes)? If it's so, is there a way to make the byte[] the same size as the actual sent data?
You can check sock.Available to see what has already come in. (so far)
byte[] data = new byte[sock.Available];
int length = sock.Receive(data);
//more code...
Note: Since you may or may not know what is coming in next on the network it usually makes more sense to read only a header (with size info) first or to allocate more space than necessary, and call .Recieve() multiple times until the end of a record is reached.
Note: This code assumes you already know there is some data to receive and you've waited long enough for some useful amount of data to be ready.
If you do choose to use length headers, .Available can help you avoid reading a partial header and having to re-assemble it which is nice. (Only large messages may need manual reassembly in that case)
You simply need to use the return value from Receive to understand how much data has arrived. You can shorten the buffer using Array.Resize if you want but that normally would be a sign that something is wrong.
Also note, that TCP is a stream of bytes and does not preserve message boundaries.
As noted normally read may return fewer bytes then it was told. See a workaround function below which ensures it reads as many bytes as it was told - basically size of the passed buffer. Function is from here.
/// Reads data into a complete array, throwing an EndOfStreamException
/// if the stream runs out of data first, or if an IOException
/// naturally occurs.
/// </summary>
/// <param name="stream">The stream to read data from</param>
/// <param name="data">The array to read bytes into. The array
/// will be completely filled from the stream, so an appropriate
/// size must be given.</param>
public static void ReadWholeArray (Stream stream, byte[] data)
{
int offset=0;
int remaining = data.Length;
while (remaining > 0)
{
int read = stream.Read(data, offset, remaining);
if (read <= 0)
throw new EndOfStreamException
(String.Format("End of stream reached with {0} bytes left to read", remaining));
remaining -= read;
offset += read;
}
}
You can use this method first to read say a 2 byte integer which should represent the number of bytes that will follow. Then you read once again however now read as many bytes as specified in that two byte integer.
But for this to work, clearly the sender first has to send a two byte integer which represents length of data that will follow - and then the data itself.
So basically you call above function on a byte array of size two first (to get data length), and then on a byte array with size as indicated in that 2 byte integer (to get data).
You can use this to read from NetworkStream. Some more reading on this topic.

Limiting the size of data being read from a CSV file so it reads only full lines

I want to use C# to read a CSV file of about 10GB. I can't read the file one line at a time and have a limitation of reading a maximum chunk of 32MB at a time.
How can I limit the size of the data I'm reading BUT also make sure I'm reading only full lines? That means that if a full 32MB means reading just for example 100.5 lines, then I want to read only the full 100 lines and leave out the half line even if it means reading less than 32MB.
This is the skeleton code I was thinking about (the comments there hold more questions):
const int MAX_BUFFER = 33554432; //32MB
byte[] buffer = new byte[MAX_BUFFER];
int bytesRead;
using (System.IO.FileStream fileStream = new FileStream(filePath, FileMode.Open, FileAccess.Read))
while ((bytesRead = fileStream.Read(buffer, 0, MAX_BUFFER)) != 0)
{
//should I somehow analyze here if what I'm reading containing only full lines?
//and if so, how can I know that I'm not currently reading something less than 32MB
//meaning bytesRead is less than that and that maybe I'm going to read the rest of the line in the next iteration?
}
You don't need to ensure you're reading full lines.
Read the file by chunks into a buffer.
Process each line from your buffer character by character, until you reach a newline character. If you're on a line and reach the end of the buffer, keep that portion around, read the next chunk, and concatenate everything from the new read up to a newline with the left overs from the previous read.
If the very last byte of the buffer is a new-line, you have a whole line and can simply move on the the next chunk. If not, read the next chunk - either the first byte will be a new-line, or you'll have other characters before it. Either way, concat everything up to the newline (even if that means 0 characters) and start on the next.
If you hit the end of file after a newline, you're done. If you hit the end of file while processing non-newline characters, it's up to you whether you want to keep them as a valid line or discard them.
This is very similar to a circular buffer.
Another solution might be to use a BufferedStream and specify the buffer size. Then just read byte by byte to each newline or EOF.

StreamReader Read method doesn't read number of chars specified

I have to parse a large file so instead of doing:
string unparsedFile = myStreamReader.ReadToEnd(); // takes 4 seconds
parse(unparsedFile); // takes another 4 seconds
I want to take advantage of the first 4 seconds and try to do both things at the same time by doing something like:
while (true)
{
char[] buffer = new char[1024];
var charsRead = sr.Read(buffer, 0, buffer.Length);
if (charsRead < 1)
break;
if (charsRead != 1024)
{
Console.Write("Here"); // debuger stops here several times why?
}
addChunkToQueue(buffer);
}
here is the image of the debuger: (I added int counter to show on what iteration we read less than 1024 bytes)
Note that there where 643 chars read and not 1024. On the next iteration I get:
I think I should read 1024 bytes all the time until I get to the last iteration where the remeining bytes are less than 1024.
So my question is why will I read "random" number of chars as I iterate throw the while loop?
Edit
I don't know what kind of stream I am dealing with. I Execute a process like:
ProcessStartInfo psi = new ProcessStartInfo("someExe.exe")
{
RedirectStandardError = true,
RedirectStandardOutput = true,
UseShellExecute = false,
CreateNoWindow = true,
};
// execute command and return ouput of command
using (var proc = new Process())
{
proc.StartInfo = psi;
proc.Start();
var output = proc.StandardOutput; // <------------- this is where I get the strem
//if (string.IsNullOrEmpty(output))
//output = proc.StandardError.ReadToEnd();
return output;
}
}
For one thing, you're reading characters, not bytes. There's a huge difference.
As for why it doesn't necessarily read everything all at once: maybe there isn't that much data available, and StreamReader has decided to give you what it's got rather than blocking for an indeterminate amount of time to fill your buffer. It's entirely within its rights to do so.
Is this coming from a local file, or over the network? Normally local file operations are much more likely to fill the buffer than network downloads, but either way you simply shouldn't rely on the buffer being filled. If it's a "file" (i.e. read using FileStream) but it happens to be sitting on a network share... well, that's a grey area in my knowledge :) It's a stream - treat it that way.
It depends on the actual stream you are reading. If this is the file stream I guess it is rather unlikely to get "partial" data. However, if you read from a network stream, you have to expect the data to come in chunks of different length.
From the docs: http://msdn.microsoft.com/en-us/library/9kstw824
When using the Read method, it is more efficient to use a buffer that
is the same size as the internal buffer of the stream, where the
internal buffer is set to your desired block size, and to always read
less than the block size. If the size of the internal buffer was
unspecified when the stream was constructed, its default size is 4
kilobytes (4096 bytes). If you manipulate the position of the
underlying stream after reading data into the buffer, the position of
the underlying stream might not match the position of the internal
buffer. To reset the internal buffer, call the DiscardBufferedData
method; however, this method slows performance and should be called
only when absolutely necessary.
So for the return value, the docs says:
The number of characters that have been read, or 0 if at the end of
> the stream and no data was read. The number will be less than or equal
to the count parameter, depending on whether the data is available
within the stream.
Or, to summarize - your buffer and the underlying buffer are not the same size, thus you get partial fill of your buffer, as the underlying one is not being filled up yet.

Reading data from a particular location of a FileStream using .NET

I am trying to read data from a file stream as shown below:
fileStream.Read(byteArray, offset, length);
The problem is that my offset and length are Unsigned Ints and above function accepts only ints. If I typecast to int, I am getting a negative value for offset which is meaningless and not acceptable by the function.
The offset and length are originally taken from another byte array as shown below:
BitConverter.ToUInt32(length, 0); //length is a 4 byte long byte-array
What is the right way to read from arbitrary locations of a file stream.
I am not sure if this is the best way to handle it, but you can change the position of the stream and use offset 0. The Position is of type long.
fileStream.Position = (long)length;
fileStream.Read(byteArray, 0, sizeToRead);
For such a filesize you should read your file in small blocks, proccess the block and read the next. int.MaxValue is about ~2GB, uint.MaxValue ~4GB. Such a size doesn't fit in most computers ram ;)
if you are having problems with conversion, something similar might help:
uint myUInt;
int i = (int)myUInt; or
int i = Convert.ToInt32(myUInt);

Categories