I have to parse a large file so instead of doing:
string unparsedFile = myStreamReader.ReadToEnd(); // takes 4 seconds
parse(unparsedFile); // takes another 4 seconds
I want to take advantage of the first 4 seconds and try to do both things at the same time by doing something like:
while (true)
{
char[] buffer = new char[1024];
var charsRead = sr.Read(buffer, 0, buffer.Length);
if (charsRead < 1)
break;
if (charsRead != 1024)
{
Console.Write("Here"); // debuger stops here several times why?
}
addChunkToQueue(buffer);
}
here is the image of the debuger: (I added int counter to show on what iteration we read less than 1024 bytes)
Note that there where 643 chars read and not 1024. On the next iteration I get:
I think I should read 1024 bytes all the time until I get to the last iteration where the remeining bytes are less than 1024.
So my question is why will I read "random" number of chars as I iterate throw the while loop?
Edit
I don't know what kind of stream I am dealing with. I Execute a process like:
ProcessStartInfo psi = new ProcessStartInfo("someExe.exe")
{
RedirectStandardError = true,
RedirectStandardOutput = true,
UseShellExecute = false,
CreateNoWindow = true,
};
// execute command and return ouput of command
using (var proc = new Process())
{
proc.StartInfo = psi;
proc.Start();
var output = proc.StandardOutput; // <------------- this is where I get the strem
//if (string.IsNullOrEmpty(output))
//output = proc.StandardError.ReadToEnd();
return output;
}
}
For one thing, you're reading characters, not bytes. There's a huge difference.
As for why it doesn't necessarily read everything all at once: maybe there isn't that much data available, and StreamReader has decided to give you what it's got rather than blocking for an indeterminate amount of time to fill your buffer. It's entirely within its rights to do so.
Is this coming from a local file, or over the network? Normally local file operations are much more likely to fill the buffer than network downloads, but either way you simply shouldn't rely on the buffer being filled. If it's a "file" (i.e. read using FileStream) but it happens to be sitting on a network share... well, that's a grey area in my knowledge :) It's a stream - treat it that way.
It depends on the actual stream you are reading. If this is the file stream I guess it is rather unlikely to get "partial" data. However, if you read from a network stream, you have to expect the data to come in chunks of different length.
From the docs: http://msdn.microsoft.com/en-us/library/9kstw824
When using the Read method, it is more efficient to use a buffer that
is the same size as the internal buffer of the stream, where the
internal buffer is set to your desired block size, and to always read
less than the block size. If the size of the internal buffer was
unspecified when the stream was constructed, its default size is 4
kilobytes (4096 bytes). If you manipulate the position of the
underlying stream after reading data into the buffer, the position of
the underlying stream might not match the position of the internal
buffer. To reset the internal buffer, call the DiscardBufferedData
method; however, this method slows performance and should be called
only when absolutely necessary.
So for the return value, the docs says:
The number of characters that have been read, or 0 if at the end of
> the stream and no data was read. The number will be less than or equal
to the count parameter, depending on whether the data is available
within the stream.
Or, to summarize - your buffer and the underlying buffer are not the same size, thus you get partial fill of your buffer, as the underlying one is not being filled up yet.
Related
My current approach is to read the COM stream into a C# MemoryStream and then call .toArray. However, I believe toArray creates a redundant copy of the data. Is there a better way that has reduced memory usage as the priority?
var memStream = new MemoryStream(10000);
var chunk = new byte[1000];
while (true)
{
int bytesRead = comStream.read(ref chunk, chunk.Length);
if (bytesRead == 0)
break; // eos
memStream.Write(chunk, 0, bytesRead);
}
//fairly sure this creates a duplicate copy of the data
var array = memStream.ToArray();
//does this also dupe the data?
var array2 = memStream.GetBuffer();
If you know the length of the data before you start consuming it, then: you can allocate a simple byte[] and fill that in your read loop simply by incrementing an offset each read with the number of bytes read (and decrementing your "number of bytes you're allowed to touch). This does depend on having a read overload / API that accepts either an offset or a pointer, though.
If that isn't an option: GetBuffer() is your best bet - it doesn't duplicate the data; rather it hands you the current possibly oversized byte[]. Because it is oversized, you must consider it in combination with the current .Length, perhaps wrapping the length/data pair in either a ArraySegment<byte>, or a Span<byte>/Memory<byte>.
In the "the length is known" scenario, if you're happy to work with oversized buffers, you could also consider a leased array, via ArrayPool<byte>.Shared - rent one of at least that size, fill it, then constrain your segment/span to the populated part (and remember to return it to the pool when you're done).
I want to read very large file (4GBish) chunk by chunk.
I am currently trying to use a StreamReader and the Read() read method. The syntax is:
sr.Read(char[] buffer, int index, int count)
Because the index is an int it will overflow in my case. What should i use instead?
The index is the starting index of buffer not the index of file pointer, usually it would be zero. On each Read call you will read characters equal to the count parameter of Read method. You would not read all the file at once rather read in chunks and use that chunk.
The index of buffer at which to begin writing, reference.
char[] c = null;
while (sr.Peek() >= 0)
{
c = new char[1024];
sr.Read(c, 0, c.Length);
//The output will look odd, because
//only five characters are read at a time.
Console.WriteLine(c);
}
The above example will ready 1024 bytes and will write to console. You can use these bytes, for instance sending these bytes to other application using TCP connection.
When using the Read method, it is more efficient to use a buffer that
is the same size as the internal buffer of the stream, where the
internal buffer is set to your desired block size, and to always read
less than the block size. If the size of the internal buffer was
unspecified when the stream was constructed, its default size is 4
kilobytes (4096 bytes), MSDN.
You could try the simpler version of Read which doesn't chunk the stream but instead reads it character by character. You would have to implement the chunking your self, but it would give you more control allowing you to use a Long instead.
http://msdn.microsoft.com/en-us/library/ath1fht8(v=vs.110).aspx
I would like to be able to get the length of the data available from a TCP network stream in C# to set the size of the buffer before reading from the network stream. There is a NetworkStream.Length property but it isn't implemented yet, and I don't want to allocate an enormous size for the buffer as it would take up too much space. The only way I though of doing it would be to precede the data transfer with another telling the size, but this seems a little messy. What would be the best way for me to go about doing this.
When accessing Streams, you usually read and write data in small chunks (e.g. a kilobyte or so), or use a method like CopyTo that does that for you.
This is an example using CopyTo to copy the contents of a stream to another stream and return it as a byte[] from a method, using an automatically-sized buffer.
using (MemoryStream ms = new MemoryStream())
{
networkStream.CopyTo(ms);
return ms.ToArray();
}
This is code that reads data in the same way, but more manually, which might be better for you to work with, depending on what you're doing with the data:
byte[] buffer = new byte[2048]; // read in chunks of 2KB
int bytesRead;
while((bytesRead = networkStream.Read(buffer, 0, buffer.Length)) > 0)
{
//do something with data in buffer, up to the size indicated by bytesRead
}
(the basis for these code snippets came from Most efficient way of reading data from a stream)
There is no inherent length of a network stream. You will either have to send the length of the data to follow from the other end or read all of the incoming data into a different stream where you can access the length information.
The thing is, you can't really be sure all the data is read by the socket yet, more data might come in at any time. This is try even if you somehow do know how much data to expect, say if you have a package header that contains the length. the whole packet might not be received yet.
If you're reading arbitrary data (like a file perhaps) you should have a buffer of reasonable size (like 1k-10k or whatever you find to be optimal for your scenario) and then write the data to a file as its read from the stream.
var buffer = byte[1000];
var readBytes = 0;
using(var netstream = GetTheStreamSomhow()){
using(var fileStream = (GetFileStreamSomeHow())){
while(netstream.Socket.Connected) //determine if there is more data, here we read until the socket is closed
{
readBytes = netstream.Read(buffer,0,buffer.Length);
fileStrem.Write(buffer,0,buffer.Length);
}
}
}
Or just use CopyTo like Tim suggested :) Just make sure that all the data has indeed been read, including data that hasn't gotten across the network yet.
You could send the lenght of the incoming data first.
For example:
You have data = byte[16] you want to send. So at first you send the 16 and define on the server, that this length is always 2 (because 16 has two characters). Now you know that the incomingLength = 16. You can wait now for data of the lenght incomingLength.
Which one is better : MemoryStream.WriteTo(Stream destinationStream) or Stream.CopyTo(Stream destinationStream)??
I am talking about the comparison of these two methods without Buffer as I am doing like this :
Stream str = File.Open("SomeFile.file");
MemoryStream mstr = new MemoryStream(File.ReadAllBytes("SomeFile.file"));
using(var Ms = File.Create("NewFile.file", 8 * 1024))
{
str.CopyTo(Ms) or mstr.WriteTo(Ms);// Which one will be better??
}
Update
Here is what I want to Do :
Open File [ Say "X" Type File]
Parse the Contents
From here I get a Bunch of new Streams [ 3 ~ 4 Files ]
Parse One Stream
Extract Thousands of files [ The Stream is an Image File ]
Save the Other Streams To Files
Editing all the Files
Generate a New "X" Type File.
I have written every bit of code which is actually working correctly..
But Now I am optimizing the code to make the most efficient.
It is an historical accident that there are two ways to do the same thing. MemoryStream always had the WriteTo() method, Stream didn't acquire the CopyTo() method until .NET 4.
The MemoryStream.WriteTo() version looks like this:
public virtual void WriteTo(Stream stream)
{
// Exception throwing code elided...
stream.Write(this._buffer, this._origin, this._length - this._origin);
}
The Stream.CopyTo() implementation like this:
private void InternalCopyTo(Stream destination, int bufferSize)
{
int num;
byte[] buffer = new byte[bufferSize];
while ((num = this.Read(buffer, 0, buffer.Length)) != 0)
{
destination.Write(buffer, 0, num);
}
}
Stream.CopyTo() is more universal, it works for any stream. And helps programmers that fumble copying data from, say, a NetworkStream. Forgetting to pay attention to the return value from Read() was a very common bug. But it of course copies the bytes twice and allocates that temporary buffer, MemoryStream doesn't need it since it can write directly from its own buffer. So you'd still prefer WriteTo(). Noticing the difference isn't very likely.
MemoryStream.WriteTo: Writes the entire contents of this memory stream to another stream.
Stream.CopyTo: Reads the bytes from the current stream and writes them to the destination stream. Copying begins at the current position in the current stream.
You'll need to seek back to 0, to get the whole source stream copied.
So I think MemoryStream.WriteTo better option for this situation
If you use Stream.CopyTo, you don't need to read all the bytes into memory to start with. However:
This code would be simpler if you just used File.Copy
If you are going to load all the data into memory, you can just use:
byte[] data = File.ReadAllBytes("input");
File.WriteAllBytes("output", data);
You should have a using statement for the input as well as the output stream
If you really need processing so can't use File.Copy, using Stream.CopyTo will cope with larger files than loading everything into memory. You may not need that, of course, or you may need to load the whole file into memory for other reasons.
If you have got a MemoryStream, I'd probably use MemoryStream.WriteTo rather than Stream.CopyTo, but it probably won't make much difference which you use, except that you need to make sure you're at the start of the stream when using CopyTo.
I think Hans Passant's claim of a bug in MemoryStream.WriteTo() is wrong; it does not "ignore the return value of Write()". Stream.Write() returns void, which implies to me that the entire count bytes are written, which implies that Stream.Write() will block as necessary to complete the operation to, e.g., a NetworkStream, or throw if it ultimately fails.
That is indeed different from the write() system call in ?nix, and its many emulations in libc and so forth, which can return a "short write". I suspect Hans leaped to the conclusion that Stream.Write() followed that, which I would have expected, too, but apparently it does not.
It is conceivable that Stream.Write() could perform a "short write", without returning any indication of that, requiring the caller to check that the Position property of the Stream has actually been advanced by count. That would be a very error-prone API, and I doubt that it does that, but I have not thoroughly tested it. (Testing it would be a bit tricky: I think you would need to hook up a TCP NetworkStream with a reader on the other end that blocked forever, and write enough to fill up the wire buffers. Or something like that...)
The comments for Stream.Write() are not quite unambiguous:
Summary:
When overridden in a derived class, writes a sequence of bytes to the current
stream and advances the current position within this stream by the number
of bytes written.
Parameters: buffer:
An array of bytes. This method copies count bytes from buffer to the current stream.
Compare that to the Linux man page for write(2):
write() writes up to count bytes from the buffer pointed buf to the file referred to by the file descriptor fd.
Note the crucial "up to". That sentence is followed by explanation of some of the conditions under which a "short write" might occur, making it very explicit that it can occur.
This is really a critical issue: we need to know how Stream.Write() behaves, beyond all doubt.
The CopyTo method creates a buffer, populates its with data from the original stream and then calls the Write method passing the created buffer as a parameter. The WriteTo uses the memoryStream's internal buffer to write. That is the difference. What is better - it is up to you to decide which method you prefer.
Creating a MemoryStream from a HttpInputStream in Vb.Net:
Dim filename As String = MyFile.PostedFile.FileName
Dim fileData As Byte() = Nothing
Using binaryReader = New BinaryReader(MyFile.PostedFile.InputStream)
binaryReader.BaseStream.Position = 0
fileData = binaryReader.ReadBytes(MyFile.PostedFile.ContentLength)
End Using
Dim memoryStream As MemoryStream = New MemoryStream(fileData)
I noticed that it will keep returning the same read characters over and over, but I was wondering if there was a more elegant way.
while(!streamReader.EndOfStream)
{
string line = streamReader.ReadLine();
Console.WriteLine(line);
}
Console.WriteLine("End of File");
Check StreamReader.EndOfStream. Stop your read loop when this is true.
Make sure your code correctly handles the returned value for "byte count just read" on ReadBlock calls as well. Sounds like you are seeing zero bytes read, and just assuming the unchanged buffer contents you see is another read of the same data.
Unfortunately I can't comment on answers yet, but to the answer by "The Moof"...
Your use of cur here is misplaced, as the index parameter is for the index in buffer where writing is to begin. So for your examples it should be replaced by 0 in the call to stream.ReadBlock.
When the returned read length is less than your requested read length, you're at the end. You also should be keeping track of the read length in case your stream size isn't a perfect match for your buffer size, so you need to account for the shorter length of the data in your buffer.
do{
len = stream.ReadBlock(buffer, 0, buffer.Length);
/* ... */
}while(len == buffer.Length);
You could also check the EndOfStream flag of the stream in your loop condition as well. I prefer this method since you won't do a read of '0' length (rare condition, but it can happen).
do{
len = stream.ReadBlock(buffer, 0, buffer.Length);
/* ... */
}while(!stream.EndOfStream);
From MSDN for ReadBlock():
Return Value Type: System.Int32 The
position of the underlying stream is
advanced by the number of characters
that were read into buffer. The number
of characters that have been read. The
number will be less than or equal to
count, depending on whether all input
characters have been read.
So I would assume it's EOF when it returns 0,