Limiting a stream size - c#

I'm working on C# project and I want to read a single file from multiple threads using streams in the following manner:
A file is logically divided into "chunks" of fixed size.
Each thread gets it's own stream representing a "chunk".
The problem that I want use a Stream interface and I want to limit the size of each chunk so that the corresponding stream "ends" when it reaches the chunk size.
Is there something available in standard library or my only option is to write my own implementation of Stream?

There is an overload in the Streamreader class for Streamreader.Read which allows you to limit the amount of characters read. An example can be found here: http://msdn.microsoft.com/en-us/library/9kstw824.aspx
The line you are looking for is sr.Read(c, 0, c.Length); You simply set up a char array and decide on the maximum amount of characters that are going to be read (the third argument).

Related

ArrayPool create method giving error in C#

Basically, i want to read data from source file to target file in azure data lake parallelly using ConcurrentAppend API.
Also, i dont want to read the data from files all at once but in chunks , i am using buffers for that. i want to create 5 buffers of 1 MB , 5 buffers of 2 MB, and 5 buffers of 4 Mb. whenever a source file arrives , it will use the appropriate buffer according to its size and i will append to target using that buffer. I dont want buffers to exceed 5 in each case/configuration.
I was using a shared ArrayPool for renting buffers. But since i have this condition that allocation should not exceed beyond 5 arrays in each case ( 1, 2 and 4 MB) -> i had to use some conditions to limit that.
I would rather like to use a custom pool which i can create like :
ArrayPool<byte> pool = ArrayPool<byte>.Create( One_mb , 5)
this will take care that my allocations dont go beyond 5 arrays and max size of array will be 1 MB. Similarly i can create two more buffer pool for 2 and 4 mb case. This way i wont need to put those conditions to limit it to 5 .
Problem :
when i use this custom pool , i get corrupted data in my target file. Moreover , target file size gets doubled, like if sum of input is 10 mb -> target file shows 20 mb .
If i use the same code and rent from single shared ArrayPool rather than these custom pools, i get correct result.
What am i doing wrong ?
My code :
https://github.com/ChahatKumar/ADLS/blob/master/CreatePool/Program.cs
FileStream.Read returns the number of bytes read. This will not necessarily be the size of your array and could very well be smaller (or zero if no byes were read). The code in your github example is ignoring the value of Read and making the incorrect assumption that the buffer was filled by telling the next method to use the entire buffer. Because your arrays are so large, it is possible (and perhaps likely) that you will not read them entirely with a single call to Read (even if the files are actually that large, FileStream has its own internal buffer and buffer size).
Your method should likely look like the following. Note I pass the actual number of bytes read to ConcurrentAppend (which I assume to be well conforming in that it respects the length argument):
int read;
while ((read = file.Read(buffer1, 0, buffer1.Length) > 0)
{
c.ConcurrentAppend(filename, true, buffer1, 0, read);
}

Is it possible to get the true read position of a StreamReader relative to its underlying stream?

Alternatively, is it possible to tell a StreamReader to reset the position of its underlying stream to the first unconsumed byte, after which BaseStream.Position would have the desired value?
This is important because there's a lot of stuff that only the stream reader knows about. That's not only the actual parsing position within a line, but also things like line endings of different lengths and byte order marks.
DiscardBufferedData() fails to reset the underlying stream to the correct position, so no dice there.
Background: StreamReader uses internal buffering, which means that BaseStream.Position is out of sync and only tells where the currently buffered block ends. For diagnostic purposes it would by useful to include the actual byte offset in a message, especially with files that are hundreds of megabytes in size. With a byte offset it would be possible to go directly to the indicated position, as opposed to having to scan the whole file in order to count line endings (when only a line number is given).
There are other potential uses for the byte offset, like handing off processing from the stream reader to code that does raw reads via the base stream. That would allow it to keep using the stream reader for all parts that are not performance-critical and to limit raw byte processing to those parts where it matters.
And matter it can sometimes, with speedups to the tune of orders of magnitude... For example, in code challenges like SPOJ's like INTEST - Enormous Input Test and lots of other tasks, where the CLR's native input/output can be orders of magnitude slower than the actual algorithm for crunching the numbers and hence can needlessly cause time limits to be exceeded.

A stream type that supports partial viewing in .Net

I'm working on a file reader for a custom file format. Part of the format is like the following:
[HEADER]
...
[EMBEDDED_RESOURCE_1]
[EMBEDDED_RESOURCE_2]
[EMBEDDED_RESOURCE_3]
...
Now what I'm trying to do is to open a new stream that its boundaries are only one resource, for instance EMBEDDED_RESOURCE_1's first byte is at the 100th byte and its length is 200 bytes so its boundaries are 100 - 300. Is there any way to do so without using any buffers?
Thanks!
Alternatively - MemoryStream.
Before reading the necessary number of bytes set the initial position of the position by the property - Position.
But it is necessary to read the entire file into MemoryStream.

C# Reading, Modifying then writing binary data to file. Best convention?

I'm new to programming in general (My understanding of programming concepts is still growing.). So this question is about learning, so please provide enough info for me to learn but not so much that I can't, thank you.
(I would also like input on how to make the code reusable with in the project.)
The goal of the project I'm working on consists of:
Read binary file.
I have known offsets I need to read to find a particular chunk of data from within this file.
First offset is first 4 bytes(Offset for end of my chunk).
Second offset is 16 bytes from end of file. I read for 4 bytes.(Gives size of chunk in hex).
Third offset is the 4 bytes following previous, read for 4 bytes(Offset for start of chunk in hex).
Locate parts in the chunk to modify by searching ASCII text as well as offsets.
Now I have the start offset, end offset and size of my chunk.
This should allow me to read bytes from file into a byte array and know the size of the array ahead of time.
(Questions: 1. Is knowing the size important? Other than verification. 2. Is reading part of a file into a byte array in order to change bytes and overwrite that part of the file the best method?)
So far I have managed to read the offsets from the file using BinaryReader on a MemoryStream. I then locate the chunk of data I need and read that into a byte array.
I'm stuck in several ways:
What are the best practices for binary Reading / Writing?
What's the best storage convention for the data that is read?
When I need to modify bytes how do I go about that.
Should I be using FileStream?
Since you want to both read and write, it makes sense to use the FileStream class directly (using FileMode.Open and FileAccess.ReadWrite). See FileStream on MSDN for a good overall example.
You do need to know the number of bytes that you are going to be reading from the stream. See the FileStream.Read documentation.
Fundamentally, you have to read the bytes into memory at some point if you're going to use and later modify their contents. So you will have to make an in-memory copy (using the Read method is the right way to go if you're reading a variable-length chunk at a time).
As for best practices, always dispose your streams when you're done; e.g.:
using (var stream = File.Open(FILE_NAME, FileMode.Open, FileAccess.ReadWrite))
{
//Do work with the FileStream here.
}
If you're going to do a large amount of work, you should be doing the work asynchronously. (Let us know if that's the case.)
And, of course, check the FileStream.Read documentation and also the FileStream.Write documentation before using those methods.
Reading bytes is best done by pre-allocating an in-memory array of bytes with the length that you're going to read, then reading those bytes. The following will read the chunk of bytes that you're interested in, let you do work on it, and then replace the original contents (assuming the length of the chunk hasn't changed):
EDIT: I've added a helper method to do work on the chunk, per the comments on variable scope.
using (var stream = File.Open(FILE_NAME, FileMode.Open, FileAccess.ReadWrite))
{
var chunk = new byte[numOfBytesInChunk];
var offsetOfChunkInFile = stream.Position; // It sounds like you've already calculated this.
stream.Read(chunk, 0, numOfBytesInChunk);
DoWorkOnChunk(ref chunk);
stream.Seek(offsetOfChunkInFile, SeekOrigin.Begin);
stream.Write(chunk, 0, numOfBytesInChunk);
}
private void DoWorkOnChunk(ref byte[] chunk)
{
//TODO: Any mutation done here to the data in 'chunk' will be written out to the stream.
}

Reading a MemoryStream which contains multiple files

If I have a single MemoryStream of which I know I sent multiple files (example 5 files) to this MemoryStream. Is it possible to read from this MemoryStream and be able to break apart file by file?
My gut is telling me no since when we Read, we are reading byte by byte... Any help and a possible snippet would be great. I haven't been able to find anything on google or here :(
You can't directly, not if you don't delimit the files in some way or know the exact size of each file as it was put into the buffer.
You can use a compressed file such as a zip file to transfer multiple files instead.
A stream is just a line of bytes. If you put the files next to each other in the stream, you need to know how to separate them. That means you must know the length of the files, or you should have used some separator. Some (most) file types have a kind of header, but looking for this in an entire stream may not be waterproof either, since the header of a file could just as well be data in another file.
So, if you need to write files to such a stream, it is wise to add some extra information. For instance, start with a version number, then, write the size of the first file, write the file itself and then write the size of the next file, etc....
By starting with a version number, you can make alterations to this format. In the future you may decide you need to store the file name as well. In that case, you can increase version number, make up a new format, and still be able to read streams that you created earlier.
This is of course especially useful if you store these streams too.
Since you're sending them, you'll have to send them into the stream in such a way that you'll know how to pull them out. The most common way of doing this is to use a length specification. For example, to write the files to the stream:
write an integer to the stream to indicate the number of files
Then for each file,
write an integer (or a long if the files are large) to indicate the number of bytes in the file
write the file
To read the files back,
read an integer (n) to determine the number of files in the stream
Then, iterating n times,
read an integer (or long if that's what you chose) to determine the number of bytes in the file
read the file
You could use an IEnumerable<Stream> instead.
You need to implement this yourself, what you would want to do is write in some sort of 'delimited' into the stream. As you're reading, look for that delimited, and you'll know when you have hit a new file.
Here's a quick and dirty example:
byte[] delimiter = System.Encoding.Default.GetBytes("++MyDelimited++");
ms.Write(myFirstFile);
ms.Write(delimiter);
ms.Write(mySecondFile);
....
int len;
do {
len = ms.ReadByte(buffer, lastOffest, delimiter.Length);
if(buffer == delimiter)
{
// Close and open a new file stream
}
// Write buffer to output stream
} while(len > 0);

Categories