Splitting a file into chunks

Splitting a file into chunks - c#

I'm trying to split large files (3gb+) into chunks of 100mb, then sending those chunks through HTTP.
For testing, i'm working on a 29 mb file, size: 30380892, size on disk: 30384128
(so there is no use of a 100mb limit condition at the moment).
This is my code:
List<byte[]> bufferList = new List<byte[]>();
byte[] buffer = new byte[4096];
FileInfo fileInfo = new FileInfo(file);
long length = fileInfo.Length;
int nameCount = 0;
long sum = 0;
long count = 0;
using (FileStream fs = new FileStream(file, FileMode.Open, FileAccess.Read))
{
while (count < length)
{
sum = fs.Read(buffer, 0, buffer.Length);
count += sum;
bufferList.Add(buffer);
}
var output2 = new byte[bufferList.Sum(arr => arr.Length)];
int writeIdx2 = 0;
foreach (var byteArr in bufferList)
{
byteArr.CopyTo(output2, writeIdx2);
writeIdx2 += byteArr.Length;
}
HttpUploadBytes(url, output2, ++nameCount + fileName, contentType, path);
}
In this testing code, i'm adding each buffer I read into a list,
when finished reading i'm combining the buffer array into one complete array.
The problem is, the result I get (output2 size) is 30384128 (as size on disk), so the file that get received in the server is corrupted.
What am I doing wrong?

The problem is that you keep adding the same buffer of size 4KB to bufferList. That's why the size of the file you receive matches the size on disk (it happens to be rounded to the nearest 4KB in your case).
A bigger problem with your code is that the data you send is wrong, because you keep overwriting the data in the buffer. If, for example, you send 200 chunks, it means that you send 200 copies of the last content of buffer.
The fix is relatively simple - make copies of the buffer before adding to bufferList:
bufferList.Add(buffer.Take(sum).ToArray());
This would fix the size problem, too, because the last chunk would have a smaller size, as represented by sum from the last call. Most importantly, though, bufferList would contain copies of the buffer, rather than the references to the buffer itself.

Related

Handling big file stream (read+write bytes)

The following code do :
Read all bytes from an input file
Keep only part of the file in outbytes
Write the extracted bytes in outputfile
byte[] outbytes = File.ReadAllBytes(sourcefile).Skip(offset).Take(size).ToArray();
File.WriteAllBytes(outfile, outbytes);
But there is a limitation of ~2GB data for each step.
Edit: The extracted bytes size can also be greater than 2GB.
How could I handle big file ? What is the best way to proceed with good performances, regardless of size ?
Thx !

Example to FileStream to take the middle 3 Gb out of a 5 Gb file:
byte[] buffer = new byte{1024*1024];
using(var readFS = File.Open(pathToBigFile))
using(var writeFS = File.OpenWrite(pathToNewFile))
{
readFS.Seek(1024*1024*1024); //seek to 1gb in
for(int i=0; i < 3000; i++){ //3000 times of one megabyte = 3gb
int bytesRead = readFS.Read(buffer, 0, buffer.Length);
writeFS.Write(buffer, 0, bytesRead);
}
}
It's not a production grade code; Read might not read a full megabyte so you'd end up with less than 3Gb - it's more to demonstrate the concept of using two filestreams and reading repeatedly from one and writing repeatedly to the other. I'm sure you can modify it so that it copies an exact number of bytes by keeping track of the total of all the bytesRead in the loop and stopping reading when you have read enough

It is better to stream the data from one file to the other, only loading small parts of it into memory:
public static void CopyFileSection(string inFile, string outFile, long startPosition, long size)
{
// Open the files as streams
using (var inStream = File.OpenRead(inFile))
using (var outStream = File.OpenWrite(outFile))
{
// seek to the start position
inStream.Seek(startPosition, SeekOrigin.Begin);
// Create a variable to track how much more to copy
// and a buffer to temporarily store a section of the file
long remaining = size;
byte[] buffer = new byte[81920];
do
{
// Read the smaller of 81920 or remaining and break out of the loop if we've already reached the end of the file
int bytesRead = inStream.Read(buffer, 0, (int)Math.Min(buffer.Length, remaining));
if (bytesRead == 0) { break; }
// Write the buffered bytes to the output file
outStream.Write(buffer, 0, bytesRead);
remaining -= bytesRead;
}
while (remaining > 0);
}
}
Usage:
CopyFileSection(sourcefile, outfile, offset, size);
This should have equivalent functionality to your current method without the overhead of reading the entire file, regardless of its size, into memory.
Note: If you're doing this in code that uses async/await, you should change CopyFileSection to be public static async Task CopyFileSection and change inStream.Read and outStream.Write to await inStream.ReadAsync and await outStream.WriteAsync respectively.

Convert streamed PNG data to image: data Streamed via TCP using UnrealEngine 4.10

I have created a small application using UnrealEngine 4.10 (UE4). Within that application, I am grabbing the colorBuffer via ReadPixels. I am then compressing the colorBuffer to PNG. Finally, the compressed colorBuffer is sent via TCP to another application. The UE4 application is written in c++, not that that should matter. The compressed colorBuffer is being sent every "tick" of the application - essentially 30 times a second (thereabouts).
My client, the one receiving the streamed PNG is written in c# and does the following:
Connect to server If connected
get the stream (memory stream)
read the memory stream into a byte array
convert the byte array to an image
Client implementation:
private void Timer_Tick(object sender, EventArgs e)
{
var connected = tcp.IsConnected();
if (connected)
{
var stream = tcp.GetStream(); //simply returns client.GetStream();
int BYTES_TO_READ = 16;
var buffer = new byte[BYTES_TO_READ];
var totalBytesRead = 0;
var bytesRead;
do {
// You have to do this in a loop because there's no
// guarantee that all the bytes you need will be ready when
// you call.
bytesRead = stream.Read(buffer, totalBytesRead,
BYTES_TO_READ - totalBytesRead);
totalBytesRead += bytesRead;
} while (totalBytesRead < BYTES_TO_READ);
Image x = byteArrayToImage(buffer);
}
}
public Image byteArrayToImage(byte[] byteArrayIn)
{
var converter = new ImageConverter();
Image img = (Image)converter.ConvertFrom(byteArrayIn);
return img;
}
The problem is that Image img = (Image)converter.ConvertFrom(byteArrayIn);
Throws an argument exception, telling me "Parmeter is not valid".
The data being sent looks like this:
My byteArrayInand buffer look like this:
I have also tried both:
Image.FromStream(stream); and Image.FromStream(new MemoryStream(bytes));
Image.FromStream(stream); causes it to read forever... and Image.FromStream(new MemoryStream(bytes)); results in the same exception as mentioned above.
Some questions:
What size shall I set BYTES_TO_READ to be? I set as 16 because when I check the size of the byte array being sent in the UE4 application (dataSize in the first image), it says the length is 16... Not too sure about what to set this as.
Is the process that I have followed correct?
What am I doing wrong?
UPDATE
#RonBeyer asked if I could verify that the data sent from the server matches that which is received. I have tried to do that and here is what I can say:
The data sent, as far as I can tell looks like this (sorry for formatting):
The data being received, looks like this:
var stream = tcp.GetStream();
int BYTES_TO_READ = 512;
var buffer = new byte[BYTES_TO_READ];
Int32 bytes = stream.Read(buffer, 0, buffer.Length);
var responseData = System.Text.Encoding.ASCII.GetString(buffer, 0,
bytes);
//responseData looks like this (has been formatted for easier reading)
//?PNG\r\n\u001a\n\0\0\0\rIHDR
//?PNG\r\n\u001a\n\0\0\0\rIHDR
//?PNG\r\n\u001a\n\0\0\0\rIHDR
//?PNG\r\n\u001a\n\0\0\0\rIHDR
//?PNG\r\n\u001a\n\0\0\0\rIHDR
//?PNG\r\n\u001a\n\0\0\0\rIHDR
//?PNG\r\n\u001a\n\0\0\0\rIHDR
//?PNG\r\n\u001a\n\0\0\0\rIHDR
//?PNG\r\n\u001a\n\0\0\0\rIHDR
//?PNG\r\n\u001a\n\0\0\0\rIHDR
If I try take a single line from the responseData and put that into an image:
var stringdata = "?PNG\r\n\u001a\n\0\0\0\rIHDR";
var data = System.Text.Encoding.ASCII.GetBytes(stringdata);
var ms = new MemoryStream(data);
Image img = Image.FromStream(ms);
data has a length of 16... the same length as the dataSize variable on the server. However, I again get the execption "Parameter is not valid".
UPDATE 2
#Darcara has helped by suggesting that what I was actually receiving was the header of the PNG file and that I needed to first send the size of the image. I have now done that and made progress:
for (TArray<class FSocket*>::TIterator ClientIt(Clients); ClientIt;
++ClientIt)
{
FSocket *Client = *ClientIt;
int32 SendSize = 2 * x * y;
Client->SetNonBlocking(true);
Client->SetSendBufferSize(SendSize, SendSize);
Client->Send(data, SendSize, bytesSent);
}
With this, I am now getting the image on the first go, however, subsequent attempts fail with the same "Parameter is not valid". Upon inspection, I have noticed that the stream now appears to be missing the header... "?PNG\r\n\u001a\n\0\0\0\rIHDR". I came to this conclusion when I converted the buffer to a string using Encoding.ASCII.GetString(buffer, 0, bytes);
Any idea why the header is now only being sent to first time and never again? What can I do to fix it?

First of all, thank you to #Dacara and #RonBeyer for your help.
I now have a solution:
Server:
for (TArray<class FSocket*>::TIterator ClientIt(Clients); ClientIt;
++ClientIt)
{
FSocket *Client = *ClientIt;
int32 SendSize = x * y; // Image is 512 x 512
Client->SetSendBufferSize(SendSize, SendSize);
Client->Send(data, SendSize, bytesSent);
}
The first issue was that the size of the image needed to be correct:
int32 SendSize = 2 * x * y;
The line above is wrong. The image is 512 by 512 and so SendSize should be x * y where x & y are both 512.
The other issue was how I was handling the stream client side.
Client:
var connected = tcp.IsConnected();
if (connected)
{
var stream = tcp.GetStream();
var BYTES_TO_READ = (512 * 512)^2;
var buffer = new byte[BYTES_TO_READ];
var bytes = stream.Read(buffer, 0, BYTES_TO_READ);
Image returnImage = Image.FromStream(new MemoryStream(buffer));
//Apply the image to a picture box. Note, this is running on a separate
//thread.
UpdateImageViewerBackgroundWorker.ReportProgress(0, returnImage);
}
The var BYTES_TO_READ = (512 * 512)^2; is now the correct size.
I now have Unreal Engine 4 streaming its frames.

You are only reading the first 16 bytes of the stream. I'm guessing that is not intentional.
If the stream ends/connection closes after the image is transferred, use stream.CopyTo to copy it into a MemoryStream. Image.FromStream(stream) might also work
If the stream does not end, you need to know the size of the transferred object beforehand, so you can copy it read-by-read into another array / memory stream or directly to disk. In that case a much higher read buffer should be used (default is 8192 I think). This is a lot more complicated though.
To manually read from the stream, you need to prepend you data with the size. A simple Int32 should suffice. Your client code might look something like this:
var stream = tcp.GetStream();
//this is our temporary read buffer
int BYTES_TO_READ = 8196;
var buffer = new byte[BYTES_TO_READ];
var bytesRead;
//read size of data object
stream.Read(buffer, 0, 4); //read 4 bytes into the beginning of the empty buffer
//TODO: check that we actually received 4 bytes.
var totalBytesExpected = BitConverter.ToInt32(buffer, 0)
//this will be the stream we will save our received bytes to
//could also be a file stream
var imageStream = new MemoryStream(totalBytesExpected);
var totalBytesRead = 0;
do {
//read as much as the buffer can hold or the remaining bytes
bytesRead = stream.Read(buffer, 0, Math.Min(BYTES_TO_READ, totalBytesExpected - totalBytesRead));
totalBytesRead += bytesRead;
//write bytes to image stream
imageStream.Write(buffer, 0, bytesRead);
} while (totalBytesRead < totalBytesExpected );
I glossed over a lot of error handling here, but that should give you the general idea.
If you want to transfer more complex objects look into proper protocols like Google Protocol Buffers or MessagePack

C# MemoryStream.Read() always reads same part

Edit: Solution is at bottom of post
I am trying my luck with reading binary files. Since I don't want to rely on byte[] AllBytes = File.ReadAllBytes(myPath), because the binary file might be rather big, I want to read small portions of the same size (which fits nicely with the file format to read) in a loop, using what I would call a "buffer".
public void ReadStream(MemoryStream ContentStream)
{
byte[] buffer = new byte[sizePerHour];
for (int hours = 0; hours < NumberHours; hours++)
{
int t = ContentStream.Read(buffer, 0, sizePerHour);
SecondsToAdd = BitConverter.ToUInt32(buffer, 0);
// further processing of my byte[] buffer
}
}
My stream contains all the bytes I want, which is a good thing. When I enter the loop several things cease to work.
My int t is 0although I would presume that ContentStream.Read() would process information from within the stream to my bytearray, but that isn't the case.
I tried buffer = ContentStream.GetBuffer(), but that results in my buffer containing all of my stream, a behaviour I wanted to avoid by using reading to a buffer.
Also resetting the stream to position 0 before reading did not help, as did specifying an offset for my Stream.Read(), which means I am lost.
Can anyone point me to reading small portions of a stream to a byte[]? Maybe with some code?
Thanks in advance
Edit:
Pointing me to the right direction was the answer, that .Read() returns 0 if the end of stream is reached. I modified my code to the following:
public void ReadStream(MemoryStream ContentStream)
{
byte[] buffer = new byte[sizePerHour];
ContentStream.Seek(0, SeekOrigin.Begin); //Added this line
for (int hours = 0; hours < NumberHours; hours++)
{
int t = ContentStream.Read(buffer, 0, sizePerHour);
SecondsToAdd = BitConverter.ToUInt32(buffer, 0);
// further processing of my byte[] buffer
}
}
And everything works like a charm. I initially reset the stream to its origin every time I iterated over hour and giving an offset. Moving the "set to beginning-Part" outside my look and leaving the offset at 0 did the trick.

Read returns zero if the end of the stream is reached. Are you sure, that your memory stream has the content you expect? I´ve tried the following and it works as expected:
// Create the source of the memory stream.
UInt32[] source = {42, 4711};
List<byte> sourceBuffer = new List<byte>();
Array.ForEach(source, v => sourceBuffer.AddRange(BitConverter.GetBytes(v)));
// Read the stream.
using (MemoryStream contentStream = new MemoryStream(sourceBuffer.ToArray()))
{
byte[] buffer = new byte[sizeof (UInt32)];
int t;
do
{
t = contentStream.Read(buffer, 0, buffer.Length);
if (t > 0)
{
UInt32 value = BitConverter.ToUInt32(buffer, 0);
}
} while (t > 0);
}

How to find the no. of bytes of a text file without reading it?

I have c# code reading a text file and printing it out which looks like this:
StreamReader sr = new StreamReader(File.OpenRead(ofd.FileName));
byte[] buffer = new byte[100]; //is there a way to simply specify the length of this to be the number of bytes in the file?
sr.BaseStream.Read(buffer, 0, buffer.Length);
foreach (byte b in buffer)
{
label1.Text += b.ToString("x") + " ";
}
Is there anyway I can know how many bytes my file has?
I want to know the length of the byte[] buffer in advance so that in the Read function, I can simply pass in buffer.length as the third argument.

System.IO.FileInfo fi = new System.IO.FileInfo("myfile.exe");
long size = fi.Length;
In order to find the file size, the system has to read from the disk. So, the above example performs data read from disk but does not read file content.

It's not clear why you're using StreamReader at all if you're going to read binary data. Just use FileStream instead. You can use the Length property to find the length of the file.
Note, however, that that still doesn't mean you should just call Read and *assume` that a single call will read all the data. You should loop until you've read everything:
byte[] data;
using (var stream = File.OpenRead(...))
{
data = new byte[(int) stream.Length];
int offset = 0;
while (offset < data.Length)
{
int chunk = stream.Read(data, offset, data.Length - offset);
if (chunk == 0)
{
// Or handle this some other way
throw new IOException("File has shrunk while reading");
}
offset += chunk;
}
}
Note that this is assuming you do want to read the data. If you don't want to even open the stream, use FileInfo.Length as other answers have shown. Note that both FileStream.Length and FileInfo.Length have a type of long, whereas arrays are limited to 32-bit lengths. What do you want to happen with a file which is bigger than 2 gigs?

You can use the FileInfo.Length method.
Take a look at the example given in the link.

I would imagine something in here should help.
I doubt you can preemptively guess the size of a file without reading it...
How do I use File.ReadAllBytes In chunks
If it is a large file; then reading in chunks should might help

File Chunking Performance in C#

I am trying to empower users to upload large files. Before I upload a file, I want to chunk it up. Each chunk needs to be a C# object. The reason why is for logging purposes. Its a long story, but I need to create actual C# objects that represent each file chunk. Regardless, I'm trying the following approach:
public static List<FileChunk> GetAllForFile(byte[] fileBytes)
{
List<FileChunk> chunks = new List<FileChunk>();
if (fileBytes.Length > 0)
{
FileChunk chunk = new FileChunk();
for (int i = 0; i < (fileBytes.Length / 512); i++)
{
chunk.Number = (i + 1);
chunk.Offset = (i * 512);
chunk.Bytes = fileBytes.Skip(chunk.Offset).Take(512).ToArray();
chunks.Add(chunk);
chunk = new FileChunk();
}
}
return chunks;
}
Unfortunately, this approach seems to be incredibly slow. Does anyone know how I can improve the performance while still creating objects for each chunk?
thank you

I suspect this is going to hurt a little:
chunk.Bytes = fileBytes.Skip(chunk.Offset).Take(512).ToArray();
Try this instead:
byte buffer = new byte[512];
Buffer.BlockCopy(fileBytes, chunk.Offset, buffer, 0, 512);
chunk.Bytes = buffer;
(Code not tested)
And the reason why this code would likely be slow is because Skip doesn't do anything special for arrays (though it could). This means that every pass through your loop is iterating the first 512*n items in the array, which results in O(n^2) performance, where you should just be seeing O(n).

Try something like this (untested code):
public static List<FileChunk> GetAllForFile(string fileName, FileMode.Open)
{
var chunks = new List<FileChunk>();
using (FileStream stream = new FileStream(fileName))
{
int i = 0;
while (stream.Position <= stream.Length)
{
var chunk = new FileChunk();
chunk.Number = (i);
chunk.Offset = (i * 512);
Stream.Read(chunk.Bytes, 0, 512);
chunks.Add(chunk);
i++;
}
}
return chunks;
}
The above code skips several steps in your process, preferring to read the bytes from the file directly.
Note that, if the file is not an even multiple of 512, the last chunk will contain less than 512 bytes.

Same as Robert Harvey's answer, but using a BinaryReader, that way I don't need to specify an offset. If you use a BinaryWriter on the other end to reassemble the file, you won't need the Offset member of FileChunk.
public static List<FileChunk> GetAllForFile(string fileName) {
var chunks = new List<FileChunk>();
using (FileStream stream = new FileStream(fileName)) {
BinaryReader reader = new BinaryReader(stream);
int i = 0;
bool eof = false;
while (!eof) {
var chunk = new FileChunk();
chunk.Number = i;
chunk.Offset = (i * 512);
chunk.Bytes = reader.ReadBytes(512);
chunks.Add(chunk);
i++;
if (chunk.Bytes.Length < 512) { eof = true; }
}
}
return chunks;
}
Have you thought about what you're going to do to compensate for packet loss and data corruption?

Since you mentioned that the load is taking a long time then I would use asynchronous file reading in order to speed up the loading process. The hard disk is the slowest component of a computer. Google does asynchronous reads and writes on Google Chrome to improve their load times. I had to do something like this in C# in a previous job.
The idea would be to spawn several asynchronous requests over different parts of the file. Then when a request comes in, take the byte array and create your FileChunk objects taking 512 bytes at a time. There are several benefits to this:
If you have this run in a separate thread, then you won't have the whole program waiting to load the large file you have.
You can process a byte array, creating FileChunk objects, while the hard disk is still trying to for-fill read request on other parts of the file.
You will save on RAM space if you limit the amount of pending read requests you can have. This allows less page faulting to the hard disk and use the RAM and CPU cache more efficiently, which speeds up processing further.
You would want to use the following methods in the FileStream class.
[HostProtectionAttribute(SecurityAction.LinkDemand, ExternalThreading = true)]
public virtual IAsyncResult BeginRead(
byte[] buffer,
int offset,
int count,
AsyncCallback callback,
Object state
)
public virtual int EndRead(
IAsyncResult asyncResult
)
Also this is what you will get in the asyncResult:
// Extract the FileStream (state) out of the IAsyncResult object
FileStream fs = (FileStream) ar.AsyncState;
// Get the result
Int32 bytesRead = fs.EndRead(ar);
Here is some reference material for you to read.
This is a code sample of working with Asynchronous File I/O Models.
This is a MS documentation reference for Asynchronous File I/O.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Splitting a file into chunks - c#

Related

Handling big file stream (read+write bytes)

Convert streamed PNG data to image: data Streamed via TCP using UnrealEngine 4.10

C# MemoryStream.Read() always reads same part

How to find the no. of bytes of a text file without reading it?

File Chunking Performance in C#

Categories

Resources