OK, I made a C# winform app, it's a File_Splitter_Joiner.
You just give it a file and it splits it for you to a number of pieces you specify.
The splitting is done in a separate thread.
Everything was working pretty fine until I sliced a 1Gig file!
In the task manager, I saw that my program started consuming 1Gigabyte of memory and my computer almost died!
not just that, when slicing finished, the consuming didn't change!
(dunno if this means that the garbage collector isn't working, although I'm pretty sure that I lost all references to what was holding the big data chumps, so it should work)
Here's the Splitter constructor (just to give you a better idea):
public FileSplitter(string FileToSplitPath, string PiecesFolder, int NumberOfPieces, int PieceSize, SplittingMethod Method)
{
FileToSplitInfo = new FileInfo(FileToSplitPath);
this.FileToSplitPath = FileToSplitPath;
this.PiecesFolder = PiecesFolder;
this.NumberOfPieces = NumberOfPieces;
this.PieceSize = PieceSize;
this.Method = Method;
SplitterThread = new Thread(Split);
}
And here is the method that did the actual splitting:
(I'm still a newbie, so what you're about to see 'may not' be done in the best way ever possible, I'm just learning here)
private void Split()
{
int remainingSize = 0;
int remainingPos = -1;
bool isNumberOfPiecesEqualInSize = true;
int fileSize = (int)FileToSplitInfo.Length; // FileToSplitInfo is a FileInfo object
if (fileSize % PieceSize != 0)
{
remainingSize = fileSize % PieceSize;
remainingPos = fileSize - remainingSize;
isNumberOfPiecesEqualInSize = false;
}
byte[] fileBytes = new byte[fileSize];
var _fs = File.Open(FileToSplitPath, FileMode.Open);
BinaryReader br = new BinaryReader(_fs);
br.Read(fileBytes, 0, fileSize);
br.Close();
_fs.Close();
for (int i = 0, index = 0; i < NumberOfPieces; i++, index += PieceSize)
{
var fs = File.Create(PiecesFolder + "\\" + Path.GetFileName(FileToSplitPath) + "." + (i+1).ToString());
var bw = new BinaryWriter(fs);
bw.Write(fileBytes, index, PieceSize);
if(i == NumberOfPieces-1 && !isNumberOfPiecesEqualInSize && Method == SplittingMethod.NumberOfPieces)
bw.Write(fileBytes, remainingPos, remainingSize);
bw.Close();
fs.Close();
}
MessageBox.Show("File has been splitted successfully!");
SplitterThread.Abort();
}
Now, instead of reading the bytes of the file via a BinaryReader, I was first reading it via the File.ReadAllBytes method, it was working fine with small file sizes, but, I got a "SystemOutOfMemory" exception when I dealt with our big guy, dunno why I didn't get that exception when I read the bytes via a BinaryReader.
(that was an in between question)
So, the main question is, how can I load big files (gigs speaking) in a way that doesn't consume so much memory ? I mean, how can I make my program not consume all that memory ?
and how I can I free the used memory after the splitting is done ?
(I actually used
bw.Dispose; fs.Dispose;
instead of
bw.Close(); fs.Close();
it was the same.
I know the Q might not make sense, cuz when we load something, it gets in our memory not somewhere else, but, the reason I asked it like that, is cuz I used another Splitting_Joining program (not written by me) just to see that if it had the same problem, I loaded the file, the program consumed about 5Migs of ram, when I started splitting, it used about 10Migs!!
Now that is a VERY big difference .. Probably that app was in C/C++ ..
So to sum up, who sucks ? is it my code, and if so how can I fix it ? or is it C# when it comes to performance ?
Thank you SOOO much for anything you could hook me up with :)
The following 2 lines will kil you:
int fileSize = (int)FileToSplitInfo.Length; // a FileInfo object
...
byte[] fileBytes = new byte[fileSize];
Your code will fail when the size is over Int32.MaxValue. Unnecessary, just use long fileSize = FileToSplitInfo.Length;
This corrected code will fail when there is not enough contiguous memory. Fragmentation (of the LOH) will bring you down sooner or later.
You allocate memory for the entire file but your only need PieceSize bytes at a time.
You don't even need to know the fileSize, just
byte[] pieceBuffer = new byte[PieceSize];
while (true)
{
int nBytes = br.Read(pieceBuffer, 0, pieceBuffer.Length);
if (nBytes == 0)
break;
// write this piece, the length is nBytes
}
There are different aspects that can be made better:
if you are working with big file, why first read all inside an array and after write into another file ? Just write into the new file while reading from the other.
use using to gurantee disposal of the streams, in any case: either there is an exception or not.
if you begin to work with really big file, like 1GB or even more, I would recommend to look on Memory Mapped Files. So you will laverage incredible memory consuption benefit with some increased performance cost.
Related
I am trying to write a simple program to stress test a hard drive, however, when I watch hard drive usage in resmon.exe and the Task Manager in Windows it shows no hard drive usage while the read is happening.
Currently I am writing 10GB of random bytes to a file as fast as possible to write test it. Then I am reading that file 512KB bytes at a time as quickly as I can. The goal is to try to max out the hard drives read capabilities.
The code I am using to read is below. My first thought is maybe it is just optimizing away the reads. So I was summing the values of the bytes in the buffer and taking the average. This makes the process take way longer, but I still see no usage in the Task Manager or in resmon.exe.
I'm sure I am missing something obvious here, but I am not sure what the problem is. I know there is a file system cache, but surely that won't affect reading through a 10GB file?
IProgress<double> progress = new Progress<double>( ( pct ) => Console.WriteLine( pct ) );
var fi = new FileInfo("E:\\temp\\out");
var size = fi.Length * 1;
using var stream = new FileStream( fi.FullName, FileMode.Open );
var curRead = 0;
var readBuffer = new byte[1024*512];
var j = 0;
long totalRead = 0;
var numReads = Math.Floor( size / (double) readBuffer.Length );
var reportWait = Math.Floor( numReads / 100 );
while ( totalRead < size )
{
curRead = stream.Read( readBuffer, 0, readBuffer.Length );
if ( curRead < readBuffer.Length )
{
stream.Position = 0;
}
totalRead += curRead;
if ( j % reportWait == 0 )
{
progress.Report( (totalRead / (double)size) * 100 );
}
j++;
}
As far as I can tell, your code works fine. The problem is you're severely underestimating how much caching is going on!
The first time I run your code, I get exactly the results you expect - steady 5 GiB/s reads. Repeating the run results in zero reads. The whole file is cached in memory, and will stay there until there's something else that needs that cache - which is unlikely while you keep running your benchmark over and over.
10 GiB just isn't a large enough file nowadays. I can't find disk cache size in my drive specs, but even just the caching done by Windows 10 on my machine is enough to give me 30 GiB of files permanently cached in memory.
Of course, there's a simple fix - tell Windows not to do that:
using var stream = new FileStream(
fi.FullName, FileMode.Open, FileAccess.Read, FileShare.ReadWrite, 8,
(FileOptions)0x20000000);
The FileOptions setting (not part of the enum in .NET) says "No buffering, please!" This probably isn't respected by the disk itself (disks lie to the OS all the time), but it does get rid of the Windows-level file caching, and you'll get far more repeatable results.
I have a code that is SSIS script task to zip file written in C#.
I have problem when zipping 1gb (approxymately) file.
I try to implement this code and still get error 'System.OutOfMemoryException'
System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
at ST_4cb59661fb81431abcf503766697a1db.ScriptMain.AddFileToZipUsingStream(String sZipFile, String sFilePath, String sFileName, String sBackupFolder, String sPrefixFolder) in c:\Users\dtmp857\AppData\Local\Temp\vsta\84bef43d323b439ba25df47c365b5a29\ScriptMain.cs:line 333
at ST_4cb59661fb81431abcf503766697a1db.ScriptMain.Main() in c:\Users\dtmp857\AppData\Local\Temp\vsta\84bef43d323b439ba25df47c365b5a29\ScriptMain.cs:line 131
This is the snippet of code when zipping file:
protected bool AddFileToZipUsingStream(string sZipFile, string sFilePath, string sFileName, string sBackupFolder, string sPrefixFolder)
{
bool bIsSuccess = false;
try
{
if (File.Exists(sZipFile))
{
using (ZipArchive addFile = ZipFile.Open(sZipFile, ZipArchiveMode.Update))
{
addFile.CreateEntryFromFile(sFilePath, sFileName);
//Move File after zipping it
BackupFile(sFilePath, sBackupFolder, sPrefixFolder);
}
}
else
{
//from https://stackoverflow.com/questions/28360775/adding-large-files-to-io-compression-ziparchiveentry-throws-outofmemoryexception
using (var zipFile = ZipFile.Open(sZipFile, ZipArchiveMode.Update))
{
var zipEntry = zipFile.CreateEntry(sFileName);
using (var writer = new BinaryWriter(zipEntry.Open()))
using (FileStream fs = File.Open(sFilePath, FileMode.Open))
{
var buffer = new byte[16 * 1024];
using (var data = new BinaryReader(fs))
{
int read;
while ((read = data.Read(buffer, 0, buffer.Length)) > 0)
writer.Write(buffer, 0, read);
}
}
}
//Move File after zipping it
BackupFile(sFilePath, sBackupFolder, sPrefixFolder);
}
bIsSuccess = true;
}
catch (Exception ex)
{
throw ex;
}
return bIsSuccess;
}
What I am missing, please give me suggestion maybe tutorial or best practice handling this problem.
I know this is an old post but what can I say, it helped me sort out some stuff and still comes up as a top hit on Google.
So there is definitely something wrong with the System.IO.Compression library!
First and Foremost...
You must make sure to turn off 32-Preferred. Having this set (in my case with a build for "AnyCPU") causes so many inconsistent issues.
Now with that said, I took some demo files (several less than 500MB, one at 500MB, and one at 1GB), and created a sample program with 3 buttons that made use of the 3 methods.
Button 1 - ZipArchive.CreateFromDirectory(AbsolutePath, TargetFile);
Button 2 - ZipArchive.CreateEntryFromFile(AbsolutePath, RelativePath);
Button 3 - Using the [16 * 1024] Byte Buffer method from above
Now here is where it gets interesting. (Assuming that the program is built as "AnyCPU" and with NO 32 Preferred check)... all 3 Methods worked on a Windows 64-Bit OS, regardless of how much memory it had.
However, as soon as I ran the same test on a 32-Bit OS, regardless of how much memory it had, ONLY method 1 worked!
Method 2 and 3 blew up with the outofmemory error AND to add salt to it, method 3 (the preferred method of chunking) actually corrupted more files than method #2!
By messed up, I mean that of my files, the 500MB and the 1GB file ended up in the zipped archive but at a size less than the original (it was basically truncated).
So I dunno... since there are not many 32-bit OS around anymore, I guess maybe it is a moot point.
But seems like there are some bugs in the System.IO.Compression Framework!
I am trying to empower users to upload large files. Before I upload a file, I want to chunk it up. Each chunk needs to be a C# object. The reason why is for logging purposes. Its a long story, but I need to create actual C# objects that represent each file chunk. Regardless, I'm trying the following approach:
public static List<FileChunk> GetAllForFile(byte[] fileBytes)
{
List<FileChunk> chunks = new List<FileChunk>();
if (fileBytes.Length > 0)
{
FileChunk chunk = new FileChunk();
for (int i = 0; i < (fileBytes.Length / 512); i++)
{
chunk.Number = (i + 1);
chunk.Offset = (i * 512);
chunk.Bytes = fileBytes.Skip(chunk.Offset).Take(512).ToArray();
chunks.Add(chunk);
chunk = new FileChunk();
}
}
return chunks;
}
Unfortunately, this approach seems to be incredibly slow. Does anyone know how I can improve the performance while still creating objects for each chunk?
thank you
I suspect this is going to hurt a little:
chunk.Bytes = fileBytes.Skip(chunk.Offset).Take(512).ToArray();
Try this instead:
byte buffer = new byte[512];
Buffer.BlockCopy(fileBytes, chunk.Offset, buffer, 0, 512);
chunk.Bytes = buffer;
(Code not tested)
And the reason why this code would likely be slow is because Skip doesn't do anything special for arrays (though it could). This means that every pass through your loop is iterating the first 512*n items in the array, which results in O(n^2) performance, where you should just be seeing O(n).
Try something like this (untested code):
public static List<FileChunk> GetAllForFile(string fileName, FileMode.Open)
{
var chunks = new List<FileChunk>();
using (FileStream stream = new FileStream(fileName))
{
int i = 0;
while (stream.Position <= stream.Length)
{
var chunk = new FileChunk();
chunk.Number = (i);
chunk.Offset = (i * 512);
Stream.Read(chunk.Bytes, 0, 512);
chunks.Add(chunk);
i++;
}
}
return chunks;
}
The above code skips several steps in your process, preferring to read the bytes from the file directly.
Note that, if the file is not an even multiple of 512, the last chunk will contain less than 512 bytes.
Same as Robert Harvey's answer, but using a BinaryReader, that way I don't need to specify an offset. If you use a BinaryWriter on the other end to reassemble the file, you won't need the Offset member of FileChunk.
public static List<FileChunk> GetAllForFile(string fileName) {
var chunks = new List<FileChunk>();
using (FileStream stream = new FileStream(fileName)) {
BinaryReader reader = new BinaryReader(stream);
int i = 0;
bool eof = false;
while (!eof) {
var chunk = new FileChunk();
chunk.Number = i;
chunk.Offset = (i * 512);
chunk.Bytes = reader.ReadBytes(512);
chunks.Add(chunk);
i++;
if (chunk.Bytes.Length < 512) { eof = true; }
}
}
return chunks;
}
Have you thought about what you're going to do to compensate for packet loss and data corruption?
Since you mentioned that the load is taking a long time then I would use asynchronous file reading in order to speed up the loading process. The hard disk is the slowest component of a computer. Google does asynchronous reads and writes on Google Chrome to improve their load times. I had to do something like this in C# in a previous job.
The idea would be to spawn several asynchronous requests over different parts of the file. Then when a request comes in, take the byte array and create your FileChunk objects taking 512 bytes at a time. There are several benefits to this:
If you have this run in a separate thread, then you won't have the whole program waiting to load the large file you have.
You can process a byte array, creating FileChunk objects, while the hard disk is still trying to for-fill read request on other parts of the file.
You will save on RAM space if you limit the amount of pending read requests you can have. This allows less page faulting to the hard disk and use the RAM and CPU cache more efficiently, which speeds up processing further.
You would want to use the following methods in the FileStream class.
[HostProtectionAttribute(SecurityAction.LinkDemand, ExternalThreading = true)]
public virtual IAsyncResult BeginRead(
byte[] buffer,
int offset,
int count,
AsyncCallback callback,
Object state
)
public virtual int EndRead(
IAsyncResult asyncResult
)
Also this is what you will get in the asyncResult:
// Extract the FileStream (state) out of the IAsyncResult object
FileStream fs = (FileStream) ar.AsyncState;
// Get the result
Int32 bytesRead = fs.EndRead(ar);
Here is some reference material for you to read.
This is a code sample of working with Asynchronous File I/O Models.
This is a MS documentation reference for Asynchronous File I/O.
I am trying to send a file over a NetworkStream and rebuild it on the client side. I can get the data over correctly (i think) but when I use either a BinaryWriter or a FileStream object to recreate the file, the file is cut off in the beginning at the same point no matter what methodology I use.
private void ReadandSaveFileFromServer(ref TcpClient clientATF,ref NetworkStream currentStream, string locationToSave)
{
int fileSize = 0;
string fileName = "";
fileName = ReadStringFromServer(ref clientATF,ref currentStream);
fileSize = ReadIntFromServer(ref clientATF,ref currentStream);
byte[] fileSent = new byte[fileSize];
if (currentStream.CanRead && clientATF.Connected)
{
currentStream.Read(fileSent, 0, fileSent.Length);
WriteToConsole("Log Recieved");
}
else
{
WriteToConsole("Log Transfer Failed");
}
FileStream fileToCreate = new FileStream(locationToSave + "\\" + fileName, FileMode.Create);
fileToCreate.Seek(0, SeekOrigin.Begin);
fileToCreate.Write(fileSent, 0, fileSent.Length);
fileToCreate.Close();
//binWriter = new BinaryWriter(File.Open(locationToSave + "\\" + fileName, FileMode.Create));
//binWriter.Write(fileSent);
//binWriter.Close();
}
When I step through and check fileName and fileSize, they are correct. The byte[] is also fully populated. Any clue as to what I can do next?
Thanks in advance...
Sean
EDIT!!!:
So I figured out what is happening. When I read a string and then the Int from stream, the byte array is 256 indices long. So my read for string is taking in the int, which then will clobber the other areas. Need to figure this out...
For one thing, you can use the convenience method File.WriteAllBytes to write the data more simply. But I doubt that that's the problem.
You're assuming you can read all the data in a single call to Read. You're ignoring the return value. Don't do that - instead, read multiple times until either you've read everything you expect to, or you've reached the end of the stream. See this article for more details. If you're using .NET 4, there's a new CopyTo method you may find useful.
(As an aside, your use of ref suggests that you don't understand what it really means. It's well worth making sure you understand how arguments are passed in C#.)
To add to Jon Skeet's answer, your reading code should be:
int bytesRead;
int readPos = 0;
do
{
bytesRead = currentStream.Read(fileSent, readPos, fileSent.Length);
readPos += bytesRead;
} while (bytesRead > 0);
If you are looking for a general solution for sending and receiving files over a network have you considered using a C# network library? It has probably solved most of the issues you will come across when trying to do this.
Disclaimer: I'm one of the developers of this library.
I have a huge file, where I have to insert certain characters at a specific location. What is the easiest way to do that in C# without rewriting the whole file again.
Filesystems do not support "inserting" data in the middle of a file. If you really have a need for a file that can be written to in a sorted kind of way, I suggest you look into using an embedded database.
You might want to take a look at SQLite or BerkeleyDB.
Then again, you might be working with a text file or a legacy binary file. In that case your only option is to rewrite the file, at least from the insertion point up to the end.
I would look at the FileStream class to do random I/O in C#.
You will probably need to rewrite the file from the point you insert the changes to the end. You might be best always writing to the end of the file and use tools such as sort and grep to get the data out in the desired order. I am assuming you are talking about a text file here, not a binary file.
There is no way to insert characters in to a file without rewriting them. With C# it can be done with any Stream classes. If the files are huge, I would recommend you to use GNU Core Utils inside C# code. They are the fastest. I used to handle very large text files with the core utils ( of sizes 4GB, 8GB or more etc ). Commands like head, tail, split, csplit, cat, shuf, shred, uniq really help a lot in text manipulation.
For example if you need to put some chars in a 2GB file, you can use split -b BYTECOUNT, put the ouptut in to a file, append the new text to it, and get the rest of the content and add to it. This should supposedly be faster than any other way.
Hope it works. Give it a try.
You can use random access to write to specific locations of a file, but you won't be able to do it in text format, you'll have to work with bytes directly.
If you know the specific location to which you want to write the new data, use the BinaryWriter class:
using (BinaryWriter bw = new BinaryWriter (File.Open (strFile, FileMode.Open)))
{
string strNewData = "this is some new data";
byte[] byteNewData = new byte[strNewData.Length];
// copy contents of string to byte array
for (var i = 0; i < strNewData.Length; i++)
{
byteNewData[i] = Convert.ToByte (strNewData[i]);
}
// write new data to file
bw.Seek (15, SeekOrigin.Begin); // seek to position 15
bw.Write (byteNewData, 0, byteNewData.Length);
}
You may take a look at this project:
Win Data Inspector
Basically, the code is the following:
// this.Stream is the stream in which you insert data
{
long position = this.Stream.Position;
long length = this.Stream.Length;
MemoryStream ms = new MemoryStream();
this.Stream.Position = 0;
DIUtils.CopyStream(this.Stream, ms, position, progressCallback);
ms.Write(data, 0, data.Length);
this.Stream.Position = position;
DIUtils.CopyStream(this.Stream, ms, this.Stream.Length - position, progressCallback);
this.Stream = ms;
}
#region Delegates
public delegate void ProgressCallback(long position, long total);
#endregion
DIUtils.cs
public static void CopyStream(Stream input, Stream output, long length, DataInspector.ProgressCallback callback)
{
long totalsize = input.Length;
long byteswritten = 0;
const int size = 32768;
byte[] buffer = new byte[size];
int read;
int readlen = length < size ? (int)length : size;
while (length > 0 && (read = input.Read(buffer, 0, readlen)) > 0)
{
output.Write(buffer, 0, read);
byteswritten += read;
length -= read;
readlen = length < size ? (int)length : size;
if (callback != null)
callback(byteswritten, totalsize);
}
}
Depending on the scope of your project, you may want to decide to insert each line of text with your file in a table datastructure. Sort of like a database table, that way you can insert to a specific location at any given moment, and not have to read-in, modify, and output the entire text file each time. This is given the fact that your data is "huge" as you put it. You would still recreate the file, but at least you create a scalable solution in this manner.
It may be "possible" depending on how the filesystem stores files to quickly insert (ie, add additional) bytes in the middle. If it is remotely possible it may only be feasible to do so a full block at a time, and only by either doing low level modification of the filesystem itself or by using a filesystem specific interface.
Filesystems are not generally designed for this operation. If you need to quickly do inserts you really need a more general database.
Depending on your application a middle ground would be to bunch your inserts together, so you only do one rewrite of the file rather than twenty.
You will always have to rewrite the remaining bytes from the insertion point. If this point is at 0, then you will rewrite the whole file. If it is 10 bytes before the last byte, then you will rewrite the last 10 bytes.
In any case there is no function to directly support "insert to file". But the following code can do it accurately.
var sw = new Stopwatch();
var ab = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ ";
// create
var fs = new FileStream(#"d:\test.txt", FileMode.OpenOrCreate, FileAccess.ReadWrite, FileShare.ReadWrite, 262144, FileOptions.None);
sw.Restart();
fs.Seek(0, SeekOrigin.Begin);
for (var i = 0; i < 40000000; i++) fs.Write(ASCIIEncoding.ASCII.GetBytes(ab), 0, ab.Length);
sw.Stop();
Console.WriteLine("{0} ms", sw.Elapsed.TotalMilliseconds);
fs.Dispose();
// insert
fs = new FileStream(#"d:\test.txt", FileMode.OpenOrCreate, FileAccess.ReadWrite, FileShare.ReadWrite, 262144, FileOptions.None);
sw.Restart();
byte[] b = new byte[262144];
long target = 10, offset = fs.Length - b.Length;
while (offset != 0)
{
if (offset < 0)
{
offset = b.Length - target;
b = new byte[offset];
}
fs.Position = offset; fs.Read(b, 0, b.Length);
fs.Position = offset + target; fs.Write(b, 0, b.Length);
offset -= b.Length;
}
fs.Position = target; fs.Write(ASCIIEncoding.ASCII.GetBytes(ab), 0, ab.Length);
sw.Stop();
Console.WriteLine("{0} ms", sw.Elapsed.TotalMilliseconds);
To gain better performance for file IO, play with "magic two powered numbers" like in the code above. The creation of the file uses a buffer of 262144 bytes (256KB) that does not help at all. The same buffer for the insertion does the "performance job" as you can see by the StopWatch results if you run the code. A draft test on my PC gave the following results:
13628.8 ms for creation and 3597.0971 ms for insertion.
Note that the target byte for insertion is 10, meaning that almost the whole file was rewritten.
Why don't you put a pointer to the end of the file (literally, four bytes above the current size of the file) and then, on the end of file write the length of inserted data, and finally the data you want to insert itself. For example, if you have a string in the middle of the file, and you want to insert few characters in the middle of the string, you can write a pointer to the end of file over some four characters in the string, and then write that four characters to the end together with the characters you firstly wanted to insert. It's all about ordering data. Of course, you can do this only if you are writing the whole file by yourself, I mean you are not using other codecs.