I am creating a file of a specified size - I don't care what data is in it, although random would be nice. Currently I am doing this:
var sizeInMB = 3; // Up to many Gb
using (FileStream stream = new FileStream(fileName, FileMode.Create))
{
using (BinaryWriter writer = new BinaryWriter(stream))
{
while (writer.BaseStream.Length <= sizeInMB * 1000000)
{
writer.Write("a"); //This could be random. Also, larger strings improve performance obviously
}
writer.Close();
}
}
This isn't efficient or even the right way to go about it. Any higher performance solutions?
Thanks for all the answers.
Edit
Ran some tests on the following methods for a 2Gb File (time in ms):
Method 1: Jon Skeet
byte[] data = new byte[sizeInMb * 1024 * 1024];
Random rng = new Random();
rng.NextBytes(data);
File.WriteAllBytes(fileName, data);
N/A - Out of Memory Exception for 2Gb File
Method 2: Jon Skeet
byte[] data = new byte[8192];
Random rng = new Random();
using (FileStream stream = File.OpenWrite(fileName))
{
for (int i = 0; i < sizeInMB * 128; i++)
{
rng.NextBytes(data);
stream.Write(data, 0, data.Length);
}
}
#1K - 45,868, 23,283, 23,346
#128K - 24,877, 20,585, 20,716
#8Kb - 30,426, 22,936, 22,936
Method 3 - Hans Passant (Super Fast but data isn't random)
using (var fs = new FileStream(fileName, FileMode.Create, FileAccess.Write, FileShare.None))
{
fs.SetLength(sizeInMB * 1024 * 1024);
}
257, 287, 3, 3, 2, 3 etc.
Well, a very simple solution:
byte[] data = new byte[sizeInMb * 1024 * 1024];
Random rng = new Random();
rng.NextBytes(data);
File.WriteAllBytes(fileName, data);
A slightly more memory efficient version :)
// Note: block size must be a factor of 1MB to avoid rounding errors :)
const int blockSize = 1024 * 8;
const int blocksPerMb = (1024 * 1024) / blockSize;
byte[] data = new byte[blockSize];
Random rng = new Random();
using (FileStream stream = File.OpenWrite(fileName))
{
// There
for (int i = 0; i < sizeInMb * blocksPerMb; i++)
{
rng.NextBytes(data);
stream.Write(data, 0, data.Length);
}
}
However, if you do this several times in very quick succession creating a new instance of Random each time, you may get duplicate data. See my article on randomness for more information - you could avoid this using System.Security.Cryptography.RandomNumberGenerator... or by reusing the same instance of Random multiple times - with the caveat that it's not thread-safe.
There's no faster way then taking advantage of the sparse file support built into NTFS, the file system for Windows used on hard disks. This code create a one gigabyte file in a fraction of a second:
using System;
using System.IO;
class Program {
static void Main(string[] args) {
using (var fs = new FileStream(#"c:\temp\onegigabyte.bin", FileMode.Create, FileAccess.Write, FileShare.None)) {
fs.SetLength(1024 * 1024 * 1024);
}
}
}
When read, the file contains only zeros.
You can use this following class created by me for generate random strings
using System;
using System.Text;
public class RandomStringGenerator
{
readonly Random random;
public RandomStringGenerator()
{
random = new Random();
}
public string Generate(int length)
{
if (length < 0)
{
throw new ArgumentOutOfRangeException("length");
}
var stringBuilder = new StringBuilder();
for (int i = 0; i < length; i++)
{
char ch = (char)random.Next(0,255 );
stringBuilder.Append(ch);
}
return stringBuilder.ToString();
}
}
for using
int length = 10;
string randomString = randomStringGenerator.Generate(length);
The efficient way to create a large file:
FileStream fs = new FileStream(#"C:\temp\out.dat", FileMode.Create);
fs.Seek(1024 * 6, SeekOrigin.Begin);
System.Text.UTF8Encoding encoding = new System.Text.UTF8Encoding();
fs.Write(encoding.GetBytes("test"), 0, 4);
fs.Close();
However this file will be empty (except for the "test" at the end). Not clear what is it exactly you are trying to do -- large file with data, or just large file. You can modify this to sparsely write some data in the file too, but without filling it up completely.
If you do want the entire file filled with random data, then the only way I can think of is using Random bytes from Jon above.
An improvement would be to fill a buffer of the desired size with the data and flushing it all at once.
Related
I have a class Value
the output of Value is used as an input to other classes and eventually in Main.
In Main a logic is performed and output is produced for first 512 bits. I want my program to return back to value() to start with next 512 bits of file.txt. How can I do that?
public static byte[] Value()
{
byte[] numbers = new byte[9999];
using (FileStream fs = File.Open(#"C:\Users\file.txt", FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
using (BufferedStream bs = new BufferedStream(fs))
using (StreamReader sr = new StreamReader(bs))
{
string line;
while ((line = sr.ReadLine()) != null)
{
for (int i = 0; i < 512; i++)
{
numbers[i] = Byte.Parse(line[i].ToString());
}
}
}
return numbers;
}
What can be done is to pass Value() an offset and a length parameter.
But there is a problem with your method, you are actually taking the first bytes for each line in the file, which I don't know is what you want to do. So I corrected this to make sure you return only length bytes.
using System.Linq Skip and Take methods, you may find things easier as well
public static byte[] Value(int startOffset, int length)
{
byte allBytes = File.ReadAllBytes(#"C:\Users\file.txt");
return allBytes.Skip(startOffset).Take(length);
}
It seems like what you are trying to do is use a recursive call on Value() this is based on your comment, but it is not clear, so I am going to do that assumption.
there is a problem I see and it's like in your scenario you're returning a byte[], So I modified your code a little bit to make it as closest as your's.
/// <summary>
/// This method will call your `value` methodand return the bytes and it is the entry point for the loci.
/// </summary>
/// <returns></returns>
public static byte[] ByteValueCaller()
{
byte[] numbers = new byte[9999];
Value(0, numbers);
return numbers;
}
public static void Value(int startingByte, byte[] numbers)
{
using (FileStream fs = File.Open(#"C:\Users\file.txt", FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
using (BinaryReader br = new BinaryReader(fs))
{
//64bytes == 512bit
//determines if the last position to use is inside your stream, or if the last position is the end of the stream.
int bytesToRead = startingByte + 64 > br.BaseStream.Length ? (int)br.BaseStream.Length - startingByte : 64;
//move your stream to the given possition
br.BaseStream.Seek(startingByte, SeekOrigin.Begin);
//populates databuffer with the given bytes
byte[] dataBuffer = br.ReadBytes(bytesToRead);
//This method will migrate from our temporal databuffer to the numbers array.
TransformBufferArrayToNumbers(startingByte, dataBuffer, numbers);
//recursive call to the same
if (startingByte + bytesToRead < fs.Length)
Value(startingByte + bytesToRead, numbers);
}
static void TransformBufferArrayToNumbers(int startingByte, byte[] dataBuffer, byte[] numbers)
{
for (var i = 0; i < dataBuffer.Length; i++)
{
numbers[startingByte + i] = dataBuffer[i];
}
}
}
Also, be careful with the byte[9999] as you are limiting the characters you can get, if that's a hardcoded limit, I will add also that information on the if that determines the recursive call.
#TiGreX
public static List<byte> ByteValueCaller()
{
List<byte> numbers = new List<byte>();
GetValue(0, numbers);
return numbers;
}
public static void GetValue(int startingByte, List<byte> numbers)
{
using (FileStream fs = File.Open(#"C:\Users\file1.txt", FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
using (BinaryReader br = new BinaryReader(fs))
{
//64bytes == 512bit
//determines if the last position to use is inside your stream, or if the last position is the end of the stream.
int bytesToRead = startingByte + 64 > br.BaseStream.Length ? (int)br.BaseStream.Length - startingByte : 64;
//move your stream to the given possition
br.BaseStream.Seek(startingByte, SeekOrigin.Begin);
//populates databuffer with the given bytes
byte[] dataBuffer = br.ReadBytes(bytesToRead);
numbers.AddRange(dataBuffer);
//recursive call to the same
if (startingByte + bytesToRead < fs.Length)
GetValue(startingByte + bytesToRead, numbers);
}
}
I would like to get an estimate of the number of lines in a csv/text file so that I can use that number for a progress bar. The file could be extremely large so getting the exact number of lines will take too long for this purpose.
What I have come up with is below (read in a portion of the file and count the number of lines and use the file size to estimate the total number of lines):
public static int GetLineCountEstimate(string file)
{
double count = 0;
using (var fs = new FileStream(file, FileMode.Open, FileAccess.Read))
{
long byteCount = fs.Length;
int maxByteCount = 524288;
if (byteCount > maxByteCount)
{
var buf = new byte[maxByteCount];
fs.Read(buf, 0, maxByteCount);
string s = System.Text.Encoding.UTF8.GetString(buf, 0, buf.Length);
count = s.Split('\n').Length * byteCount / maxByteCount;
}
else
{
var buf = new byte[byteCount];
fs.Read(buf, 0, (int)byteCount);
string s = System.Text.Encoding.UTF8.GetString(buf, 0, buf.Length);
count = s.Split('\n').Length;
}
}
return Convert.ToInt32(count);
}
This seems to work ok, but I have some concerns:
1) I would like to have my parameter simply as Stream (as opposed to a filename) since I may also be reading from the clipboard (MemoryStream). However Stream doesn't seem to be able to read n bytes at once into a buffer or get the total length of the Stream in bytes, like FileStream can. Stream is the parent class to both MemoryStream and FileStream.
2) I don't want to assume an encoding such as UTF8
3) I don't want to assume an end of line character (it should work for CR, CRLF, and LF)
I would appreciate any help to make this function more robust.
Here is what I came up with as a more robust solution for estimating line count.
public static int EstimateLineCount(string file)
{
using (var fs = new FileStream(file, FileMode.Open, FileAccess.Read))
{
return EstimateLineCount(fs);
}
}
public static int EstimateLineCount(Stream s)
{
//if file is larger than 10MB estimate the line count, otherwise get the exact line count
const int maxBytes = 10485760; //10MB = 1024*1024*10 bytes
s.Position = 0;
using (var sr = new StreamReader(s, Encoding.UTF8))
{
int lineCount = 0;
if (s.Length > maxBytes)
{
while (s.Position < maxBytes && sr.ReadLine() != null)
lineCount++;
return Convert.ToInt32((double)lineCount * s.Length / s.Position);
}
while (sr.ReadLine() != null)
lineCount++;
return lineCount;
}
}
var lineCount = File.ReadLines(#"C:\file.txt").Count();
An other way:
var lineCount = 0;
using (var reader = File.OpenText(#"C:\file.txt"))
{
while (reader.ReadLine() != null)
{
lineCount++;
}
}
You're cheating! You're asking more than one question... I'll try to help you anyway :P
No, you can't use Stream, but you can use StreamReader. This should provide the flexibility you need.
Test for encoding, since I deduce you'll be working with various. Keep in mind however that it's usually hard to cater for ALL scenarios, so pick a few important ones first, and extend your program later.
Don't - let me show you how:
First, consider your source. Whether it's a file or memory stream, you should have an idea about it's size. I've done the file bit because I'm lazy and it's easy, so you'll have to figure out the memory stream bit yourself. What I've done is much simpler but less accurate: Read the first line of the file, and use it as a percentage of the size of the file. Note I multiplied the length of the string by 2 as that is the delta, in other words number of extra bytes used per extra character in a string. Obviously this isn't very accurate, so you can extend it to x number of lines, just keep in mind that you'll have to change the formula as well.
static void Main(string[] args)
{
FileInfo fileInfo = new FileInfo((#"C:\Muckabout\StringCounter\test.txt"));
using (var stream = new StreamReader(fileInfo.FullName))
{
var firstLine = stream.ReadLine(); // Read the first line.
Console.WriteLine("First line read. This is roughly " + (firstLine.Length * 2.0) / fileInfo.Length * 100 + " per cent of the file.");
}
Console.ReadKey();
}
I'm working with large files , beginning from 10Gb. I'm loading the parts of the file in the memory for processing. Following code works fine for smaller files (700Mb)
byte[] byteArr = new byte[layerPixelCount];
using (FileStream fs = File.OpenRead(recFileName))
{
using (BinaryReader br = new BinaryReader(fs))
{
fs.Seek(offset, SeekOrigin.Begin);
for (int i = 0; i < byteArr.Length; i++)
{
byteArr[i] = (byte)(br.ReadUInt16() / 256);
}
}
}
After opening a 10Gb file, the first run of this function is OK. But the second Seek() throws an IO exception:
An attempt was made to move the file pointer before the beginning of the file.
The numbers are:
fs.Length = 11998628352
offset = 4252580352
byteArr.Length = 7746048
I assumed that GC didn't collect the closed fs reference before the second call and tried
GC.Collect();
GC.WaitForPendingFinalizers();
but no luck.
Any help is apreciated
I'm guessing it's because either your signed integer indexer or offset is rolling over to negative values. Try declaring offset and i as long.
//Offest is now long
long offset = 4252580352;
byte[] byteArr = new byte[layerPixelCount];
using (FileStream fs = File.OpenRead(recFileName))
{
using (BinaryReader br = new BinaryReader(fs))
{
fs.Seek(offset, SeekOrigin.Begin);
for (long i = 0; i < byteArr.Length; i++)
{
byteArr[i] = (byte)(br.ReadUInt16() / 256);
}
}
}
My following written code logic is appropriate with large files beyond 4GB. The key issue to notice is the LONG data type used with the SEEK method. As a LONG is able to point beyond 2^32 data boundaries. In this example, the code is processing first processing the large file in chunks of 1GB, after the large whole 1GB chunks are processed, the left over (<1GB) bytes are processed. I use this code with calculating the CRC of files beyond the 4GB size. (using https://crc32c.machinezoo.com/ for the crc32c calculation in this example)
private uint Crc32CAlgorithmBigCrc(string fileName)
{
uint hash = 0;
byte[] buffer = null;
FileInfo fileInfo = new FileInfo(fileName);
long fileLength = fileInfo.Length;
int blockSize = 1024000000;
decimal div = fileLength / blockSize;
int blocks = (int)Math.Floor(div);
int restBytes = (int)(fileLength - (blocks * blockSize));
long offsetFile = 0;
uint interHash = 0;
Crc32CAlgorithm Crc32CAlgorithm = new Crc32CAlgorithm();
bool firstBlock = true;
using (FileStream fs = new FileStream(fileName, FileMode.Open, FileAccess.Read))
{
buffer = new byte[blockSize];
using (BinaryReader br = new BinaryReader(fs))
{
while (blocks > 0)
{
blocks -= 1;
fs.Seek(offsetFile, SeekOrigin.Begin);
buffer = br.ReadBytes(blockSize);
if (firstBlock)
{
firstBlock = false;
interHash = Crc32CAlgorithm.Compute(buffer);
hash = interHash;
}
else
{
hash = Crc32CAlgorithm.Append(interHash, buffer);
}
offsetFile += blockSize;
}
if (restBytes > 0)
{
Array.Resize(ref buffer, restBytes);
fs.Seek(offsetFile, SeekOrigin.Begin);
buffer = br.ReadBytes(restBytes);
hash = Crc32CAlgorithm.Append(interHash, buffer);
}
buffer = null;
}
}
//MessageBox.Show(hash.ToString());
//MessageBox.Show(hash.ToString("X"));
return hash;
}
I want to create a file with a cryptographically strong sequence of random values. This is the code
int bufferLength = 719585280;
byte[] random = new byte[bufferLength];
RNGCryptoServiceProvider rng = new RNGCryptoServiceProvider();
rng.GetBytes(random);
File.WriteAllBytes("crypto.bin",random);
The problem is it returns OutOfMemoryException at rng.GetBytes(random);. I need a file with that kind of size(no more, no less). How can I solve this? Thanks.
Simply do it in chunks:
byte[] buffer = new byte[16 * 1024];
int bytesToWrite = 719585280;
RNGCryptoServiceProvider rng = new RNGCryptoServiceProvider();
using (Stream output = File.Create("crypto.bin"))
{
while (bytesToWrite > 0)
{
rng.GetBytes(buffer);
int bytesThisTime = Math.Min(bytesToWrite, buffer.Length);
output.Write(buffer, 0, bytesThisTime);
bytesToWrite -= bytesThisTime;
}
}
There's no reason to generate the whole thing in memory in one go, basically.
int fileSize = 719585280;
var bufLength = 4096;
byte[] random = new byte[bufLength];
RNGCryptoServiceProvider rng = new RNGCryptoServiceProvider();
var bytesRemaining = fileSize;
using(var fs=File.Create("c:\crypto.bin"))
{
while(bytesRemaining > 0)
{
rng.GetBytes(random);
var bytesToWrite = Math.Min(bufLength, bytesRemaining);
fs.Write(random, 0, bytesToWrite);
bytesRemaining -= bytesToWrite;
}
}
Try generating it in parts and stream it together into the file.
I want to compare two binary files. One of them is already stored on the server with a pre-calculated CRC32 in the database from when I stored it originally.
I know that if the CRC is different, then the files are definitely different. However, if the CRC is the same, I don't know that the files are. So, I'm looking for a nice efficient way of comparing the two streams: one from the posted file and one from the file system.
I'm not an expert on streams, but I'm well aware that I could easily shoot myself in the foot here as far as memory usage is concerned.
static bool FileEquals(string fileName1, string fileName2)
{
// Check the file size and CRC equality here.. if they are equal...
using (var file1 = new FileStream(fileName1, FileMode.Open))
using (var file2 = new FileStream(fileName2, FileMode.Open))
return FileStreamEquals(file1, file2);
}
static bool FileStreamEquals(Stream stream1, Stream stream2)
{
const int bufferSize = 2048;
byte[] buffer1 = new byte[bufferSize]; //buffer size
byte[] buffer2 = new byte[bufferSize];
while (true) {
int count1 = stream1.Read(buffer1, 0, bufferSize);
int count2 = stream2.Read(buffer2, 0, bufferSize);
if (count1 != count2)
return false;
if (count1 == 0)
return true;
// You might replace the following with an efficient "memcmp"
if (!buffer1.Take(count1).SequenceEqual(buffer2.Take(count2)))
return false;
}
}
I sped up the "memcmp" by using a Int64 compare in a loop over the read stream chunks. This reduced time to about 1/4.
private static bool StreamsContentsAreEqual(Stream stream1, Stream stream2)
{
const int bufferSize = 2048 * 2;
var buffer1 = new byte[bufferSize];
var buffer2 = new byte[bufferSize];
while (true)
{
int count1 = stream1.Read(buffer1, 0, bufferSize);
int count2 = stream2.Read(buffer2, 0, bufferSize);
if (count1 != count2)
{
return false;
}
if (count1 == 0)
{
return true;
}
int iterations = (int)Math.Ceiling((double)count1 / sizeof(Int64));
for (int i = 0; i < iterations; i++)
{
if (BitConverter.ToInt64(buffer1, i * sizeof(Int64)) != BitConverter.ToInt64(buffer2, i * sizeof(Int64)))
{
return false;
}
}
}
}
This is how I would do it if you didn't want to rely on crc:
/// <summary>
/// Binary comparison of two files
/// </summary>
/// <param name="fileName1">the file to compare</param>
/// <param name="fileName2">the other file to compare</param>
/// <returns>a value indicateing weather the file are identical</returns>
public static bool CompareFiles(string fileName1, string fileName2)
{
FileInfo info1 = new FileInfo(fileName1);
FileInfo info2 = new FileInfo(fileName2);
bool same = info1.Length == info2.Length;
if (same)
{
using (FileStream fs1 = info1.OpenRead())
using (FileStream fs2 = info2.OpenRead())
using (BufferedStream bs1 = new BufferedStream(fs1))
using (BufferedStream bs2 = new BufferedStream(fs2))
{
for (long i = 0; i < info1.Length; i++)
{
if (bs1.ReadByte() != bs2.ReadByte())
{
same = false;
break;
}
}
}
}
return same;
}
The accepted answer had an error that was pointed out, but never corrected: stream read calls are not guaranteed to return all bytes requested.
BinaryReader ReadBytes calls are guaranteed to return as many bytes as requested unless the end of the stream is reached first.
The following code takes advantage of BinaryReader to do the comparison:
static private bool FileEquals(string file1, string file2)
{
using (FileStream s1 = new FileStream(file1, FileMode.Open, FileAccess.Read, FileShare.Read))
using (FileStream s2 = new FileStream(file2, FileMode.Open, FileAccess.Read, FileShare.Read))
using (BinaryReader b1 = new BinaryReader(s1))
using (BinaryReader b2 = new BinaryReader(s2))
{
while (true)
{
byte[] data1 = b1.ReadBytes(64 * 1024);
byte[] data2 = b2.ReadBytes(64 * 1024);
if (data1.Length != data2.Length)
return false;
if (data1.Length == 0)
return true;
if (!data1.SequenceEqual(data2))
return false;
}
}
}
if you change that crc to a sha1 signature the chances of it being different but with the same signature are astronomicly small
You can check the length and dates of the two files even before checking the CRC to possibly avoid the CRC check.
But if you have to compare the entire file contents, one neat trick I've seen is reading the bytes in strides equal to the bitness of the CPU. For example, on a 32 bit PC, read 4 bytes at a time and compare them as int32's. On a 64 bit PC you can read 8 bytes at a time. This is roughly 4 or 8 times as fast as doing it byte by byte. You also would probably wanna use an unsafe code block so that you could use pointers instead of doing a bunch of bit shifting and OR'ing to get the bytes into the native int sizes.
You can use IntPtr.Size to determine the ideal size for the current processor architecture.