I am using fs.Length, where fs is a FileStream.
Is this an O(1) operation? I would think this would just read from the properties of the file, as opposed to going through the file to find when the seek position has reached the end. The file I am trying to find the length of could easily range from 1 MB to 4-5 GB.
However I noticed that there is a FileInfo class, which also has a Length property.
Do both of these Length properties theoretically take the same amount of time? Or does is fs.Length slower because it must open the FileStream first?
The natural way to get the file size in .NET is the FileInfo.Length property you mentioned.
I am not sure Stream.Length is slower (it won't read the whole file anyway), but it's definitely more natural to use FileInfo instead of a FileStream if you do not plan to read the file.
Here's a small benchmark that will provide some numeric values:
private static void Main(string[] args)
{
string filePath = ...; // Path to 2.5 GB file here
Stopwatch z1 = new Stopwatch();
Stopwatch z2 = new Stopwatch();
int count = 10000;
z1.Start();
for (int i = 0; i < count; i++)
{
long length;
using (Stream stream = new FileStream(filePath, FileMode.Open))
{
length = stream.Length;
}
}
z1.Stop();
z2.Start();
for (int i = 0; i < count; i++)
{
long length = new FileInfo(filePath).Length;
}
z2.Stop();
Console.WriteLine(string.Format("Stream: {0}", z1.ElapsedMilliseconds));
Console.WriteLine(string.Format("FileInfo: {0}", z2.ElapsedMilliseconds));
Console.ReadKey();
}
Results:
Stream: 886
FileInfo: 727
Both will access the file system metadata rather than reading the whole file. I don't know which is more efficient necessarily, as a rule of thumb I'd say that if you only want to know the length (and other metadata), use FileInfo - whereas if you're opening the file as a stream anyway, use FileStream.Length.
Related
I want to read a text file line by line. I wanted to know if I'm doing it as efficiently as possible within the .NET C# scope of things.
This is what I'm trying so far:
var filestream = new System.IO.FileStream(textFilePath,
System.IO.FileMode.Open,
System.IO.FileAccess.Read,
System.IO.FileShare.ReadWrite);
var file = new System.IO.StreamReader(filestream, System.Text.Encoding.UTF8, true, 128);
while ((lineOfText = file.ReadLine()) != null)
{
//Do something with the lineOfText
}
To find the fastest way to read a file line by line you will have to do some benchmarking. I have done some small tests on my computer but you cannot expect that my results apply to your environment.
Using StreamReader.ReadLine
This is basically your method. For some reason you set the buffer size to the smallest possible value (128). Increasing this will in general increase performance. The default size is 1,024 and other good choices are 512 (the sector size in Windows) or 4,096 (the cluster size in NTFS). You will have to run a benchmark to determine an optimal buffer size. A bigger buffer is - if not faster - at least not slower than a smaller buffer.
const Int32 BufferSize = 128;
using (var fileStream = File.OpenRead(fileName))
using (var streamReader = new StreamReader(fileStream, Encoding.UTF8, true, BufferSize)) {
String line;
while ((line = streamReader.ReadLine()) != null)
{
// Process line
}
}
The FileStream constructor allows you to specify FileOptions. For example, if you are reading a large file sequentially from beginning to end, you may benefit from FileOptions.SequentialScan. Again, benchmarking is the best thing you can do.
Using File.ReadLines
This is very much like your own solution except that it is implemented using a StreamReader with a fixed buffer size of 1,024. On my computer this results in slightly better performance compared to your code with the buffer size of 128. However, you can get the same performance increase by using a larger buffer size. This method is implemented using an iterator block and does not consume memory for all lines.
var lines = File.ReadLines(fileName);
foreach (var line in lines)
// Process line
Using File.ReadAllLines
This is very much like the previous method except that this method grows a list of strings used to create the returned array of lines so the memory requirements are higher. However, it returns String[] and not an IEnumerable<String> allowing you to randomly access the lines.
var lines = File.ReadAllLines(fileName);
for (var i = 0; i < lines.Length; i += 1) {
var line = lines[i];
// Process line
}
Using String.Split
This method is considerably slower, at least on big files (tested on a 511 KB file), probably due to how String.Split is implemented. It also allocates an array for all the lines increasing the memory required compared to your solution.
using (var streamReader = File.OpenText(fileName)) {
var lines = streamReader.ReadToEnd().Split("\r\n".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
foreach (var line in lines)
// Process line
}
My suggestion is to use File.ReadLines because it is clean and efficient. If you require special sharing options (for example you use FileShare.ReadWrite), you can use your own code but you should increase the buffer size.
If you're using .NET 4, simply use File.ReadLines which does it all for you. I suspect it's much the same as yours, except it may also use FileOptions.SequentialScan and a larger buffer (128 seems very small).
While File.ReadAllLines() is one of the simplest ways to read a file, it is also one of the slowest.
If you're just wanting to read lines in a file without doing much, according to these benchmarks, the fastest way to read a file is the age old method of:
using (StreamReader sr = File.OpenText(fileName))
{
string s = String.Empty;
while ((s = sr.ReadLine()) != null)
{
//do minimal amount of work here
}
}
However, if you have to do a lot with each line, then this article concludes that the best way is the following (and it's faster to pre-allocate a string[] if you know how many lines you're going to read) :
AllLines = new string[MAX]; //only allocate memory here
using (StreamReader sr = File.OpenText(fileName))
{
int x = 0;
while (!sr.EndOfStream)
{
AllLines[x] = sr.ReadLine();
x += 1;
}
} //Finished. Close the file
//Now parallel process each line in the file
Parallel.For(0, AllLines.Length, x =>
{
DoYourStuff(AllLines[x]); //do your work here
});
Use the following code:
foreach (string line in File.ReadAllLines(fileName))
This was a HUGE difference in reading performance.
It comes at the cost of memory consumption, but totally worth it!
If the file size is not big, then it is faster to read the entire file and split it afterwards
var filestreams = sr.ReadToEnd().Split(Environment.NewLine,
StringSplitOptions.RemoveEmptyEntries);
There's a good topic about this in Stack Overflow question Is 'yield return' slower than "old school" return?.
It says:
ReadAllLines loads all of the lines into memory and returns a
string[]. All well and good if the file is small. If the file is
larger than will fit in memory, you'll run out of memory.
ReadLines, on the other hand, uses yield return to return one line at
a time. With it, you can read any size file. It doesn't load the whole
file into memory.
Say you wanted to find the first line that contains the word "foo",
and then exit. Using ReadAllLines, you'd have to read the entire file
into memory, even if "foo" occurs on the first line. With ReadLines,
you only read one line. Which one would be faster?
If you have enough memory, I've found some performance gains by reading the entire file into a memory stream, and then opening a stream reader on that to read the lines. As long as you actually plan on reading the whole file anyway, this can yield some improvements.
You can't get any faster if you want to use an existing API to read the lines. But reading larger chunks and manually find each new line in the read buffer would probably be faster.
When you need to efficiently read and process a HUGE text file, ReadLines() and ReadAllLines() are likely to throw Out of Memory exception, this was my case. On the other hand, reading each line separately would take ages. The solution was to read the file in blocks, like below.
The class:
//can return empty lines sometimes
class LinePortionTextReader
{
private const int BUFFER_SIZE = 100000000; //100M characters
StreamReader sr = null;
string remainder = "";
public LinePortionTextReader(string filePath)
{
if (File.Exists(filePath))
{
sr = new StreamReader(filePath);
remainder = "";
}
}
~LinePortionTextReader()
{
if(null != sr) { sr.Close(); }
}
public string[] ReadBlock()
{
if(null==sr) { return new string[] { }; }
char[] buffer = new char[BUFFER_SIZE];
int charactersRead = sr.Read(buffer, 0, BUFFER_SIZE);
if (charactersRead < 1) { return new string[] { }; }
bool lastPart = (charactersRead < BUFFER_SIZE);
if (lastPart)
{
char[] buffer2 = buffer.Take<char>(charactersRead).ToArray();
buffer = buffer2;
}
string s = new string(buffer);
string[] sresult = s.Split(new string[] { "\r\n" }, StringSplitOptions.None);
sresult[0] = remainder + sresult[0];
if (!lastPart)
{
remainder = sresult[sresult.Length - 1];
sresult[sresult.Length - 1] = "";
}
return sresult;
}
public bool EOS
{
get
{
return (null == sr) ? true: sr.EndOfStream;
}
}
}
Example of use:
class Program
{
static void Main(string[] args)
{
if (args.Length < 3)
{
Console.WriteLine("multifind.exe <where to search> <what to look for, one value per line> <where to put the result>");
return;
}
if (!File.Exists(args[0]))
{
Console.WriteLine("source file not found");
return;
}
if (!File.Exists(args[1]))
{
Console.WriteLine("reference file not found");
return;
}
TextWriter tw = new StreamWriter(args[2], false);
string[] refLines = File.ReadAllLines(args[1]);
LinePortionTextReader lptr = new LinePortionTextReader(args[0]);
int blockCounter = 0;
while (!lptr.EOS)
{
string[] srcLines = lptr.ReadBlock();
for (int i = 0; i < srcLines.Length; i += 1)
{
string theLine = srcLines[i];
if (!string.IsNullOrEmpty(theLine)) //can return empty lines sometimes
{
for (int j = 0; j < refLines.Length; j += 1)
{
if (theLine.Contains(refLines[j]))
{
tw.WriteLine(theLine);
break;
}
}
}
}
blockCounter += 1;
Console.WriteLine(String.Format("100 Mb blocks processed: {0}", blockCounter));
}
tw.Close();
}
}
I believe splitting strings and array handling can be significantly improved,
yet the goal here was to minimize number of disk reads.
I need to combine multiple files (video & music & text) into single file with a custom file type (for example: *.abcd) and custom file data structure , and the file should only be read by my program and my program should be able to separate parts of this file. How do this in .net & c#?
https://www.imageupload.co.uk/image/ZWnh
Like M463 rightly pointed out, you could use System.IO.Compression to compress those files together and encrypt them...although encryption is a completely different art and another headache.
A better option would be to have some metadata in, say, the first few bytes of the file and then store the files' bytes as raw bytes. This would avoid any person from figuring the contents by just looking at it in a text editor. Again, if you want to really protect your data, encryption is unavoidable. But a simple algorithm to begin with would be this:
using System;
using System.IO;
using System.Collections.Generic;
namespace Compression
{
public class ClassName
{
public static void Compress(string[] fileNames, string resultantFileName)
{
List<byte> bytesToWrite = new List<byte>();
//add metadata about the number of files
int filesLength = fileNames.Length;
bytesToWrite.AddRange(BitConverter.GetBytes(filesLength));
List<byte[]> files = new List<byte[]>();
foreach(string fileName in fileNames)
{
byte[] bytes = File.ReadAllBytes(fileName);
//add metadata about the size of each file
bytesToWrite.AddRange(BitConverter.GetBytes(bytes.Length));
files.Add(bytes);
}
foreach(byte[] bytes in files)
{
//write the actual files itself
bytesToWrite.AddRange(bytes);
}
File.WriteAllBytes(resultantFileName, bytesToWrite.ToArray());
}
public static void Decompress(string fileName)
{
List<byte> bytes = new List<byte>(File.ReadAllBytes(fileName));
//this int represents the number of files in the byte array
int filesLength = BitConverter.ToInt32(bytes.ToArray(), 0);
List<int> sizes = new List<int>();
//get the size of each file
for(int i = 0; i < filesLength; i++)
{
//the first 2 bytes represent the number of files
//then each succeding int represents the size of each file
int size = BitConverter.ToInt32(bytes.ToArray(), 2 + i * 2);
sizes.Add(size);
}
//now read all the files
for(int i = 0; i < filesLength; i++)
{
int lastByteTillNow = 0;
for(int j = 0; j < i; j++)
lastByteTillNow += sizes[j];
File.WriteAllBytes("file " + i, bytes.GetRange(2 + 2 * filesLength + lastByteTillNow, sizes[i]).ToArray());
}
}
}
}
Now obviously this is not the best algorithm you've come across nor is it the most optimized snippet of code. After all, it is just what I could come up with in 10-15 mins. So I haven't even tested it yet. However, the point is, it gives you the idea doesn't it? I have limited the size of each file to the maximum length of an Int32 (however, changing it to Int64 a.k.a long wouldnt be much of a trouble). But it gives you an idea. You can even modify the snippet to load and write to and from the RAM via MemoryStreams (Systtem.IO.MemorySteam I think). But whatever, this should give you a start!
How about an encrypted .zip-container containing your files? Handling of .zip-files is already available in the .NET-Framework, take a look at the System.IO.Compression namespace. Or you could use some third party library.
You could even force a different file extension by just renaming the file, if you really want to...
I would like to set a small buffer so the buffer can be written to file more frequently. But it seems this does not work. I wrote the following code and check the text file from time to time and find that text is written to file when i = 840 and the file size is exactly 4K, which is the default buffer size. How come?
using (StreamWriter sw = new StreamWriter("u:\\log.txt", true, Encoding.UTF8, 1))
{
for (int i = 0; i < 300000; i++)
{
sw.WriteLine(i);
Console.Write(i);
Console.ReadLine();
}
}
StreamWriter uses an underlying FileStream, and based on the source code, it looks like the buffer size isn't passed to the file stream.
You can google: .net source code streamwriter
I am trying to empower users to upload large files. Before I upload a file, I want to chunk it up. Each chunk needs to be a C# object. The reason why is for logging purposes. Its a long story, but I need to create actual C# objects that represent each file chunk. Regardless, I'm trying the following approach:
public static List<FileChunk> GetAllForFile(byte[] fileBytes)
{
List<FileChunk> chunks = new List<FileChunk>();
if (fileBytes.Length > 0)
{
FileChunk chunk = new FileChunk();
for (int i = 0; i < (fileBytes.Length / 512); i++)
{
chunk.Number = (i + 1);
chunk.Offset = (i * 512);
chunk.Bytes = fileBytes.Skip(chunk.Offset).Take(512).ToArray();
chunks.Add(chunk);
chunk = new FileChunk();
}
}
return chunks;
}
Unfortunately, this approach seems to be incredibly slow. Does anyone know how I can improve the performance while still creating objects for each chunk?
thank you
I suspect this is going to hurt a little:
chunk.Bytes = fileBytes.Skip(chunk.Offset).Take(512).ToArray();
Try this instead:
byte buffer = new byte[512];
Buffer.BlockCopy(fileBytes, chunk.Offset, buffer, 0, 512);
chunk.Bytes = buffer;
(Code not tested)
And the reason why this code would likely be slow is because Skip doesn't do anything special for arrays (though it could). This means that every pass through your loop is iterating the first 512*n items in the array, which results in O(n^2) performance, where you should just be seeing O(n).
Try something like this (untested code):
public static List<FileChunk> GetAllForFile(string fileName, FileMode.Open)
{
var chunks = new List<FileChunk>();
using (FileStream stream = new FileStream(fileName))
{
int i = 0;
while (stream.Position <= stream.Length)
{
var chunk = new FileChunk();
chunk.Number = (i);
chunk.Offset = (i * 512);
Stream.Read(chunk.Bytes, 0, 512);
chunks.Add(chunk);
i++;
}
}
return chunks;
}
The above code skips several steps in your process, preferring to read the bytes from the file directly.
Note that, if the file is not an even multiple of 512, the last chunk will contain less than 512 bytes.
Same as Robert Harvey's answer, but using a BinaryReader, that way I don't need to specify an offset. If you use a BinaryWriter on the other end to reassemble the file, you won't need the Offset member of FileChunk.
public static List<FileChunk> GetAllForFile(string fileName) {
var chunks = new List<FileChunk>();
using (FileStream stream = new FileStream(fileName)) {
BinaryReader reader = new BinaryReader(stream);
int i = 0;
bool eof = false;
while (!eof) {
var chunk = new FileChunk();
chunk.Number = i;
chunk.Offset = (i * 512);
chunk.Bytes = reader.ReadBytes(512);
chunks.Add(chunk);
i++;
if (chunk.Bytes.Length < 512) { eof = true; }
}
}
return chunks;
}
Have you thought about what you're going to do to compensate for packet loss and data corruption?
Since you mentioned that the load is taking a long time then I would use asynchronous file reading in order to speed up the loading process. The hard disk is the slowest component of a computer. Google does asynchronous reads and writes on Google Chrome to improve their load times. I had to do something like this in C# in a previous job.
The idea would be to spawn several asynchronous requests over different parts of the file. Then when a request comes in, take the byte array and create your FileChunk objects taking 512 bytes at a time. There are several benefits to this:
If you have this run in a separate thread, then you won't have the whole program waiting to load the large file you have.
You can process a byte array, creating FileChunk objects, while the hard disk is still trying to for-fill read request on other parts of the file.
You will save on RAM space if you limit the amount of pending read requests you can have. This allows less page faulting to the hard disk and use the RAM and CPU cache more efficiently, which speeds up processing further.
You would want to use the following methods in the FileStream class.
[HostProtectionAttribute(SecurityAction.LinkDemand, ExternalThreading = true)]
public virtual IAsyncResult BeginRead(
byte[] buffer,
int offset,
int count,
AsyncCallback callback,
Object state
)
public virtual int EndRead(
IAsyncResult asyncResult
)
Also this is what you will get in the asyncResult:
// Extract the FileStream (state) out of the IAsyncResult object
FileStream fs = (FileStream) ar.AsyncState;
// Get the result
Int32 bytesRead = fs.EndRead(ar);
Here is some reference material for you to read.
This is a code sample of working with Asynchronous File I/O Models.
This is a MS documentation reference for Asynchronous File I/O.
I have a huge file, where I have to insert certain characters at a specific location. What is the easiest way to do that in C# without rewriting the whole file again.
Filesystems do not support "inserting" data in the middle of a file. If you really have a need for a file that can be written to in a sorted kind of way, I suggest you look into using an embedded database.
You might want to take a look at SQLite or BerkeleyDB.
Then again, you might be working with a text file or a legacy binary file. In that case your only option is to rewrite the file, at least from the insertion point up to the end.
I would look at the FileStream class to do random I/O in C#.
You will probably need to rewrite the file from the point you insert the changes to the end. You might be best always writing to the end of the file and use tools such as sort and grep to get the data out in the desired order. I am assuming you are talking about a text file here, not a binary file.
There is no way to insert characters in to a file without rewriting them. With C# it can be done with any Stream classes. If the files are huge, I would recommend you to use GNU Core Utils inside C# code. They are the fastest. I used to handle very large text files with the core utils ( of sizes 4GB, 8GB or more etc ). Commands like head, tail, split, csplit, cat, shuf, shred, uniq really help a lot in text manipulation.
For example if you need to put some chars in a 2GB file, you can use split -b BYTECOUNT, put the ouptut in to a file, append the new text to it, and get the rest of the content and add to it. This should supposedly be faster than any other way.
Hope it works. Give it a try.
You can use random access to write to specific locations of a file, but you won't be able to do it in text format, you'll have to work with bytes directly.
If you know the specific location to which you want to write the new data, use the BinaryWriter class:
using (BinaryWriter bw = new BinaryWriter (File.Open (strFile, FileMode.Open)))
{
string strNewData = "this is some new data";
byte[] byteNewData = new byte[strNewData.Length];
// copy contents of string to byte array
for (var i = 0; i < strNewData.Length; i++)
{
byteNewData[i] = Convert.ToByte (strNewData[i]);
}
// write new data to file
bw.Seek (15, SeekOrigin.Begin); // seek to position 15
bw.Write (byteNewData, 0, byteNewData.Length);
}
You may take a look at this project:
Win Data Inspector
Basically, the code is the following:
// this.Stream is the stream in which you insert data
{
long position = this.Stream.Position;
long length = this.Stream.Length;
MemoryStream ms = new MemoryStream();
this.Stream.Position = 0;
DIUtils.CopyStream(this.Stream, ms, position, progressCallback);
ms.Write(data, 0, data.Length);
this.Stream.Position = position;
DIUtils.CopyStream(this.Stream, ms, this.Stream.Length - position, progressCallback);
this.Stream = ms;
}
#region Delegates
public delegate void ProgressCallback(long position, long total);
#endregion
DIUtils.cs
public static void CopyStream(Stream input, Stream output, long length, DataInspector.ProgressCallback callback)
{
long totalsize = input.Length;
long byteswritten = 0;
const int size = 32768;
byte[] buffer = new byte[size];
int read;
int readlen = length < size ? (int)length : size;
while (length > 0 && (read = input.Read(buffer, 0, readlen)) > 0)
{
output.Write(buffer, 0, read);
byteswritten += read;
length -= read;
readlen = length < size ? (int)length : size;
if (callback != null)
callback(byteswritten, totalsize);
}
}
Depending on the scope of your project, you may want to decide to insert each line of text with your file in a table datastructure. Sort of like a database table, that way you can insert to a specific location at any given moment, and not have to read-in, modify, and output the entire text file each time. This is given the fact that your data is "huge" as you put it. You would still recreate the file, but at least you create a scalable solution in this manner.
It may be "possible" depending on how the filesystem stores files to quickly insert (ie, add additional) bytes in the middle. If it is remotely possible it may only be feasible to do so a full block at a time, and only by either doing low level modification of the filesystem itself or by using a filesystem specific interface.
Filesystems are not generally designed for this operation. If you need to quickly do inserts you really need a more general database.
Depending on your application a middle ground would be to bunch your inserts together, so you only do one rewrite of the file rather than twenty.
You will always have to rewrite the remaining bytes from the insertion point. If this point is at 0, then you will rewrite the whole file. If it is 10 bytes before the last byte, then you will rewrite the last 10 bytes.
In any case there is no function to directly support "insert to file". But the following code can do it accurately.
var sw = new Stopwatch();
var ab = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ ";
// create
var fs = new FileStream(#"d:\test.txt", FileMode.OpenOrCreate, FileAccess.ReadWrite, FileShare.ReadWrite, 262144, FileOptions.None);
sw.Restart();
fs.Seek(0, SeekOrigin.Begin);
for (var i = 0; i < 40000000; i++) fs.Write(ASCIIEncoding.ASCII.GetBytes(ab), 0, ab.Length);
sw.Stop();
Console.WriteLine("{0} ms", sw.Elapsed.TotalMilliseconds);
fs.Dispose();
// insert
fs = new FileStream(#"d:\test.txt", FileMode.OpenOrCreate, FileAccess.ReadWrite, FileShare.ReadWrite, 262144, FileOptions.None);
sw.Restart();
byte[] b = new byte[262144];
long target = 10, offset = fs.Length - b.Length;
while (offset != 0)
{
if (offset < 0)
{
offset = b.Length - target;
b = new byte[offset];
}
fs.Position = offset; fs.Read(b, 0, b.Length);
fs.Position = offset + target; fs.Write(b, 0, b.Length);
offset -= b.Length;
}
fs.Position = target; fs.Write(ASCIIEncoding.ASCII.GetBytes(ab), 0, ab.Length);
sw.Stop();
Console.WriteLine("{0} ms", sw.Elapsed.TotalMilliseconds);
To gain better performance for file IO, play with "magic two powered numbers" like in the code above. The creation of the file uses a buffer of 262144 bytes (256KB) that does not help at all. The same buffer for the insertion does the "performance job" as you can see by the StopWatch results if you run the code. A draft test on my PC gave the following results:
13628.8 ms for creation and 3597.0971 ms for insertion.
Note that the target byte for insertion is 10, meaning that almost the whole file was rewritten.
Why don't you put a pointer to the end of the file (literally, four bytes above the current size of the file) and then, on the end of file write the length of inserted data, and finally the data you want to insert itself. For example, if you have a string in the middle of the file, and you want to insert few characters in the middle of the string, you can write a pointer to the end of file over some four characters in the string, and then write that four characters to the end together with the characters you firstly wanted to insert. It's all about ordering data. Of course, you can do this only if you are writing the whole file by yourself, I mean you are not using other codecs.