I am trying to read a text file using a for loop that runs for a 100 times.
StreamReader reader = new StreamReader("client.txt");
for (int i=0;i<=100;i++)
{
reader.readline();
}
Now this works fine if the text file has 100 lines but not if lets say 700. So I want the loop to run for 100 times but read "1%" of the file in each run.How would i do that?
If file size is not too large you can:
string[] lines = File.ReadAllLines("client.txt");
or
string text = File.ReadAllText("client.txt");
Reading 1% at a time is a bit tricky, I'd go with the approach of reading line by line:
var filename = "client.txt";
var info = new FileInfo(filename);
var text = new StringBuilder();
using (var stream = new FileStream(filename, FileMode.Open))
using (var reader = new StreamReader(stream))
{
while (!reader.EndOfStream)
{
text.AppendLine(reader.ReadLine());
var progress = Convert.ToDouble(stream.Position) * 100 / info.Length;
Console.WriteLine(progress);
}
}
var result = text.ToString();
But please notice, the progress will not be very accurate because StreamReader.ReadLine (and equivalently ReadLineAsync) will often read more than just a single line - it basically reads into a buffer and then interprets that buffer. That's much more efficient than reading a single byte at a time, but it does mean that the stream will have advanced further than it strictly speaking needs to.
Related
I'm trying to parse a crg-file in C#. The file is mixed with plain text and binary data. The first section of the file contains plain text while the rest of the file is binary (lots of floats), here's an example:
$
$ROAD_CRG
reference_line_start_u = 100
reference_line_end_u = 120
$
$KD_DEFINITION
#:KRBI
U:reference line u,m,730.000,0.010
D:reference line phi,rad
D:long section 1,m
D:long section 2,m
D:long section 3,m
...
$
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
�#z����RA����\�l
...
I know I can read bytes starting at a specific offset but how do I find out which byte to start from? The last row before the binary section will always contain at least four dollar signs "$$$$". Here's what I've got so far:
using var fs = new FileStream(#"crg_sample.crg", FileMode.Open, FileAccess.Read);
var startByte = ??; // How to find out where to start?
using (BinaryReader reader = new BinaryReader(fs))
{
reader.BaseStream.Seek(startByte, SeekOrigin.Begin);
var f = reader.ReadSingle();
Debug.WriteLine(f);
}
When you have a mixture of text data and binary data, you need to treat everything as binary. This means you should be using raw Stream access, or something similar, and using binary APIs to look through the text data (often looking for cr/lf/crlf at bytes as sentinels, although it sounds like in your case you could just look for the $$$$ using binary APIs, then decode the entire block before, and scan forwards). When you think you have an entire line, then you can use Encoding to parse each line - the most convenient API being encoding.GetString(). When you've finished looking through the text data as binary, then you can continue parsing the binary data, again using the binary API. I would usually recommend against BinaryReader here too, because frankly it doesn't gain you much over more direct API. The other problem you might want to think about is CPU endianness, but assuming that isn't a problem: BitConverter.ToSingle() may be your friend.
If the data is modest in size, you may find it easiest to use byte[] for the data; either via File.ReadAllBytes, or by renting an oversized byte[] from the array-pool, and loading it from a FileStream. The Stream API is awkward for this kind of scenario, because once you've looked at data: it has gone - so you need to maintain your own back-buffers. The pipelines API is ideal for this, when dealing with large data, but is an advanced topic.
UPDATE: This code may not work as expected. Please review the valuable information in the comments.
using (var fs = new FileStream(#"crg_sample.crg", FileMode.Open, FileAccess.Read))
{
using (StreamReader sr = new StreamReader(fs, Encoding.ASCII, true, 1, true))
{
var line = sr.ReadLine();
while (!string.IsNullOrWhiteSpace(line) && !line.Contains("$$$$"))
{
line = sr.ReadLine();
}
}
using (BinaryReader reader = new BinaryReader(fs))
{
// TODO: Start reading the binary data
}
}
Solution
I know this is far from the most optimized solution but in my case it did the trick and since the plain text section of the file was known to be fairly small this didn't cause any noticable performance issues. Here's the code:
using var fileStream = new FileStream(#"crg_sample.crg", FileMode.Open, FileAccess.Read);
using var reader = new BinaryReader(fileStream);
var newLine = '\n';
var markerString = "$$$$";
var currentString = "";
var foundMarker = false;
var foundNewLine = false;
while (!foundNewLine)
{
var c = reader.ReadChar();
if (!foundMarker)
{
currentString += c;
if (currentString.Length > markerString.Length)
currentString = currentString.Substring(1);
if (currentString == markerString)
foundMarker = true;
}
else
{
if (c == newLine)
foundNewLine = true;
}
}
if (foundNewLine)
{
// Read binary
}
Note: If you're dealing with larger or more complex files you should probably take a look at Mark Gravell's answer and the comment sections.
I want to read a text file line by line. I wanted to know if I'm doing it as efficiently as possible within the .NET C# scope of things.
This is what I'm trying so far:
var filestream = new System.IO.FileStream(textFilePath,
System.IO.FileMode.Open,
System.IO.FileAccess.Read,
System.IO.FileShare.ReadWrite);
var file = new System.IO.StreamReader(filestream, System.Text.Encoding.UTF8, true, 128);
while ((lineOfText = file.ReadLine()) != null)
{
//Do something with the lineOfText
}
To find the fastest way to read a file line by line you will have to do some benchmarking. I have done some small tests on my computer but you cannot expect that my results apply to your environment.
Using StreamReader.ReadLine
This is basically your method. For some reason you set the buffer size to the smallest possible value (128). Increasing this will in general increase performance. The default size is 1,024 and other good choices are 512 (the sector size in Windows) or 4,096 (the cluster size in NTFS). You will have to run a benchmark to determine an optimal buffer size. A bigger buffer is - if not faster - at least not slower than a smaller buffer.
const Int32 BufferSize = 128;
using (var fileStream = File.OpenRead(fileName))
using (var streamReader = new StreamReader(fileStream, Encoding.UTF8, true, BufferSize)) {
String line;
while ((line = streamReader.ReadLine()) != null)
{
// Process line
}
}
The FileStream constructor allows you to specify FileOptions. For example, if you are reading a large file sequentially from beginning to end, you may benefit from FileOptions.SequentialScan. Again, benchmarking is the best thing you can do.
Using File.ReadLines
This is very much like your own solution except that it is implemented using a StreamReader with a fixed buffer size of 1,024. On my computer this results in slightly better performance compared to your code with the buffer size of 128. However, you can get the same performance increase by using a larger buffer size. This method is implemented using an iterator block and does not consume memory for all lines.
var lines = File.ReadLines(fileName);
foreach (var line in lines)
// Process line
Using File.ReadAllLines
This is very much like the previous method except that this method grows a list of strings used to create the returned array of lines so the memory requirements are higher. However, it returns String[] and not an IEnumerable<String> allowing you to randomly access the lines.
var lines = File.ReadAllLines(fileName);
for (var i = 0; i < lines.Length; i += 1) {
var line = lines[i];
// Process line
}
Using String.Split
This method is considerably slower, at least on big files (tested on a 511 KB file), probably due to how String.Split is implemented. It also allocates an array for all the lines increasing the memory required compared to your solution.
using (var streamReader = File.OpenText(fileName)) {
var lines = streamReader.ReadToEnd().Split("\r\n".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
foreach (var line in lines)
// Process line
}
My suggestion is to use File.ReadLines because it is clean and efficient. If you require special sharing options (for example you use FileShare.ReadWrite), you can use your own code but you should increase the buffer size.
If you're using .NET 4, simply use File.ReadLines which does it all for you. I suspect it's much the same as yours, except it may also use FileOptions.SequentialScan and a larger buffer (128 seems very small).
While File.ReadAllLines() is one of the simplest ways to read a file, it is also one of the slowest.
If you're just wanting to read lines in a file without doing much, according to these benchmarks, the fastest way to read a file is the age old method of:
using (StreamReader sr = File.OpenText(fileName))
{
string s = String.Empty;
while ((s = sr.ReadLine()) != null)
{
//do minimal amount of work here
}
}
However, if you have to do a lot with each line, then this article concludes that the best way is the following (and it's faster to pre-allocate a string[] if you know how many lines you're going to read) :
AllLines = new string[MAX]; //only allocate memory here
using (StreamReader sr = File.OpenText(fileName))
{
int x = 0;
while (!sr.EndOfStream)
{
AllLines[x] = sr.ReadLine();
x += 1;
}
} //Finished. Close the file
//Now parallel process each line in the file
Parallel.For(0, AllLines.Length, x =>
{
DoYourStuff(AllLines[x]); //do your work here
});
Use the following code:
foreach (string line in File.ReadAllLines(fileName))
This was a HUGE difference in reading performance.
It comes at the cost of memory consumption, but totally worth it!
If the file size is not big, then it is faster to read the entire file and split it afterwards
var filestreams = sr.ReadToEnd().Split(Environment.NewLine,
StringSplitOptions.RemoveEmptyEntries);
There's a good topic about this in Stack Overflow question Is 'yield return' slower than "old school" return?.
It says:
ReadAllLines loads all of the lines into memory and returns a
string[]. All well and good if the file is small. If the file is
larger than will fit in memory, you'll run out of memory.
ReadLines, on the other hand, uses yield return to return one line at
a time. With it, you can read any size file. It doesn't load the whole
file into memory.
Say you wanted to find the first line that contains the word "foo",
and then exit. Using ReadAllLines, you'd have to read the entire file
into memory, even if "foo" occurs on the first line. With ReadLines,
you only read one line. Which one would be faster?
If you have enough memory, I've found some performance gains by reading the entire file into a memory stream, and then opening a stream reader on that to read the lines. As long as you actually plan on reading the whole file anyway, this can yield some improvements.
You can't get any faster if you want to use an existing API to read the lines. But reading larger chunks and manually find each new line in the read buffer would probably be faster.
When you need to efficiently read and process a HUGE text file, ReadLines() and ReadAllLines() are likely to throw Out of Memory exception, this was my case. On the other hand, reading each line separately would take ages. The solution was to read the file in blocks, like below.
The class:
//can return empty lines sometimes
class LinePortionTextReader
{
private const int BUFFER_SIZE = 100000000; //100M characters
StreamReader sr = null;
string remainder = "";
public LinePortionTextReader(string filePath)
{
if (File.Exists(filePath))
{
sr = new StreamReader(filePath);
remainder = "";
}
}
~LinePortionTextReader()
{
if(null != sr) { sr.Close(); }
}
public string[] ReadBlock()
{
if(null==sr) { return new string[] { }; }
char[] buffer = new char[BUFFER_SIZE];
int charactersRead = sr.Read(buffer, 0, BUFFER_SIZE);
if (charactersRead < 1) { return new string[] { }; }
bool lastPart = (charactersRead < BUFFER_SIZE);
if (lastPart)
{
char[] buffer2 = buffer.Take<char>(charactersRead).ToArray();
buffer = buffer2;
}
string s = new string(buffer);
string[] sresult = s.Split(new string[] { "\r\n" }, StringSplitOptions.None);
sresult[0] = remainder + sresult[0];
if (!lastPart)
{
remainder = sresult[sresult.Length - 1];
sresult[sresult.Length - 1] = "";
}
return sresult;
}
public bool EOS
{
get
{
return (null == sr) ? true: sr.EndOfStream;
}
}
}
Example of use:
class Program
{
static void Main(string[] args)
{
if (args.Length < 3)
{
Console.WriteLine("multifind.exe <where to search> <what to look for, one value per line> <where to put the result>");
return;
}
if (!File.Exists(args[0]))
{
Console.WriteLine("source file not found");
return;
}
if (!File.Exists(args[1]))
{
Console.WriteLine("reference file not found");
return;
}
TextWriter tw = new StreamWriter(args[2], false);
string[] refLines = File.ReadAllLines(args[1]);
LinePortionTextReader lptr = new LinePortionTextReader(args[0]);
int blockCounter = 0;
while (!lptr.EOS)
{
string[] srcLines = lptr.ReadBlock();
for (int i = 0; i < srcLines.Length; i += 1)
{
string theLine = srcLines[i];
if (!string.IsNullOrEmpty(theLine)) //can return empty lines sometimes
{
for (int j = 0; j < refLines.Length; j += 1)
{
if (theLine.Contains(refLines[j]))
{
tw.WriteLine(theLine);
break;
}
}
}
}
blockCounter += 1;
Console.WriteLine(String.Format("100 Mb blocks processed: {0}", blockCounter));
}
tw.Close();
}
}
I believe splitting strings and array handling can be significantly improved,
yet the goal here was to minimize number of disk reads.
I have a .csv file with 100 000 records with five columns in it. I am reading it line by line and storing it in a remote database .
Previously, I was following a performance oriented approach. I was reading the .csv file line by line and in the same transaction I was opening the connection to database and closing it. This was taking a serious performance overhead.
For just writing 10 000 lines, it took one hour.
using (FileStream reader = File.OpenRead(#"C:\Data.csv"))
using (TextFieldParser parser = new TextFieldParser(reader))
{
parser.TrimWhiteSpace = true; // if you want
parser.Delimiters = new[] { " " };
parser.HasFieldsEnclosedInQuotes = true;
while (!parser.EndOfData)
{
//Open a connection to a database
//Write the data from the .csv file line by line
//Close the connection
}
}
Now I have changed the approach. For testing purpose I have taken a .csv file with 10 000 lines and after reading all the 10 000 lines, I am making one connection to database and writing it there.
Now, the only issue is:
I want to read first 10 000 lines and write it, similarly read the next 10 000 lines and write it,
using (FileStream reader = File.OpenRead(#"C:\Data.csv"))
using (TextFieldParser parser = new TextFieldParser(reader))
but the above two lines will read the entire file . I don’t want to read it completely.
Is there any way to read the .csv file chunk by chunk of 10 000 lines each?
Try below code it reads data from csv chunk by chunk
IEnumerable<DataTable> GetFileData(string sourceFileFullName)
{
int chunkRowCount = 0;
using (var sr = new StreamReader(sourceFileFullName))
{
string line = null;
//Read and display lines from the file until the end of the file is reached.
while ((line = sr.ReadLine()) != null)
{
chunkRowCount++;
var chunkDataTable = ; ////Code for filling datatable or whatever
if (chunkRowCount == 10000)
{
chunkRowCount = 0;
yield return chunkDataTable;
chunkDataTable = null;
}
}
}
//return last set of data which less then chunk size
if (null != chunkDataTable)
yield return chunkDataTable;
}
Suppose the following lines in text file to which i have to read
INFO 2014-03-31 00:26:57,829 332024549ms Service1 startmethod - FillPropertyColor end
INFO 2014-03-31 00:26:57,829 332024549ms Service1 getReports_Dataset - getReports_Dataset started
INFO 2014-03-31 00:26:57,829 332024549ms Service1 cheduledGeneration - SwitchScheduledGeneration start
INFO 2014-03-31 00:26:57,829 332024549ms Service1 cheduledGeneration - SwitchScheduledGeneration limitId, subscriptionId, limitPeriod, dtNextScheduledDate,shoplimittype0, 0, , 3/31/2014 12:26:57 AM,0
I use the FileStream method to read the text file because the text file size having size over 1 GB. I have to read the files into chunks like initially in first run of program this would read two lines i.e. up to "getReports_Dataset started of second line". In next run it should read from 3rd line. I did the code but unable to get desired output.Problem is that my code doesn't give the exact chunk from where i have to start read text in next run. And second problem is while reading text lines .. don't give a complete line..i.e. some part is missing in lines. Following code:
readPosition = getLastReadPosition();
using (FileStream fStream = new FileStream(logFilePath, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
using (System.IO.StreamReader rdr = new System.IO.StreamReader(fStream))
{
rdr.BaseStream.Seek(readPosition, SeekOrigin.Begin);
while (numCharCount > 0)
{
int numChars = rdr.ReadBlock(block, 0, block.Length);
string blockString = new string(block);
lines = blockString.Split(Convert.ToChar('\r'));
lines[0] = fragment + lines[0];
fragment = lines[lines.Length - 1];
foreach (string line in lines)
{
lstTextLog.Add(line);
if (lstTextLog.Contains(fragment))
{
lstTextLog.Remove(fragment);
}
numProcessedChar++;
}
numCharCount--;
}
SetLastPosition(numProcessedChar, logFilePath);
}
If you want to read a file line-by-line, do this:
foreach (string line in File.ReadLines("filename"))
{
// process line here
}
If you really must read a line and save the position, you need to save the last line number read, rather than the stream position. For example:
int lastLineRead = getLastLineRead();
string nextLine = File.ReadLines("filename").Skip(lastLineRead).FirstOrDefault();
if (nextLine != null)
{
lastLineRead++;
SetLastPosition(lastLineRead, logFilePath);
}
The reason you can't do it by saving the base stream position is because StreamReader reads a large buffer full of data from the base stream, which moves the file pointer forward by the buffer size. StreamReader then satisfies read requests from that buffer until it has to read the next buffer full. For example, say you open a StreamReader and ask for a single character. Assuming that it has a buffer size of 4 kilobytes, StreamReader does essentially this:
if (buffer is empty)
{
read buffer (4,096 bytes) from base stream
buffer_position = 0;
}
char c = buffer[buffer_position];
buffer_position++; // increment position for next read
return c;
Now, if you ask for the base stream's position, it's going to report that the position is at 4096, even though you've only read one character from the StreamReader.
I have a text file that I want to read line by line and record the position in the text file as I go. After reading any line of the file the program can exit, and I need to resume reading the file at the next line when it resumes.
Here is some sample code:
using (FileStream fileStream = new FileStream("Sample.txt", FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
{
fileStream.Seek(GetLastPositionInFile(), SeekOrigin.Begin);
using (StreamReader streamReader = new StreamReader(fileStream))
{
while (!streamReader.EndOfStream)
{
string line = streamReader.ReadLine();
DoSomethingInteresting(line);
SaveLastPositionInFile(fileStream.Position);
if (CheckSomeCondition())
{
break;
}
}
}
}
When I run this code, the value of fileStream.Position does not change after reading each line, it only advances after reading a couple of lines. When it does change, it increases in multiples of 1024. Now I assume that there is some buffering going on under the covers, but how can I record the exact position in the file?
It's not FileStream that's responsible - it's StreamReader. It's reading 1K at a time for efficiency.
Keeping track of the effective position of the stream as far as the StreamReader is concerned is tricky... particularly as ReadLine will discard the line ending, so you can't accurately reconstruct the original data (it could have ended with "\n" or "\r\n"). It would be nice if StreamReader exposed something to make this easier (I'm pretty sure it could do so without too much difficulty) but I don't think there's anything in the current API to help you :(
By the way, I would suggest that instead of using EndOfStream, you keep reading until ReadLine returns null. It just feels simpler to me:
string line;
while ((line = reader.ReadLine()) != null)
{
// Process the line
}
I would agree with Stefan M., it is probably the buffering which is causing the Position to be incorrect. If it is just the number of characters that you have read that you want to track than I suggest you do it yourself, as in:
using(FileStream fileStream = new FileStream("Sample.txt", FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
{
fileStream.Seek(GetLastPositionInFile(), SeekOrigin.Begin);
/**Int32 position = 0;**/
using(StreamReader streamReader = new StreamReader(fileStream))
{
while(!streamReader.EndOfStream)
{
string line = streamReader.ReadLine();
/**position += line.Length;**/
DoSomethingInteresting(line);
/**SaveLastPositionInFile(position);**/
if(CheckSomeCondition())
{
break;
}
}
}
}
Provide that your file is not too big, why not read the whole thing in big chuncks and then manipulate the string - probably faster than the stop and go i/o.
For example,
//load entire file
StreamReader srFile = new StreamReader(strFileName);
StringBuilder sbFileContents = new StringBuilder();
char[] acBuffer = new char[32768];
while (srFile.ReadBlock(acBuffer, 0, acBuffer.Length)
> 0)
{
sbFileContents.Append(acBuffer);
acBuffer = new char[32768];
}
srFile.Close();