Reading csv file in chunks for processing

Reading csv file in chunks for processing - c#

I have a .csv file with 100 000 records with five columns in it. I am reading it line by line and storing it in a remote database .
Previously, I was following a performance oriented approach. I was reading the .csv file line by line and in the same transaction I was opening the connection to database and closing it. This was taking a serious performance overhead.
For just writing 10 000 lines, it took one hour.
using (FileStream reader = File.OpenRead(#"C:\Data.csv"))
using (TextFieldParser parser = new TextFieldParser(reader))
{
parser.TrimWhiteSpace = true; // if you want
parser.Delimiters = new[] { " " };
parser.HasFieldsEnclosedInQuotes = true;
while (!parser.EndOfData)
{
//Open a connection to a database
//Write the data from the .csv file line by line
//Close the connection
}
}
Now I have changed the approach. For testing purpose I have taken a .csv file with 10 000 lines and after reading all the 10 000 lines, I am making one connection to database and writing it there.
Now, the only issue is:
I want to read first 10 000 lines and write it, similarly read the next 10 000 lines and write it,
using (FileStream reader = File.OpenRead(#"C:\Data.csv"))
using (TextFieldParser parser = new TextFieldParser(reader))
but the above two lines will read the entire file . I don’t want to read it completely.
Is there any way to read the .csv file chunk by chunk of 10 000 lines each?

Try below code it reads data from csv chunk by chunk
IEnumerable<DataTable> GetFileData(string sourceFileFullName)
{
int chunkRowCount = 0;
using (var sr = new StreamReader(sourceFileFullName))
{
string line = null;
//Read and display lines from the file until the end of the file is reached.
while ((line = sr.ReadLine()) != null)
{
chunkRowCount++;
var chunkDataTable = ; ////Code for filling datatable or whatever
if (chunkRowCount == 10000)
{
chunkRowCount = 0;
yield return chunkDataTable;
chunkDataTable = null;
}
}
}
//return last set of data which less then chunk size
if (null != chunkDataTable)
yield return chunkDataTable;
}

Related

Reading multiple .csv files C#

I've a program where i need to read in multiple .csv files from a directory, take some info from each one and then create one large .csv file. However i'm having problems reading them in but not sure why. I have this piece of code in my main method:
string sourceDirectory = #"sourceDirectory/test";
var csvFiles = Directory.EnumerateFiles(sourceDirectory, "*.csv", SearchOption.AllDirectories);
foreach (string currentFile in csvFiles)
{
readFile(currentFile);
}
And then the following in my readFile method:
public static void readFile(string currentFile)
{
StreamWriter writer = new StreamWriter(#"destinationFile.csv");
StreamReader reader = new StreamReader(currentFile);
while(**){
object[] array;
array = new object[11];
array[0] = info1;
array[1] = info2;
array[2] = info3;
//........
writer.WriteLine(string.Join(", ", array));
}
reader.DiscardBufferedData();
writer.Close();
reader.Close();
Without the while loop it only reads in one line of the file, understanably. I can't seem to understand what or even if the while loop should contain. If it was a .txt file i would simply put while ((line = reader.ReadLine()) != null). My code never seems to read in more than one .csv file from the direcotry but there are 6 .csv files in there.
The only data i really need from them is to count certain lines between dates(one of the .csv columns).

Reading a complete text file using for loop

I am trying to read a text file using a for loop that runs for a 100 times.
StreamReader reader = new StreamReader("client.txt");
for (int i=0;i<=100;i++)
{
reader.readline();
}
Now this works fine if the text file has 100 lines but not if lets say 700. So I want the loop to run for 100 times but read "1%" of the file in each run.How would i do that?

If file size is not too large you can:
string[] lines = File.ReadAllLines("client.txt");
or
string text = File.ReadAllText("client.txt");
Reading 1% at a time is a bit tricky, I'd go with the approach of reading line by line:
var filename = "client.txt";
var info = new FileInfo(filename);
var text = new StringBuilder();
using (var stream = new FileStream(filename, FileMode.Open))
using (var reader = new StreamReader(stream))
{
while (!reader.EndOfStream)
{
text.AppendLine(reader.ReadLine());
var progress = Convert.ToDouble(stream.Position) * 100 / info.Length;
Console.WriteLine(progress);
}
}
var result = text.ToString();
But please notice, the progress will not be very accurate because StreamReader.ReadLine (and equivalently ReadLineAsync) will often read more than just a single line - it basically reads into a buffer and then interprets that buffer. That's much more efficient than reading a single byte at a time, but it does mean that the stream will have advanced further than it strictly speaking needs to.

Writing file line by line in C# very slow using streamreader/streamwriter

I wrote a Winform application that reads in each line of a text file, does a search and replace using RegEx on the line, and then it writes back out to a new file. I chose the "line by line" method as some of the files are just too large to load into memory.
I am using the BackgroundWorker object so the UI can be updated with the progress of the job. Below is the code (with parts omitted for brevity) that handles the reading and then outputting of the lines in the file.
public void bgWorker_DoWork(object sender, DoWorkEventArgs e)
{
// Details of obtaining file paths omitted for brevity
int totalLineCount = File.ReadLines(inputFilePath).Count();
using (StreamReader sr = new StreamReader(inputFilePath))
{
int currentLine = 0;
String line;
while ((line = sr.ReadLine()) != null)
{
currentLine++;
// Match and replace contents of the line
// omitted for brevity
if (currentLine % 100 == 0)
{
int percentComplete = (currentLine * 100 / totalLineCount);
bgWorker.ReportProgress(percentComplete);
}
using (FileStream fs = new FileStream(outputFilePath, FileMode.Append, FileAccess.Write))
using (StreamWriter sw = new StreamWriter(fs))
{
sw.WriteLine(line);
}
}
}
}
Some of the files I am processing are very large (8 GB with 132 million rows). The process takes a very long time (a 2 GB file took about 9 hours to complete). It looks to be working at around 58 KB/sec. Is this expected or should the process be going faster?

Don't close and re-open the writing file every loop iteration, just open the writer outside the file loop. This should improve performance as the writer no longer needs to seek to the end of the file every single loop iteration.
AlsoFile.ReadLines(inputFilePath).Count(); is causing you to read your input file twice and could be a big chunk of time. Instead of a percentage based off of lines calculate the percentage based off of stream position.
public void bgWorker_DoWork(object sender, DoWorkEventArgs e)
{
// Details of obtaining file paths omitted for brevity
using (StreamWriter sw = new StreamWriter(outputFilePath, true)) //You can use this constructor instead of FileStream, it does the same operation.
using (StreamReader sr = new StreamReader(inputFilePath))
{
int lastPercentage = 0;
String line;
while ((line = sr.ReadLine()) != null)
{
// Match and replace contents of the line
// omitted for brevity
//Poisition and length are longs not ints so we need to cast at the end.
int currentPercentage = (int)(sr.BaseStream.Position * 100L / sr.BaseStream.Length);
if (lastPercentage != currentPercentage )
{
bgWorker.ReportProgress(currentPercentage );
lastPercentage = currentPercentage;
}
sw.WriteLine(line);
}
}
}
Other than that you will need to show what Match and replace contents of the line omitted for brevity does as I would guess that is where your slowness comes from. Run a profiler on your code and see where it is taking the most time and focus your efforts there.

Follow this process:
Instantiate reader and writer
Loop through lines, doing the next two steps
In loop change line
In loop write changed line
Dispose of reader and writer
This should be a LOT faster than instantiating the writer on each line loop, as you have.
I will append this with a code sample shortly. Looks like someone else beat me to the punch on code samples - see #Scott Chamberlain's answer.

Remove the ReadAllLines method at the top as the reads through whole file just to get numberof lines.

Read the large text files into chunks line by line

Suppose the following lines in text file to which i have to read
INFO 2014-03-31 00:26:57,829 332024549ms Service1 startmethod - FillPropertyColor end
INFO 2014-03-31 00:26:57,829 332024549ms Service1 getReports_Dataset - getReports_Dataset started
INFO 2014-03-31 00:26:57,829 332024549ms Service1 cheduledGeneration - SwitchScheduledGeneration start
INFO 2014-03-31 00:26:57,829 332024549ms Service1 cheduledGeneration - SwitchScheduledGeneration limitId, subscriptionId, limitPeriod, dtNextScheduledDate,shoplimittype0, 0, , 3/31/2014 12:26:57 AM,0
I use the FileStream method to read the text file because the text file size having size over 1 GB. I have to read the files into chunks like initially in first run of program this would read two lines i.e. up to "getReports_Dataset started of second line". In next run it should read from 3rd line. I did the code but unable to get desired output.Problem is that my code doesn't give the exact chunk from where i have to start read text in next run. And second problem is while reading text lines .. don't give a complete line..i.e. some part is missing in lines. Following code:
readPosition = getLastReadPosition();
using (FileStream fStream = new FileStream(logFilePath, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
using (System.IO.StreamReader rdr = new System.IO.StreamReader(fStream))
{
rdr.BaseStream.Seek(readPosition, SeekOrigin.Begin);
while (numCharCount > 0)
{
int numChars = rdr.ReadBlock(block, 0, block.Length);
string blockString = new string(block);
lines = blockString.Split(Convert.ToChar('\r'));
lines[0] = fragment + lines[0];
fragment = lines[lines.Length - 1];
foreach (string line in lines)
{
lstTextLog.Add(line);
if (lstTextLog.Contains(fragment))
{
lstTextLog.Remove(fragment);
}
numProcessedChar++;
}
numCharCount--;
}
SetLastPosition(numProcessedChar, logFilePath);
}

If you want to read a file line-by-line, do this:
foreach (string line in File.ReadLines("filename"))
{
// process line here
}
If you really must read a line and save the position, you need to save the last line number read, rather than the stream position. For example:
int lastLineRead = getLastLineRead();
string nextLine = File.ReadLines("filename").Skip(lastLineRead).FirstOrDefault();
if (nextLine != null)
{
lastLineRead++;
SetLastPosition(lastLineRead, logFilePath);
}
The reason you can't do it by saving the base stream position is because StreamReader reads a large buffer full of data from the base stream, which moves the file pointer forward by the buffer size. StreamReader then satisfies read requests from that buffer until it has to read the next buffer full. For example, say you open a StreamReader and ask for a single character. Assuming that it has a buffer size of 4 kilobytes, StreamReader does essentially this:
if (buffer is empty)
{
read buffer (4,096 bytes) from base stream
buffer_position = 0;
}
char c = buffer[buffer_position];
buffer_position++; // increment position for next read
return c;
Now, if you ask for the base stream's position, it's going to report that the position is at 4096, even though you've only read one character from the StreamReader.

Best way to read multiple very large files

I need help figuring out the fastest way to read through about 80 files with over 500,000 lines in each file, and write to one master file with each input file's line as a column in the master. The master file must be written to a text editor like notepad and not a Microsoft product because they can't handle the number of lines.
For example, the master file should look something like this:
File1_Row1,File2_Row1,File3_Row1,...
File1_Row2,File2_Row2,File3_Row2,...
File1_Row3,File2_Row3,File3_Row3,...
etc.
I've tried 2 solutions so far:
Create a jagged array to hold each files' contents into an array and then once reading all lines in all files, write the master file. The issue with this solution is that Windows OS memory throws an error that too much virtual memory is being used.
Dynamically create a reader thread for each of the 80 files that reads a specific line number, and once all threads finish reading a line, combine those values and write to file, and repeat for each line in all files. The issue with this solution is that it is very very slow.
Does anybody have a better solution for reading so many large files in a fast way?

The best way is going to be to open the input files with a StreamReader for each one and a StreamWriter for the output file. Then you loop through each reader and read a single line and write it to the master file. This way you are only loading one line at a time so there should be minimal memory pressure. I was able to copy 80 ~500,000 line files in 37 seconds. An example:
using System;
using System.Collections.Generic;
using System.IO;
using System.Diagnostics;
class MainClass
{
static string[] fileNames = Enumerable.Range(1, 80).Select(i => string.Format("file{0}.txt", i)).ToArray();
public static void Main(string[] args)
{
var stopwatch = Stopwatch.StartNew();
List<StreamReader> readers = fileNames.Select(f => new StreamReader(f)).ToList();
try
{
using (StreamWriter writer = new StreamWriter("master.txt"))
{
string line = null;
do
{
for(int i = 0; i < readers.Count; i++)
{
if ((line = readers[i].ReadLine()) != null)
{
writer.Write(line);
}
if (i < readers.Count - 1)
writer.Write(",");
}
writer.WriteLine();
} while (line != null);
}
}
finally
{
foreach(var reader in readers)
{
reader.Close();
}
}
Console.WriteLine("Elapsed {0} ms", stopwatch.ElapsedMilliseconds);
}
}
I've assume that all the input files have the same number of lines, but you should be add the logic to keep reading when at least one file has given you data.

Use Memory Mapped files seems what is suitable to you. Something that does not execute pressure on memory of your app contemporary maintaining good performance in IO operations.
Here complete documentation: Memory-Mapped Files

If you have enough memory on the computer, I would use the Parallel.Invoke construct and read each file into a pre-allocated array such as:
string[] file1lines = new string[some value];
string[] file2lines = new string[some value];
string[] file3lines = new string[some value];
Parallel.Invoke(
() =>
{
ReadMyFile(file1,file1lines);
},
() =>
{
ReadMyFile(file2,file2lines);
},
() =>
{
ReadMyFile(file3,file3lines);
}
);
Each ReadMyFile method should just use the following sample code which, according to these benchmarks, is the fastest way to read a text file:
int x = 0;
using (StreamReader sr = File.OpenText(fileName))
{
while ((file1lines[x] = sr.ReadLine()) != null)
{
x += 1;
}
}
If you need to manipulate the data from each file before writing your final output, read this article on the fastest way to do that.
Then you just need one method to write the contents to each string[] to the output as you desire.

Have an array of open file handles. Loop through this array and read a line from each file into a string array. Then combine this array into the master file, append a newline at the end.
This differs from your second approach that it is single threaded and doesn't read a specific line but always the next one.
Of course you need to be error proof if there are files with less lines than others.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Reading csv file in chunks for processing - c#

Related

Reading multiple .csv files C#

Reading a complete text file using for loop

Writing file line by line in C# very slow using streamreader/streamwriter

Read the large text files into chunks line by line

Best way to read multiple very large files

Categories

Resources