Split text file, fastest method - c#

Morning,
I'm trying to split a large text file (15,000,000 rows) using StreamReader/StreamWriter. Is there a quicker way?
I tested it with 130,000 rows and it took 2min 40sec which implies 15,000,000 rows will take approx 5hrs which seems a bit excessive.
//Perform split.
public void SplitFiles(int[] newFiles, string filePath, int processorCount)
{
using (StreamReader Reader = new StreamReader(filePath))
{
for (int i = 0; i < newFiles.Length; i++)
{
string extension = System.IO.Path.GetExtension(filePath);
string temp = filePath.Substring(0, filePath.Length - extension.Length)
+ i.ToString();
string FilePath = temp + extension;
if (!File.Exists(FilePath))
{
for (int x = 0; x < newFiles[i]; x++)
{
DataWriter(Reader.ReadLine(), FilePath);
}
}
else
{
return;
}
}
}
}
public void DataWriter(string rowData, string filePath)
{
bool appendData = true;
using (StreamWriter sr = new StreamWriter(filePath, appendData))
{
{
sr.WriteLine(rowData);
}
}
}
Thanks for your help.

You haven't made it very clear, but I'm assuming that the value of each element of the newFiles array is the number of lines to copy from the original into that file. Note that currently you don't detect the situation where there's either extra data at the end of the input file, or it's shorter than expected. I suspect you want something like this:
public void SplitFiles(int[] newFiles, string inputFile)
{
string baseName = Path.GetFileNameWithoutExtension(inputFile);
string extension = Path.GetExtension(inputFile);
using (TextReader reader = File.OpenText(inputFile))
{
for (int i = 0; i < newFiles.Length; i++)
{
string outputFile = baseName + i + extension;
if (File.Exists(outputFile))
{
// Better than silently returning, I'd suggest...
throw new IOException("File already exists: " + outputFile);
}
int linesToCopy = newFiles[i];
using (TextWriter writer = File.CreateText(outputFile))
{
for (int j = 0; i < linesToCopy; j++)
{
string line = reader.ReadLine();
if (line == null)
{
return; // Premature end of input
}
writer.WriteLine(line);
}
}
}
}
}
Note that this still won't detect if there's any unconsumed input... it's not clear what you want to do in that situation.
One option for code clarity is to extract the middle of this into a separate method:
public void SplitFiles(int[] newFiles, string inputFile)
{
string baseName = Path.GetFileNameWithoutExtension(inputFile);
string extension = Path.GetExtension(inputFile);
using (TextReader reader = File.OpenText(inputFile))
{
for (int i = 0; i < newFiles.Length; i++)
{
string outputFile = baseName + i + extension;
// Could put this into the CopyLines method if you wanted
if (File.Exists(outputFile))
{
// Better than silently returning, I'd suggest...
throw new IOException("File already exists: " + outputFile);
}
CopyLines(reader, outputFile, newFiles[i]);
}
}
}
private static void CopyLines(TextReader reader, string outputFile, int count)
{
using (TextWriter writer = File.CreateText(outputFile))
{
for (int i = 0; i < count; i++)
{
string line = reader.ReadLine();
if (line == null)
{
return; // Premature end of input
}
writer.WriteLine(line);
}
}
}

There are utilities for splitting files that may outperform your solution - e.g. search for "split file by line".
If they don't suit, there are solutions for loading all the source file into memory and then writing out the files but that probably isn't appropriate given the size of the source file.
In terms of improving your code, a minor improvement would be the generation of the destination file path (and also clarifying the confusing between the source filePath you use and the destination files). You don't need to re-establish the source file extension each time in your loop.
The second improvement (and probably more significant improvement - as highlighted by commenters) is about how you write out the destination files - these seem to have a differing number of lines from the source (value in each newFiles entry) that you specify you want in individual destination files? So I'd suggest for each entry you read all the source file relevant to the next destination file, then output the destination rather than repeatedly opening a destination file. You could "gather" the lines in a StringBuilder/List etc - alternatively just write them directly out to the destination file (but only opening it once)
public void SplitFiles(int[] newFiles, string sourceFilePath, int processorCount)
{
string sourceDirectory = System.IO.Path.GetDirectoryName(sourceFilePath);
string sourceFileName = System.IO.Path.GetFileNameWithoutExtension(sourceFilePath);
string extension = System.IO.Path.GetExtension(sourceFilePath);
using (StreamReader Reader = new StreamReader(sourceFilePath))
{
for (int i = 0; i < newFiles.Length; i++)
{
string destinationFileNameWithExtension = string.Format("{0}{1}{2}", sourceFileName, i, extension);
string destinationFilePath = System.IO.Path.Combine(sourceDirectory, destinationFileNameWithExtension);
if (!File.Exists(destinationFilePath))
{
// Read all the lines relevant to this destination file
// and temporarily store them in memory
StringBuilder destinationText = new StringBuilder();
for (int x = 0; x < newFiles[i]; x++)
{
destinationText.Append(Reader.ReadLine());
}
DataWriter(destinationFilePath, destinationText.ToString());
}
else
{
return;
}
}
}
}
private static void DataWriter(string destinationFilePath, string content)
{
using (StreamWriter sr = new StreamWriter(destinationFilePath))
{
{
sr.Write(content);
}
}
}

I've recently had to do this for several hundred files under 2 GB each (up to 1.92 GB), and the fastest method I found (if you have the memory available) is StringBuilder. All the other methods I tried were painfully slow.
Please note that this is memory dependent. Adjust "CurrentPosition = 130000" accordingly.
string CurrentLine = String.Empty;
int CurrentPosition = 0;
int CurrentSplit = 0;
foreach (string file in Directory.GetFiles(#"C:\FilesToSplit"))
{
StringBuilder sb = new StringBuilder();
using (StreamReader sr = new StreamReader(file))
{
while ((CurrentLine = sr.ReadLine()) != null)
{
if (CurrentPosition == 130000) // Or whatever you want to split by.
{
using (StreamWriter sw = new StreamWriter(#"C:\FilesToSplit\SplitFiles\" + Path.GetFileNameWithoutExtension(file) + "-" + CurrentSplit + "." + Path.GetExtension(file)))
{
// Append this line too, so we don't lose it.
sb.Append(CurrentLine);
// Write the StringBuilder contents
sw.Write(sb.ToString());
// Clear the StringBuilder buffer, so it doesn't get too big. You can adjust this based on your computer's available memory.
sb.Clear();
// Increment the CurrentSplit number.
CurrentSplit++;
// Reset the current line position. We've found 130,001 lines of text.
CurrentPosition = 0;
}
}
else
{
sb.Append(CurrentLine);
CurrentPosition++;
}
}
}
// Reset the integers at the end of each file check, otherwise it can quickly go out of order.
CurrentPosition = 0;
CurrentSplit = 0;
}

Related

How to count lines

how do i count the line in log file and create a new log files of it?
Below is my log file :
DDD.CGLOG
ID|AFP|DATE|FOLDER
1|DDD|20181204|B
2|DDD|20181104|B
3|DDD|20181004|B
FFF.CGLOG
ID|AFP|DATE|FOLDER
1|FFF|20181204|B
2|FFF|20181104|B
WWW.CGLOG
ID|AFP|DATE|FOLDER
1|WWW|20181204|B
i want to count the line and create a new log file as below :
DDD_QTY.Log
AFP|QTY
DDD|3
EEE_QTY.Log
AFP|QTY
EEE|2
WWW_QTY.Log
AFP|QTY
WWW|1
Below is what i have tried. I have managed to get the count from each log file inside the folder, now i just need to write the count into a new log file using a same name with existing log file.
string[] ori_Files = Directory.GetFiles(#"F:\Work\FLP Code\test", "*.CGLOG*", SearchOption.TopDirectoryOnly);
foreach (var file in ori_Files)
{
using (StreamReader file1 = new StreamReader(file))
{
string line;
int count = 0;
while ((line = file1.ReadLine()) != null)
{
Console.WriteLine(line);
count++;
}
Console.WriteLine(count);
}
}
Console.ReadLine();
Since you only want to count lines, You can keep it simple. Assuming your file name dictates the AFP value
static long CountLinesInFile(string fileName,string outputfile)
{
var afp = Path.GetFileNameWithoutExtension(fileName);
var lineCount = File.ReadAllLines(fileName).Length;
File.WriteAllText(outputfile,$"AFP|QTY{Environment.NewLine}{afp}|{lineCount -1}");
return lineCount-1;
}
Please note you are counting a line less(headers are not counted as in your example). In case the file is different from AFP term, you can use regex to parse the AFP Term from the any line other than the header line in each term. Example Regex for parsing AFP Term
new Regex(#"^[0-9]+\|(?<AFP>[a-zA-Z]+)\|[0-9]+\|[a-zA-Z]+$")
Update
In case your file is pretty large (say 15-20Gb - considering it is a log file), a better approach would be
static long CountLinesInFile(string fileName,string outputFileName)
{
var afp = Path.GetFileNameWithoutExtension(fileName);
uint count = 0;
int query = (int)Convert.ToByte('\n');
using (var stream = File.OpenRead(fileName))
{
int current;
do
{
current = stream.ReadByte();
if (current == query)
{
count++;
continue;
}
} while (current!= -1);
}
using (System.IO.StreamWriter file = new System.IO.StreamWriter(outputFileName, true))
{
file.WriteLine($"AFP|QTY{Environment.NewLine}{afp}|{count}");
}
return count;
}
Update 2
To invoke the method for all files in a given folder, you can make use DirectoryInfo.GetFiles, for example
DirectoryInfo d = new DirectoryInfo(#"E:\TestFolder");
FileInfo[] Files = d.GetFiles("*.txt");
foreach(FileInfo file in Files )
{
CountLinesInFile(file.FullName,$"{file.FullName}.processed");
}
a simple 2 liner
static void CountLines(string path,sting outfile)
{
var count = File.ReadLines(path).Count();
File.WriteAllText(outfile, $"AFP|QTY{Environment.NewLine}DDD|{count}");
}

create lines with strings for each blank line until it reaches line 100

I'm new to C# i need help on reading a file that currently has 7 lines of text but I need it to write "Line PlaceHolder" after those 7 lines until it reaches line 100 in the text file. This is what i have so far and i know it's my failed attempt: EDIT: It's good but only issue is an exception is throw that a process is already using the text file, how do I solve this to read/write that file at the same time??
public void ReadFile()
{
if (File.Exists(AccountsFile))
{
using (StreamReader Reader = new StreamReader(AccountsFile))
using (StreamWriter Writer = new StreamWriter((AccountsFile)))
{
for (int i = 0; i < 100; i++)
{
string line;
if ((line = Reader.ReadLine()) == null)
{
Writer.WriteLine("Line Placeholder");
}
}
}
}
else
{
File.Create(AccountsFile);
}
}
You could first read the file contents into an array using File.ReadAllLines, get the array .Length (representing the number of lines in the file), and subtract that number from 100 to see how many lines you need to write. If the number is greater than zero, then create a List<string> with that many empty lines and write those lines to the end of the file using File.AppendAllLines:
// See how many lines we need to add
var newLinesNeeded = 100 - File.ReadAllLines(AccountsFile).Length;
// Add them if needed
if (newLinesNeeded > 0)
{
// Create a list of empty lines
var blankLines = new List<string>();
for(int i = 0; i < newLinesNeeded; i++)
{
blankLines.Add("");
}
// Append them to our file
File.AppendAllLines(AccountsFile, blankLines);
}
Looks like you are just missing an else:
public void ReadFile()
{
if (File.Exists(AccountsFile))
{
using (StreamReader Reader = new StreamReader(AccountsFile))
using (StreamWriter Writer = new StreamWriter((AccountsFile)))
{
for (int i = 0; i < 100; i++)
{
string line;
if ((line = Reader.ReadLine()) == null)
{
Writer.WriteLine("Line Placeholder");
}
else
Writer.WriteLine(line);
}
}
}
else
{
File.Create(AccountsFile);
}
}
this may work if you do not mind opening the file as Read/Write
using (FileStream fileStream = File.Open(AccountsFile, FileMode.OpenOrCreate, FileAccess.ReadWrite))
{
var streamWriter = new StreamWriter(fileStream);
var streamReader = new StreamReader(fileStream);
var i = 0;
// read and count the lines
while (streamReader.ReadLine() != null){
i++;
}
// if any more lines are needed write them
while (i++ < 100)
{
streamWriter.WriteLine("Line Placeholder");
}
streamWriter.Flush();
}

Is there a more efficient way of reading and writing a text fill at the same time?

I'm back at it again with another question, this time with regards to editing text files. My home work is as follow
Write a program that reads the contents of a text file and inserts the line numbers at the beginning of each line, then rewrites the file contents.
This is what I have so far, though I am not so sure if this is the most efficient way of doing it. I've only started learning on handling text files at the moment.
static void Main(string[] args)
{
string fileName = #"C:\Users\Nate\Documents\Visual Studio 2015\Projects\Chapter 15\Chapter 15 Question 3\Chapter 15 Question 3\TextFile1.txt";
StreamReader reader = new StreamReader(fileName);
int lineCounter = 0;
List<string> list = new List<string>();
using (reader)
{
string line = reader.ReadLine();
while (line != null)
{
list.Add("line " + (lineCounter + 1) + ": " + line);
line = reader.ReadLine();
lineCounter++;
}
}
StreamWriter writer = new StreamWriter(fileName);
using (writer)
{
foreach (string line in list)
{
writer.WriteLine(line);
}
}
}
your help would be appreciated!
thanks once again. :]
this should be enough (in case the file is relatively small):
using System.IO;
(...)
static void Main(string[] args)
{
string fileName = #"C:\Users\Nate\Documents\Visual Studio 2015\Projects\Chapter 15\Chapter 15 Question 3\Chapter 15 Question 3\TextFile1.txt";
string[] lines = File.ReadAllLines(fileName);
for (int i = 0; i< lines.Length; i++)
{
lines[i] = string.Format("{0} {1}", i + 1, lines[i]);
}
File.WriteAllLines(fileName, lines);
}
I suggest using Linq, use File.ReadLinesto read the content.
// Read all lines and apply format
var formatteLines = File
.ReadLines("filepath") // read lines
.Select((line, i) => string.Format("line {0} :{1} ", line, i+1)); // format each line.
// write formatted lines to either to the new file or override previous file.
File.WriteAllLines("outputfilepath", formatteLines);
Just one loop here. I think it will be efficient.
class Program
{
public static void Main()
{
string path = Directory.GetCurrentDirectory() + #"\MyText.txt";
StreamReader sr1 = File.OpenText(path);
string s = "";
int counter = 1;
StringBuilder sb = new StringBuilder();
while ((s = sr1.ReadLine()) != null)
{
var lineOutput = counter++ + " " + s;
Console.WriteLine(lineOutput);
sb.Append(lineOutput);
}
sr1.Close();
Console.WriteLine();
StreamWriter sw1 = File.AppendText(path);
sw1.Write(sb);
sw1.Close();
}

C# Taking a listbox with many values, and dividing it up into mulitple text files

I am having the hardest time figuring out how to do this.
I have a listbox with a lot of data in it. I want to take this listbox and then have a button to save it.
The button will choose the directory to put the files in. Afterwards, the program should start saving these values into a text file with the naming schema Seed1.txt, Seed2.txt, etc.
The thing is, I would like to put only 100 items into each text file that is generated until the list is done.
For saving the path I have:
Stream s;
string folderPath = string.Empty;
using (FolderBrowserDialog fdb = new FolderBrowserDialog())
{
if (fdb.ShowDialog() == DialogResult.OK)
{
folderPath = fdb.SelectedPath;
MessageBox.Show(folderPath);
}
For saving everything in one shot, I believe this will work:
int total = list_failed.Items.Count;
for (int i = 0; i < list_failed.Items.Count; i++)
{
StreamWriter text = new StreamWriter(s);
text.Write(list_failed.Items[i]);
s.Close();
I'm not sure about the rest though. Something like this for the filenames perhaps
string filename;
int i = 0;
do
{
filename = "Seed" + ++i + ".txt";
} while (files.Contains(filename));
Here's a working example that you can use.
string pathname = Server.MapPath("/");
int counter = 1;
string file = String.Empty;
List<string> list = new List<string>();
//Add the list items
for (int i = 0; i <= 1234; i++)
{
list.Add(String.Format("item {0}", i));
}
//write to file
for (int i = 1; i < list.Count(); i++)
{
//generate a dynamic filename with path
file = String.Format("{0}Seed{1}.txt", pathname, counter);
//the using statement closes the streamwriter when it completes the process
using (StreamWriter text = new StreamWriter(file, true))
{
//write the line
text.Write(list[i]);
}
//check to see if the max lines have been written
if (i == counter * 100) counter++;
}
string folderPath;
const int ITEMS_PER_FILE=100;
void AskUserForFolder()
{
folderPath = string.Empty;
using (FolderBrowserDialog fdb = new FolderBrowserDialog())
{
if (fdb.ShowDialog() == DialogResult.OK)
{
folderPath = fdb.SelectedPath;
// MessageBox.Show(folderPath);
}
}
}
void SaveItems(ListBox listBox, int seed)
{
int total = listBox.Items.Count;
for ( int fileCount=0;fileCount<listBox.Items.Count/ITEMS_PER_FILE;++fileCount)
{
using (StreamWriter sw = new StreamWriter(folderPath + "\\" + GetFilePath(folderPath, "filename.txt",ref seed)))
{
for (int i = 0; i < listBox.Items.Count; i++)
{
sw.WriteLine(listBox.Items[i+(ITEMS_PER_FILE*fileCount)]);
}
sw.Close();
}
}
}
//I'm not sure about the rest though. Something like this for the filenames perhaps
/// <summary>
/// Gets a filename that has not been used before by incrementing a number at the end of the filename
/// </summary>
/// <param name="seed">seed is passed in as a referrect value and acts as a starting point to itterate through the list
/// By passing it in as a reference we can save ourselves from having to itterate unneccssarily for the start each time
/// </param>
/// <returns>the path of the file</returns>
string GetFilePath(string folderpath, string fileName,string extension,ref int seed)
{
FileInfo fi = new FileInfo(string.Format("{0}\\{1}{2}.{3}", folderPath, fileName, seed,extension));
while (fi.Exists)
{
fi = new FileInfo(string.Format("{0}\\{1}{2}.{3}", folderPath, fileName, ++seed,extension));
}
return fi.FullName;
}
Try this to iterate over ListBox items and put them in files with up to 100 items:
private void writeItemsToFile(ListBox lb)
{
string path = #"c:\test\";
string filename = "seed";
int itemCounter = 0;
int fileCounter = 1;
StreamWriter sw = new StreamWriter(File.OpenWrite(System.IO.Path.Combine(path,string.Format(filename+"{0}.txt",fileCounter))));
foreach (var s in lb.Items)
{
if (itemCounter > 100)
{
fileCounter++;
itemCounter = 0;
sw.Flush();
sw.Close();
sw.Dispose();
sw = null;
sw = new StreamWriter(File.OpenWrite(System.IO.Path.Combine(path,string.Format(filename+"{0}.txt",fileCounter))));
}
sw.WriteLine(s.ToString());
itemCounter++;
}
if (sw != null)
{
sw.Flush();
sw.Dispose();
}
}

Reading a Line from text file and return back

I am developing a C# application in which I need to read a line from a text file and return back to first of line.
As file size may be too large I can't copy it into an array .
I tried this code
StreamReader str1 = new StreamReader(#"c:\file1.txt");
StreamReader str2 = new StreamReader(#"c:\file2.txt");
int a, b;
long pos1, pos2;
while (!str1.EndOfStream && !str2.EndOfStream)
{
pos1 = str1.BaseStream.Position;
pos2 = str2.BaseStream.Position;
a = Int32.Parse(str1.ReadLine());
b = Int32.Parse(str2.ReadLine());
if (a <= b)
{
Console.WriteLine("File1 ---> " + a.ToString());
str2.BaseStream.Seek(pos2, SeekOrigin.Begin);
}
else
{
Console.WriteLine("File2 ---> " + b.ToString());
str1.BaseStream.Seek(pos1, SeekOrigin.Begin);
}
}
When I debuged the program I found out str1.BaseStream.Position and str2.BaseStream.Position are same in every loop , so nothing will change.
Is there any better way ?
Thanks
You can use ReadLines for large file, it is deferred execution and does not load the whole file into memory, so you can manipulate lines in IEnumerable type:
var lines = File.ReadLines("path");
If you are in old .NET version, below is how to build ReadLines by yourself:
public IEnumerable<string> ReadLine(string path)
{
using (var streamReader = new StreamReader(path))
{
string line;
while((line = streamReader.ReadLine()) != null)
{
yield return line;
}
}
}
Another way Which I prefer to use.
Create a Function like this:
string ReadLine( Stream sr,bool goToNext)
{
if (sr.Position >= sr.Length)
return string.Empty;
char readKey;
StringBuilder strb = new StringBuilder();
long position = sr.Position;
do
{
readKey = (char)sr.ReadByte();
strb.Append(readKey);
}
while (readKey != (char)ConsoleKey.Enter && sr.Position<sr.Length);
if(!goToNext)
sr.Position = position;
return strb.ToString();
}
Then , Create a stream from file for It's argument
Stream stream = File.Open("C:\\1.txt", FileMode.Open);

Categories