How to efficiently cross reference 2 text files? | Improve my code - c#

Below an outline of what my code does:
Read TextFileA which has 150k lines.
Read TextFileB which has 150k lines and is a cross reference list for TextFileA.
.Split both text files and match specified elements.
Finally, output a 3rd text file which will contain values from both TextFileA and TextFileB.
The below code runs well until about 13,000 lines in and then the program becomes exceedingly slow.
Could someone explain why the program becomes exponentially slower and how I could improve on this code? Thanks.
private void BT_Xref_Click(object sender, EventArgs e)
{
//grabs file path from text box
string ManifestPath = TB_Manifest.Text;
//grabs parent directory from file path
string directoryName = Path.GetDirectoryName(ManifestPath);
//creates a new folder for the final output text file
string pathString = Path.Combine(directoryName, "Final Index");
Directory.CreateDirectory(pathString);
//list for matching text lines which will eventually be output to the final text file
List<string> NewData = new List<string>();
//initializes StreamReader for the first text file
StreamReader ManifestReader = new StreamReader(ManifestPath);
String[] ManifestArray = File.ReadAllLines(ManifestPath);
List<string> RemoveManifest = new List<string>(ManifestArray);
//initializes StreamReader for the second text file
StreamReader OutputReader = new StreamReader(TB_Complete.Text);
String[] OutputArray = File.ReadAllLines(TB_Complete.Text);
List<string> RemoveOutput = new List<string>(OutputArray);
//initializes a count which decides at what point a text file should be created
int shortcount = 0;
//.ReadLine is initialized to ignore the first line in both text files
string ManifestLine = ManifestReader.ReadLine();
string OutputLine = OutputReader.ReadLine();
foreach (string mfile in ManifestArray)
{
ManifestLine = ManifestReader.ReadLine();
string ManifestElement = ManifestLine.Split(',')[6];
string ManifestElement2 = ManifestLine.Split(',')[5];
//value to be retreived and output to final text file
string ManifestElementDate = ManifestElement2.Replace("/", "-");
//value to be compared with the other text file
string ManifestNoExt = Regex.Replace(ManifestElement, ("(\\.\\w+$)"),"");
//resets OutpuReader reader to ensure no lines are being skipped
OutputReader.BaseStream.Position = 0;
//counting the mfile position in the ManifestArray
//int removeIndex = Array.IndexOf(ManifestArray, mfile);
//remove by resising the array
//Array.Resize(ref ManifestArray, ManifestArray.Length - 1);
foreach (string ofile in OutputArray)
{
OutputLine = OutputReader.ReadLine();
//value to be comapred with other text file
string OutputElement = OutputLine.Split('|')[2];
//if values equal then add the specified line of text to the list.
if (ManifestNoExt.Equals(OutputElement))
{
NewData.Add(OutputLine + "|" + ManifestElementDate);
RemoveManifest.RemoveAll(item => item == ManifestLine);
if (NewData.Count == 1000)
{
//if youve reached the count then output files into a new text file
shortcount = shortcount + 1;
File.WriteAllLines(pathString + "\\test" + shortcount + ".txt", NewData);
NewData.Clear();
}
break;
}
}
}
//once all line of text have been searched combine all text files in directory
shortcount = shortcount + 1;
File.WriteAllLines(pathString + "\\test" + shortcount + ".txt", NewData);
String[] SplitTextFiles = Directory.GetFiles(pathString, "*.*", SearchOption.AllDirectories);
using (var FinalIndexFile = File.Create(pathString + "\\FinalIndex.txt"))
{
foreach (var file in SplitTextFiles)
{
using (var input = File.OpenRead(file))
{
input.CopyTo(FinalIndexFile);
}
File.Delete(file);
}
}
//File.WriteAllLines("\\test.txt", Directory.EnumerateFiles(pathString, #"*.txt").SelectMany(file => File.ReadLines(file)));
}

You have an O(nm) algorithm here, and assuming that n and m are the same, its actually an O(n^2). That's not so good and is why its slowing to a crawl (for 150k rows in each file, you are looking at 22500000000 iterations of the inner loop. Not entirely certain what your code is trying to do, but based on the condition if (ManifestNoExt.Equals(OutputElement)), I think you can reduce the complexity drastically as follows:
Read in TextFileA, store values into a Dictionary based on ManifestNoExt as Key and mFile as value.
Next read in TextFileB and iterate over all rows in B and do a lookup in the dictionary that was constructed.
This will give you an algorithm that is O(n) + O(m), which will be fast.
Also, I am not sure why you are reading in the entire files and then reading them in again inside the loops (the contents of ManifestArray and OutputArray is the same as the files). That is certainly a cause for slow down as well since you are going to end up hammering the file system.
A completely untested version of this idea:
private void BT_Xref_Click(object sender, EventArgs e)
{
//grabs file path from text box
string ManifestPath = TB_Manifest.Text;
//grabs parent directory from file path
string directoryName = Path.GetDirectoryName(ManifestPath);
//creates a new folder for the final output text file
string pathString = Path.Combine(directoryName, "Final Index");
Directory.CreateDirectory(pathString);
//list for matching text lines which will eventually be output to the final text file
List<string> NewData = new List<string>();
String[] ManifestArray = File.ReadAllLines(ManifestPath);
List<string> RemoveManifest = new List<string>(ManifestArray);
String[] OutputArray = File.ReadAllLines(TB_Complete.Text);
List<string> RemoveOutput = new List<string>(OutputArray);
//initializes a count which decides at what point a text file should be created
int shortcount = 0;
//.ReadLine is initialized to ignore the first line in both text files
string ManifestLine = ManifestReader.ReadLine();
string OutputLine = OutputReader.ReadLine();
Dictionary<string, Tuple<string, string>> ManifestMap = new Dictionary<string, Tuple<string, string>>();
foreach (string mfile in ManifestArray.Skip(1))
{
string ManifestLine = mfile;
string ManifestElement = ManifestLine.Split(',')[6];
string ManifestElement2 = ManifestLine.Split(',')[5];
//value to be retreived and output to final text file
string ManifestElementDate = ManifestElement2.Replace("/", "-");
//value to be compared with the other text file
string ManifestNoExt = Regex.Replace(ManifestElement, ("(\\.\\w+$)"),"");
ManifestMap.Add(ManifestNoExt, Tuple.Create(ManifestElementDate, ManifestLine));
//counting the mfile position in the ManifestArray
//int removeIndex = Array.IndexOf(ManifestArray, mfile);
//remove by resising the array
//Array.Resize(ref ManifestArray, ManifestArray.Length - 1);
}
foreach (string ofile in OutputArray.Skip(1))
{
//value to be compared with other text file
string OutputElement = OutputLine.Split('|')[2];
//if values equal then add the specified line of text to the list.
if (ManifestMap.ContainsKey(OutputElement))
{
NewData.Add(OutputLine + "|" + ManifestMap[OutputElement].Item1);
RemoveManifest.RemoveAll(item => item == ManifestMap[OutputElement].Item2);
if (NewData.Count == 1000)
{
//if youve reached the count then output files into a new text file
shortcount = shortcount + 1;
File.WriteAllLines(pathString + "\\test" + shortcount + ".txt", NewData);
NewData.Clear();
}
break;
}
}
//once all line of text have been searched combine all text files in directory
shortcount = shortcount + 1;
File.WriteAllLines(pathString + "\\test" + shortcount + ".txt", NewData);
String[] SplitTextFiles = Directory.GetFiles(pathString, "*.*", SearchOption.AllDirectories);
using (var FinalIndexFile = File.Create(pathString + "\\FinalIndex.txt"))
{
foreach (var file in SplitTextFiles)
{
using (var input = File.OpenRead(file))
{
input.CopyTo(FinalIndexFile);
}
File.Delete(file);
}
}
//File.WriteAllLines("\\test.txt", Directory.EnumerateFiles(pathString, #"*.txt").SelectMany(file => File.ReadLines(file)));
}

Related

Delete specific line from a text file which i don't have the name

i need to find and delete all lines wich contain the word "recto",
i did search in stackoverflow forum, but all what i found is do that (delete the line) using path (Directory & FileName).
in my case i want to delete the line contain "recto" in all fils with specific extention (*.txt) in the directory.
thanks for help
here is my code so far
string sourceDir = #"C:\SRCE\";
string destinDir = #"C:\DIST\";
string[] files = Directory.GetFiles(sourceDir);
foreach (string file in files)
{
using (StreamReader sr_ = new StreamReader
(sourceDir + Path.GetFileName(file)))
{
string line = sr_.ReadLine();
if (line.Contains("recto"))
{
File.Copy(file, destinDir + Path.GetFileName(file));
string holdName = sourceDir + Path.GetFileName(file);
}
sr_.DiscardBufferedData();
sr_.Close();
}
}
}
You can try something like this. You were only identifying the files with the word but not making any try to remove it. At the end, you were copying the files that included the word "recto"
string sourceDir = #"C:\SRCE\";
string destinDir = #"C:\DIST\";
string[] files = Directory.GetFiles(sourceDir);
foreach (string file in files)
{
using (StreamReader sr_ = new StreamReader
(sourceDir + Path.GetFileName(file)))
{
string res = string.Empty;
while(!sr_.EndOfStream)
{
var l = sr_.ReadLine();
if (l.Contains("recto"))
{
continue;
}
res += l + Environment.NewLine;
}
var streamWriter = File.CreateText(destinDir + Path.GetFileName(file));
streamWriter.Write(res);
streamWriter.Flush();
streamWriter.Close();
}
}
If the files are not really big you can simplify a lot your code reading all lines in memory, processing the lines with Linq and then rewriting the files
string sourceDir = #"C:\SRCE\";
string destinDir = #"C:\DIST\";
string[] files = Directory.GetFiles(sourceDir);
foreach (string file in files)
{
var lines = File.ReadLines(file);
var result = lines.Where(x => x != "recto").ToArray();
File.WriteAllLines(Path.Combine(destinDir, Path.GetFileName(file)), result);
}

How can I filter out lines in a richtextbox?

I have code that shows the name of all .dll files within a directory on a rich textbox. How would I be able to filter/hide all filenames which I specify as unimportant and keep the rest?
Example:
Actual directory contains: 1.dll, 2.dll, 3.dll
Rich Textbox shows: 1.dll, 3.dll because 2.dll is specified as unimportant in the code.
Code I currently have that displays all files.
DirectoryInfo r = new DirectoryInfo(#"E:\SteamLibrary\steamapps\common\Grand Theft Auto V");
FileInfo[] rFiles = r.GetFiles("*.dll");
string rstr = "";
foreach (FileInfo rfile in rFiles)
{
rstr = rstr + rfile.Name + " ";
}
string strfinalR;
strfinalR = richTextBox3.Text + rstr;
richTextBox3.Text = (strfinalR);
Just make a blacklist :
string[] blacklist = new string[] { "2", "1337" };
and filter the filenames within your foreach
foreach(FileInfo rFile in rFiles)
{
if(blacklist.Any(bl => rFile.Name.Contains(bl)))
continue;
// your code
}
or when you retrieve files from r
r.GetFiles("*.dll").Where(file => !blacklist.Any( type => file.Name.Contains(type))).ToArray();
one approach is to make a list of files to ignore and then fetch all files excluding those in ignore list. Something like this:
//ignore list
List<string> filesToIgnore = new List<string>() { "2.dll", "some_other.dll" };
//get files and filter them
DirectoryInfo r = new DirectoryInfo(#"E:\SteamLibrary\steamapps\common\Grand Theft Auto V");
List<FileInfo> rFiles = r.GetFiles("*.dll").Where(x => !filesToIgnore.Contains(x.Name)).ToList();
//your existing code follows
string rstr = "";
foreach (FileInfo rfile in rFiles)
{
rstr = rstr + rfile.Name + " ";
}
string strfinalR;
strfinalR = richTextBox3.Text + rstr;
richTextBox3.Text = (strfinalR);

Open txt file and replace text in the first two lines (C#)

I am writing this program that allows me to generate txt files that with an incremental number, however, i want each file serial number to be writting inside the txt file itself.
For example:
I generated 3 files, Mytext-000001.txt, Mytext-000002.txt, Mytext-000003.txt, and each file first line contains "Hello 000000" and the second line contains "My number is 000000", now i want to change each txt file to contain "Hello " + the incremental number that it is named with.
So the output of each file will be:
Mytext-000001.txt,
Hello 000001
My number is 000001
Mytext-000002.txt,
Hello 000002
My number is 000002
Mytext-000003.txt,
Hello 000002
My number is 000003
My Code
string path = System.IO.Path.GetDirectoryName(Application.ExecutablePath) + #"\path.txt";
string pth_input = null;
string pth_output = null;
using (StreamReader sx = File.OpenText(path))
{
pth_input = sx.ReadLine();
pth_output = sx.ReadLine();
}
Console.WriteLine("Number of Files?");
string number_of_files = Console.ReadLine();
int get_number_of_files = Int32.Parse(number_of_files) + 1;
string PathFiletoCopy = pth_input;
string Extension = System.IO.Path.GetExtension(PathFiletoCopy);
string PartialNewPathFile = System.IO.Path.Combine(System.IO.Path.GetDirectoryName(PathFiletoCopy), System.IO.Path.GetFileNameWithoutExtension(PathFiletoCopy) + "-");
for (int i = 1; i < get_number_of_files; i++)
{
System.IO.File.Copy(PathFiletoCopy, PartialNewPathFile + i.ToString("D6") + Extension);
}
string[] txtfiles = Directory.GetFiles(pth_output, "*.txt");
foreach (var file in txtfiles)
{
string get_file_counter = System.IO.Path.GetDirectoryName(file.Substring(7,6));
FileStream fs = new FileStream(file, FileMode.Append, FileAccess.Write);
FileStream fi = new FileStream(file, FileMode.Open, FileAccess.Read);
using (StreamReader reader = new StreamReader(fi))
{
using (StreamWriter writer = new StreamWriter(fs))
{
string line = null;
while ((line = reader.ReadLine()) != null)
{
string replace_line_one = line.Replace("Hello 000001","Hello"+ "["+get_file_counter+"]");
string replace_line_two = line.Replace("My number is 000001", "My number is" + "[" + get_file_counter + "]");
}
writer.Close();
} reader.Close();
}
}
Console.Read();
I hope you can help
Appreciate your help guys
string[] files = Directory.GetFiles(directoryPath, "*.txt");
Regex regex = new Regex("\\d+(?=\\.txt)");
foreach (var file in files)
{
string[] lines = File.ReadAllLines(file);
string number = regex.Match(Path.GetFileName(file)).Value;
lines[0] = "Hello " + number;
lines[1] = "My number is " + number;
File.WriteAllLines(file, lines);
}
Regex makes this solution nonspecific.
This might do the trick for you
System.IO.Directory myDir = pth_output;
int count = (myDir.GetFiles().Length) + 1;
string thenumber = String.Format("0:000000", count);
string filename = "Mytext-" + thenumber + ".txt";
string filetext = "Hello " + thenumber + Environment.NewLine + "My number is " + thenumber;
File.WriteAllText(Path.Combine(myDir,filename) , createText);
In myDir I am expecting you can pull the path of the folder which contains all the txt files.
myDir.GetFiles().Length will give you the count of the files exists in the folder and as we want only txt files you can search it like Directory.GetFiles(path, "*.txt", SearchOption.AllDirectories).Length; instead
String.Format("0:000000", count); will give you the number in your format of preceding zeros.

Copying CSV file while reordering/adding empty columns

Copying CSV file while reordering/adding empty columns.
For example if ever line of incoming file has values for 3 out of 10 columns in order different from output like (except first which is header with column names):
col2,col6,col4 // first line - column names
2, 5, 8 // subsequent lines - values for 3 columns
and output expected to have
col0,col1,col2,col3,col4,col5,col6,col7,col8,col9
then output should be "" for col0,col1,col3,col5,col7,col8,col9,and values from col2,col4,col4 in the input file. So for the shown second line (2,5,8) expected output is ",,2,,5,,8,,,,,"
Below code I've tried and it is slower than I want.
I have two lists.
The first list filecolumnnames is created by splitting a delimited string (line) and this list gets recreated for every line in the file.
The second list list has the order in which the first list needs to be rearranged and re concatenated.
This works
string fileName = "F:\\temp.csv";
//file data has first row col3,col2,col1,col0;
//second row: 4,3,2,1
//so on
string fileName_recreated = "F:\\temp_1.csv";
int count = 0;
const Int32 BufferSize = 1028;
using (var fileStream = File.OpenRead(fileName))
using (var streamReader = new StreamReader(fileStream, Encoding.UTF8, true, BufferSize))
{
String line;
List<int> list = new List<int>();
string orderedcolumns = "\"\"";
string tableheader = "col0,col1,col2,col3,col4,col5,col6,col7,col8,col9,col10";
List<string> tablecolumnnames = new List<string>();
List<string> filecolumnnames = new List<string>();
while ((line = streamReader.ReadLine()) != null)
{
count = count + 1;
StringBuilder sb = new StringBuilder("");
tablecolumnnames = tableheader.Split(',').ToList();
if (count == 1)
{
string fileheader = line;
//fileheader=""col2,col1,col0"
filecolumnnames = fileheader.Split(',').ToList();
foreach (string col in tablecolumnnames)
{
int index = filecolumnnames.IndexOf(col);
if (index == -1)
{
sb.Append(",");
// orderedcolumns=orderedcolumns+"+\",\"";
list.Add(-1);
}
else
{
sb.Append(filecolumnnames[index] + ",");
//orderedcolumns = orderedcolumns+ "+filecolumnnames["+index+"]" + "+\",\"";
list.Add(index);
}
// MessageBox.Show(orderedcolumns);
}
}
else
{
filecolumnnames = line.Split(',').ToList();
foreach (int items in list)
{
//MessageBox.Show(items.ToString());
if (items == -1)
{
sb.Append(",");
}
else
{
sb.Append(filecolumnnames[items] + ",");
}
}
//expected format sb.Append(filecolumnnames[3] + "," + filecolumnnames[2] + "," + filecolumnnames[2] + ",");
//sb.Append(orderedcolumns);
var result = String.Join (", ", list.Select(index => filecolumnnames[index]));
}
using (FileStream fs = new FileStream(fileName_recreated, FileMode.Append, FileAccess.Write))
using (StreamWriter sw = new StreamWriter(fs))
{
sw.WriteLine(sb.ToString());
}
}
I am trying to make it faster by constructing a string orderedcolumns and remove the second for each loop which happens for every row and replace it with constructed string.
so if you uncomment the orderedcolumns string construction orderedcolumns = orderedcolumns+ "+filecolumnnames["+index+"]" + "+\",\""; and uncomment the append sb.Append(orderedcolumns); I am expecting the value inside the constructed string but when I append the orderedcolumns it is appending the text i.e.
""+","+filecolumnnames[3]+","+filecolumnnames[2]+","+filecolumnnames[1]+","+filecolumnnames[0]+","+","+","+","+","+","+","
i.e. I instead want it to take the value inside the filecolumnnames[3] list and not the filecolumnnames[3] name itself.
Expected value: if that line has 1,2,3,4
I want the output to be 4,3,2,1 as filecolumnnames[3] will have 4, filecolumnnames[2] will have 3..
String.Join is the way to construct comma/space delimited strings from sequence.
var result = String.Join (", ", list.Select(index => filecolumnnames[index]);
Since you are reading only subset of columns and orders in input and output don't match I'd use dictionary to hold each row of input.
var row = tablecolumnnames
.Zip(line.Split(','), (Name,Value)=> new {Name,Value})
.ToDictionary(x => x.Name, x.Value);
For output I'd fill sequence from defaults or input row:
var outputLine = String.Join(",",
filecolumnnames
.Select(name => row.ContainsKey(name) ? row[name] : ""));
Note code is typed in and not compiled.
orderedcolumns = orderedcolumns+ "+filecolumnnames["+index+"]" + "+\",\""; "
should be
orderedcolumns = orderedcolumns+ filecolumnnames[index] + ",";
you should however use join as others have pointed out. Or
orderedcolumns.AppendFormat("{0},", filecolumnnames[index]);
you will have to deal with the extra ',' on the end

C# Edit string in file - delete a character (000)

I am rookie in C#, but I need solve one Problem.
I have several text files in Folder and each text files has this structure:
IdNr 000000100
Name Name
Lastname Lastname
Sex M
.... etc...
Load all files from Folder, this is no Problem ,but i need delete "zero" in IdNr, so delete 000000 and 100 leave there. After this file save. Each files had other IdNr, Therefore, it is harder :(
Yes, it is possible each files manual edit, but when i have 3000 files, this is not good :)
Can C# one algorithm, which could this 000000 delete and leave only number 100?
Thank you All.
Vaclav
So, thank you ALL !
But in the End I have this Code :-) :
using System.IO;
namespace name
{
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
private void Browse_Click(object sender, EventArgs e)
{
DialogResult dialog = folderBrowserDialog1.ShowDialog();
if (dialog == DialogResult.OK)
TP_zdroj.Text = folderBrowserDialog1.SelectedPath;
}
private void start_Click(object sender, EventArgs e)
{
try
{
foreach (string file in Directory.GetFiles(TP_zdroj.Text, "*.txt"))
{
string text = File.ReadAllText(file, Encoding.Default);
text = System.Text.RegularExpressions.Regex.Replace(text, "IdNr 000*", "IdNr ");
File.WriteAllText(file, text, Encoding.Default);
}
}
catch
{
MessageBox.Show("Warning...!");
return;
}
{
MessageBox.Show("Done");
}
}
}
}
Thank you ALL ! ;)
You can use int.Parse:
int number = int.Parse("000000100");
String withoutzeros = number.ToString();
According to your read/save file issue, do the files contain more than one record, is that the header or does each record is a list of key and value like "IdNr 000000100"? It's difficult to answer without these informations.
Edit: Here's a simple but efficient approach which should work if the format is strict:
var files = Directory.EnumerateFiles(path, "*.txt", SearchOption.TopDirectoryOnly);
foreach (var fPath in files)
{
String[] oldLines = File.ReadAllLines(fPath); // load into memory is faster when the files are not really huge
String key = "IdNr ";
if (oldLines.Length != 0)
{
IList<String> newLines = new List<String>();
foreach (String line in oldLines)
{
String newLine = line;
if (line.Contains(key))
{
int numberRangeStart = line.IndexOf(key) + key.Length;
int numberRangeEnd = line.IndexOf(" ", numberRangeStart);
String numberStr = line.Substring(numberRangeStart, numberRangeEnd - numberRangeStart);
int number = int.Parse(numberStr);
String withoutZeros = number.ToString();
newLine = line.Replace(key + numberStr, key + withoutZeros);
newLines.Add(line);
}
newLines.Add(newLine);
}
File.WriteAllLines(fPath, newLines);
}
}
Use TrimStart
var trimmedText = number.TrimStart('0');
This should do it. It assumes your files have a .txt extension, and it removes all occurrences of "000000" from each file.
foreach (string fileName in Directory.GetFiles("*.txt"))
{
File.WriteAllText(fileName, File.ReadAllText(fileName).Replace("000000", ""));
}
These are the steps you would want to take:
Loop each file
Read file line by line
for each line split on " " and remove leading zeros from 2nd element
write the new line back to a temp file
after all lines processed, delete original file and rename temp file
do next file
(you can avoid the temp file part by reading each file in full into memory, but depending on your file sizes this may not be practical)
You can remove the leading zeros with something like this:
string s = "000000100";
s = s.TrimStart('0');
Simply, read every token from the file and use this method:
var token = "000000100";
var result = token.TrimStart('0');
You can write a function similar to this one:
static IEnumerable<string> ModifiedLines(string file) {
string line;
using(var reader = File.OpenText(file)) {
while((line = reader.ReadLine()) != null) {
string[] tokens = line.Split(new char[] { ' ' });
line = string.Empty;
foreach (var token in tokens)
{
line += token.TrimStart('0') + " ";
}
yield return line;
}
}
}
Usage:
File.WriteAllLines(file, ModifiedLines(file));

Categories