Remove Duplicate Lines From Text File?

Remove Duplicate Lines From Text File? - c#

Given an input file of text lines, I want duplicate lines to be identified and removed. Please show a simple snippet of C# that accomplishes this.

For small files:
string[] lines = File.ReadAllLines("filename.txt");
File.WriteAllLines("filename.txt", lines.Distinct().ToArray());

This should do (and will copy with large files).
Note that it only removes duplicate consecutive lines, i.e.
a
b
b
c
b
d
will end up as
a
b
c
b
d
If you want no duplicates anywhere, you'll need to keep a set of lines you've already seen.
using System;
using System.IO;
class DeDuper
{
static void Main(string[] args)
{
if (args.Length != 2)
{
Console.WriteLine("Usage: DeDuper <input file> <output file>");
return;
}
using (TextReader reader = File.OpenText(args[0]))
using (TextWriter writer = File.CreateText(args[1]))
{
string currentLine;
string lastLine = null;
while ((currentLine = reader.ReadLine()) != null)
{
if (currentLine != lastLine)
{
writer.WriteLine(currentLine);
lastLine = currentLine;
}
}
}
}
}
Note that this assumes Encoding.UTF8, and that you want to use files. It's easy to generalize as a method though:
static void CopyLinesRemovingConsecutiveDupes
(TextReader reader, TextWriter writer)
{
string currentLine;
string lastLine = null;
while ((currentLine = reader.ReadLine()) != null)
{
if (currentLine != lastLine)
{
writer.WriteLine(currentLine);
lastLine = currentLine;
}
}
}
(Note that that doesn't close anything - the caller should do that.)
Here's a version that will remove all duplicates, rather than just consecutive ones:
static void CopyLinesRemovingAllDupes(TextReader reader, TextWriter writer)
{
string currentLine;
HashSet<string> previousLines = new HashSet<string>();
while ((currentLine = reader.ReadLine()) != null)
{
// Add returns true if it was actually added,
// false if it was already there
if (previousLines.Add(currentLine))
{
writer.WriteLine(currentLine);
}
}
}

For a long file (and non consecutive duplications) I'd copy the files line by line building a hash // position lookup table as I went.
As each line is copied check for the hashed value, if there is a collision double check that the line is the same and move to the next. (
Only worth it for fairly large files though.

Here's a streaming approach that should incur less overhead than reading all unique strings into memory.
var sr = new StreamReader(File.OpenRead(#"C:\Temp\in.txt"));
var sw = new StreamWriter(File.OpenWrite(#"C:\Temp\out.txt"));
var lines = new HashSet<int>();
while (!sr.EndOfStream)
{
string line = sr.ReadLine();
int hc = line.GetHashCode();
if(lines.Contains(hc))
continue;
lines.Add(hc);
sw.WriteLine(line);
}
sw.Flush();
sw.Close();
sr.Close();

I am new to .net & have written something more simpler,may not be very efficient.Please fill free to share your thoughts.
class Program
{
static void Main(string[] args)
{
string[] emp_names = File.ReadAllLines("D:\\Employee Names.txt");
List<string> newemp1 = new List<string>();
for (int i = 0; i < emp_names.Length; i++)
{
newemp1.Add(emp_names[i]); //passing data to newemp1 from emp_names
}
for (int i = 0; i < emp_names.Length; i++)
{
List<string> temp = new List<string>();
int duplicate_count = 0;
for (int j = newemp1.Count - 1; j >= 0; j--)
{
if (emp_names[i] != newemp1[j]) //checking for duplicate records
temp.Add(newemp1[j]);
else
{
duplicate_count++;
if (duplicate_count == 1)
temp.Add(emp_names[i]);
}
}
newemp1 = temp;
}
string[] newemp = newemp1.ToArray(); //assigning into a string array
Array.Sort(newemp);
File.WriteAllLines("D:\\Employee Names.txt", newemp); //now writing the data to a text file
Console.ReadLine();
}
}

Related

How to find average from multiple lines of a text file

I need to take a number from every line of a text file and find the average. I'm using stream reader to take the number out of a text file, I don't know where to go from here. Here's what I've done so far
using (StreamReader sr = new StreamReader("pupilSkiTimes.txt"))
{
string line = "";
while ((line = sr.ReadLine()) != null)
{
string[] components = line.Split("~".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
skiTime.Add(components[7]);
}
sr.Close();
}
How do I get this to read from every line of the text file, and once that's done, how do I get the average.
In case you need to know, the data I'm trying to read is doubles, e.g "23.43"

Here is how I will do it, as you mentioned in comments components[7] are double data that you read from the file.
We need to parse it to double, sum it up and divide it by the counting time we are able to parse the number in the file. If the number is not parsed and you want the total average of all lines then move the count out of the if statement.
using (StreamReader sr = new StreamReader("pupilSkiTimes.txt"))
{
string line;
double sum = 0;
int count = 0;
while ((line = sr.ReadLine()) != null)
{
string[] components = line.Split("~".ToCharArray(),
StringSplitOptions.RemoveEmptyEntries);
if (double.TryParse(components[7], out var result))
{
count++;
sum += result;
}
}
sr.Close();
var average = sum / count;
Console.WriteLine(average);
}

I think the handy method for this situation like this is useful hopefully you use it, I am using a similar this in my codes.
I am passing FilePath, Separator, and index value
static double getAvgAtIndex(string fPath, char seperator, int index)
{
double sum = 0;
int counter = 0;
using (StreamReader sr = new StreamReader(fPath))
{
string line = "";
while ((line = sr.ReadLine()) != null)
{
double rawData = 0;
string[] lineData = line.Split(seperator, StringSplitOptions.RemoveEmptyEntries);
double.TryParse(lineData[index], out rawData);
sum += rawData;
counter++;
}
sr.Close();
}
return sum / counter;
}
Usage of this,
static void Main(string[] args)
{
Console.WriteLine("The Avg is: {0}", getAvgAtIndex(#"..\myTextFile.txt", '~' ,1));
// The Avg is: 34.688
}

Here is how to use LINQ to clean up the code a bit
static class Program
{
static void Main(string[] args)
{
var data = File.ReadAllLines("pupilSkiTimes.txt")
.Select((line)=> line.Split("~".ToCharArray(), StringSplitOptions.RemoveEmptyEntries));
List<double> skiTime = new List<double>();
foreach (var parts in data)
{
if (double.TryParse(parts[7], out double x))
{
skiTime.Add(x);
}
}
double average = skiTime.Average();
}
}

Split one string and put it in two arrays

I'd like to split several strings of a text file into two strings each (example: car;driver). I do not know how to put the first word in array1 and the second word in array2. So I tried with an if query for a semicolon to put every single letter of word1 in array1 and the same with the second word to put them back together to the words later.
But I think it's too complicated what I've done and I am stuck now, lol.
Here I show a piece of my code:
private void BtnShow_Click(object sender, EventArgs e)
{
LibPasswords.Items.Clear();
string path = "passwords.txt";
int counter = 0;
using (FileStream fs = new FileStream(path, FileMode.Open, FileAccess.Read))
using (StreamReader reader = new StreamReader(fs))
{
while (reader.ReadLine() != null)
{
counter++;
}
//for (int i = 0; i < counter; i++)
//{
// var Website = reader.ReadLine().Split(';').Select(x => new String[] { x });
// var Passwort = reader.ReadLine().Split(';').Select(y => new String[] { y });
// LibPasswords.Items.Add(String.Format(table, Website, Passwort));
//}
string[] firstWord = new string[counter];
string[] lastWord = new string[counter];
int i = 0;
int index = 0;
while (reader.Peek() >= 0)
{
string ch = reader.Read().ToString();
if (ch != ";")
{
firstWord[i] = ch;
i++;
}
else
{
index = 1;
}
while (reader.Peek() >= 0)
{
??????????????????????????????????
}
}
}
}
Sorry for my English, it's not my mother tongue.

As you don't know in advance how many lines there are, it is more convenient to use a List<string> instead of a string[]. A List will automatically increase its capacity as needed.
You can use the string.Split method to split the string at the ';' into an array. If the resulting array has the correct number of parts, you can add those parts to the Lists.
List<string> firstWord = new List<string>();
List<string> lastWord = new List<string>();
string fileName = #"C:\temp\SO61715409.txt";
foreach (string line in File.ReadLines(fileName))
{
string[] parts = line.Split(new char[] { ';' });
if (parts.Length == 2)
{
firstWord.Add(parts[0]);
lastWord.Add(parts[1]);
}
}

Parsing a huge text file(around 2GB) with custom delimiters

I have a huge text file around 2GB which I am trying to parse in C#.
The file has custom delimiters for rows and columns. I want to parse the file and extract the data and write to another file by inserting column header and replacing RowDelimiter by newline and ColumnDelimiter by tab so that I can get the data in tabular format.
sample data:
1'~'2'~'3#####11'~'12'~'13
RowDelimiter: #####
ColumnDelimiter: '~'
I keep on getting System.OutOfMemoryException on the following line
while ((line = rdr.ReadLine()) != null)
public void ParseFile(string inputfile,string outputfile,string header)
{
using (StreamReader rdr = new StreamReader(inputfile))
{
string line;
while ((line = rdr.ReadLine()) != null)
{
using (StreamWriter sw = new StreamWriter(outputfile))
{
//Write the Header row
sw.Write(header);
//parse the file
string[] rows = line.Split(new string[] { ParserConstants.RowSeparator },
StringSplitOptions.None);
foreach (string row in rows)
{
string[] columns = row.Split(new string[] {ParserConstants.ColumnSeparator},
StringSplitOptions.None);
foreach (string column in columns)
{
sw.Write(column + "\\t");
}
sw.Write(ParserConstants.NewlineCharacter);
Console.WriteLine();
}
}
Console.WriteLine("File Parsing completed");
}
}
}

As mentioned already in the comments you won't be able to use ReadLine to handle this, you'll have to essentially process the data one byte - or character - at a time. The good news is that this is basically how ReadLine works anyway, so we're not losing a lot in this case.
Using a StreamReader we can read a series of characters from the source stream (in whatever encoding you need) into an array. Using that and a StringBuilder we can process the stream in chunks and check for separator sequences on the way.
Here's a method that will handle an arbitrary delimiter:
public static IEnumerable<string> ReadDelimitedRows(StreamReader reader, string delimiter)
{
char[] delimChars = delimiter.ToArray();
int matchCount = 0;
char[] buffer = new char[512];
int rc = 0;
StringBuilder sb = new StringBuilder();
while ((rc = reader.Read(buffer, 0, buffer.Length)) > 0)
{
for (int i = 0; i < rc; i++)
{
char c = buffer[i];
if (c == delimChars[matchCount])
{
if (++matchCount >= delimChars.Length)
{
// found full row delimiter
yield return sb.ToString();
sb.Clear();
matchCount = 0;
}
}
else
{
if (matchCount > 0)
{
// append previously matched portion of the delimiter
sb.Append(delimChars.Take(matchCount));
matchCount = 0;
}
sb.Append(c);
}
}
}
// return the last row if found
if (sb.Length > 0)
yield return sb.ToString();
}
This should handle any cases where part of your block delimiter can appear in the actual data.
In order to translate your file from the input format you describe to a simple tab-delimited format you could do something along these lines:
const string RowDelimiter = "#####";
const string ColumnDelimiter = "'~'";
using (var reader = new StreamReader(inputFilename))
using (var writer = new StreamWriter(File.Create(ouputFilename)))
{
foreach (var row in ReadDelimitedRows(reader, RowDelimiter))
{
writer.Write(row.Replace(ColumnDelimiter, "\t"));
}
}
That should process fairly quickly without eating up too much memory. Some adjustments might be required for non-ASCII output.

Read the data into a buffer and then do your parsing.
using (StreamReader rdr = new StreamReader(inputfile))
using (StreamWriter sw = new StreamWriter(outputfile))
{
char[] buffer = new char[256];
int read;
//Write the Header row
sw.Write(header);
string remainder = string.Empty;
while ((read = rdr.Read(buffer, 0, 256)) > 0)
{
string bufferData = new string(buffer, 0, read);
//parse the file
string[] rows = bufferData.Split(
new string[] { ParserConstants.RowSeparator },
StringSplitOptions.None);
rows[0] = remainder + rows[0];
int completeRows = rows.Length - 1;
remainder = rows.Last();
foreach (string row in rows.Take(completeRows))
{
string[] columns = row.Split(
new string[] {ParserConstants.ColumnSeparator},
StringSplitOptions.None);
foreach (string column in columns)
{
sw.Write(column + "\\t");
}
sw.Write(ParserConstants.NewlineCharacter);
Console.WriteLine();
}
}
if(reamainder.Length > 0)
{
string[] columns = remainder.Split(
new string[] {ParserConstants.ColumnSeparator},
StringSplitOptions.None);
foreach (string column in columns)
{
sw.Write(column + "\\t");
}
sw.Write(ParserConstants.NewlineCharacter);
Console.WriteLine();
}
Console.WriteLine("File Parsing completed");
}

The problem you have is that you are eagerly consuming the whole file and placing it in memory. Attempting to split a 2GB file in memory is going to be problematic, as you now know.
Solution? Consume one lime a time. Because your file doesn't have a standard line separator you'll have to implement a custom parser that does this for you. The following code does just that (or I think it does, I haven't tested it). Its probably very improvable from a performance perspective but it should at least get you started in the right direction (c#7 syntax):
public static IEnumerable<string> GetRows(string path, string rowSeparator)
{
bool tryParseSeparator(StreamReader reader, char[] buffer)
{
var count = reader.Read(buffer, 0, buffer.Length);
if (count != buffer.Length)
return false;
return Enumerable.SequenceEqual(buffer, rowSeparator);
}
using (var reader = new StreamReader(path))
{
int peeked;
var rowBuffer = new StringBuilder();
var separatorBuffer = new char[rowSeparator.Length];
while ((peeked = reader.Peek()) > -1)
{
if ((char)peeked == rowSeparator[0])
{
if (tryParseSeparator(reader, separatorBuffer))
{
yield return rowBuffer.ToString();
rowBuffer.Clear();
}
else
{
rowBuffer.Append(separatorBuffer);
}
}
else
{
rowBuffer.Append((char)reader.Read());
}
}
if (rowBuffer.Length > 0)
yield return rowBuffer.ToString();
}
}
Now you have a lazy enumeration of rows from your file, and you can process it as you intended to:
foreach (var row in GetRows(inputFile, ParserConstants.RowSeparator))
{
var columns = line.Split(new string[] {ParserConstants.ColumnSeparator},
StringSplitOptions.None);
//etc.
}

I think this should do the trick...
public void ParseFile(string inputfile, string outputfile, string header)
{
int blockSize = 1024;
using (var file = File.OpenRead(inputfile))
{
using (StreamWriter sw = new StreamWriter(outputfile))
{
int bytes = 0;
int parsedBytes = 0;
var buffer = new byte[blockSize];
string lastRow = string.Empty;
while ((bytes = file.Read(buffer, 0, buffer.Length)) > 0)
{
// Because the buffer edge could split a RowDelimiter, we need to keep the
// last row from the prior split operation. Append the new buffer to the
// last row from the prior loop iteration.
lastRow += Encoding.Default.GetString(buffer,0, bytes);
//parse the file
string[] rows = lastRow.Split(new string[] { ParserConstants.RowSeparator }, StringSplitOptions.None);
// We cannot process the last row in this set because it may not be a complete
// row, and tokens could be clipped.
if (rows.Count() > 1)
{
for (int i = 0; i < rows.Count() - 1; i++)
{
sw.Write(new Regex(ParserConstants.ColumnSeparator).Replace(rows[i], "\t") + ParserConstants.NewlineCharacter);
}
}
lastRow = rows[rows.Count() - 1];
parsedBytes += bytes;
// The following statement is not quite true because we haven't parsed the lastRow.
Console.WriteLine($"Parsed {parsedBytes.ToString():N0} bytes");
}
// Now that there are no more bytes to read, we know that the lastrow is complete.
sw.Write(new Regex(ParserConstants.ColumnSeparator).Replace(lastRow, "\t"));
}
}
Console.WriteLine("File Parsing completed.");
}

Late to the party here, but in case anyone else want to know easy way to load such large CSV file with custom delimiters, Cinchoo ETL does the job for you.
using (var parser = new ChoCSVReader("CustomNewLine.csv")
.WithDelimiter("~")
.WithEOLDelimiter("#####")
)
{
foreach (dynamic x in parser)
Console.WriteLine(x.DumpAsJson());
}
Disclaimer: I'm the author of this library.

How to search to search a file for string, display the line containing the string and also the 6 lines preceding it

I am trying to search through a text file for a string, once I have found this string I need to display this line and then also display the 6 preceding lines i.e. which will contain the details about the error message in the string. I have been searching for similar code and have found the following code but it doesn’t meet my requirements, just wondering if it's possible to do this.
Thanks,
John.
private static void Main(string[] args)
{
string cacheline = "";
string line;
System.IO.StreamReader file = new
System.IO.StreamReader(#"D:\Temp\AccessOutlook.txt");
List<string> lines = new List<string>();
while ((line = file.ReadLine()) != null)
{
if (line.Contains("errors"))
{
lines.Add(cacheline);
}
cacheline = line;
}
file.Close();
foreach (var l in lines)
{
Console.WriteLine(l);
}
}
}

This is probably what you want:
static void Main(string[] args)
{
Queue<string> lines = new Queue<string>();
using (var reader = new StreamReader(args[0]))
{
string line;
while ((line = reader.ReadLine()) != null)
{
if (line.Contains("error"))
{
Console.WriteLine("----- ERROR -----");
foreach (var errLine in lines)
Console.WriteLine(errLine);
Console.WriteLine(line);
Console.WriteLine("-----------------");
}
lines.Enqueue(line);
while (lines.Count > 6)
lines.Dequeue();
}
}
}

You can keep caching the lines until you find the line you are looking for:
using(var file = new StreamReader(#"D:\Temp\AccessOutlook.txt"))
{
List<string> lines = new List<string>();
while ((line = file.ReadLine()) != null)
{
if (!line.Contains(myString))
{
lines.Add(line);
}
else
{
Console.WriteLine(string.Join(Environment.NewLine, lines.Concat(new[] { line })));
}
if(lines.Count > 6) lines.RemoveAt(0);
}
}

string filename = "filename"; // Put your own filename here.
string target = "target"; // Put your target string here.
int numLinesToShow = 7;
var lines = File.ReadAllLines(filename);
int index = Array.FindIndex(lines, element => element.Contains(target));
if (index >= 0)
{
int start = Math.Max(0, index - numLinesToShow + 1);
var result = lines.Skip(start).Take(numLinesToShow).ToList();
// Use result.
}

The code below will open the file, search for the line you want, and then write the 6 preceeding lines to the Console.
var lines = File.ReadAllLines(filePath);
int lineIndex;
for (lineIndex = 0; lineIndex < lines.Length - 1; lineIndex++)
{
if (lines[lineIndex] == textToFind)
{
break;
}
}
var startLine = Math.Max(0, lineIndex - 6);
for (int i = startLine; i < lineIndex; i++)
{
Console.WriteLine(lines[i]);
}

Split large string into smaller chunks in c#

I have a large string separated by newline character. This string contains 100 lines. I want to split these line into small chunks say chunk of 20 also based on newline character.
Let's say the string variable is like this,
Line1 This is line2 Line3 is here I am Line4
Now I want to split this large string variable into small chunks of 2. The result should be 2 strings as,
Line1 This is line2
Line3 is here I am Line4
Using Split function, I am not getting the expected results. Please help me in achieving this.
Thanks in advance,
Vijay

The simple approach (Split on Environment.NewLine, then loop and append):
public static List<string> GetStringSegments(string originalString, int linesPerSegment)
{
List<string> segments = new List<string>();
string[] allLines = originalString.Split(new string[] {Environment.NewLine}, StringSplitOptions.RemoveEmptyEntries);
StringBuilder sb = new StringBuilder();
int linesProcessed = 0;
for (int i = 0; i < allLines.Length; i++)
{
sb.AppendLine(allLines[i]);
linesProcessed++;
if (linesProcessed == linesPerSegment
|| i == allLines.Length-1)
{
segments.Add(sb.ToString());
sb.Clear();
inesProcessed = 0;
}
}
return segments;
}
The above approach is slightly inefficient since it requires splitting the string first into individual lines, which creates unnecessary strings. A string of 1000 lines will create an array of 1000 strings. We can improved this if we just scan the string and search for \n:
public static List<string> GetStringSegments(string original, int linesPerSegment)
{
List<string> segments = new List<string>();
int startIndex = 0;
int newLinesEncountered = 0;
for (int i = 0; i < original.Length; i++)
{
if (original[i] == '\n')
{
newLinesEncountered++;
}
if (newLinesEncountered == linesPerSegment
|| i == original.Length - 1)
{
segments.Add(original.Substring(startIndex, (i - startIndex + 1)));
startIndex = i + 1;
newLinesEncountered = 0;
}
}
return segments;
}

You can use something like the batch operator from http://www.make-awesome.com/2010/08/batch-or-partition-a-collection-with-linq
string s = "[YOUR DATA]";
var lines = s.Split(new[]{Environment.NewLine}, StringSplitOptions.RemoveEmptyEntries);
foreach(var batch in lines.Batch(20))
{
foreach(batchLine in batch)
{
Console.Writeline(batchLine);
}
}
static class LinqEx
{
// from http://www.make-awesome.com/2010/08/batch-or-partition-a-collection-with-linq
public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> collection,
int batchSize)
{
List<T> nextbatch = new List<T>(batchSize);
foreach (T item in collection)
{
nextbatch.Add(item);
if (nextbatch.Count == batchSize)
{
yield return nextbatch;
nextbatch = new List<T>(batchSize);
}
}
if (nextbatch.Count > 0)
yield return nextbatch;
}
}

As several people mentioned, using string.Split will split the whole string into memory, which might be an allocation-heavy operation. This is why we have the TextReader class and its descendants, which should provide better memory performance, and might also be clearer, logically:
using (var reader = new StringReader(myString))
{
do
{
StringBuilder newString = null;
StringWriter newStringWriter = null;
if (lineCounter % 20 == 0)
{
newString = new StringBuilder();
newStringWriter = new StringWriter(newString);
newStringCollection.Add(newString);
}
string line = reader.ReadLine();
if (!string.isNullOrEmpty(line))
{
newStringWriter.WriteLine(line);
lineCounter++;
}
}
while (line != null)
}
We're using the StringReader to read our big string, one line at a time. And the corresponding StringWriter writes those lines to the new string, one line a time. After every 20 lines, we start a new StringBuilder (and the appropriate StringWriter wrapper).

split the strings by newline.
Then merge/fetch the number of strings together while using the strings.

string s = "Line1\nThis is line2 \nLine3 is here\nI am Line4";
string [] str = s.split('\n');
List<String> str1 = new List<String>();
for(int i=0; i<str.Length; i+=2)
{
string ss = str[i];
if(i+1 <str.Length)
ss += '\n' + str[i+1];
str1.Add(ss);
}
str = str1.ToArray();
If condition has been checked inside loop because may be the length of str is odd

var strAray = myLongString.Split('\n').ToList();
var skip=0;
var take=20;
var chunk = strAray.Skip(skip).Take(take).ToList();
While(chunk.Count >0)
{
foreach(var line in chunk)
{
// use line string
}
skip++;
chunk = strAray.Skip(skip).Take(take).ToList()
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Remove Duplicate Lines From Text File? - c#

Given an input file of text lines, I want duplicate lines to be identified and removed. Please show a simple snippet of C# that accomplishes this.

For small files: string[] lines = File.ReadAllLines("filename.txt"); File.WriteAllLines("filename.txt", lines.Distinct().ToArray());

Related

How to find average from multiple lines of a text file

Split one string and put it in two arrays

Parsing a huge text file(around 2GB) with custom delimiters

How to search to search a file for string, display the line containing the string and also the 6 lines preceding it

Split large string into smaller chunks in c#

Categories

Resources