C# Merging Two or more Text Files side by side - c#

using (StreamWriter writer = File.CreateText(FinishedFile))
{
int lineNum = 0;
while (lineNum < FilesLineCount.Min())
{
for (int i = 0; i <= FilesToMerge.Count() - 1; i++)
{
if (i != FilesToMerge.Count() - 1)
{
var CurrentFile = File.ReadLines(FilesToMerge[i]).Skip(lineNum).Take(1);
string CurrentLine = string.Join("", CurrentFile);
writer.Write(CurrentLine + ",");
}
else
{
var CurrentFile = File.ReadLines(FilesToMerge[i]).Skip(lineNum).Take(1);
string CurrentLine = string.Join("", CurrentFile);
writer.Write(CurrentLine + "\n");
}
}
lineNum++;
}
}
The current way i am doing this is just too slow. I am merging files that are each 50k+ lines long with various amounts of data.
for ex:
File 1
1
2
3
4
File 2
4
3
2
1
i need this to merge into being a third fileFile 3
1,4
2,3
3,2
4,1P.S. The user can pick as many files as they want from any locations.
Thanks for the help.

You approach is slow because of the Skip and Take in the loops.
You could use a dictionary to collect all line-index' lines:
string[] allFileLocationsToMerge = { "filepath1", "filepath2", "..." };
var mergedLists = new Dictionary<int, List<string>>();
foreach (string file in allFileLocationsToMerge)
{
string[] allLines = File.ReadAllLines(file);
for (int lineIndex = 0; lineIndex < allLines.Length; lineIndex++)
{
bool indexKnown = mergedLists.TryGetValue(lineIndex, out List<string> allLinesAtIndex);
if (!indexKnown)
allLinesAtIndex = new List<string>();
allLinesAtIndex.Add(allLines[lineIndex]);
mergedLists[lineIndex] = allLinesAtIndex;
}
}
IEnumerable<string> mergeLines = mergedLists.Values.Select(list => string.Join(",", list));
File.WriteAllLines("targetPath", mergeLines);

Here's another approach - this implementation only stores in memory one set of lines from each file simultaneously, thus reducing memory pressure significantly (if that is an issue).
public static void MergeFiles(string output, params string[] inputs)
{
var files = inputs.Select(File.ReadLines).Select(iter => iter.GetEnumerator()).ToArray();
StringBuilder line = new StringBuilder();
bool any;
using (var outFile = File.CreateText(output))
{
do
{
line.Clear();
any = false;
foreach (var iter in files)
{
if (!iter.MoveNext())
continue;
if (line.Length != 0)
line.Append(", ");
line.Append(iter.Current);
any = true;
}
if (any)
outFile.WriteLine(line.ToString());
}
while (any);
}
foreach (var iter in files)
{
iter.Dispose();
}
}
This also handles files of different lengths.

Related

Removing carriage return from specific line in c#

I have this type of data in a text file (csv) :
column1|column2|column3|column4|column5 (\r\n)
column1|column2|column3|column4|column5 (\r\n)
column1|column2 (\r\n)
column2 (\r\n)
column2|column3|column4|column5 (\r\n)
I would like to delete the \r\n that are line 3 and line 4 to have :
column1|column2|column3|column4|column5 (\r\n)
column1|column2|column3|column4|column5 (\r\n)
column1|column2/column2/column2|column3|column4|column5 (\r\n)
My idea is if the row doesn't have 4 column separators ("|") then delete the CRLF, and repeat the operation until you have only correct rows.
This is my code :
String path = "test.csv";
// Read file
string[] readText = File.ReadAllLines(path);
// Empty the file
File.WriteAllText(path, String.Empty);
int x = 0;
int countheaders = 0;
int countlines;
using (StreamWriter writer = new StreamWriter(path))
{
foreach (string s in readText)
{
if (x == 0)
{
countheaders = s.Where(c => c == '|').Count();
x = 1;
}
countlines = 0;
countlines = s.Where(d => d == '|').Count();
if (countlines == countheaders)
{
writer.WriteLine(s);
}
else
{
string s2 = s;
s2 = s2.ToString().TrimEnd('\r', '\n');
writer.Write(s2);
}
}
}
The problem is that i'm reading the file in one pass, so the line break on line 4 is removed and line 4 and line 5 are together...
You could probably do the following (cant test it now, but it should work):
IEnumerable<string> batchValuesIn(
IEnumerable<string> source,
string separator,
int size)
{
var counter = 0;
var buffer = new StringBuilder();
foreach (var line in source)
{
var values = line.Split(separator);
if (line.Length != 0)
{
foreach (var value in values)
{
buffer.Append(value);
counter++;
if (counter % size == 0)
{
yield return buffer.ToString();
buffer.Clear();
}
else
buffer.Append(separator);
}
}
}
if (buffer.Length != 0)
yield return buffer.ToString();
And you'd use it like:
var newLines = batchValuesIn(File.ReadLines(path), "|", 5);
The good thing about this solution is that you are never loading into memory the enitre orignal source. You simply build the lines on the fly.
DISCLAIMER: this may behave weirdly with malfomred input strings.

Split one string and put it in two arrays

I'd like to split several strings of a text file into two strings each (example: car;driver). I do not know how to put the first word in array1 and the second word in array2. So I tried with an if query for a semicolon to put every single letter of word1 in array1 and the same with the second word to put them back together to the words later.
But I think it's too complicated what I've done and I am stuck now, lol.
Here I show a piece of my code:
private void BtnShow_Click(object sender, EventArgs e)
{
LibPasswords.Items.Clear();
string path = "passwords.txt";
int counter = 0;
using (FileStream fs = new FileStream(path, FileMode.Open, FileAccess.Read))
using (StreamReader reader = new StreamReader(fs))
{
while (reader.ReadLine() != null)
{
counter++;
}
//for (int i = 0; i < counter; i++)
//{
// var Website = reader.ReadLine().Split(';').Select(x => new String[] { x });
// var Passwort = reader.ReadLine().Split(';').Select(y => new String[] { y });
// LibPasswords.Items.Add(String.Format(table, Website, Passwort));
//}
string[] firstWord = new string[counter];
string[] lastWord = new string[counter];
int i = 0;
int index = 0;
while (reader.Peek() >= 0)
{
string ch = reader.Read().ToString();
if (ch != ";")
{
firstWord[i] = ch;
i++;
}
else
{
index = 1;
}
while (reader.Peek() >= 0)
{
??????????????????????????????????
}
}
}
}
Sorry for my English, it's not my mother tongue.
As you don't know in advance how many lines there are, it is more convenient to use a List<string> instead of a string[]. A List will automatically increase its capacity as needed.
You can use the string.Split method to split the string at the ';' into an array. If the resulting array has the correct number of parts, you can add those parts to the Lists.
List<string> firstWord = new List<string>();
List<string> lastWord = new List<string>();
string fileName = #"C:\temp\SO61715409.txt";
foreach (string line in File.ReadLines(fileName))
{
string[] parts = line.Split(new char[] { ';' });
if (parts.Length == 2)
{
firstWord.Add(parts[0]);
lastWord.Add(parts[1]);
}
}

Parsing a huge text file(around 2GB) with custom delimiters

I have a huge text file around 2GB which I am trying to parse in C#.
The file has custom delimiters for rows and columns. I want to parse the file and extract the data and write to another file by inserting column header and replacing RowDelimiter by newline and ColumnDelimiter by tab so that I can get the data in tabular format.
sample data:
1'~'2'~'3#####11'~'12'~'13
RowDelimiter: #####
ColumnDelimiter: '~'
I keep on getting System.OutOfMemoryException on the following line
while ((line = rdr.ReadLine()) != null)
public void ParseFile(string inputfile,string outputfile,string header)
{
using (StreamReader rdr = new StreamReader(inputfile))
{
string line;
while ((line = rdr.ReadLine()) != null)
{
using (StreamWriter sw = new StreamWriter(outputfile))
{
//Write the Header row
sw.Write(header);
//parse the file
string[] rows = line.Split(new string[] { ParserConstants.RowSeparator },
StringSplitOptions.None);
foreach (string row in rows)
{
string[] columns = row.Split(new string[] {ParserConstants.ColumnSeparator},
StringSplitOptions.None);
foreach (string column in columns)
{
sw.Write(column + "\\t");
}
sw.Write(ParserConstants.NewlineCharacter);
Console.WriteLine();
}
}
Console.WriteLine("File Parsing completed");
}
}
}
As mentioned already in the comments you won't be able to use ReadLine to handle this, you'll have to essentially process the data one byte - or character - at a time. The good news is that this is basically how ReadLine works anyway, so we're not losing a lot in this case.
Using a StreamReader we can read a series of characters from the source stream (in whatever encoding you need) into an array. Using that and a StringBuilder we can process the stream in chunks and check for separator sequences on the way.
Here's a method that will handle an arbitrary delimiter:
public static IEnumerable<string> ReadDelimitedRows(StreamReader reader, string delimiter)
{
char[] delimChars = delimiter.ToArray();
int matchCount = 0;
char[] buffer = new char[512];
int rc = 0;
StringBuilder sb = new StringBuilder();
while ((rc = reader.Read(buffer, 0, buffer.Length)) > 0)
{
for (int i = 0; i < rc; i++)
{
char c = buffer[i];
if (c == delimChars[matchCount])
{
if (++matchCount >= delimChars.Length)
{
// found full row delimiter
yield return sb.ToString();
sb.Clear();
matchCount = 0;
}
}
else
{
if (matchCount > 0)
{
// append previously matched portion of the delimiter
sb.Append(delimChars.Take(matchCount));
matchCount = 0;
}
sb.Append(c);
}
}
}
// return the last row if found
if (sb.Length > 0)
yield return sb.ToString();
}
This should handle any cases where part of your block delimiter can appear in the actual data.
In order to translate your file from the input format you describe to a simple tab-delimited format you could do something along these lines:
const string RowDelimiter = "#####";
const string ColumnDelimiter = "'~'";
using (var reader = new StreamReader(inputFilename))
using (var writer = new StreamWriter(File.Create(ouputFilename)))
{
foreach (var row in ReadDelimitedRows(reader, RowDelimiter))
{
writer.Write(row.Replace(ColumnDelimiter, "\t"));
}
}
That should process fairly quickly without eating up too much memory. Some adjustments might be required for non-ASCII output.
Read the data into a buffer and then do your parsing.
using (StreamReader rdr = new StreamReader(inputfile))
using (StreamWriter sw = new StreamWriter(outputfile))
{
char[] buffer = new char[256];
int read;
//Write the Header row
sw.Write(header);
string remainder = string.Empty;
while ((read = rdr.Read(buffer, 0, 256)) > 0)
{
string bufferData = new string(buffer, 0, read);
//parse the file
string[] rows = bufferData.Split(
new string[] { ParserConstants.RowSeparator },
StringSplitOptions.None);
rows[0] = remainder + rows[0];
int completeRows = rows.Length - 1;
remainder = rows.Last();
foreach (string row in rows.Take(completeRows))
{
string[] columns = row.Split(
new string[] {ParserConstants.ColumnSeparator},
StringSplitOptions.None);
foreach (string column in columns)
{
sw.Write(column + "\\t");
}
sw.Write(ParserConstants.NewlineCharacter);
Console.WriteLine();
}
}
if(reamainder.Length > 0)
{
string[] columns = remainder.Split(
new string[] {ParserConstants.ColumnSeparator},
StringSplitOptions.None);
foreach (string column in columns)
{
sw.Write(column + "\\t");
}
sw.Write(ParserConstants.NewlineCharacter);
Console.WriteLine();
}
Console.WriteLine("File Parsing completed");
}
The problem you have is that you are eagerly consuming the whole file and placing it in memory. Attempting to split a 2GB file in memory is going to be problematic, as you now know.
Solution? Consume one lime a time. Because your file doesn't have a standard line separator you'll have to implement a custom parser that does this for you. The following code does just that (or I think it does, I haven't tested it). Its probably very improvable from a performance perspective but it should at least get you started in the right direction (c#7 syntax):
public static IEnumerable<string> GetRows(string path, string rowSeparator)
{
bool tryParseSeparator(StreamReader reader, char[] buffer)
{
var count = reader.Read(buffer, 0, buffer.Length);
if (count != buffer.Length)
return false;
return Enumerable.SequenceEqual(buffer, rowSeparator);
}
using (var reader = new StreamReader(path))
{
int peeked;
var rowBuffer = new StringBuilder();
var separatorBuffer = new char[rowSeparator.Length];
while ((peeked = reader.Peek()) > -1)
{
if ((char)peeked == rowSeparator[0])
{
if (tryParseSeparator(reader, separatorBuffer))
{
yield return rowBuffer.ToString();
rowBuffer.Clear();
}
else
{
rowBuffer.Append(separatorBuffer);
}
}
else
{
rowBuffer.Append((char)reader.Read());
}
}
if (rowBuffer.Length > 0)
yield return rowBuffer.ToString();
}
}
Now you have a lazy enumeration of rows from your file, and you can process it as you intended to:
foreach (var row in GetRows(inputFile, ParserConstants.RowSeparator))
{
var columns = line.Split(new string[] {ParserConstants.ColumnSeparator},
StringSplitOptions.None);
//etc.
}
I think this should do the trick...
public void ParseFile(string inputfile, string outputfile, string header)
{
int blockSize = 1024;
using (var file = File.OpenRead(inputfile))
{
using (StreamWriter sw = new StreamWriter(outputfile))
{
int bytes = 0;
int parsedBytes = 0;
var buffer = new byte[blockSize];
string lastRow = string.Empty;
while ((bytes = file.Read(buffer, 0, buffer.Length)) > 0)
{
// Because the buffer edge could split a RowDelimiter, we need to keep the
// last row from the prior split operation. Append the new buffer to the
// last row from the prior loop iteration.
lastRow += Encoding.Default.GetString(buffer,0, bytes);
//parse the file
string[] rows = lastRow.Split(new string[] { ParserConstants.RowSeparator }, StringSplitOptions.None);
// We cannot process the last row in this set because it may not be a complete
// row, and tokens could be clipped.
if (rows.Count() > 1)
{
for (int i = 0; i < rows.Count() - 1; i++)
{
sw.Write(new Regex(ParserConstants.ColumnSeparator).Replace(rows[i], "\t") + ParserConstants.NewlineCharacter);
}
}
lastRow = rows[rows.Count() - 1];
parsedBytes += bytes;
// The following statement is not quite true because we haven't parsed the lastRow.
Console.WriteLine($"Parsed {parsedBytes.ToString():N0} bytes");
}
// Now that there are no more bytes to read, we know that the lastrow is complete.
sw.Write(new Regex(ParserConstants.ColumnSeparator).Replace(lastRow, "\t"));
}
}
Console.WriteLine("File Parsing completed.");
}
Late to the party here, but in case anyone else want to know easy way to load such large CSV file with custom delimiters, Cinchoo ETL does the job for you.
using (var parser = new ChoCSVReader("CustomNewLine.csv")
.WithDelimiter("~")
.WithEOLDelimiter("#####")
)
{
foreach (dynamic x in parser)
Console.WriteLine(x.DumpAsJson());
}
Disclaimer: I'm the author of this library.

How to search to search a file for string, display the line containing the string and also the 6 lines preceding it

I am trying to search through a text file for a string, once I have found this string I need to display this line and then also display the 6 preceding lines i.e. which will contain the details about the error message in the string. I have been searching for similar code and have found the following code but it doesn’t meet my requirements, just wondering if it's possible to do this.
Thanks,
John.
private static void Main(string[] args)
{
string cacheline = "";
string line;
System.IO.StreamReader file = new
System.IO.StreamReader(#"D:\Temp\AccessOutlook.txt");
List<string> lines = new List<string>();
while ((line = file.ReadLine()) != null)
{
if (line.Contains("errors"))
{
lines.Add(cacheline);
}
cacheline = line;
}
file.Close();
foreach (var l in lines)
{
Console.WriteLine(l);
}
}
}
This is probably what you want:
static void Main(string[] args)
{
Queue<string> lines = new Queue<string>();
using (var reader = new StreamReader(args[0]))
{
string line;
while ((line = reader.ReadLine()) != null)
{
if (line.Contains("error"))
{
Console.WriteLine("----- ERROR -----");
foreach (var errLine in lines)
Console.WriteLine(errLine);
Console.WriteLine(line);
Console.WriteLine("-----------------");
}
lines.Enqueue(line);
while (lines.Count > 6)
lines.Dequeue();
}
}
}
You can keep caching the lines until you find the line you are looking for:
using(var file = new StreamReader(#"D:\Temp\AccessOutlook.txt"))
{
List<string> lines = new List<string>();
while ((line = file.ReadLine()) != null)
{
if (!line.Contains(myString))
{
lines.Add(line);
}
else
{
Console.WriteLine(string.Join(Environment.NewLine, lines.Concat(new[] { line })));
}
if(lines.Count > 6) lines.RemoveAt(0);
}
}
string filename = "filename"; // Put your own filename here.
string target = "target"; // Put your target string here.
int numLinesToShow = 7;
var lines = File.ReadAllLines(filename);
int index = Array.FindIndex(lines, element => element.Contains(target));
if (index >= 0)
{
int start = Math.Max(0, index - numLinesToShow + 1);
var result = lines.Skip(start).Take(numLinesToShow).ToList();
// Use result.
}
The code below will open the file, search for the line you want, and then write the 6 preceeding lines to the Console.
var lines = File.ReadAllLines(filePath);
int lineIndex;
for (lineIndex = 0; lineIndex < lines.Length - 1; lineIndex++)
{
if (lines[lineIndex] == textToFind)
{
break;
}
}
var startLine = Math.Max(0, lineIndex - 6);
for (int i = startLine; i < lineIndex; i++)
{
Console.WriteLine(lines[i]);
}

Split large string into smaller chunks in c#

I have a large string separated by newline character. This string contains 100 lines. I want to split these line into small chunks say chunk of 20 also based on newline character.
Let's say the string variable is like this,
Line1 This is line2 Line3 is here I am Line4
Now I want to split this large string variable into small chunks of 2. The result should be 2 strings as,
Line1 This is line2
Line3 is here I am Line4
Using Split function, I am not getting the expected results. Please help me in achieving this.
Thanks in advance,
Vijay
The simple approach (Split on Environment.NewLine, then loop and append):
public static List<string> GetStringSegments(string originalString, int linesPerSegment)
{
List<string> segments = new List<string>();
string[] allLines = originalString.Split(new string[] {Environment.NewLine}, StringSplitOptions.RemoveEmptyEntries);
StringBuilder sb = new StringBuilder();
int linesProcessed = 0;
for (int i = 0; i < allLines.Length; i++)
{
sb.AppendLine(allLines[i]);
linesProcessed++;
if (linesProcessed == linesPerSegment
|| i == allLines.Length-1)
{
segments.Add(sb.ToString());
sb.Clear();
inesProcessed = 0;
}
}
return segments;
}
The above approach is slightly inefficient since it requires splitting the string first into individual lines, which creates unnecessary strings. A string of 1000 lines will create an array of 1000 strings. We can improved this if we just scan the string and search for \n:
public static List<string> GetStringSegments(string original, int linesPerSegment)
{
List<string> segments = new List<string>();
int startIndex = 0;
int newLinesEncountered = 0;
for (int i = 0; i < original.Length; i++)
{
if (original[i] == '\n')
{
newLinesEncountered++;
}
if (newLinesEncountered == linesPerSegment
|| i == original.Length - 1)
{
segments.Add(original.Substring(startIndex, (i - startIndex + 1)));
startIndex = i + 1;
newLinesEncountered = 0;
}
}
return segments;
}
You can use something like the batch operator from http://www.make-awesome.com/2010/08/batch-or-partition-a-collection-with-linq
string s = "[YOUR DATA]";
var lines = s.Split(new[]{Environment.NewLine}, StringSplitOptions.RemoveEmptyEntries);
foreach(var batch in lines.Batch(20))
{
foreach(batchLine in batch)
{
Console.Writeline(batchLine);
}
}
static class LinqEx
{
// from http://www.make-awesome.com/2010/08/batch-or-partition-a-collection-with-linq
public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> collection,
int batchSize)
{
List<T> nextbatch = new List<T>(batchSize);
foreach (T item in collection)
{
nextbatch.Add(item);
if (nextbatch.Count == batchSize)
{
yield return nextbatch;
nextbatch = new List<T>(batchSize);
}
}
if (nextbatch.Count > 0)
yield return nextbatch;
}
}
As several people mentioned, using string.Split will split the whole string into memory, which might be an allocation-heavy operation. This is why we have the TextReader class and its descendants, which should provide better memory performance, and might also be clearer, logically:
using (var reader = new StringReader(myString))
{
do
{
StringBuilder newString = null;
StringWriter newStringWriter = null;
if (lineCounter % 20 == 0)
{
newString = new StringBuilder();
newStringWriter = new StringWriter(newString);
newStringCollection.Add(newString);
}
string line = reader.ReadLine();
if (!string.isNullOrEmpty(line))
{
newStringWriter.WriteLine(line);
lineCounter++;
}
}
while (line != null)
}
We're using the StringReader to read our big string, one line at a time. And the corresponding StringWriter writes those lines to the new string, one line a time. After every 20 lines, we start a new StringBuilder (and the appropriate StringWriter wrapper).
split the strings by newline.
Then merge/fetch the number of strings together while using the strings.
string s = "Line1\nThis is line2 \nLine3 is here\nI am Line4";
string [] str = s.split('\n');
List<String> str1 = new List<String>();
for(int i=0; i<str.Length; i+=2)
{
string ss = str[i];
if(i+1 <str.Length)
ss += '\n' + str[i+1];
str1.Add(ss);
}
str = str1.ToArray();
If condition has been checked inside loop because may be the length of str is odd
var strAray = myLongString.Split('\n').ToList();
var skip=0;
var take=20;
var chunk = strAray.Skip(skip).Take(take).ToList();
While(chunk.Count >0)
{
foreach(var line in chunk)
{
// use line string
}
skip++;
chunk = strAray.Skip(skip).Take(take).ToList()
}

Categories