Consolidating 300+ files into 5-8, OutOfMemory exception - c#

I have 369 files that need to be formatted and consolidated into 5-8 files before being submitted to the server. I can't submit the 369 files because that would overwhelm the metadata tables in our database (they can handle it, but it'd be 369 rows for what was essentially one file, which would make querying and utilizing those tables a nightmare) and I can't handle it as one file because the total of 3.6 GB is too much for SSIS to handle on our servers.
I wrote the following script to fix the issue:
static void PrepPAIDCLAIMSFiles()
{
const string HEADER = "some long header text, trimmed for SO question";
const string FOOTER = "some long footer text, trimmed for SO question";
//path is defined as a static member of the containing class
string[] files = Directory.GetFiles(path + #"split\");
int splitFileCount = 0, finalFileCount = 0;
List<string> newFileContents = new List<string>();
foreach(string file in files)
{
try
{
var contents = File.ReadAllLines(file).ToList();
var fs = File.OpenRead(file);
if (splitFileCount == 0)
{
//Grab everything except the header
contents = contents.GetRange(1, contents.Count - 1);
}
else if (splitFileCount == files.Length - 1)
{
//Grab everything except the footer
contents = contents.GetRange(0, contents.Count - 1);
}
if (!Directory.Exists(path + #"split\formatted"))
{
Directory.CreateDirectory(path + #"split\formatted");
}
newFileContents.AddRange(contents);
if (splitFileCount % 50 == 0 || splitFileCount >= files.Length)
{
Console.WriteLine($"{splitFileCount} {finalFileCount}");
var sb = new StringBuilder(HEADER);
foreach (var row in newFileContents)
{
sb.Append(row);
}
sb.Append(FOOTER);
newFileContents = new List<string>();
GC.Collect();
string fileName = file.Split('\\').Last();
string baseFileName = fileName.Split('.')[0];
DateTime currentTime = DateTime.Now;
baseFileName += "." + COMPANY_NAME_SetHHMMSS(currentTime, finalFileCount) + ".TXT";
File.WriteAllText(path + #"split\formatted\" + baseFileName, sb.ToString());
finalFileCount += 1;
}
splitFileCount += 1;
}
catch(OutOfMemoryException OOM)
{
Console.WriteLine(file);
Console.WriteLine(OOM.Message);
break;
}
}
}
The way this works is it reads the split file, puts its rows into a string builder, every time it gets to a multiple of 50 files, it writes the string builder to a new file and starts over. The COMPANY_NAME_SetHHMMSS() method ensures the file has a unique name, so it's not writing to the same file over and over (and I can verify this by seeing the output, it writes two files before exploding.)
It breaks when it gets to the 81st file. System.OutOfMemoryException on var contents = File.ReadAllLines(file).ToList();. There's nothing special about the 81st file, it's the same exact size as all the others (~10MB.) The files this function delivers are about ~500MB. It also has no trouble reading and processing all the files upto and not including the 81st, so I don't think that it's running out of memory reading the file, but running out of memory doing something else and it's at the 81st where memory runs out.
The newFileContents() list should be getting emptied by overwriting it with a new list, right? That shouldn't be growing with every iteration in this function. GC.Collect() was sort of a last ditch effort.
The original file that the 369 splits come from has been a headache for a few days now, causing UltraEdit to crash, SSIS to crash, C# to crash, etc. Splitting it via 7zip seemed to be the only option that worked, and splitting it to 369 files seemed to be the only option 7zip had that didn't also reformat or somehow compress the file in an undesirable way.
Is there something that I'm missing? Something in my code that keeps growing in memory? I know File.ReadAllLines() opens and closes the file, so it should be disposed after called, right? newFileContents() gets overwritten every 50th file, as does the string builder. What else could I be doing?

One thing that jumps out at me is that you are opening a FileStream, never using it, and never disposing of it. With 300+ file streams this may be contributing to your issue.
var fs = File.OpenRead(file);
Another thing that perked my ear is that you said 3.6GB. Make sure you are compiling for 64 bit architecture.
Finally, stuffing gigabytes into a string builder may cause you grief. Maybe create a staging file - which every time you open a new input file, you write that to the staging file, close the input, and not depend on stuffing everything into memory.

You should just be looping over the rows in your source files and appending them to a new file. You're holding the contents of up to 50 10MB files in memory at once, plus anything else you're doing. This may be because you're compiling for x86 instead of x64, but there isn't any reason this should use anywhere near that memory. Something like the following:
var files = Directory.Getfiles(System.IO.Path.Combing(path, "split")).ToList();
//since you were skipping the first and last file
files.Remove(files.FirstOrDefault());
files.Remove(files.LastOrDefault());
string combined_file_path = "<whatever you want to call this>";
System.IO.StreamWriter combined_file_writer = null;
try
{
foreach(var file in files)
{
//if multiple of 50, write footer, dispose of stream, and make a new stream
if((files.IndexOf(file)) % 50 == 0)
{
combined_file_writer?.WriteLine(FOOTER);
combined_file_writer?.Dispose();
combined_file_writer = new System.IO.StreamWriter(combined_file_path + "_1"); //increment the name somewhow
combined_file_writer.WriteLine(Header);
}
using(var file_reader = new System.IO.StreamReader(file))
{
while(!file_reader.EOF)
{
combined_file_writer.WriteLine(file_reader.ReadLine());
}
}
}
//finish out the last file
combined_file_writer?.WriteLine(FOOTER);
}
finally
{
//dispose of last file
combined_file_writer?.Dispose();
}

Related

c# - splitting a large list into smaller sublists

Fairly new to C# - Sitting here practicing. I have a file with 10 million passwords listed in a single file that I downloaded to practice with.
I want to break the file down to lists of 99. Stop at 99 then do something. Then start where it left off and repeat the do something with the next 99 until it reaches the last item in the file.
I can do the count part well, it is the stop at 99 and continue where I left off is where I am having trouble. Anything I find online is not close to what I am trying to do and anything I add to this code on my own does not work.
I am more than happy to share more information if I am not clear. Just ask and will respond however, I might not be able to respond until tomorrow depending on what time it is.
Here is the code I have started:
using System;
using System.IO;
namespace lists01
{
class Program
{
static void Main(string[] args)
{
int count = 0;
var f1 = #"c:\tmp\10-million-password-list-top-1000000.txt";
{
var content = File.ReadAllLines(f1);
foreach (var v2 in content)
{
count++;
Console.WriteLine(v2 + "\t" + count);
}
}
}
}
}
My end goal is to do this with any list of items from files I have. I am only using this password list because it was sizable and thought it would be good for this exercise.
Thank you
Keith
Here is a couple of different ways to approach this. Normally, I would suggest the ReadAllLines function that you have in your code. The trade off is that you are loading the entire file into memory at once, then you operate on it.
Using read all lines in concert with Linq's Skip() and Take() methods, you can chop the lines up into groups like this:
var lines = File.ReadAllLines(fileName);
int linesAtATime = 99;
for (int i = 0; i < lines.Length; i = i + linesAtATime)
{
List<string> currentLinesGroup = lines.Skip(i).Take(linesAtATime).ToList();
DoSomethingWithLines(currentLinesGroup);
}
But, if you are working with a really large file, it might not be practical to load the entire file into memory. Plus, you might not want to leave the file open while you are working on the lines. This option gives you more control over how you move through the file. It just loads the part it needs into memory, and closes the file while you are working on the current set of lines.
List<string> lines = new List<string>();
int maxLines = 99;
long seekPosition = 0;
bool fileLoaded = false;
string line;
while (!fileLoaded)
{
using (Stream stream = File.Open(fileName, FileMode.Open))
{
//Jump back to the previous position
stream.Seek(seekPosition, SeekOrigin.Begin);
using (StreamReader reader = new StreamReader(stream))
{
while (!reader.EndOfStream && lines.Count < maxLines)
{
line = reader.ReadLine();
seekPosition += (line.Length + 2); //Tracks how much data has been read.
lines.Add(line);
}
fileLoaded = reader.EndOfStream;
}
}
DoSomethingWithLines(lines);
lines.Clear();
}
In this case, I used Stream because it has the ability to seek to a specific position in the file. But then I used StreaReader because it has the ReadLine() methods.

C# WebClient multiple execution

I have a problem with WebClient.
Simple I check file missed in one folder. If I don't have this file, with WebClient I navigate to web page and send a value to execute a query and store the value in the database.
My problem:
I have a List of 1500 Elements for example.
But after first element the for loop is stopped (maybe) or doesn't navigate again.
My code:
List<string> fileneed = new List<string>();
In the Thread
//Distinct
fileneed = fileneed.Distinct().ToList<string>();
for (int i = 0; i < fileneed.Count; i++)
{
if (fileneed[i].Contains("."))
{
w = new WebClient();
w.OpenRead("http://mywebsite.org/collab/files.php?act=need&user=" + Properties.Settings.Default.user + "&file=" + fileneed[i]);
fileneed.RemoveAt(i);
}
}
After execution of the thread, I go to my PhpMyAdmin and I see only one file.
Other files in the list don't show or are present or with a strange problem, my code execute one time the loop.
There are a few things wrong with the example code:
1st: Because it is removing items from the fileneed list at the same point it is reading from the list it is going to skip files in the list. This is because when you remove an item, the index of all the following items is made one smaller. We can get around this by iterating over the list from the end to the start.
2nd: Though the code is reading a file from the server, it is not doing anything with the file to write it out to disk. As such the file will simply be lost. This can be fixed by opening a file stream and copying to it.
3rd: WebClient and the Stream returned from OpenRead need to be Disposed. Otherwise the resources they use will not be cleaned up and your program will become a memory/connection hog. This is fixed by using the using statement.
With these three fixes the resulting code looks like this:
fileneed = fileneed.Distinct().ToList<string>();
for (int i = fileneed.Count - 1; i >= 0; i--)
{
if (fileneed[i].Contains("."))
{
using (var w = new WebClient())
using (var webFile = w.OpenRead("http://mywebsite.org/collab/files.php?act=need&user=" + Properties.Settings.Default.user + "&file=" + fileneed[i]))
using (var diskFile = File.OpenWrite(fileneed[i]))
{
webFile.CopyTo(diskFile);
}
fileneed.RemoveAt(i);
}
}
You are opening a 'connection' to that file, but you aren't reading it or storing it anyway. You need to create a new file, and read from the remote stream and write to the local file stream:
using(var myFile = File.OpenWrite(fileneed[i]))
{
w.CopyTo(myFile);
}
See this page for details
http://mywebsite.org/collab/files.php
I don't know this page what exactly do but you should remove this line;
fileneed.RemoveAt(i);
Every iterate, you are removing the element and Count changes. If you want to remove processed items, you could store in another list and except from original string list.

Best way to read multiple very large files

I need help figuring out the fastest way to read through about 80 files with over 500,000 lines in each file, and write to one master file with each input file's line as a column in the master. The master file must be written to a text editor like notepad and not a Microsoft product because they can't handle the number of lines.
For example, the master file should look something like this:
File1_Row1,File2_Row1,File3_Row1,...
File1_Row2,File2_Row2,File3_Row2,...
File1_Row3,File2_Row3,File3_Row3,...
etc.
I've tried 2 solutions so far:
Create a jagged array to hold each files' contents into an array and then once reading all lines in all files, write the master file. The issue with this solution is that Windows OS memory throws an error that too much virtual memory is being used.
Dynamically create a reader thread for each of the 80 files that reads a specific line number, and once all threads finish reading a line, combine those values and write to file, and repeat for each line in all files. The issue with this solution is that it is very very slow.
Does anybody have a better solution for reading so many large files in a fast way?
The best way is going to be to open the input files with a StreamReader for each one and a StreamWriter for the output file. Then you loop through each reader and read a single line and write it to the master file. This way you are only loading one line at a time so there should be minimal memory pressure. I was able to copy 80 ~500,000 line files in 37 seconds. An example:
using System;
using System.Collections.Generic;
using System.IO;
using System.Diagnostics;
class MainClass
{
static string[] fileNames = Enumerable.Range(1, 80).Select(i => string.Format("file{0}.txt", i)).ToArray();
public static void Main(string[] args)
{
var stopwatch = Stopwatch.StartNew();
List<StreamReader> readers = fileNames.Select(f => new StreamReader(f)).ToList();
try
{
using (StreamWriter writer = new StreamWriter("master.txt"))
{
string line = null;
do
{
for(int i = 0; i < readers.Count; i++)
{
if ((line = readers[i].ReadLine()) != null)
{
writer.Write(line);
}
if (i < readers.Count - 1)
writer.Write(",");
}
writer.WriteLine();
} while (line != null);
}
}
finally
{
foreach(var reader in readers)
{
reader.Close();
}
}
Console.WriteLine("Elapsed {0} ms", stopwatch.ElapsedMilliseconds);
}
}
I've assume that all the input files have the same number of lines, but you should be add the logic to keep reading when at least one file has given you data.
Use Memory Mapped files seems what is suitable to you. Something that does not execute pressure on memory of your app contemporary maintaining good performance in IO operations.
Here complete documentation: Memory-Mapped Files
If you have enough memory on the computer, I would use the Parallel.Invoke construct and read each file into a pre-allocated array such as:
string[] file1lines = new string[some value];
string[] file2lines = new string[some value];
string[] file3lines = new string[some value];
Parallel.Invoke(
() =>
{
ReadMyFile(file1,file1lines);
},
() =>
{
ReadMyFile(file2,file2lines);
},
() =>
{
ReadMyFile(file3,file3lines);
}
);
Each ReadMyFile method should just use the following sample code which, according to these benchmarks, is the fastest way to read a text file:
int x = 0;
using (StreamReader sr = File.OpenText(fileName))
{
while ((file1lines[x] = sr.ReadLine()) != null)
{
x += 1;
}
}
If you need to manipulate the data from each file before writing your final output, read this article on the fastest way to do that.
Then you just need one method to write the contents to each string[] to the output as you desire.
Have an array of open file handles. Loop through this array and read a line from each file into a string array. Then combine this array into the master file, append a newline at the end.
This differs from your second approach that it is single threaded and doesn't read a specific line but always the next one.
Of course you need to be error proof if there are files with less lines than others.

Append to file failure when executable not in same folder as data files

Problem is now solved. Mistake by me that I hadn't seen before.
I am pretty new to coding in general and am very new to C# so I am probably missing something simple. I wrote a program to pull data from a login website and save that data to files on the local hard drive. The data is power and energy data for solar modules and each module has its own file. On my main workstation I am running Windows Vista and the program works just fine. When I run the program on the machine running Server 2003, instead of the new data being appended to the files, it just overwrites the data originally in the file.
The data I am downloading is csv format text over a span of 7 days at a time. I run the program once a day to pull the new day's data and append it to the local file. Every time I run the program, the local file is a copy of the newly downloaded data with none of the old data. Since the data on the web site is only updated once a day, I have been testing by removing the last day's data in the local file and/or the first day's data in the local file. Any time I change the file and run the program, the file contains the downloaded data and nothing else.
I just tried something new to test why it wasn't working and think I have found the source of the error. When I ran on my local machine, the "filePath" variable was set to "". On the server and now on my local machine I have changed the "filePath" to #"C:\Solar Yard Data\" and on both machines it catches the file not found exception and creates a new file in the same directory which overwrites the original. Anyone have an idea as to why this happens?
The code is the section that download's each data set and appends any new data to the local file.
int i = 0;
string filePath = "C:/Solar Yard Data/";
string[] filenamesPower = new string[]
{
"inverter121201321745_power",
"inverter121201325108_power",
"inverter121201326383_power",
"inverter121201326218_power",
"inverter121201323111_power",
"inverter121201324916_power",
"inverter121201326328_power",
"inverter121201326031_power",
"inverter121201325003_power",
"inverter121201326714_power",
"inverter121201326351_power",
"inverter121201323205_power",
"inverter121201325349_power",
"inverter121201324856_power",
"inverter121201325047_power",
"inverter121201324954_power",
};
// download and save every module's power data
foreach (string url in modulesPower)
{
// create web request and download data
HttpWebRequest req_csv = (HttpWebRequest)HttpWebRequest.Create(String.Format(url, auth_token));
req_csv.CookieContainer = cookie_container;
HttpWebResponse res_csv = (HttpWebResponse)req_csv.GetResponse();
// save the data to files
using (StreamReader sr = new StreamReader(res_csv.GetResponseStream()))
{
string response = sr.ReadToEnd();
string fileName = filenamesPower[i] + ".csv";
// save the new data to file
try
{
int startIndex = 0; // start index for substring to append to file
int searchResultIndex = 0; // index returned when searching downloaded data for last entry of data on file
string lastEntry; // will hold the last entry in the current data
//open existing file and find last entry
using (StreamReader sr2 = new StreamReader(fileName))
{
//get last line of existing data
string fileContents = sr2.ReadToEnd();
string nl = System.Environment.NewLine; // newline string
int nllen = nl.Length; // length of a newline
if (fileContents.LastIndexOf(nl) == fileContents.Length - nllen)
{
lastEntry = fileContents.Substring(0, fileContents.Length - nllen).Substring(fileContents.Substring(0, fileContents.Length - nllen).LastIndexOf(nl) + nllen);
}
else
{
lastEntry = fileContents.Substring(fileContents.LastIndexOf(nl) + 2);
}
// search the new data for the last existing line
searchResultIndex = response.LastIndexOf(lastEntry);
}
// if the downloaded data contains the last record on file, append the new data
if (searchResultIndex != -1)
{
startIndex = searchResultIndex + lastEntry.Length;
File.AppendAllText(filePath + fileName, response.Substring(startIndex+1));
}
// else append all the data
else
{
Console.WriteLine("The last entry of the existing data was not found\nin the downloaded data. Appending all data.");
File.AppendAllText(filePath + fileName, response.Substring(109)); // the 109 index removes the file header from the new data
}
}
// if there is no file for this module, create the first one
catch (FileNotFoundException e)
{
// write data to file
Console.WriteLine("File does not exist, creating new data file.");
File.WriteAllText(filePath + fileName, response);
//Debug.WriteLine(response);
}
}
Console.WriteLine("Power file " + (i + 1) + " finished.");
//Debug.WriteLine("File " + (i + 1) + " finished.");
i++;
}
Console.WriteLine("\nPower data finished!\n");
Couple of suggestions wich I think will probably resolve the issue
First change your filePath string
string filePath = #"C:\Solar Yard Data\";
create a string with the full path
String fullFilePath = filePath + fileName;
then check to see if it exists and create it if it doesnt
if (!File.Exists(fullFilePath ))
File.Create(fullFilePath );
put the full path to the file in your streamReader
using (StreamReader sr2 = new StreamReader(fullFilePath))

How can I split a big text file into smaller file?

I have a big file with some text, and I want to split it into smaller files.
In this example, What I do:
I open a text file let's say with 10 000 lines into it
I set a number of package=300 here, which means, that's the small file limit, once a small file has 300 lines into it, close it, open a new file for writing for example (package2).
Same, as step 2.
You already know
Here is the code from my function that should do that. The ideea (what I dont' know) is how to close, and open a new file once it has reached the 300 limit (in our case here).
Let me show you what I'm talking about:
int nr = 1;
package=textBox1.Text;//how many lines/file (small file)
string packnr = nr.ToString();
string filer=package+"Pack-"+packnr+"+_"+date2+".txt";//name of small file/s
int packtester = 0;
int package= 300;
StreamReader freader = new StreamReader("bigfile.txt");
StreamWriter pak = new StreamWriter(filer);
while ((line = freader.ReadLine()) != null)
{
if (packtester < package)
{
pak.WriteLine(line);//writing line to small file
packtester++;//increasing the lines of small file
}
else if (packtester == package)//in this example, checking if the lines
//written, got to 300
{
packtester = 0;
pak.Close();//closing the file
nr++;//nr++ -> just for file name to be Pack-2;
packnr = nr.ToString();
StreamWriter pak = new StreamWriter(package + "Pack-" + packnr + "+_" + date2 + ".txt");
}
}
I get this errors:
Cannot use local variable 'pak' before it is declared
A local variable named 'pak' cannot be declared in this scope because it would give a different meaning to 'pak', which is already used in a 'parent or current' scope to denote something else
Try this:
public void SplitFile()
{
int nr = 1;
int package = 300;
DateTime date2 = DateTime.Now;
int packtester = 0;
using (var freader = new StreamReader("bigfile.txt"))
{
StreamWriter pak = null;
try
{
pak = new StreamWriter(GetPackFilename(package, nr, date2), false);
string line;
while ((line = freader.ReadLine()) != null)
{
if (packtester < package)
{
pak.WriteLine(line); //writing line to small file
packtester++; //increasing the lines of small file
}
else
{
pak.Flush();
pak.Close(); //closing the file
packtester = 0;
nr++; //nr++ -> just for file name to be Pack-2;
pak = new StreamWriter(GetPackFilename(package, nr, date2), false);
}
}
}
finally
{
if(pak != null)
{
pak.Dispose();
}
}
}
}
private string GetPackFilename(int package, int nr, DateTime date2)
{
return string.Format("{0}Pack-{1}+_{2}.txt", package, nr, date2);
}
Logrotate can do this automatically for you. Years have been put into it and it's what people trust to handle their sometimes very large webserver logs.
Note that the code, as written, will not compile because you define the variable pak more than once. It should otherwise function, though it has some room for improvement.
When working with files, my suggestion and the general norm is to wrap your code in a using block, which is basically syntactic sugar built on top of a finally clause:
using (var stream = File.Open("C:\hi.txt"))
{
//write your code here. When this block is exited, stream will be disposed.
}
Is equivalent to:
try
{
var stream = File.Open(#"C:\hi.txt");
}
finally
{
stream.Dispose();
}
In addition, when working with files, always prefer opening file streams using very specific permissions and modes as opposed to using the more sparse constructors that assume some default options. For example:
var stream = new StreamWriter(File.Open(#"c:\hi.txt", FileMode.CreateNew, FileAccess.ReadWrite, FileShare.Read));
This will guarantee, for example, that files should not be overwritten -- instead, we assume that the file we want to open doesn't exist yet.
Oh, and instead of using the check you perform, I suggest using the EndOfStream property of the StreamReader object.
This code looks like it closes the stream and re-opens a new stream when you hit 300 lines. What exactly doesn't work in this code?
One thing you'll want to add is a final close (probably with a check so it doesn't try to close an already closed stream) in case you don't have an even multiple of 300 lines.
EDIT:
Due to your edit I see your problem. You don't need to redeclare pak in the last line of code, simply reinitialize it to another streamwriter.
(I don't remember if that is disposable but if it is you probably should do that before making a new one).
StreamWriter pak = new StreamWriter(package + "Pack-" + packnr + "+_" + date2 + ".txt");
becomes
pak = new StreamWriter(package + "Pack-" + packnr + "+_" + date2 + ".txt");

Categories