How can I split a big text file into smaller file?

How can I split a big text file into smaller file? - c#

I have a big file with some text, and I want to split it into smaller files.
In this example, What I do:
I open a text file let's say with 10 000 lines into it
I set a number of package=300 here, which means, that's the small file limit, once a small file has 300 lines into it, close it, open a new file for writing for example (package2).
Same, as step 2.
You already know
Here is the code from my function that should do that. The ideea (what I dont' know) is how to close, and open a new file once it has reached the 300 limit (in our case here).
Let me show you what I'm talking about:
int nr = 1;
package=textBox1.Text;//how many lines/file (small file)
string packnr = nr.ToString();
string filer=package+"Pack-"+packnr+"+_"+date2+".txt";//name of small file/s
int packtester = 0;
int package= 300;
StreamReader freader = new StreamReader("bigfile.txt");
StreamWriter pak = new StreamWriter(filer);
while ((line = freader.ReadLine()) != null)
{
if (packtester < package)
{
pak.WriteLine(line);//writing line to small file
packtester++;//increasing the lines of small file
}
else if (packtester == package)//in this example, checking if the lines
//written, got to 300
{
packtester = 0;
pak.Close();//closing the file
nr++;//nr++ -> just for file name to be Pack-2;
packnr = nr.ToString();
StreamWriter pak = new StreamWriter(package + "Pack-" + packnr + "+_" + date2 + ".txt");
}
}
I get this errors:
Cannot use local variable 'pak' before it is declared
A local variable named 'pak' cannot be declared in this scope because it would give a different meaning to 'pak', which is already used in a 'parent or current' scope to denote something else

Try this:
public void SplitFile()
{
int nr = 1;
int package = 300;
DateTime date2 = DateTime.Now;
int packtester = 0;
using (var freader = new StreamReader("bigfile.txt"))
{
StreamWriter pak = null;
try
{
pak = new StreamWriter(GetPackFilename(package, nr, date2), false);
string line;
while ((line = freader.ReadLine()) != null)
{
if (packtester < package)
{
pak.WriteLine(line); //writing line to small file
packtester++; //increasing the lines of small file
}
else
{
pak.Flush();
pak.Close(); //closing the file
packtester = 0;
nr++; //nr++ -> just for file name to be Pack-2;
pak = new StreamWriter(GetPackFilename(package, nr, date2), false);
}
}
}
finally
{
if(pak != null)
{
pak.Dispose();
}
}
}
}
private string GetPackFilename(int package, int nr, DateTime date2)
{
return string.Format("{0}Pack-{1}+_{2}.txt", package, nr, date2);
}

Logrotate can do this automatically for you. Years have been put into it and it's what people trust to handle their sometimes very large webserver logs.

Note that the code, as written, will not compile because you define the variable pak more than once. It should otherwise function, though it has some room for improvement.
When working with files, my suggestion and the general norm is to wrap your code in a using block, which is basically syntactic sugar built on top of a finally clause:
using (var stream = File.Open("C:\hi.txt"))
{
//write your code here. When this block is exited, stream will be disposed.
}
Is equivalent to:
try
{
var stream = File.Open(#"C:\hi.txt");
}
finally
{
stream.Dispose();
}
In addition, when working with files, always prefer opening file streams using very specific permissions and modes as opposed to using the more sparse constructors that assume some default options. For example:
var stream = new StreamWriter(File.Open(#"c:\hi.txt", FileMode.CreateNew, FileAccess.ReadWrite, FileShare.Read));
This will guarantee, for example, that files should not be overwritten -- instead, we assume that the file we want to open doesn't exist yet.
Oh, and instead of using the check you perform, I suggest using the EndOfStream property of the StreamReader object.

This code looks like it closes the stream and re-opens a new stream when you hit 300 lines. What exactly doesn't work in this code?
One thing you'll want to add is a final close (probably with a check so it doesn't try to close an already closed stream) in case you don't have an even multiple of 300 lines.
EDIT:
Due to your edit I see your problem. You don't need to redeclare pak in the last line of code, simply reinitialize it to another streamwriter.
(I don't remember if that is disposable but if it is you probably should do that before making a new one).
StreamWriter pak = new StreamWriter(package + "Pack-" + packnr + "+_" + date2 + ".txt");
becomes
pak = new StreamWriter(package + "Pack-" + packnr + "+_" + date2 + ".txt");

Related

C# Combine Archive Divided Into One File

Code:
public void mergeFiles(string dir)
{
for (int i = 0; i < parts; i++)
{
if (!File.Exists(dir))
{
File.Create(dir).Close();
}
var output = File.Open(dir, FileMode.Open);
var input = File.Open(dir + ".part" + (i + 1), FileMode.Open);
input.CopyTo(output);
output.Close();
input.Close();
File.Delete(dir + ".part" + (i + 1));
}
}
dir variable is for example /path/file.txt.gz
I have a file packed into a .gz archive. This archive is divided into e.g. 8 parts and I want to get this file.
The problem is that I don't know how to combine these files "file.gz.part1..." to extract them later.
When I use the above function, the archive is corrupted.
I have been struggling with it for a week, looking on the Internet, but this is the best solution I have found and it does not work.
Anyone have any advice on how to combine archive parts into one file?

Your code has a few problems. If you look at the documentation for System.IO.Stream.Close you will see the following remark (emphasis mine):
Closes the current stream and releases any resources (such as sockets and file handles) associated with the current stream. Instead of calling this method, ensure that the stream is properly disposed.
So, per the docs, you want to dispose your streams rather than calling close directly (I'll come back to that in a second). Ignoring that, your main problem lies here:
var output = File.Open(dir, FileMode.Open);
You're using FileMode.Open for your output file. Again from the docs:
Specifies that the operating system should open an existing file. The ability to open the file is dependent on the value specified by the FileAccess enumeration. A FileNotFoundException exception is thrown if the file does not exist.
That's opening a stream at the beginning of the file. So, you're writing each partial file over the beginning of your output file repeatedly. I'm sure you noticed that your combined file size was only as large as the largest partial file. Take a look at FileMode.Append on the other hand:
Opens the file if it exists and seeks to the end of the file, or creates a new file. This requires Append permission. FileMode.Append can be used only in conjunction with FileAccess.Write. Trying to seek to a position before the end of the file throws an IOException exception, and any attempt to read fails and throws a NotSupportedException exception.
OK - but backing up even a step further, this:
if (!File.Exists(dir))
{
File.Create(dir).Close();
}
var output = File.Open(dir, FileMode.Open);
... is ineffecient. Why would we check for the file existing n number of times, then open/close it n number of times? We can just create the file as the first step, and leave that output stream open until we have appended all of our data to it.
So, how would we refactor your code to use IDisposable while fixing your bug? Check out the using statement. Putting all of this together, your code might look like this:
public void mergeFiles(string dir)
{
using (FileStream combinedFile = File.Create(dir))
{
for (int i = 0; i < parts; i++)
{
// Since this string is referenced more than once, capture as a
// variable to lower risk of copy/paste errors.
var splitFileName = dir + ".part" + (i + 1);
using (FileStream filePart = File.Open(splitFileName, FileMode.Open))
{
filePart.CopyTo(combinedFile);
}
// Note that it's safe to delete the file now, because our filePart
// stream has been disposed as it is out of scope.
File.Delete(splitFileName);
}
}
}
Give that a try. And here's an entire working program with a contrived example that you can past into a new console app and run:
using System.IO;
using System.Text;
namespace temp_test
{
class Program
{
static int parts = 10;
static void Main(string[] args)
{
// First we will generate some dummy files.
generateFiles();
// Next, open files and combine.
combineFiles();
}
/// <summary>
/// A contived example to generate some files.
/// </summary>
static void generateFiles()
{
for (int i = 0; i < parts; i++)
{
using (FileStream newFile = File.Create("splitfile.part" + i))
{
byte[] info = new UTF8Encoding(true).GetBytes($"This is File # ${i.ToString()}");
newFile.Write(info);
}
}
}
/// <summary>
/// A contived example to combine our files.
/// </summary>
static void combineFiles()
{
using (FileStream combinedFile = File.Create("combined"))
{
for (int i = 0; i < parts; i++)
{
var splitFileName = "splitfile.part" + i;
using (FileStream filePart = File.Open(splitFileName, FileMode.Open))
{
filePart.CopyTo(combinedFile);
}
// Note that it's safe to delete the file now, because our filePart
// stream has been disposed as it is out of scope.
File.Delete(splitFileName);
}
}
}
}
}
Good luck and welcome to StackOverflow!

Consolidating 300+ files into 5-8, OutOfMemory exception

I have 369 files that need to be formatted and consolidated into 5-8 files before being submitted to the server. I can't submit the 369 files because that would overwhelm the metadata tables in our database (they can handle it, but it'd be 369 rows for what was essentially one file, which would make querying and utilizing those tables a nightmare) and I can't handle it as one file because the total of 3.6 GB is too much for SSIS to handle on our servers.
I wrote the following script to fix the issue:
static void PrepPAIDCLAIMSFiles()
{
const string HEADER = "some long header text, trimmed for SO question";
const string FOOTER = "some long footer text, trimmed for SO question";
//path is defined as a static member of the containing class
string[] files = Directory.GetFiles(path + #"split\");
int splitFileCount = 0, finalFileCount = 0;
List<string> newFileContents = new List<string>();
foreach(string file in files)
{
try
{
var contents = File.ReadAllLines(file).ToList();
var fs = File.OpenRead(file);
if (splitFileCount == 0)
{
//Grab everything except the header
contents = contents.GetRange(1, contents.Count - 1);
}
else if (splitFileCount == files.Length - 1)
{
//Grab everything except the footer
contents = contents.GetRange(0, contents.Count - 1);
}
if (!Directory.Exists(path + #"split\formatted"))
{
Directory.CreateDirectory(path + #"split\formatted");
}
newFileContents.AddRange(contents);
if (splitFileCount % 50 == 0 || splitFileCount >= files.Length)
{
Console.WriteLine($"{splitFileCount} {finalFileCount}");
var sb = new StringBuilder(HEADER);
foreach (var row in newFileContents)
{
sb.Append(row);
}
sb.Append(FOOTER);
newFileContents = new List<string>();
GC.Collect();
string fileName = file.Split('\\').Last();
string baseFileName = fileName.Split('.')[0];
DateTime currentTime = DateTime.Now;
baseFileName += "." + COMPANY_NAME_SetHHMMSS(currentTime, finalFileCount) + ".TXT";
File.WriteAllText(path + #"split\formatted\" + baseFileName, sb.ToString());
finalFileCount += 1;
}
splitFileCount += 1;
}
catch(OutOfMemoryException OOM)
{
Console.WriteLine(file);
Console.WriteLine(OOM.Message);
break;
}
}
}
The way this works is it reads the split file, puts its rows into a string builder, every time it gets to a multiple of 50 files, it writes the string builder to a new file and starts over. The COMPANY_NAME_SetHHMMSS() method ensures the file has a unique name, so it's not writing to the same file over and over (and I can verify this by seeing the output, it writes two files before exploding.)
It breaks when it gets to the 81st file. System.OutOfMemoryException on var contents = File.ReadAllLines(file).ToList();. There's nothing special about the 81st file, it's the same exact size as all the others (~10MB.) The files this function delivers are about ~500MB. It also has no trouble reading and processing all the files upto and not including the 81st, so I don't think that it's running out of memory reading the file, but running out of memory doing something else and it's at the 81st where memory runs out.
The newFileContents() list should be getting emptied by overwriting it with a new list, right? That shouldn't be growing with every iteration in this function. GC.Collect() was sort of a last ditch effort.
The original file that the 369 splits come from has been a headache for a few days now, causing UltraEdit to crash, SSIS to crash, C# to crash, etc. Splitting it via 7zip seemed to be the only option that worked, and splitting it to 369 files seemed to be the only option 7zip had that didn't also reformat or somehow compress the file in an undesirable way.
Is there something that I'm missing? Something in my code that keeps growing in memory? I know File.ReadAllLines() opens and closes the file, so it should be disposed after called, right? newFileContents() gets overwritten every 50th file, as does the string builder. What else could I be doing?

One thing that jumps out at me is that you are opening a FileStream, never using it, and never disposing of it. With 300+ file streams this may be contributing to your issue.
var fs = File.OpenRead(file);
Another thing that perked my ear is that you said 3.6GB. Make sure you are compiling for 64 bit architecture.
Finally, stuffing gigabytes into a string builder may cause you grief. Maybe create a staging file - which every time you open a new input file, you write that to the staging file, close the input, and not depend on stuffing everything into memory.

You should just be looping over the rows in your source files and appending them to a new file. You're holding the contents of up to 50 10MB files in memory at once, plus anything else you're doing. This may be because you're compiling for x86 instead of x64, but there isn't any reason this should use anywhere near that memory. Something like the following:
var files = Directory.Getfiles(System.IO.Path.Combing(path, "split")).ToList();
//since you were skipping the first and last file
files.Remove(files.FirstOrDefault());
files.Remove(files.LastOrDefault());
string combined_file_path = "<whatever you want to call this>";
System.IO.StreamWriter combined_file_writer = null;
try
{
foreach(var file in files)
{
//if multiple of 50, write footer, dispose of stream, and make a new stream
if((files.IndexOf(file)) % 50 == 0)
{
combined_file_writer?.WriteLine(FOOTER);
combined_file_writer?.Dispose();
combined_file_writer = new System.IO.StreamWriter(combined_file_path + "_1"); //increment the name somewhow
combined_file_writer.WriteLine(Header);
}
using(var file_reader = new System.IO.StreamReader(file))
{
while(!file_reader.EOF)
{
combined_file_writer.WriteLine(file_reader.ReadLine());
}
}
}
//finish out the last file
combined_file_writer?.WriteLine(FOOTER);
}
finally
{
//dispose of last file
combined_file_writer?.Dispose();
}

c# - splitting a large list into smaller sublists

Fairly new to C# - Sitting here practicing. I have a file with 10 million passwords listed in a single file that I downloaded to practice with.
I want to break the file down to lists of 99. Stop at 99 then do something. Then start where it left off and repeat the do something with the next 99 until it reaches the last item in the file.
I can do the count part well, it is the stop at 99 and continue where I left off is where I am having trouble. Anything I find online is not close to what I am trying to do and anything I add to this code on my own does not work.
I am more than happy to share more information if I am not clear. Just ask and will respond however, I might not be able to respond until tomorrow depending on what time it is.
Here is the code I have started:
using System;
using System.IO;
namespace lists01
{
class Program
{
static void Main(string[] args)
{
int count = 0;
var f1 = #"c:\tmp\10-million-password-list-top-1000000.txt";
{
var content = File.ReadAllLines(f1);
foreach (var v2 in content)
{
count++;
Console.WriteLine(v2 + "\t" + count);
}
}
}
}
}
My end goal is to do this with any list of items from files I have. I am only using this password list because it was sizable and thought it would be good for this exercise.
Thank you
Keith

Here is a couple of different ways to approach this. Normally, I would suggest the ReadAllLines function that you have in your code. The trade off is that you are loading the entire file into memory at once, then you operate on it.
Using read all lines in concert with Linq's Skip() and Take() methods, you can chop the lines up into groups like this:
var lines = File.ReadAllLines(fileName);
int linesAtATime = 99;
for (int i = 0; i < lines.Length; i = i + linesAtATime)
{
List<string> currentLinesGroup = lines.Skip(i).Take(linesAtATime).ToList();
DoSomethingWithLines(currentLinesGroup);
}
But, if you are working with a really large file, it might not be practical to load the entire file into memory. Plus, you might not want to leave the file open while you are working on the lines. This option gives you more control over how you move through the file. It just loads the part it needs into memory, and closes the file while you are working on the current set of lines.
List<string> lines = new List<string>();
int maxLines = 99;
long seekPosition = 0;
bool fileLoaded = false;
string line;
while (!fileLoaded)
{
using (Stream stream = File.Open(fileName, FileMode.Open))
{
//Jump back to the previous position
stream.Seek(seekPosition, SeekOrigin.Begin);
using (StreamReader reader = new StreamReader(stream))
{
while (!reader.EndOfStream && lines.Count < maxLines)
{
line = reader.ReadLine();
seekPosition += (line.Length + 2); //Tracks how much data has been read.
lines.Add(line);
}
fileLoaded = reader.EndOfStream;
}
}
DoSomethingWithLines(lines);
lines.Clear();
}
In this case, I used Stream because it has the ability to seek to a specific position in the file. But then I used StreaReader because it has the ReadLine() methods.

C# - StreamWriter appending to file when append is set to false [duplicate]

This question already has answers here:
unable to overwrite file using streamwriter despite append= false, without closing file
(2 answers)
Closed 5 years ago.
I am working on a simple console application which writes into .txt files. I have few Streamwriters with append set to false:
StreamWriter j1 = new StreamWriter(#"jmeno1.txt", false);
StreamWriter j2 = new StreamWriter(#"jmeno2.txt", false);
StreamWriter s1 = new StreamWriter(#"skore1.txt", false);
StreamWriter s2 = new StreamWriter(#"skore2.txt", false);
StreamWriter l1 = new StreamWriter(#"legy1.txt", false);
StreamWriter l2 = new StreamWriter(#"legy2.txt", false);
First I write down the default values:
string jmeno1;
string jmeno2;
int legy1 = 0;
int legy2 = 0;
int skore1 = 501;
int skore2 = 501;
Console.WriteLine("První jméno?");
jmeno1 = Console.ReadLine();
Console.WriteLine("Druhé jméno?");
jmeno2 = Console.ReadLine();
j1.WriteLine(jmeno1);
j2.WriteLine(jmeno2);
s1.WriteLine(skore1.ToString());
s2.WriteLine(skore2.ToString());
l1.WriteLine(legy1.ToString());
l2.WriteLine(legy2.ToString());
j1.Flush();
j2.Flush();
s1.Flush();
s2.Flush();
l1.Flush();
l2.Flush();
Then after some user inputs I want to overwrite these files with new strings (using the same way as for the default ones). But the files aren't being overwritten, the text is only being appended. I find this really strange since the append is set to false. I've never experienced this before.
Here's the part of code where writing to files happens (sorry for the foreign language):
Console.WriteLine("\n" + jmeno1 + " hodil/a?");
skore1 = skore1 - int.Parse(Console.ReadLine());
if (skore1 == 0)
{
legy1++;
// writing to file
l1.WriteLine(legy1.ToString());
l1.Flush();
Console.WriteLine("\n" + jmeno1 + " zavřel/a!");
skore1 = 501;
skore2 = 501;
// writing to file
s1.WriteLine(skore1.ToString());
s2.WriteLine(skore2.ToString());
s1.Flush();
s2.Flush();
zacina = 2;
}
else
{
// writing to file
s1.WriteLine(skore1.ToString());
s1.Flush();
Console.WriteLine("\n" + jmeno2 + " hodil/a?");
skore2 = skore2 - int.Parse(Console.ReadLine());
if (skore2 == 0)
{
legy2++;
// writing to file
l2.WriteLine(legy2.ToString());
l2.Flush();
Console.WriteLine("\n" + jmeno2 + " zavřel/a!");
skore1 = 501;
skore2 = 501;
// writing to file
s1.WriteLine(skore1.ToString());
s2.WriteLine(skore2.ToString());
s1.Flush();
s2.Flush();
zacina = 2;
}
else
{
// writing to file
s2.WriteLine(skore2.ToString());
s2.Flush();
}
}
The file with the score then looks like this.
Thanks for any help.

Your code (which is incomplete at the time of writing) shows one wrong assumption about what appending mode of StreamWriter (let's call it SR) means. Once file is opened and SR writes first lines followed by SR.Flush you merely written text to file, advanced FileStream position to byte length of text written, and flushed buffer containing text to disk. Next call to SR.WriteLine will write next line in the very same file starting at last FileStrea.Position without overwriting anything.
Whereas option append to file for StreamWriter merely means that when you open the very same file next time with StreamWriter it will append text to exisitng content. In contrary using option not append creates new empty file which will overwrite any exiting content previously being present in that file. MSDN documentation on bool append constructor parameter says the following:
append
Type: System.Boolean
true to append data to the file; false to overwrite the file. If the specified file does not exist, this parameter has no effect, and the constructor creates a new file.

How to close file that has been read

So im trying to close a file (transactions.txt) that has been open that i've used to read into a textbox and now I want to save back to the file but the problem debug says that the file is in use so I need to find a way to close it. Can anyone help me with this? Thanks!
SearchID = textBox1.Text;
string ID = SearchID.ToString();
bool idFound = false;
int count = 0;
foreach (var line in File.ReadLines("transactions.txt"))
{
//listView1.Items.Add(line);
if (line.Contains(ID))
{
idFound = true;
}
//Displays Transactions if the variable SearchID is found.
if (idFound && count < 8)
{
textBox2.Text += line + "\r\n";
count++;
}
}
}
private void SaveEditedTransaction()
{
SearchID = textBox1.Text;
string ID = SearchID.ToString();
bool idFound = false;
int count = 0;
foreach (var lines in File.ReadLines("transactions.txt"))
{
//listView1.Items.Add(line);
if (lines.Contains(ID))
{
idFound = true;
}
if (idFound)
{
string edited = File.ReadAllText("transactions.txt");
edited = edited.Replace(lines, textBox2.Text);
File.WriteAllText("Transactions.txt", edited);
}

The problem here is that File.ReadLines keeps the file open while you read it, since you've put the call to write new text to it inside the loop, the file is still open.
Instead I would simply break out of the loop when you find the id, and then put the if-statement that writes to the file outside the loop.
This, however, means that you will also need to maintain which line to replace in.
So actually, instead I would switch to using File.ReadAllLines. This reads the entire file into memory, and closes it, before the loop starts.
Now, pragmatic minds might argue that if you have a lot of text in that text file, File.ReadLines (that you're currently using) will use a lot less memory than File.ReadAllLines (that I am suggesting you should use), but if that's the case then you should switch to a database, which would be much more suited to your purpose anyway. It is, however, a bit of an overkill for a toy project with 5 lines in that file.

Use StreamReader directly with the using statement, for example:
var lines = new List<string>();
using (StreamReader reader = new StreamReader(#"C:\test.txt")) {
var line = reader.ReadLine();
while (line != null) {
lines.Add(line);
line = reader.ReadLine();
}
}
By using the using statement the StreamReader instance will automatically be disposed of after it's done with it.

You can try with this:
File.WriteAllLines(
"transactions.txt",
File.ReadAllLines("transactions.txt")
.Select(x => x.Contains(ID) ? textBox2.Text : x));
It works fine, but if the file is big you have to find other solutions.

You can use the StreamReader class instead of the methods of the File class. In this way you can use, Stream.Close() and Stream.Dispose().

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How can I split a big text file into smaller file? - c#

Logrotate can do this automatically for you. Years have been put into it and it's what people trust to handle their sometimes very large webserver logs.

Related

C# Combine Archive Divided Into One File

Consolidating 300+ files into 5-8, OutOfMemory exception

c# - splitting a large list into smaller sublists

C# - StreamWriter appending to file when append is set to false [duplicate]

How to close file that has been read

Categories

Resources