Reading 2 GB file with C# takes too much time - c#

I have an indented text file in the following pattern:
cl /FoD:\jnks\complire_flags /c legacy\roxapi\fjord\Module.c
Note: including file: d:\jnks\e\patchlevel.f
Note: including file: d:\3_4_2_patched4\release\include\pyconfig.f
Note: including file: C:\11.0\VC\INCLUDE\io.f
Using stream reader, I am able to read the above file which requires following processing.
1) Every line starting with cl and ending with c is the Parent file.
2) All files starting with Note and ending with f are child files.
3) The .f file is parent of the .f file below it if left indent increases,(space between file: and drive name) hence pyconfig.f is a parent of io.f
Using Entity framework I am writing above data in two tables of SQL server;
Parent table (for only .c files) and Child table (for only .f files).
My big issue here is- It takes 6 hours to read the file (using stream reader) and another 6 hours to write it in the database (using entity framework). I tried first reading the whole file and then writing it. I have also tried reading the file one Parent c file at a time and writing its info along with the child .f files.
The file size might increase in future to 5 GB so I really would appreciate help in achieving better performance.
Below is a part of my read logic:
while (!isEndOfFile)
{
// Read next Line conditionally
if (readNextLine)
{
if (inputFile.Peek() > -1)
{
line = inputFile.ReadLine();
}
else
{
isEndOfFile = true;
continue;
}
}
// Get the name of the CPP file - Condition is that it starts with cl
if (isCPPFile(line))
{
// Regular expression match to extract the CPP file name
Match match = cppFilePathRegex.Match(line);
if (match.Success)
{
cppFileName = match.Value;
addFileDetails = true;
}
readNextLine = true;
}
// Check if meets the condition of Header starting text - "Note: including file:" and we have a parent CPP File
else if (addFileDetails && isHeaderFile(line))
{
//do something
}

1) Go and read why GNU grep is fast?. It gives a number of hints on how to process fast input text files, specifically looking for patterns.
2) Use SQlBulkCopy to transfer the data into SQL Server. EF is definitely not an appropriate solution for bulk import.
But, if I was you, I would do a del /q /s on my entire import solution and start from scratch using SQL Server Integration Services. SSIS is a dedicated solution for your task, it contains countless optimizations around file read, record access, buffering, cache access and ultimately database writes.

Seems like the time drastically reduced (Almost 2 hours) if I split the file before I process it.
The file is in tree structure so it has to be processed line by line only but I can split it at points where a certain character occurs to signify new tree.
If I read block at new character, instead of splitting; the 2 GB file still eats up a lot of memory.
I have split the file using following power shell and will later see how I can probably call both powershell and MY C-SHARP APPLICATION (for processing and db insertion). I am still working on reducing time but please find my powershell below for reference.
// My PowerShell
$Path = "D:\Parser\Test\" -- path of input file
$PathSplit = "D:\Parser\Test\Cpp\" -- path of output
$InputFile = (Join-Path $Path "input_file.txt") --input filename
$Reader = New-Object System.IO.StreamReader($InputFile)
$N = 1
While(($Line = $Reader.ReadLine()) -ne $null)
{
If(($Line -match "^[cl].*")-and($Line -match "/Fo")) {
$OutputFile = $matches+$N + ".txt"
Add-Content(Join-Path $PathSplit $OutputFile) $Line
$N++
}}

Related

Files disappear after they fail to be moved

We have a process where people scan documents with photocopiers and drop them in a certain directory on our file server. We then have a hourly service within an .NET Core app, that scans the directory, grabs the file and moves them according to their file name to a certain directory. Here comes the problems.
The code looks something like that:
private string MoveFile(string file, string commNumber)
{
var fileName = Path.GetFileName(file);
var baseFileName = Path.GetFileNameWithoutExtension(fileName).Split("-v")[0];
// 1. Check if the file already exists at destination
var existingFileList = luxWebSamContext.Documents.Where(x => EF.Functions.Like(x.DocumentName, "%" + Path.GetFileNameWithoutExtension(baseFileName) + "%")).ToList();
// If the file exists, check for the current version of file
if (existingFileList.Count > 0)
{
var nextVersion = existingFileList.Max(x => x.UploadVersion) + 1;
var extension = Path.GetExtension(fileName);
fileName = baseFileName + "-v" + nextVersion.ToString() + extension;
}
var from = #file;
var to = Path.Combine(#destinationPath, commNumber,fileName);
try
{
log.Info($"------ Moving File! ------ {fileName}");
Directory.CreateDirectory(Path.Combine(#destinationPath, commNumber));
File.Move(from, to, true);
return to;
}
catch (Exception ex)
{
log.Error($"----- Couldn't MOVE FILE: {file} ----- commission number: {commNumber}", ex);
The interesting part is in the try-block, where the file move takes place. Sometmes we have the problem that the program throws the following exception
2021-11-23 17:15:37,960 [60] ERROR App ----- Couldn't MOVE FILE:
\PATH\PATH\PATH\Filename_423489120.pdf ----- commission number:
05847894
System.IO.IOException: The process cannot access the file because it is being used by another process.
at System.IO.FileSystem.MoveFile(String sourceFullPath, String destFullPath, Boolean overwrite)
at System.IO.File.Move(String sourceFileName, String destFileName, Boolean overwrite)
So far so good. I would expect that after the file cannot be moved, it remains in the directory from it was supposed to be moved. But that's not the case. We had this issue yesterday afternoon and after I looked for the file, it was gone from the directory.
Is this the normal behaviour of the File.Move() method?
First to your question:
Is this the normal behaviour of the File.Move() method?
No, thats not the expected behaviour. The documentation says:
Moving the file across disk volumes is equivalent to copying the file
and deleting it from the source if the copying was successful.
If you try to move a file across disk volumes and that file is in use,
the file is copied to the destination, but it is not deleted from the
source.
Your Exception says, that another process is using the file in the same moment. So you should check, whether other parts of your application may performs a Delete, or someone (if this scenario is valid) is deleting files manually from the file system.
Typically, File.Move() only removes the source file, once the destination file is successfully transferred in place. So the answer to your question is no, it cannot be purely the File.Move(). The interesting part is, why is this file locked? Probaby because some file stream is still open and blocking access to the file. Also, do you have multiple instances of the copy process services running? This may cause several services trying to access the file simultaneously, causing the exception you posted.
There must be a different cause making the files disappear because the File.Move() will certainly not remove the file when the copy process did not succeed.
For debugging purposes, you may try and open the file with a lock on it. This will fail when a different process locks the file providing you a little bit more information.

Check inside loop if *txt file has been created

My code is searchcing inside a loop if a *txt file has been created.
If file will not be created after x time then i will throw an exception.
Here is my code:
var AnswerFile = #"C:\myFile.txt";
for (int i = 0; i <= 30; i++)
{
if (File.Exists(AnswerFile))
break;
await Task.Delay(100);
}
if (File.Exists(AnswerFile))
{
}
else
{
}
After the loop i check my file if has been created or not. Loop will expire in 3 seconds, 100ms * 30times.
My code is working, i am just looking for the performance and quality of my code. Is there any better approach than mine? Example should i use FileInfo class instead this?
var fi1 = new FileInfo(AnswerFile);
if(fi1.Exists)
{
}
Or should i use filewatcher Class?
You should perhaps use a FileSystemWatcher for this and decouple the process of creating the file from the process of reacting to its presence. If the file must be generated in a certain time because it has some expiry time then you could make the expiry datetime part of the file name so that if it appears after that time you know it's expired. A note of caution with the FileSystemWatcher - it can sometimes miss something (the fine manual says that events can be missed if large numbers are generated in a short time)
In the past I've used this for watching for files being uploaded via ftp. As soon as the notification of file created appears I put the file into a list and check it periodically to see if it is still growing - you can either look at the filesystem watcher lastwritetime event for this or directly check the size of the file now vs some time ago etc - in either approach it's probably easiest to use a dictionary to track the file and the previous size/most recent lastwritedate event.
After a minute of no growth I consider the file uploaded completely and I process it. It might be wise for you to implement a similar delay if using a file system watcher and the files are arriving by some slow generating method
Why you don't retrieve a list of files name, then search in the list? You can use Directory.GetFiles to get the files list inside a directory then search in this list.
This would be more fixable for you since you will create the list once, and reuse it across the application, instead of calling File.Exists for each file.
Example :
var path = #"C:\folder\"; // set the folder path, which contains all answers files
var ext = "*.txt"; // set the file extension.
// GET filename list (bare name) and make them all lowercase.
var files = Directory.GetFiles(path, ext).Select(x=> x.Substring(path.Length, (x.Length - path.Length) - ext.Length + 1 ).Trim().ToLower()).ToList();
// Search for this filename
var search = "myFile";
// Check
if(files.Contains(search.ToLower()))
{
Console.WriteLine($"File : {search} is already existed.");
}
else
{
Console.WriteLine($"File : {search} is not found.");
}

Add Files Into Existing Zip - performance issue

I have a WCF webservice that saves files to a folder(about 200,000 small files).
After that, I need to move them to another server.
The solution I've found was to zip them then move them.
When I adopted this solution, I've made the test with (20,000 files), zipping 20,000 files took only about 2 minutes and moving the zip is really fast.
But in production, zipping 200,000 files takes more than 2 hours.
Here is my code to zip the folder :
using (ZipFile zipFile = new ZipFile())
{
zipFile.UseZip64WhenSaving = Zip64Option.Always;
zipFile.CompressionLevel = CompressionLevel.None;
zipFile.AddDirectory(this.SourceDirectory.FullName, string.Empty);
zipFile.Save(DestinationCurrentFileInfo.FullName);
}
I want to modify the WCF webservice, so that instead of saving to a folder, it saves to the zip.
I use the following code to test:
var listAes = Directory.EnumerateFiles(myFolder, "*.*", SearchOption.AllDirectories).Where(s => s.EndsWith(".aes")).Select(f => new FileInfo(f));
foreach (var additionFile in listAes)
{
using (var zip = ZipFile.Read(nameOfExistingZip))
{
zip.CompressionLevel = Ionic.Zlib.CompressionLevel.None;
zip.AddFile(additionFile.FullName);
zip.Save();
}
file.WriteLine("Delay for adding a file : " + sw.Elapsed.TotalMilliseconds);
sw.Restart();
}
The first file to add to the zip takes only 5 ms, but the 10,000 th file to add takes 800 ms.
Is there a way to optimize this ? Or if you have other suggestions ?
EDIT
The example shown above is only for test, in the WCF webservice, i'll have different request sending files that I need to Add to the Zip file.
As WCF is statless, I will have a new instance of my class with each call, so how can I keep the Zip file open to add more files ?
I've looked at your code and immediately spot problems. The problem with a lot of software developers nowadays is that they nowadays don't understand how stuff works, which makes it impossible to reason about it. In this particular case you don't seem to know how ZIP files work; therefore I would suggest you first read up on how they work and attempted to break down what happens under the hood.
Reasoning
Now that we're all on the same page on how they work, let's start the reasoning by breaking down how this works using your source code; we'll continue from there on forward:
var listAes = Directory.EnumerateFiles(myFolder, "*.*", SearchOption.AllDirectories).Where(s => s.EndsWith(".aes")).Select(f => new FileInfo(f));
foreach (var additionFile in listAes)
{
// (1)
using (var zip = ZipFile.Read(nameOfExistingZip))
{
zip.CompressionLevel = Ionic.Zlib.CompressionLevel.None;
// (2)
zip.AddFile(additionFile.FullName);
// (3)
zip.Save();
}
file.WriteLine("Delay for adding a file : " + sw.Elapsed.TotalMilliseconds);
sw.Restart();
}
(1) opens a ZIP file. You're doing this for every file you attempt to add
(2) Adds a single file to the ZIP file
(3) Saves the complete ZIP file
On my computer this takes about an hour.
Now, not all of the file format details are relevant. We're looking for stuff that will get increasingly worse in your program.
Skimming over the file format specification, you'll notice that compression is based on Deflate which doesn't require information on the other files that are compressed. Moving on, we'll notice how the 'file table' is stored in the ZIP file:
You'll notice here that there's a 'central directory' which stores the files in the ZIP file. It's basically stored as a 'list'. So, using this information we can reason on what the trivial way is to update that when implementing steps (1-3) in this order:
Open the zip file, read the central directory
Append data for the (new) compressed file, store the pointer along with the filename in the new central directory.
Re-write the central directory.
Think about it for a moment, for file #1 you need 1 write operation; for file #2, you need to read (1 item), append (in memory) and write (2 items); for file #3, you need to read (2 item), append (in memory) and write (3 items). And so on. This basically means that you're performance will go down the drain if you add more files. You've already observed this, now you know why.
A possible solution
In the previous solution I have added all files at once. That might not work in your use case. Another solution is to implement a merge that basically merges 2 files together every time. This is more convenient if you don't have all files available when you start the compression process.
Basically the algorithm then becomes:
Add a few (say, 16, files). You can toy with this number. Store this in -say- 'file16.zip'.
Add more files. When you hit 16 files, you have to merge the two files of 16 items into a single file of 32 items.
Merge files until you cannot merge anymore. Basically every time you have two files of N items, you create a new file of 2*N items.
Goto (2).
Again, we can reason about it. The first 16 files aren't a problem, we've already established that.
We can also reason what will happen in our program. Because we're merging 2 files into 1 file, we don't have to do as many read and writes. In fact, if you reason about it, you'll see that you have a file of 32 entries in 2 merges, 64 in 4 merges, 128 in 8 merges, 256 in 16 merges... hey, wait we know this sequence, it's 2^N. Again, reasoning about it we'll find that we need approximately 500 merges -- which is much better than the 200.000 operations that we started with.
Hacking in the ZIP file
Yet another solution that might come to mind is to overallocate the central directory, creating slack space for future entries to add. However, this probably requires you to hack into the ZIP code and create your own ZIP file writer. The idea is that you basically overallocate the central directory to a 200K entries before you get started, so that you can simply append in-place.
Again, we can reason about it: adding file now means: adding a file and updating some headers. It won't be as fast as the original solution because you'll need random disk IO, but it'll probably work fast enough.
I haven't worked this out, but it doesn't seem overly complicated to me.
The easiest solution is the most practical
What we haven't discussed so far is the easiest possible solution: one approach that comes to mind is to simply add all files at once, which we can again reason about.
Implementation is quite easy, because now we don't have to do any fancy things; we can simply use the ZIP handler (I use ionic) as-is:
static void Main()
{
try { File.Delete(#"c:\tmp\test.zip"); }
catch { }
var sw = Stopwatch.StartNew();
using (var zip = new ZipFile(#"c:\tmp\test.zip"))
{
zip.UseZip64WhenSaving = Zip64Option.Always;
for (int i = 0; i < 200000; ++i)
{
string filename = "foo" + i.ToString() + ".txt";
byte[] contents = Encoding.UTF8.GetBytes("Hello world!");
zip.CompressionLevel = Ionic.Zlib.CompressionLevel.None;
zip.AddEntry(filename, contents);
}
zip.Save();
}
Console.WriteLine("Elapsed: {0:0.00}s", sw.Elapsed.TotalSeconds);
Console.ReadLine();
}
Whop; that finishes in 4,5 seconds. Much better.
I can see that you just want to group the 200,000 files into one big single file, without compression (like a tar archive).
Two ideas to explore:
Experiment with other file formats than Zip, as it may not be the fastest. Tar (tape archive) comes to mind (with natural speed advantages due to its simplicity), it even has an append mode which is exactly what you are after to ensure O(1) operations. SharpCompress is a library that will allow you to work with this format (and others).
If you have control over your remote server, you could implement your own file format, the simplest I can think of would be to zip each new file separately (to store the file metadata such as name, date, etc. in the file content itself), and then to append each such zipped file to a single raw bytes file. You would just need to store the byte offsets (separated by columns in another txt file) to allow the remote server to split the huge file into the 200,000 zipped files, and then unzip each of them to get the meta data. I guess this is also roughly what tar does behind the scene :).
Have you tried zipping to a MemoryStream rather than to a file, only flushing to a file when you are done for the day? Of course for back-up purposes your WCF service would have to keep a copy of the received individual files until you are sure they have been "committed" to the remote server.
If you do need compression, 7-Zip (and fiddling with the options) is well worth a try.
You are opening the file repeatedly, why not add loop through and add them all to one zip, then save it?
var listAes = Directory.EnumerateFiles(myFolder, "*.*", SearchOption.AllDirectories)
.Where(s => s.EndsWith(".aes"))
.Select(f => new FileInfo(f));
using (var zip = ZipFile.Read(nameOfExistingZip))
{
foreach (var additionFile in listAes)
{
zip.CompressionLevel = Ionic.Zlib.CompressionLevel.None;
zip.AddFile(additionFile.FullName);
}
zip.Save();
}
If the files aren't all available right away, you could at least batch them together. So if you're expecting 200k files, but you only have received 10 so far, don't open the zip, add one, then close it. Wait for a few more to come in and add them in batches.
If you are OK with performance of 100 * 20,000 files, can't you simply partition your large ZIP into a 100 "small" ZIP files? For simplicity, create a new ZIP file every minute and put a time-stamp in the name.
You can zip all the files using .Net TPL (Task Parallel Library) like this:
while(0 != (read = sourceStream.Read(bufferRead, 0, sliceBytes)))
{
tasks[taskCounter] = Task.Factory.StartNew(() =>
CompressStreamP(bufferRead, read, taskCounter, ref listOfMemStream, eventSignal)); // Line 1
eventSignal.WaitOne(-1); // Line 2
taskCounter++; // Line 3
bufferRead = new byte[sliceBytes]; // Line 4
}
Task.WaitAll(tasks); // Line 6
There is a compiled library and source code here:
http://www.codeproject.com/Articles/49264/Parallel-fast-compression-unleashing-the-power-of

C# searching for string in large text file. If you search the same file goto line you last read and start search

If I am searching for strings in the same LOG file many times through the day would it be faster to somehow go to the last line read in the file on the previous search and then begin reading line by line? Would there be any significant savings here?
EXAMPLE FILE
process ID logic
11111 Run some silly logic on middle tier servers.
11111 Still running logic
22222 Run some silly logic on middle tier servers from another user.
11111 Oh look the first process is done.
22222 Still running logic on the second process.
There are times I want many lines of a file from the last time I last loaded it. Currently I use UltraEdit to load the file once and then update file but this still takes quite a bit of time.
In this example above I want ever line from the first process.
NOTE:
The file can be several hundred MB in size at times.
The example above is abbreviated, each process ID may contain 100''s of lines of logic.
I am accessing the log file across a network. I have found that with UE it is faster to load the file from across the network and then continue to update file than to copy to my local PC and then open it.
I am hoping to have a C# console application that can be ran from powershell and pipe the lines I want to the screen or to a file.
Another question I have is what would make this process as efficient as possible?
1. in regards to C# methods used for my file size?
2. in regards to application used to write the utility? I have powershell, C#, C++, perl
This would be possible using Stream.Seek. You would just have to remember what the last position in the stream was, then move forward from there. If your log file only adds lines to it, this will work just fine, and it will certainly be faster than reading and scanning the same lines over and over again.
If you post some of your existing code, I can even help you write the code to do it.
http://msdn.microsoft.com/en-us/library/system.io.stream.seek.aspx
I've wanted to implement something like this myself, so I took some time to give it a shot. Here's an extension method (you'll have to put it in a static class) to FileStream I've come up with:
public static string ReadLineAndCountPosition(this FileStream fs, ref long position)
{
//Check if too great a position was passed in:
if (position > fs.Length)
return null;
bool is_carriage_return = false;
StringBuilder sb = new StringBuilder();
fs.Seek(position, SeekOrigin.Begin);
while (position < fs.Length)
{
var my_byte = fs.ReadByte();
position++;
//Check for newlines
if (is_carriage_return && my_byte == 10)// \n
return sb.ToString();
if (my_byte == 13) // \r
is_carriage_return = true;
else
{
is_carriage_return = false;
sb.Append((char)my_byte);
}
}
return sb.ToString();//We've consumed the entire file.
}
And to use it, you can use ReadLineAndCountPosition by simply calling it and passing in a long parameter which we will save the position in. We will simply .Seek() to this position some time later.
static void Main(string[] args)
{
FileStream fs = new FileStream("testfile.txt",FileMode.Open);
long saved_position = 0;
while(true)
{
string current_line = fs.ReadLineAndCountPosition(ref saved_position);
if (current_line == null || current_line == "SomeSearchString")
break;
}
//Some time later we want to search from the saved position:
while(true)
{
string current_line = fs.ReadLineAndCountPosition(ref saved_position);
if (current_line == null || current_line == "SecondSearchString")
break;
}
}
I ran a few tests myself, and it seems to have worked fine. Let me know if you have any troubles. Hopefully it helps you out.

append text to lines in a CSV file

This question seems to have been asked a million times around the web, but I cannot find an answer which will work for me.
Basically, I have a CSV file which has a number of columns (say two). The program goes through each row in the CSV file, taking the first column value, then asks the user for the value to be placed in the second column. This is done on a handheld running Windows 6. I am developing using C#.
It seems a simple thing to do. But I cant seem to add text to a line.
I cant use OleDb, as System.Data.Oledb isnt in the .Net version I am using. I could use another CSV file, and when they complete each line, it writes it to another CSV file. But the problems with that are - The file thats produced at the end needs to contain EVERY line (so what if they pull the batterys out half way). And what if they go back, to continue doing this another time, how will the program know where to start back from.
For every row, open the output file, append the new row to it, and then close the output file. To restart, count the number of rows in the existing output file from the previous run, which will give you your starting in the input file (i.e., skip that number of rows in the input file).
Edit: right at the start, use System.IO.File.Copy to copy the input file to the output file, so you have all the file in case of failure. Now open the input file, read a line, convert it, use File.ReadAllLines to read ALL of the output file into an array, replace the line you have changed at the right index in the array, then use File.WriteAllLines to write out the new output file.
Something like this:
string inputFileName = ""; // Use a sensible file name.
string outputFileName = ""; // Use a sensible file name.
File.Copy(inputFileName, outputFileName, true);
using (StreamReader reader = new StreamReader(inputFileName))
{
string line = null;
int inputLinesIndex = 0;
while ((line = reader.ReadLine()) != null)
{
string convertedLine = ConvertLine(line);
string[] outputFileLines = File.ReadAllLines(outputFileName);
if (inputLinesIndex < outputFileLines.Length)
{
outputFileLines[inputLinesIndex] = convertedLine;
File.WriteAllLines(outputFileName, outputFileLines);
}
inputLinesIndex++;
}
}

Categories