Processing and updating a large file row by row - c#

So I am processing a 200 mb txt file and I have to read each row in the file update one or two columns and then save the same. What is the best way to achieve the same?
I was thinking of lading into a datatable but holding that big of a file in memory is a big pain.
I realise I should do it in batches but what is the best way to achieve the same?
I dont think I want to load into a dB first cos I cant do a mass update anyways. i Have to do a line by line read there too.
Just as an update my files basically have columns in any order and I need to update two or more columns all the time.
Thanks.

Read a line, parse it, and write fields into a temp file. When all the lines are done, delete the original file and rename the temp file.

To add to what Ants said...
You have options ...
Line by line:
StreamReader fileStream = new StreamReader( sourceFileName );
StreamWriter ansiWriter = new StreamWriter( destinationFileName,
false, Encoding.GetEncoding( 20127 ) );
string fileContent;
while ( ( fileContent = fileStream.ReadLine() ) != null )
{
YourReplaceMethod( fileContent );
ansiWriter.WriteLine( fileContent );
}
fileStream.Close();
ansiWriter.Close();
Bulk (today's boxes should be able to handle 200MB w/o problems):
byte[] bytes = File.ReadAllBytes( sourceFileName );
byte[] writeMeBytes = YourReplaceMethod( bytes );
File.WriteAllBytes( destinationFileName, writeMeBytes );

Related

How can i get to the next column with FileStream?

I want to write data to a csv file. But i can only get to the next row, not to the next column. I hope some people here know how to get to the next column.
String fileName = "C:\\Users\\hogen\\Desktop\\test.csv";
FileStream fileStream = new FileStream(fileName, FileMode.OpenOrCreate, FileAccess.Write);
string columnTitles = "test, test \n test \n test";
fileStream.Write(Encoding.ASCII.GetBytes(columnTitles), 0, columnTitles.Length);
fileStream.Close();
I get now:
test, test
test
test
in the csv file. How do i get the second test of the first row in the next column?
Maybe if u guys know this, could maybe give some good examples of reading columns with filestream?
I think you're asking how to skip a column?
If you want a blank column, just add another comma: test,,test - make sure there are no spaces or other characters between the two commas.
No it doesn't seperate..
I found also this tutorial:
String fileName = "C:\\Users\\hogen\\Desktop\\test1.csv";
StringBuilder content = new StringBuilder();
content.AppendLine("name, age");
content.AppendLine("nick, 26");
File.AppendAllText(fileName, content.ToString());
Output:
https://i.stack.imgur.com/AWxMG.png

Slow loading of .CSV files using EPPLUS

I have loads of .csv files I need to convert to .xslx after applying some formatting.
A file containing approx 20 000 rows and 7 columns takes 12 minutes to convert.
If the file contains more than 100 000 it runs for > 1 hour.
This is unfortunately not acceptable for me.
Code snippet:
var format = new ExcelTextFormat();
format.Delimiter = ';';
format.Encoding = new UTF7Encoding();
format.Culture = new CultureInfo(System.Threading.Thread.CurrentThread.CurrentCulture.ToString());
format.Culture.DateTimeFormat.ShortDatePattern = "dd.mm.yyyy";
using (ExcelPackage package = new ExcelPackage(new FileInfo(file.Name))){
ExcelWorksheet worksheet = package.Workbook.Worksheets.Add(Path.GetFileNameWithoutExtension(file.Name));
worksheet.Cells["A1"].LoadFromText(new FileInfo(file.FullName), format);
}
I have verified that it is the LoadFromText command that spends the time used.
Is there a way to speed things up?
I have tried without the "format" parameter, but the loadtime was the same.
What loadtimes are you experiencing?
My suggestion here is to read the file by yourself and then use the library to create the file.
The code to read the CSV could be as simple as:
List<String> lines = new List<String>();
using (StreamReader reader = new StreamReader("file.csv"))
{
String line;
while((line = reader.ReadLine()) != null)
{
lines.add(line);
}
}
//Now you got all lines of your CSV
//Create your file with EPPLUS
foreach(String line in lines)
{
var values = line.Split(';');
foreach(String value in values)
{
//use EPPLUS library to fill your file
}
}
I ran into a very similar problem with LoadFromCollection. EPPlus has to account for all situations in their methods to load data generically like that so there is a good deal of overhead. I ended up narrowing done the bottleneck to that method and ended up just manually coverting the data from the collection to Excel Cell object in EPPlus. Probably saved several minutes in my exports.
Plenty of examples on how to read csv data:
C# Read a particular value from CSV file

while writing result into csv , format of number/text not getting set properly

I am writing data into csv. While doing so I have written below result into csv (size of file is in bytes).
FNumber Name Size
1 Save.png 6.89766E+11
I have tried to use string, long, double for size variable but it is not keeping text or number format for that. So written size is looks like above or sometime it is appending 000000 at end of size..
I want whole value should look like a number.
Please let me know how to set text/number setting before writing into csv
Thanks in advance
The following code should work
static void WriteCsvFile()
{
FileInfo file = new FileInfo(#"<Path of the file>");
StreamWriter writer = new StreamWriter(#"D:\output.csv", false);
writer.WriteLine("{0},{1},{2}", "FNumber", "Name", "Size");
writer.WriteLine("{0},{1},{2:D}", "1", "Save.bng", file.Length);
writer.Close();
}

How to read a csv file one line at a time and replace/edit certain lines as you go?

I have a 60GB csv file I need to make some modifications to. The customer wants some changes to the files data, but I don't want to regenerate the data in that file because it took 4 days to do.
How can I read the file, line by line (not loading it all into memory!), and make edits to those lines as I go, replacing certain values etc.?
The process would be something like this:
Open a StreamWriter to a temporary file.
Open a StreamReader to the target file.
For each line:
Split the text into columns based on a delimiter.
Check the columns for the values you want to replace, and replace them.
Join the column values back together using your delimiter.
Write the line to the temporary file.
When you are finished, delete the target file, and move the temporary file to the target file path.
Note regarding Steps 2 and 3.1: If you are confident in the structure of your file and it is simple enough, you can do all this out of the box as described (I'll include a sample in a moment). However, there are factors in a CSV file that may need attention (such as recognizing when a delimiter is being used literally in a column value). You can drudge through this yourself, or try an existing solution.
Basic example just using StreamReader and StreamWriter:
var sourcePath = #"C:\data.csv";
var delimiter = ",";
var firstLineContainsHeaders = true;
var tempPath = Path.GetTempFileName();
var lineNumber = 0;
var splitExpression = new Regex(#"(" + delimiter + #")(?=(?:[^""]|""[^""]*"")*$)");
using (var writer = new StreamWriter(tempPath))
using (var reader = new StreamReader(sourcePath))
{
string line = null;
string[] headers = null;
if (firstLineContainsHeaders)
{
line = reader.ReadLine();
lineNumber++;
if (string.IsNullOrEmpty(line)) return; // file is empty;
headers = splitExpression.Split(line).Where(s => s != delimiter).ToArray();
writer.WriteLine(line); // write the original header to the temp file.
}
while ((line = reader.ReadLine()) != null)
{
lineNumber++;
var columns = splitExpression.Split(line).Where(s => s != delimiter).ToArray();
// if there are no headers, do a simple sanity check to make sure you always have the same number of columns in a line
if (headers == null) headers = new string[columns.Length];
if (columns.Length != headers.Length) throw new InvalidOperationException(string.Format("Line {0} is missing one or more columns.", lineNumber));
// TODO: search and replace in columns
// example: replace 'v' in the first column with '\/': if (columns[0].Contains("v")) columns[0] = columns[0].Replace("v", #"\/");
writer.WriteLine(string.Join(delimiter, columns));
}
}
File.Delete(sourcePath);
File.Move(tempPath, sourcePath);
memory-mapped files is a new feature in .NET Framework 4 which can be used to edit large files.
read here http://msdn.microsoft.com/en-us/library/dd997372.aspx
or google Memory-mapped files
Just read the file, line by line, with streamreader, and then use REGEX! The most amazing tool in the world.
using (var sr = new StreamReader(new FileStream(#"C:\temp\file.csv", FileMode.Open)))
{
var line = sr.ReadLine();
while (!sr.EndOfStream)
{
// do stuff
line = sr.ReadLine();
}
}

Explaining codes

Hi can anyone explain these lines of codes, I need to understand how it works in order to proceed with what I am doing
if (e.Error == null){
Stream responseStream = e.Result;
StreamReader responseReader = new StreamReader(responseStream);
string response = responseReader.ReadToEnd();
string[] split1 = Regex.Split(response, "},{");
List<string> pri1 = new List<string>(split1);
pri1.RemoveAt(0);
string last = pri1[pri1.Count() - 1];
pri1.Remove(last);
}
// Check if there was no error
if (e.Error == null)
{
// Streams are a way to read/write information from/to somewhere
// without having to manage buffer allocation and such
Stream responseStream = e.Result;
// StreamReader is a class making it easier to read from a stream
StreamReader responseReader = new StreamReader(responseStream);
// read everything that was written to a stream and convert it to a string using
// the character encoding that was specified for the stream/reader.
string response = responseReader.ReadToEnd();
// create an array of the string by using "},{" as delimiter
// string.Split would be more efficient and more straightforward.
string[] split1 = Regex.Split(response, "},{");
// create a list of the array. Lists makes it easier to work with arrays
// since you do not have to move elements manually or take care of allocations
List<string> pri1 = new List<string>(split1);
pri1.RemoveAt(0);
// get the last item in the array. It would be more efficient to use .Length instead
// of Count()
string last = pri1[pri1.Count() - 1];
// remove the last item
pri1.Remove(last);
}
I would use a LinkedList instead of List if the only thing to do was to remove the first and last elements.
It's reading the response stream as a string, making the assumption that the string consists of sequences "{...}" separated by commas, e.g.:
{X},{Y},{Z}
then splits the string on "},{", giving an array of
{X
Y
Z}
then removes the first brace from the first element of the array ( {X => X ) and the end brace from the last element of the array ( Z} => Z).
From what I can see, it is reading from a stream that could have came from TCP.
It reads the whole chunk of data, then separate the chunk using the delimiter },{.
So if you have something like abc},{dec , it will be placed into split1 array with 2 values, split1 [0]=abc , split1 [1]=dec.
After that, it basically remove the 1st and the last content
It is processing an error output.
It received a stream from the e (I guess it is an exception), reads it.
It looks something like :
""{DDD},{I failed},{Because},{There was no signal}{ENDCODE}
It splits it into different string, and removes to fist and last entries (DDD, ENDCODE)

Categories