TextWriter/StreamWriter high memory usage

TextWriter/StreamWriter high memory usage - c#

I have a console app that reads in a large text file with 40k+ lines, each line is a key that I use in a search for which the results are written to a output file. Issue is I leave this console app running for a while until it just suddenly closes and I realize that the process memory usage was really high was sitting at 1.6gb when I last saw it crash.
I looked around and didn't find many answers I did try to use the gcAllowVeryLargeObjects but that seems like I'm just dodging the problem.
Below is a snippet from my main() of where I write out to the file. I can't seem to understand why the memory usage gets so high. I flush the writer after every write (could it be because I'm keeping the file open for such a long period of time?).
TextWriter writer = new StreamWriter("output.csv", false));
foreach (var item in list)
{
Console.WriteLine("{0}/{1}", count, numofitem);
var result = TableServiceContext.Read(p.id);
if (result != null)
{
writer.WriteLine(String.Join(",", result.id,
result.code,
result.hash));
}
count++;
writer.Flush();
}
writer.Close();
Edit: I have 32gb of ram on my computer so I am sure it's not running out of memory because I don't have enough ram.
Edit2: changed the name of the repository as that was misleading.

If the average line length is 1KB then 40K lines is 40MB, and it nothing. That's why, I'm pretty sure problem is in your repository class. If it is EF repository, try to recreate DbContext for each line.
If you want to tune up your program, then, you can use the following method: Try to put timestamps to Console output, you can use Stopwatch class, and try to recreate your repository each 10 or 100 or N lines. Then, looking at timestamps, you can find optimal N to use.
var timer = Stopwatch.StartNew();
...
Console.WriteLine(timer.ElapsedMilliseconds);

From looking at the code I think the problem isN't the Streamwriter but some memory leak in your repository. Suggestions to check:
replace the repository by some dummy e.g. class dummy_repository with just the three properties id, value, hash.
likewise create a long "list" e.g. 40k small entries.
run your program and see if it still consumes memory (I am pretty sure it will not)
then step by step add back your original parts. See what step causes the memory leak.

Related

Understanding VS performance analysis

I have a C# assembly which processes retail promotions. It is able to process a promotion that has 1,288 qualifying products in just 7 seconds. However, where it is tasked to process a promotion with a larger number of qualifying products then the time taken increases exponentially in relation to the number of products. For example, a promo with 29,962 products takes 7 mins 7 secs and a promo with 77,350 products takes 39 mins and 7 secs.
I've been trying to identify if there's code in the assembly that can be easily optimized. I set the assembly processing the largest of the promotions then attached the performance analyzer to the containing process (BizTalk host instance), the resulted in the following report:
This suggests that the function taking the greatest amount of time is "GetDataPromoLines". This function contains simple string formatting. It is called from the following loop of the function "MapForFF":
foreach (var promoLine in promoLineChunk.PromoLines)
{
outputFile = outputFile + GetDataPromoLines(promoLine, promoLineNumber+1);
promoLineNumber++;
}
The promoLineChunck.PromoLines is a List of a class which describes the promotion, it contains only private strings - one for each column of the database table from which the promotion details were selected. The content of the "GetDataPromoLines" function can be seen below:
private string GetDataPromoLines(VW_BT_PROMOTIONSRECORDSELECT promoLine, int sequenceNumber)
{
StringBuilder sb = new StringBuilder();
string seqNum = sequenceNumber.ToString().PadLeft(5, '0');
string uniqueNumber = promoLine.CIMS_PROMO_NUMBER + seqNum;
sb.AppendLine(string.Format("PromoDiscount,\"{0}\",\"{1}\",\"{2}\",\"{3}\",\"{4}\",\"{5}\",\"{6}\",\"{7}\"",
uniqueNumber,
promoLine.CIMS_PROMO_NAME,
promoLine.TYPE,
promoLine.DESCRIPTION_,
promoLine.DISCOUNTLEVEL,
promoLine.COUPONNUMBERMIN,
promoLine.COUPONNUMBERMAX,
promoLine.COUPONNUMBERLENGTH
));
sb.AppendLine(string.Format("ItemReq,\"{0}\",\"{1}\",\"{2}\",\"{3}\",\"{4}\",\"{5}\",\"{6}\",\"{7}\",\"{8}\"",
"00001",
promoLine.IDENTITYTYPE,
promoLine.ITEMNUM,
promoLine.DIVISIONNUM,
promoLine.DEPARTMENTNUM,
promoLine.DEPTGROUPNUM,
promoLine.CLASSNUM,
promoLine.ITEMGROUPNUM,
promoLine.IR_QUANTITY
));
sb.AppendLine(string.Format("TierDefinition,\"{0}\",\"{1}\",\"{2}\",\"{3}\",\"{4}\",\"{5}\",\"{6}\",\"{7}\",\"{8}\"",
"00001",
promoLine.THRESHOLDTYPE,
promoLine.THRESHOLDQTY,
promoLine.THRESHOLDAMT,
promoLine.DISCTYPE,
promoLine.DISCPCT,
promoLine.DISCAMT,
promoLine.DISCAPPLIESTO,
promoLine.DISCQTY,
promoLine.ADDLINFO
));
return sb.ToString();
}
Can anyone suggest what is causing the exponential increase in time to process? Is it something to do with CLR unboxing?

outputFile = outputFile + GetDataPromoLines(promoLine, promoLineNumber+1);
Is that an attempt to build an entire output file by appending strings? There's your Schlemiel.
For cases like this, you really want to use StringBuilder (or even better, output directly into a file stream using StreamWriter or something):
StringBuilder outputFile;
foreach (var promoLine in promoLineChunk.PromoLines)
{
outputFile.Append(GetDataPromoLines(promoLine, promoLineNumber+1));
promoLineNumber++;
}
The problem with simple appends is that string is immutable in .NET - every time you modify it, it is copied over. For things like outputting huge text files, this is incredibly costly, of course - you spend most of your time copying the parts of the string that didn't change.
The same way, don't do sb.AppendLine(string.Format(...)); - simply use sb.AppendFormat. Ideally, pass the StringBuilder as an argument, to avoid having to copy over the lines themselves - although that should be relatively insignificant performance hit next to the outputFile += ....
As a side-note, be careful when interpreting the results of profiling - it's often subtly misleading. In your case, I'm pretty certain your problem is not in GetDataPromoLines itself (although even that could be improved, as seen above), but in the outputFile += .... It's not enough to just look at the function with the highest exclusive samples. It's also not enough to just look at the hot path, although that's already a huge step-up that usually leads you straight where your attention is needed. Also, understand the difference between sampling and instrumentation - sampling can often lead you to try optimizing a method that's not really a performance problem on its own - rather, it simply shouldn't be called as often as it is. Do not use profiler results as blindfolds - you still need to pay attention to what actually makes sense.

Reading very large text files, should I be incorporating async?

I have been challenged with producing a method that will read in very large text files into a program these files can range from 2gb to 100gb.
The idea so far has been to read say a couple of 1000 lines of text into the method.
At the moment the program is setup using a stream reader reading a file line by line and processing the necessary areas of data found on that line.
using (StreamReader reader = new StreamReader("FileName"))
{
string nextline = reader.ReadLine();
string textline = null;
while (nextline != null)
{
textline = nextline;
Row rw = new Row();
var property = from matchID in xmldata
from matching in matchID.MyProperty
where matchID.ID == textline.Substring(0, 3).TrimEnd()
select matching;
string IDD = textline.Substring(0, 3).TrimEnd();
foreach (var field in property)
{
Field fl = new Field();
fl.Name = field.name;
fl.Data = textline.Substring(field.startByte - 1, field.length).TrimEnd();
fl.Order = order;
fl.Show = true;
order++;
rw.ID = IDD;
rw.AddField(fl);
}
rec.Rows.Add(rw);
nextline = reader.ReadLine();
if ((nextline == null) || (NewPack == nextline.Substring(0, 3).TrimEnd()))
{
d.ID = IDs.ToString();
d.Records.Add(rec);
IDs++;
DataList.Add(d.ID, d);
rec = new Record();
d = new Data();
}
}
}
The program goes on further and populates a class. ( just decided not to post the rest)
I know that once the program is shown an extremely large file, memory exception errors will occur.
so that is my current problem and so far i have been googling several approaches with many people just answering use a stream reader and reader.readtoend, i know readtoend wont work for me as i will get those memory errors.
Finally i have been looking into async as a way of creating a method that will read a certain amount of lines and wait for a call before processing the next amount of lines.
This brings me to my problem i am struggling to understand async and i can't seem to find any material that will help me learn and was hoping someone here can help me out with a way to understand async.
Of course if anyone knows of a better way to solve this problem I am all ears.
EDIT Added the remainder of the code to put a end to any confusion.

Your problem isn't synchronous v's asynchronous, it's that you're reading the entire file and storing parts of the file in memory before you do something with that data.
If you were reading each line, processing it and writing the result to another file/database, then StreamReader will let you process multi GB (or TB) files.
Theres only a problem if you're storing a portions of the file until you finish reading it, then you can run into memory issues (but you'd be surprised how large you can let Lists & Dictionaries get before you run out of memory)
What you need to do is save your processed data as soon as you can, and not keep it in memory (or keep as little in memory as possible).
With files that large you may need to keep your working set (your processing data) in a database - possibly something like SqlExpress or SqlLite would do (but again, it depends on how large your working set gets).
Hope this helps, don't hesitate to ask further questions in the comments, or edit your original question, I'll update this answer if I can help in any way.
Update - Paging/Chunking
You need to read the text file in chunks of one page, and allow the user to scroll through the "pages" in the file. As the user scrolls you read and present them with the next page.
Now, there are a couple of things you can do to help yourself, always keep about 10 pages in memory, this allows your app to be responsive if the user pages up / down a couple of pages very quickly. In the applications idle time (Application Idle event) you can read in the next few pages, again you throw away pages that are more than five pages before or after the current page.
Paging backwards is a problem, because you don't know where each line begins or ends in the file, therefore you don't know where each page begins or ends. So for paging backwards, as you read down through the file, keep a list of offsets to the start of each page (Stream.Pos), then you can quickly Seek to a given position and read the page in from there.
If you need to allow the user to search through the file, then you pretty much read through the file line by line (remembering the page offsets as you go) looking for the text, then when you find something, read in and present them with that page.
You can speed everything up by pre-processing the file into a database, there are grid controls that will work off a dynamic dataset (they will do the paging for you) and you get the benefit of built in searches / filters.
So, from a certain point of view, this is reading the file asynchronously, but that's from the users point of view. But from a technical point of view, we tend to mean something else when we talk about doing something asynchronous when programming.

c# how to grep through ~300mb log file quickly

im trying to read in a log file in c# thats huge - approx 300mbs of raw text data. ive been testing my program on smaller files approx 1mb which stores all log messages into a string[] array and searching with contains.
however that is too slow and takes up too much memory, i will never be able to process the 300mb log file. i need a way to grep the file, which quickly filters through it finding useful data and printing the line of log information corresponding to the search.
the big question is scale, i think 300mb will be my max, but need my program to handle it. what functions, data structions, searching can i use that will scale well with speed and efficiency to read a log file that big

File.ReadLines is probably your best bet as it gives you an IEnumerable of lines of the text file and reads them lazily as you iterate over the IEnumerable. You can then use whatever method for searching the line you'd like to use (Regex, Contains, etc) and do something with it. My example below spawns a thread to search the line and output it to the console, but you can do just about anything. Of course, TEST, TEST, TEST on large files to see your performance mileage. I imagine if each individual thread spawned below takes too long, you can run into a thread limit.
IEnumerable<string> lines = File.ReadLines("myLargeFile.txt");
foreach (string line in lines) {
string lineInt = line;
(new Thread(() => {
if (lineInt.Contains(keyword)) {
Console.WriteLine(lineInt);
}
})).Start();
}
EDIT: Through my own testing, this is obviously faster:
foreach (string lineInt in File.ReadLines("myLargeFile.txt").Where(lineInt => lineInt.Contains(keyword))) {
Console.WriteLine(lineInt);
}

.net section running real slow

Update: The answers from Andrew and Conrad were both equally helpful. The easy fix for the timing issue fixed the problem, and caching the bigger object references instead of re-building them every time removed the source of the problem. Thanks for the input, guys.
I'm working with a c# .NET API and for some reason the following code executes what I feel is /extremely/ slowly.
This is the handler for a System.Timers.Timer that triggers its elapsed event every 5 seconds.
private static void TimerGo(object source, System.Timers.ElapsedEventArgs e)
{
tagList = reader.GetData(); // This is a collection of 10 objects.
storeData(tagList); // This calls the 'storeData' method below
}
And the storeData method:
private static void storeData(List<obj> tagList)
{
TimeSpan t = (DateTime.UtcNow - new DateTime(1970, 1, 1));
long timestamp = (long)t.TotalSeconds;
foreach (type object in tagList)
{
string file = #"path\to\file" + object.name + ".rrd";
RRD dbase = RRD.load(file);
// Update rrd with current time timestamp and data.
dbase.update(timestamp, new object[1] { tag.data });
}
}
Am I missing some glaring resource sink? The RRD stuff you see is from the NHawk C# wrapper for rrdtool; in this case I update 10 different files with it, but I see no reason why it should take so long.
When I say 'so long', I mean the timer was triggering a second time before the first update was done, so eventually "update 2" would happen before "update 1", which breaks things because "update 1" has a timestamp that's earlier than "update 2".
I increased the timer length to 10 seconds, and it ran for longer, but still eventually out-raced itself and tried to update a file with an earlier timestamp. What can I do differently to make this more efficient, because obviously I'm doing something drastically wrong...

Doesn't really answer your perf question but if you want to fix the rentrancy bit set your timer.AutoRest to false and then call start() at the end of the method e.g.
private static void TimerGo(object source, System.Timers.ElapsedEventArgs e)
{
tagList = reader.GetData(); // This is a collection of 10 objects.
storeData(tagList); // This calls the 'storeData' method below
timer.Start();
}

Is there a different RRD file for each tag in your tagList? In your pseudo code you open each file N number of times. (You stated there is only 10 objects in the list thought.) Then you perform an update. I can only assume that you dispose your RRD file after you have updated it. If you do not you are keeping references to an open file.
If the RRD is the same but you are just putting different types of plot data into a single file then you only need to keep it open for as long as you want exclusive write access to it.
Without profiling the code you have a few options (I recommend profiling btw)
Keep the RRD files open
Cache the opened files to prevent you from having to open, write close every 5 seconds for each file. Just cache the 10 opened file references and write to them every 5 seconds.
Separate the data collection from data writing
It appears you are taking metric samples from some object every 5 seconds. If you do not having something 'tailing' your file, separate the collection from the writing. Take your data sample and throw it into a queue to be processed. The processor will dequeue each tagList and write it as fast as it can, going back for more lists from the queue.
This way you can always be sure you are getting ~5 second samples even if the writing mechanism is slowed down.

Use a profiler. JetBrains is my personal recommendation. Run the profiler with your program and look for the threads / methods taking the longest time to run. This sounds very much like an IO or data issue, but that's not immediately obvious from your example code.

How to efficiently write a large text file in C#?

I am creating a method in C# which generates a text file for a Google Product Feed. The feed will contain upwards of 30,000 records and the text file currently weighs in at ~7Mb.
Here's the code I am currently using (some lines removed for brevity's sake).
public static void GenerateTextFile(string filePath) {
var sb = new StringBuilder(1000);
sb.Append("availability").Append("\t");
sb.Append("condition").Append("\t");
sb.Append("description").Append("\t");
// repetitive code hidden for brevity ...
sb.Append(Environment.NewLine);
var items = inventoryRepo.GetItemsForSale();
foreach (var p in items) {
sb.Append("in stock").Append("\t");
sb.Append("used").Append("\t");
sb.Append(p.Description).Append("\t");
// repetitive code hidden for brevity ...
sb.AppendLine();
}
using (StreamWriter outfile = new StreamWriter(filePath)) {
result.Append("Writing text file to disk.").AppendLine();
outfile.Write(sb.ToString());
}
}
I am wondering if StringBuilder is the right tool for the job. Would there be performance gains if I used a TextWriter instead?
I don't know a ton about IO performance so any help or general improvements would be appreciated. Thanks.

File I/O operations are generally well optimized in modern operating systems. You shouldn't try to assemble the entire string for the file in memory ... just write it out piece by piece. The FileStream will take care of buffering and other performance considerations.
You can make this change easily by moving:
using (StreamWriter outfile = new StreamWriter(filePath)) {
to the top of the function, and getting rid of the StringBuilder writing directly to the file instead.
There are several reasons why you should avoid building up large strings in memory:
It can actually perform worse, because the StringBuilder has to increase its capacity as you write to it, resulting in reallocation and copying of memory.
It may require more memory than you can physically allocate - which may result in the use of virtual memory (the swap file) which is much slower than RAM.
For truly large files (> 2Gb) you will run out of address space (on 32-bit platforms) and will fail to ever complete.
To write the StringBuilder contents to a file you have to use ToString() which effectively doubles the memory consumption of the process since both copies must be in memory for a period of time. This operation may also fail if your address space is sufficiently fragmented, such that a single contiguous block of memory cannot be allocated.

Just move the using statement so it encompasses the whole of your code, and write directly to the file. I see no point in keeping it all in memory first.

Write one string at a time using StreamWriter.Write rather than caching everything in a StringBuilder.

This might be old but I had a file to write with about 17 million lines
so I ended up batching the writes every 10k lines similar to these lines
for (i6 = 1; i6 <= ball; i6++)
{ //this is middle of 6 deep nest ..
counter++;
// modus to get a value at every so often 10k lines
divtrue = counter % 10000; // remainder operator % for 10k
// build the string of fields with \n at the end
lineout = lineout + whatever
// the magic 10k block here
if (divtrue.Equals(0))
{
using (StreamWriter outFile = new StreamWriter(#filepath, true))
{
// write the 10k lines with .write NOT writeline..
outFile.Write(lineout);
}
// reset the string so we dont do silly like memory overflow
lineout = "";
}
}
In my case it was MUCH faster then one line at a time.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.