Should I build a string first and then write to file? - c#

A program I am working on right now has to generate a file. Is it better for me to generate the file's contents as a string first and then write that string to the file, or should I just directly add the contents to the file?
Are there any advantages of one over the other?
The file will be about 0.5 - 1MB in size.

If you write to a file as-you-go, you'll have the benefit of not keeping everything in memory, if it's a big enough file and you constantly flush the stream.
However, you'll be more likely to run into problems with a partially-written file, since you're doing your IO over a period of time instead of in a single shot.
Personally, I'd build it up using a StringBuilder, and then write it all to disk in a single shot.

I think it's a better idea, in general, to create a StreamWriter and just write to it. Why keep things in memory when you don't have to? And it's a whole lot easier. For example:
using (var writer = new StreamWriter("filename"))
{
writer.WriteLine(header);
// write all your data with Write and WriteLine,
// taking advantage of composite formatting
}
If you want to build multiple lines with StringBuilder you have to write something like:
var sb = new StringBuilder();
sb.AppendLine(string.Format("{0:N0} blocks read", blocksRead));
// etc., etc.
// and finally write it to file
File.WriteAllText("filename", sb.ToString());
There are other options, of course. You could build the lines into a List<string> and then use File.WriteAllLines. Or you could write to a StringStream and then write that to the file. But all of those approaches have you handling the data multiple times. Just open the StreamWriter and write.
The primary reasons I think it's a better idea in general to go directly to output:
You don't have to refactor your code when it turns out that your output data is too big to fit in memory.
The planned destination is the file anyway, so why fool with formatting it in memory before writing to the file?
The API for writing multiple lines to a text file is, in my opinion, cleaner than the API for adding lines to a StringBuilder.

I think it is better to use string or stringbuilder to store your data then you can write to file using File.Write functions.

Related

Write as building table in HttpHandler?

I have to export very large files as an "excel export". Since .NET can't export excel files, I went with simple html tables.
It works fine, but it's slow.
Is it possible to context.response.write each line as they're being created instead of building some super huge string and trying to export the whole thing once it's done?
I could care less what function is used to do this, but I hope you know what I mean. I don't want to build a string into memory and then try to send it all at once. I'd rather export as I build the table.
Is this possible?
Thanks in advance!
Yes, using context.Response.Write on each line is just fine. If the reason for not wanting to build a large string is server memory use, then you'll need to turn off response buffering like so:
context.Response.BufferOutput = false;
Otherwise, .NET will just buffer your writes in memory until the end, anyway.
If the reason is execution time, then you may be experiencing performance hits from multiple string concatenations. In that case, you could use the StringBuilder class to construct the table instead.
For example...
StringBuilder sb = new StringBuilder();
for each (<row in database>) {
sb.AppendLine(<current table row>);
}
context.Response.Write(sb.ToString());
More info on response buffering:
http://msdn.microsoft.com/en-us/library/system.web.httpresponse.bufferoutput.aspx
More info on the StringBuilder class:
http://msdn.microsoft.com/en-us/library/system.text.stringbuilder(v=vs.110).aspx

Reading a MemoryStream which contains multiple files

If I have a single MemoryStream of which I know I sent multiple files (example 5 files) to this MemoryStream. Is it possible to read from this MemoryStream and be able to break apart file by file?
My gut is telling me no since when we Read, we are reading byte by byte... Any help and a possible snippet would be great. I haven't been able to find anything on google or here :(
You can't directly, not if you don't delimit the files in some way or know the exact size of each file as it was put into the buffer.
You can use a compressed file such as a zip file to transfer multiple files instead.
A stream is just a line of bytes. If you put the files next to each other in the stream, you need to know how to separate them. That means you must know the length of the files, or you should have used some separator. Some (most) file types have a kind of header, but looking for this in an entire stream may not be waterproof either, since the header of a file could just as well be data in another file.
So, if you need to write files to such a stream, it is wise to add some extra information. For instance, start with a version number, then, write the size of the first file, write the file itself and then write the size of the next file, etc....
By starting with a version number, you can make alterations to this format. In the future you may decide you need to store the file name as well. In that case, you can increase version number, make up a new format, and still be able to read streams that you created earlier.
This is of course especially useful if you store these streams too.
Since you're sending them, you'll have to send them into the stream in such a way that you'll know how to pull them out. The most common way of doing this is to use a length specification. For example, to write the files to the stream:
write an integer to the stream to indicate the number of files
Then for each file,
write an integer (or a long if the files are large) to indicate the number of bytes in the file
write the file
To read the files back,
read an integer (n) to determine the number of files in the stream
Then, iterating n times,
read an integer (or long if that's what you chose) to determine the number of bytes in the file
read the file
You could use an IEnumerable<Stream> instead.
You need to implement this yourself, what you would want to do is write in some sort of 'delimited' into the stream. As you're reading, look for that delimited, and you'll know when you have hit a new file.
Here's a quick and dirty example:
byte[] delimiter = System.Encoding.Default.GetBytes("++MyDelimited++");
ms.Write(myFirstFile);
ms.Write(delimiter);
ms.Write(mySecondFile);
....
int len;
do {
len = ms.ReadByte(buffer, lastOffest, delimiter.Length);
if(buffer == delimiter)
{
// Close and open a new file stream
}
// Write buffer to output stream
} while(len > 0);

Handling strings more than 2 GB

I have an application where an XLS file with lots of data entered by the user is opened and the data in it is converted to XML. I have already mapped the columns in the XLS file to XML Maps. When I try to use the ExportXml method in XMLMaps, I get a string with the proper XML representation of the XLS file. I parse this string a bit and upload it to my server.
The problem is, when my XLS file is really large, the string produced for XML is over 2 GB and I get a Out of Memory exception. I understand that the limit for CLR objects is 2 GB. But in my case I need to handle this scenario. Presently I just message asking the user to send less data.
Any ideas on how I can do this?
EDIT:
This is just a jist of the operation I need to do on the generated XML.
Remove certain fields which are not needed for the server data.
Add something like ID numbers for each row of data.
Modify the values of certain elements.
Do validation on the data.
While the XMLReader stream is a good idea, I cannot perform these operations by that method. While data validation can be done by Excel itself, the other things cannot be done here.
Using XMLTextReader and XMLTextWriter and creating a custom method for each of the step is a solution I had thought of. But to go through the jist above, it requires the XML document to be gone through or processed 4 times. This is just not efficient.
If the XML is that large, then you might be able to use Export to a temporary file, rather than using ExportXML to a string - http://msdn.microsoft.com/en-us/library/microsoft.office.interop.excel.xmlmap.export.aspx
If you then need to parse/handle the XML in C#, then for handling such large XML structures, you'll probably be better off implementing a custom XMLReader (or XMLWriter) which works at the stream level. See this question for some similar advice - What is the best way to parse large XML (size of 1GB) in C#?
I guess there is no other way then using x64-OS and FX if you really need to hold the whole thing in RAM, but using some other way to process the data like suggested by Stuart may is the better way to go...
What you need to do is to use "stream chaining", i.e. you open up an input stream which reads from your excel file and an output stream that writes to your xml file. Then your conversion class/method will take the two streams as input and read sufficient data from the input stream to be able to write to the output.
Edit: very simple minimal Example
Converting from file:
123
1244125
345345345
4566
11
to
<List>
<ListItem>123</ListItem>
<ListItem>1244125</ListItem>
...
</List>
using
void Convert(Stream fromStream, Stream toStream)
{
using(StreamReader from= new StreamReader(fromStream))
using(StreamWriter to = new StreamWriter(toStream))
{
to.WriteLine("<List>");
while(!from.EndOfStream)
{
string bulk = from.ReadLine(); //in this case, a single line is sufficient
//some code to parse the bulk or clean it up, e.g. remove '\r\n'
to.WriteLine(string.Format("<ListItem>{0}</ListItem>", bulk));
}
to.WriteLine("</List>");
}
}
Convert(File.OpenRead("source.xls"), File.OpenWrite("source.xml"));
Of course you could do this in much more elegent, abstract manner but this is only to show my point

When should I slurp a file, and when should I read it by-line?

Imagine that I have a C# application that edits text files. The technique employed for each file can be either:
1) Read the file at once in to a string, make the changes, and write the string over the existing file:
string fileContents = File.ReadAllText(fileName);
// make changes to fileContents here...
using (StreamWriter writer = new StreamWriter(fileName))
{
writer.Write(fileContents);
}
2) Read the file by line, writing the changes to a temp file, then deleting the source and renaming the temp file:
using (StreamReader reader = new StreamReader(fileName))
{
string line;
using (StreamWriter writer = new StreamWriter(fileName + ".tmp"))
{
while (!reader.EndOfStream)
{
line = reader.ReadLine();
// make changes to line here
writer.WriteLine(line);
}
}
}
File.Delete(fileName);
File.Move(fileName + ".tmp", fileName);
What are the performance considerations with these options?
It seems to me that either reading by line or reading the entire file at once, the same quantity of data will be read, and disk times will dominate the memory alloc times. That said, once a file is in memory, the OS is free to page it back out, and when it does so the benefit of that large read has been lost. On the other hand, when working wit a temporary file, once the handles are closed I need to delete the old file and rename the temp file, which incurs a cost.
Then there are questions around caching, and prefetching, and disk buffer sizes...
I am assuming that in some cases, slurping the file is better, and in others, operating by line is better. My question is, what are the conditions for these two cases?
in some cases, slurping the file is better, and in others, operating by line is better.
Very nearly; except that reading line-by-line is actually a much more specific case. The actual choices we want to distinguish between are ReadAll and using a buffer. ReadLine makes assumptions - the biggest one being that the file actually has lines, and they are a reasonable length! If we can't make this assumption about the file, we want to choose a specific buffer size and read into that, regardless of whether we've reached the end of a line or not.
So deciding between reading it all at once and using a buffer - always go with the easiest to implement, and most naive approach until you run into a specific situation that does not work for you - and having a concrete case, you can make an educated decision based on the information you actually have, rather than speculating about hypothetical situations.
Simplest - read it all at once.
Is performance becoming a problem? Does this application run against uncontrolled files, so their size is not predictable? Just a few examples where you want to chunk it.

What is the fastest way to parse text with custom delimiters and some very, very large field values in C#?

I've been trying to deal with some delimited text files that have non standard delimiters (not comma/quote or tab delimited). The delimiters are random ASCII characters that don't show up often between the delimiters. After searching around, I've seem to have only found no solutions in .NET will suit my needs and the custom libraries that people have written for this seem to have some flaws when it comes to gigantic input (4GB file with some field values having very easily several million characters).
While this seems to be a bit extreme, it is actually a standard in the Electronic Document Discovery (EDD) industry for some review software to have field values that contain the full contents of a document. For reference, I've previously done this in python using the csv module with no problems.
Here's an example input:
Field delimiter =
quote character = þ
þFieldName1þþFieldName2þþFieldName3þþFieldName4þ
þValue1þþValue2þþValue3þþSomeVery,Very,Very,Large value(5MB or so)þ
...etc...
Edit:
So I went ahead and created a delimited file parser from scratch. I'm kind of weary using this solution as it may be prone to bugs. It also doesn't feel "elegant" or correct to have to write my own parser for a task like this. I also have a feeling that I probably didn't have to write a parser from scratch for this anyway.
Use the File Helpers API. It's .NET and open source. It's extremely high performance using compiled IL code to set fields on strongly typed objects, and supports streaming.
It supports all sorts of file types and custom delimiters; I've used it to read files larger than 4GB.
If for some reason that doesn't do it for you, try just reading line by line with a string.split:
public IEnumerable<string[]> CreateEnumerable(StreamReader input)
{
string line;
while ((line = input.ReadLine()) != null)
{
yield return line.Split('þ');
}
}
That'll give you simple string arrays representing the lines in a streamy fashion that you can even Linq into ;) Remember however that the IEnumerable is lazy loaded, so don't close or alter the StreamReader until you've iterated (or caused a full load operation like ToList/ToArray or such - given your filesize however, I assume you won't do that!).
Here's a good sample use of it:
using (StreamReader sr = new StreamReader("c:\\test.file"))
{
var qry = from l in CreateEnumerable(sr).Skip(1)
where l[3].Contains("something")
select new { Field1 = l[0], Field2 = l[1] };
foreach (var item in qry)
{
Console.WriteLine(item.Field1 + " , " + item.Field2);
}
}
Console.ReadLine();
This will skip the header line, then print out the first two field from the file where the 4th field contains the string "something". It will do this without loading the entire file into memory.
Windows and high performance I/O means, use IO Completion ports. You may have todo some extra plumbing to get it working in your case.
This is with the understanding that you want to use C#/.NET, and according to Joe Duffy
18) Don’t use Windows Asynchronous Procedure Calls (APCs) in managed
code.
I had to learn that one the hard way ;), but ruling out APC use, IOCP is the only sane option. It also supports many other types of I/O, frequently used in socket servers.
As far as parsing the actual text, check out Eric White's blog for some streamlined stream use.
I would be inclined to use a combination of Memory Mapped Files (msdn point to a .NET wrapper here) and a simple incremental parse, yielding back to an IEnumerable list of your record / text line (or whatever)
You mention that some fields are very very big, if you try to read them in their entirety to memory you may be getting yourself into trouble. I would read through the file in 8K (or small chunks), parse the current buffer, keep track of state.
What are you trying to do with this data that you are parsing? Are you searching for something? Are you transforming it?
I don't see a problem with you writing a custom parser. The requirements seem sufficiently different to anything already provided by the BCL, so go right ahead.
"Elegance" is obviously a subjective thing. In my opinion, if your parser's API looks and works like a standard BCL "reader"-type API, then that is quite "elegant".
As for the large data sizes, make your parser work by reading one byte at a time and use a simple state machine to work out what to do. Leave the streaming and buffering to the underlying FileStream class. You should be OK with performance and memory consumption.
Example of how you might use such a parser class:
using(var reader = new EddReader(new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read, 8192)) {
// Read a small field
string smallField = reader.ReadFieldAsText();
// Read a large field
Stream largeField = reader.ReadFieldAsStream();
}
While this doesn't help address the large input issue, a possible solution to the parsing issue might include a custom parser that users the strategy pattern to supply a delimiter.

Categories