Write as building table in HttpHandler? - c#

I have to export very large files as an "excel export". Since .NET can't export excel files, I went with simple html tables.
It works fine, but it's slow.
Is it possible to context.response.write each line as they're being created instead of building some super huge string and trying to export the whole thing once it's done?
I could care less what function is used to do this, but I hope you know what I mean. I don't want to build a string into memory and then try to send it all at once. I'd rather export as I build the table.
Is this possible?
Thanks in advance!

Yes, using context.Response.Write on each line is just fine. If the reason for not wanting to build a large string is server memory use, then you'll need to turn off response buffering like so:
context.Response.BufferOutput = false;
Otherwise, .NET will just buffer your writes in memory until the end, anyway.
If the reason is execution time, then you may be experiencing performance hits from multiple string concatenations. In that case, you could use the StringBuilder class to construct the table instead.
For example...
StringBuilder sb = new StringBuilder();
for each (<row in database>) {
sb.AppendLine(<current table row>);
}
context.Response.Write(sb.ToString());
More info on response buffering:
http://msdn.microsoft.com/en-us/library/system.web.httpresponse.bufferoutput.aspx
More info on the StringBuilder class:
http://msdn.microsoft.com/en-us/library/system.text.stringbuilder(v=vs.110).aspx

Related

Parse a large CSV and stream the resulting rows

I'm attempting to read huge CSV files (50M+ rows, ~30 columns, multiple gigabyte files).
This will be run on business desktop-spec machines, so loading the file into memory isn't going to cut it. Streaming rows as they're parsed seems to be the sanest option.
To make things slightly more interesting, I only need 2 of the columns in the file, but the ordering of fields is not guaranteed and has to be derived from column headings.
As such, an iterator that returns array-per-row or similar would be excellent.
I can't just split on line breaks, as some of the field values may span multiple lines. I'd prefer to avoid manually checking which fields are quoted, unescaping as appropriate, etc...
Is there anything in the framework that will do this for me? If not, can someone give me some hints on how best to approach this?
You can try, Cinchoo ETL - an open source library to read and write CSV files
using (var reader = new ChoCSVReader("test.csv").WithFirstLineHeader()
.WithField("Field1")
.WithField("Field2")
)
{
foreach (dynamic item in reader)
{
Console.WriteLine(item.Field1);
Console.WriteLine(item.Field2);
}
}
Please check out articles at CodeProject on how to use it.
Hope it helps your needs.
Disclaimer: I'm the author of this library

Read/Write array to a file

I need guidance, someone to point me in the right direction. As the tittle says, I need to save information to a file: Date, string, integer and an array of integers. And I also need to be able to access that information later, when an user wants to review it.
Optional: File is plain text and I can directly check it and it is understandable.
Bonus points if chosen method can be "easily" converted to working with a database in the future instead of individual files.
I'm pretty new to C# and what I've found so far is that I should turn the array into a string with separators.
So, what'd you guys suggest?
// JSON.Net
string json = JsonConvert.SerializeObject(objOrArray);
File.WriteAllText(path, json);
// (note: can also use File.Create etc if don't need the string in memory)
or...
using(var file = File.Create(path)) { // protobuf-net
Serializer.Serialize(file, objOrArray);
}
The first is readable; the second will be smaller. Both will cope fine with "Date, string, integer and an array of integers", or an array of such objects. Protobuf-net would require adding some attributes to help it, but really simple.
As for working with a database as columns... the array of integers is the glitch there, because most databases don't support "array of integers" as a column type. I'd say "separation of concerns" - have a separate model for DB persistence. If you are using the database purely to store documents, then: pretty much every DB will support CLOB and BLOB data, so either is usable. Many databases now have inbuilt JSON support (helper methods, etc), which might make JSON as a CLOB more tempting.
I would probably serialize this to json and save it somewhere. Json.Net is a very popular way.
The advantage of this is also creating a class that can be later used to work with an Object-Relational Mapper.
var userInfo = new UserInfoModel();
// write the data (overwrites)
using (var stream = new StreamWriter(#"path/to/your/file.json", append: false))
{
stream.Write(JsonConvert.SerializeObject(userInfo));
}
//read the data
using (var stream = new StreamReader(#"path/to/your/file.json"))
{
userInfo = JsonConvert.DeserializeObject<UserInfoModel>(stream.ReadToEnd());
}
public class UserInfoModel
{
public DateTime Date { get; set; }
// etc.
}
for the Plaintext File you're right.
Use 1 Line for each Entry:
Date
string
Integer
Array of Integer
If you read the File in your code you can easily seperate them by reading line to line.
Make a string with a specific Seperator out of the Array:
[1,2,3] -> "1,2,3"
When you read the line you can Split the String by "," and gets a Array of Strings. Parse each Entry to int into an Array of Int with the same length.
How to read and write the File get a look at Easiest way to read from and write to files
If you really wants the switch to a database at a point, try a JSON Format for your File. It is easy to handle and there are some good Plugins to work with.
Mfg
Henne
The way I got started with C# is via the game Space Engineers from the Steam Platform, the Mods need to save a file Locally (%AppData%\Local\Temp\SpaceEngineers\ or %AppData%\Roaming\SpaceEngineers\Storage\) for various settings, and their logging is similar to what #H. Sandberg mentioned (line by line, perhaps a separator to parse with later), the upside to this is that it's easy to retrieve, easy to append, easy to overwrite, and I'm pretty sure it's even possible to retrieve File Size, which when combined with File Deletion and File Creation can prevent runaway File Sizes as this allows you to set an Upper Limit to check against, allowing you to run it on a Server with minimal impact (probably best to include a minimum Date filter {make sure X is at least Y days old before deleting it for being over Z Bytes} to prevent Debugging Data Loss {"Why was it over that limit?"})
As far as the actual Code behind the idea, I'm approximately at the same Skill Level as the OP, which is to say; Rookie, but I would advise looking at the Coding in the Space Engineers Mods for some Samples (plus it's not half bad for a Beta Game), as they are almost all written in C#. Also, the Programmable Blocks compile in C# as well, so you'll be able to use that to both assist in learning C# and reinforce and utilize what you already know (although certain C# commands aren't allowed for security reasons, utilizing the Mod API you'll have more flexibility to do things such as Creating/Maintaining Log Files, Retrieving/Modifying Object Properties, etc.), You are even capable of printing Text to various in Game Text Monitors.
I apologise if my Syntax needs some work, and I'm sorry I am not currently capable of just whipping up some Code to solve your issue, but I do know
using System;
Console.WriteLine("Hello World");
so at least it's not a total loss, but my example Code likely won't compile, since it's likely missing things like: an Output Location, perhaps an API reference or two, and probably a few other settings. Like I said, I'm New, but that is a valid C# Command, I know I got that part correct.
Edit: here's a better attempt:
using System;
class Test
{
static void Main()
{
string a = "Hello Hal, ";
string b = "Please open the Airlock Doors.";
string c = "I'm sorry Dave, "
string d = "I'm afraid I can't do that."
Console.WriteLine(a + b);
Console.WriteLine(c + d);
Console.Read();
}
}
This:
"Hello Hal, Please open the Airlock Doors."
"I'm sorry Dave, I'm afraid I can't do that."
Should be the result. (the "Quotation Marks" shouldn't appear in the readout {the last Code Block}, that's simply to improve readability)

Handling strings more than 2 GB

I have an application where an XLS file with lots of data entered by the user is opened and the data in it is converted to XML. I have already mapped the columns in the XLS file to XML Maps. When I try to use the ExportXml method in XMLMaps, I get a string with the proper XML representation of the XLS file. I parse this string a bit and upload it to my server.
The problem is, when my XLS file is really large, the string produced for XML is over 2 GB and I get a Out of Memory exception. I understand that the limit for CLR objects is 2 GB. But in my case I need to handle this scenario. Presently I just message asking the user to send less data.
Any ideas on how I can do this?
EDIT:
This is just a jist of the operation I need to do on the generated XML.
Remove certain fields which are not needed for the server data.
Add something like ID numbers for each row of data.
Modify the values of certain elements.
Do validation on the data.
While the XMLReader stream is a good idea, I cannot perform these operations by that method. While data validation can be done by Excel itself, the other things cannot be done here.
Using XMLTextReader and XMLTextWriter and creating a custom method for each of the step is a solution I had thought of. But to go through the jist above, it requires the XML document to be gone through or processed 4 times. This is just not efficient.
If the XML is that large, then you might be able to use Export to a temporary file, rather than using ExportXML to a string - http://msdn.microsoft.com/en-us/library/microsoft.office.interop.excel.xmlmap.export.aspx
If you then need to parse/handle the XML in C#, then for handling such large XML structures, you'll probably be better off implementing a custom XMLReader (or XMLWriter) which works at the stream level. See this question for some similar advice - What is the best way to parse large XML (size of 1GB) in C#?
I guess there is no other way then using x64-OS and FX if you really need to hold the whole thing in RAM, but using some other way to process the data like suggested by Stuart may is the better way to go...
What you need to do is to use "stream chaining", i.e. you open up an input stream which reads from your excel file and an output stream that writes to your xml file. Then your conversion class/method will take the two streams as input and read sufficient data from the input stream to be able to write to the output.
Edit: very simple minimal Example
Converting from file:
123
1244125
345345345
4566
11
to
<List>
<ListItem>123</ListItem>
<ListItem>1244125</ListItem>
...
</List>
using
void Convert(Stream fromStream, Stream toStream)
{
using(StreamReader from= new StreamReader(fromStream))
using(StreamWriter to = new StreamWriter(toStream))
{
to.WriteLine("<List>");
while(!from.EndOfStream)
{
string bulk = from.ReadLine(); //in this case, a single line is sufficient
//some code to parse the bulk or clean it up, e.g. remove '\r\n'
to.WriteLine(string.Format("<ListItem>{0}</ListItem>", bulk));
}
to.WriteLine("</List>");
}
}
Convert(File.OpenRead("source.xls"), File.OpenWrite("source.xml"));
Of course you could do this in much more elegent, abstract manner but this is only to show my point

Should I build a string first and then write to file?

A program I am working on right now has to generate a file. Is it better for me to generate the file's contents as a string first and then write that string to the file, or should I just directly add the contents to the file?
Are there any advantages of one over the other?
The file will be about 0.5 - 1MB in size.
If you write to a file as-you-go, you'll have the benefit of not keeping everything in memory, if it's a big enough file and you constantly flush the stream.
However, you'll be more likely to run into problems with a partially-written file, since you're doing your IO over a period of time instead of in a single shot.
Personally, I'd build it up using a StringBuilder, and then write it all to disk in a single shot.
I think it's a better idea, in general, to create a StreamWriter and just write to it. Why keep things in memory when you don't have to? And it's a whole lot easier. For example:
using (var writer = new StreamWriter("filename"))
{
writer.WriteLine(header);
// write all your data with Write and WriteLine,
// taking advantage of composite formatting
}
If you want to build multiple lines with StringBuilder you have to write something like:
var sb = new StringBuilder();
sb.AppendLine(string.Format("{0:N0} blocks read", blocksRead));
// etc., etc.
// and finally write it to file
File.WriteAllText("filename", sb.ToString());
There are other options, of course. You could build the lines into a List<string> and then use File.WriteAllLines. Or you could write to a StringStream and then write that to the file. But all of those approaches have you handling the data multiple times. Just open the StreamWriter and write.
The primary reasons I think it's a better idea in general to go directly to output:
You don't have to refactor your code when it turns out that your output data is too big to fit in memory.
The planned destination is the file anyway, so why fool with formatting it in memory before writing to the file?
The API for writing multiple lines to a text file is, in my opinion, cleaner than the API for adding lines to a StringBuilder.
I think it is better to use string or stringbuilder to store your data then you can write to file using File.Write functions.

What is the fastest way to parse text with custom delimiters and some very, very large field values in C#?

I've been trying to deal with some delimited text files that have non standard delimiters (not comma/quote or tab delimited). The delimiters are random ASCII characters that don't show up often between the delimiters. After searching around, I've seem to have only found no solutions in .NET will suit my needs and the custom libraries that people have written for this seem to have some flaws when it comes to gigantic input (4GB file with some field values having very easily several million characters).
While this seems to be a bit extreme, it is actually a standard in the Electronic Document Discovery (EDD) industry for some review software to have field values that contain the full contents of a document. For reference, I've previously done this in python using the csv module with no problems.
Here's an example input:
Field delimiter =
quote character = þ
þFieldName1þþFieldName2þþFieldName3þþFieldName4þ
þValue1þþValue2þþValue3þþSomeVery,Very,Very,Large value(5MB or so)þ
...etc...
Edit:
So I went ahead and created a delimited file parser from scratch. I'm kind of weary using this solution as it may be prone to bugs. It also doesn't feel "elegant" or correct to have to write my own parser for a task like this. I also have a feeling that I probably didn't have to write a parser from scratch for this anyway.
Use the File Helpers API. It's .NET and open source. It's extremely high performance using compiled IL code to set fields on strongly typed objects, and supports streaming.
It supports all sorts of file types and custom delimiters; I've used it to read files larger than 4GB.
If for some reason that doesn't do it for you, try just reading line by line with a string.split:
public IEnumerable<string[]> CreateEnumerable(StreamReader input)
{
string line;
while ((line = input.ReadLine()) != null)
{
yield return line.Split('þ');
}
}
That'll give you simple string arrays representing the lines in a streamy fashion that you can even Linq into ;) Remember however that the IEnumerable is lazy loaded, so don't close or alter the StreamReader until you've iterated (or caused a full load operation like ToList/ToArray or such - given your filesize however, I assume you won't do that!).
Here's a good sample use of it:
using (StreamReader sr = new StreamReader("c:\\test.file"))
{
var qry = from l in CreateEnumerable(sr).Skip(1)
where l[3].Contains("something")
select new { Field1 = l[0], Field2 = l[1] };
foreach (var item in qry)
{
Console.WriteLine(item.Field1 + " , " + item.Field2);
}
}
Console.ReadLine();
This will skip the header line, then print out the first two field from the file where the 4th field contains the string "something". It will do this without loading the entire file into memory.
Windows and high performance I/O means, use IO Completion ports. You may have todo some extra plumbing to get it working in your case.
This is with the understanding that you want to use C#/.NET, and according to Joe Duffy
18) Don’t use Windows Asynchronous Procedure Calls (APCs) in managed
code.
I had to learn that one the hard way ;), but ruling out APC use, IOCP is the only sane option. It also supports many other types of I/O, frequently used in socket servers.
As far as parsing the actual text, check out Eric White's blog for some streamlined stream use.
I would be inclined to use a combination of Memory Mapped Files (msdn point to a .NET wrapper here) and a simple incremental parse, yielding back to an IEnumerable list of your record / text line (or whatever)
You mention that some fields are very very big, if you try to read them in their entirety to memory you may be getting yourself into trouble. I would read through the file in 8K (or small chunks), parse the current buffer, keep track of state.
What are you trying to do with this data that you are parsing? Are you searching for something? Are you transforming it?
I don't see a problem with you writing a custom parser. The requirements seem sufficiently different to anything already provided by the BCL, so go right ahead.
"Elegance" is obviously a subjective thing. In my opinion, if your parser's API looks and works like a standard BCL "reader"-type API, then that is quite "elegant".
As for the large data sizes, make your parser work by reading one byte at a time and use a simple state machine to work out what to do. Leave the streaming and buffering to the underlying FileStream class. You should be OK with performance and memory consumption.
Example of how you might use such a parser class:
using(var reader = new EddReader(new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read, 8192)) {
// Read a small field
string smallField = reader.ReadFieldAsText();
// Read a large field
Stream largeField = reader.ReadFieldAsStream();
}
While this doesn't help address the large input issue, a possible solution to the parsing issue might include a custom parser that users the strategy pattern to supply a delimiter.

Categories