I'm using StringTemplate to generate some xml files from datasets. Sometimes I have more than 100,000 records in the dataset that is enumerated by a loop in a template. It goes very slow (15-20 secs per operation) so performance is not good for me.
This is an example how I use ST to render a report:
using (var sw = new StringWriter())
{
st.Write(new StringTemplateWriter(sw));
return sw.ToString();
}
StringTemplateWriter is a simple writer-class derived from IStringTemplateWriter without indentation.
By the way, in the debug screen I see a lot of such weird message:
"A first chance exception of type 'antlr.NoViableAltException' occurred in StringTemplate.DLL"
in a deep of debug I found that it parses my template recursively and if something failed (don't know what exactly) it throws NoViableAltException exception to return from a deep of stack back to a surface, so I guess the problem is in using of too much try-catch-throw's.
Google found nothing useful on this.
Main question: how to decrease this number of exceptions (except rewriting the code of ST) and improve performance of template rendering?
ST parses ST templates and groups with ANTLR. If you are getting syntax errors, your template(s) have errors. All bets are off for performance as it throws an exception for each one. ANTLR/ST not at fault here ;)
Terence
NoViableAltException sounds like a parser error. I am not sure why ANTLR is being used (except that they come from the same author), but the only guess I can come up with is that the template language itself is parsed using ANTLR. Maybe the template contains errors? Anyway ANTLR's error handling is really slow (for one, it uses exceptions) so that's probably why your template expansion is slow.
Related
I want to create a simple ternary expression similar to the following:
Convert.ToInt32(stringname.Substring(0,2)) != 99 ?
Convert.ToInt32(stringname.Substring(0,2)) : 15
I get an error about incompatibilities between int and bool. Is there a simple workaround?
Does it have to be a single line? Apparently you already know the logic. Why do you need to cram it into a single code line? Cramming too much code into a single line will only make reading and debugging unessesarily hard.
Debugging will be hard, because you never know wich of those 4(four!) function calls is throwing the exception. It is also unessesarily slow, as you are doing the same two operations twice in a row. Just do it once and store the result.
My advise is to split up the code using temporary variables. One command + 1 assingment per line. This will make debugging and reading doable. Do not worry about performance. Between Compiler Optimisations and the JiT compiler, there is decent chance any underused variables will be cut out at runtime in a release build.
I have written a C# application to parse very large (100MB+) XML files.
The way I accomplished it was that I traverse through the file using a System.Xml.XmlReader and then, once I get to the final nodes I need to collect values from, I convert each of those very small elements into a System.Xml.Linq.XElement and perform various XPath statements via XEelement.XPathEvaluate to get the data I need.
This worked very well and efficiently, but I hit a snag where I was sometimes getting bad data due to the fact that XPathEvaluate only supports XPath 1.0 and my statements were XPath 2.0 (question posted here).
My code for originally doing this looks somewhat as follows:
void parseNode_Old(XmlReader rdr, List<string> xPathsToExtract)
{
// Enter the node:
rdr.Read();
// Load it as an XElement so as to be able to evaluate XPaths:
var nd = XElement.Load(rdr);
// Loop through the XPaths related to that node and evaluate them:
foreach (var xPath in xPathsToExtract)
{
var xPathVal = nd.XPathEvaluate(xPath);
// Do whatever with the extracted value(s)
}
}
Following the suggestions given in my previous question, I decided the best solution would be to move from System.Xml to Saxon.Api (which does support XPath 2.0) and my current updated code looks as follows:
void parseNode_Saxon(XmlReader rdr, List<string> xPathsToExtract)
{
// Set up the Saxon XPath processors:
Processor processor = new Processor(false);
XPathCompiler compiler = processor.NewXPathCompiler();
XdmNode nd = processor.NewDocumentBuilder().Build(rdr);
// Loop through the XPaths related to that node and evaluate them:
foreach (var xPath in xPathsToExtract)
{
var xPathVal = compiler.EvaluateSingle(xPath, (XdmNode)childNode);
// Do whatever with the extracted value(s)
}
}
This is working (with a few other changes to my XPaths), but it has become about 5-10 times slower.
This is my first time working with the Saxon.Api library and this is what I came up with. I'm hoping there's a better way to accomplish this to make the code-execution speed comparable or, if anyone has other ideas on how to evaluate XPath 2.0 statements in a better way without a substantial re-write, I'd love to hear them!
Any help would be greatly appreciated!!
Thanks!!
UPDATE:
In trying to fix this myself, I moved the following 2 statements to my constructor:
Processor processor = new Processor(false);
XPathCompiler compiler = processor.NewXPathCompiler();
as opposed to constantly re-create them with each call to this method which has helped substantially, but the process is still about 3 times slower than the native System.Xml.Linq version. Any other ideas / thoughts on ways to implement this parser?
This may be the best you can do with this set-up.
Saxon on .NET is often 3-5 times slower that Saxon on Java, for reasons which we have never got to the bottom of. We're currently exploring the possibility of rebuilding it using Excelsior JET rather than IKVMC to see if this can speed things up.
Saxon is much slower on a third-party DOM implementation than on its own native tree representation, but it seems you have changed your code to use the native tree model.
Since you're parsing each XPath expression every time it is executed, your performance may be dominated by XPath compilation time (even if you're searching a large XML document). Until recently Saxon's compile-time performance received very little attention, since we reckoned that doing more work at compile time to save effort at run-time was always worthwhile; but in this kind of scenario this clearly isn't the case. It may be worth splitting the compile and run-time and measuring both separately, just to see if that gives any insights. It might suggest, for example, switching off some of the optimization options. Obviously if you can cache and reuse compiled XPath expressions that will help.
I've run into a peculiar case where I get the following error when creating certain types of string:
Unexpected error writing debug information -- 'Error HRESULT E_FAIL has been returned from a call to a COM component.'
This error is not new to Stack Overflow (see this question and this question), but the problems presented have nothing to do with this one.
For me, this is happening when I create a const string of a certain length that includes a null-terminating character (\0) somewhere near the beginning.
To reproduce, first generate a string of appropriate length, e.g. using:
var s = new string('a', 3000);
Grab the resulting string at runtime (e.g. Immediate Window or by hovering over the variable and copying its value). Then, make a const out of it:
const string history = "aaaaaa...aaaaa";
Finally, put a \0 in there somewhere:
const string history = "aaaaaaaaaaaa\0aa...aaaaa";
Some things I noticed:
if you put the \0 near the end, the error doesn't happen.
Reproduced using .NET Framework 4.6.1 and 4.5
Doesn't happen if the string is short.
Edit: even more precious info available in the comments below.
Any idea why this is happening? Is it some kind of bug?
Edit: Bug filed, including info from comments. Thanks everybody.
I'll noodle about this issue a little bit. This issue occurs both in VS2015 and earlier versions. So nothing directly to do with the C# compiler itself, this goes wrong in the ISymUnmanagedWriter2::DefineConstant2() implementation method. ISymUnmanagedWriter2 is a COM interface, part of the .NET infrastructure that all compilers use. And used both by Roslyn and the legacy C# compiler.
The comments in the Roslyn source code (actually dates back to the CCI project) that uses the method are illuminating enough, that there is trouble with this method was discovered before:
// EDMAURER If defining a string constant and it is too long (length limit is undocumented), this method throws
// an ArgumentException.
// (see EMITTER::EmitDebugLocalConst)
try
{
this.symWriter.DefineConstant2(name, value, constantSignatureToken);
}
catch (ArgumentException)
{
// writing the constant value into the PDB failed because the string value was most probably too long.
// We will report a warning for this issue and continue writing the PDB.
// The effect on the debug experience is that the symbol for the constant will not be shown in the local
// window of the debugger. Nor will the user be able to bind to it in expressions in the EE.
//The triage team has deemed this new warning undesirable. The effects are not significant. The warning
//is showing up in the DevDiv build more often than expected. We never warned on it before and nobody cared.
//The proposed warning is not actionable with no source location.
}
catch (Exception ex)
{
throw new PdbWritingException(ex);
}
Swallowing exceptions, tsk, tsk. It dies on the last catch clause in your case. They did dig a little deeper to reverse-engineer the string length problem:
internal const int PdbLengthLimit = 2046; // Empirical, based on when ISymUnmanagedWriter2 methods start throwing.
Which is fairly close to where the \0 starts throwing, I got 2034. Nothing much that you or anybody else here can do about this of course. All you can reasonably do is report the bug at connect.microsoft.com. But hopefully you see the writing on the wall, the odds that it will get fixed are rather small. This is code that nobody maintains anymore, it now has 'undocumented' status and judging from other comments this goes back long before .NET. Not Ed Maurer either :)
Workaround ought to be easy enough, glue this string together at runtime.
I was able to repro the issue as coded. Then I changed the declaration to:
const string history = #"aaa\0aaa...lots and lots of aaa...aaa";
Tried again and it compiles just fine.
We have a circumstance where we basically want to generate string representation of a code file by passing in arguments for the template contents and information the template needs to build itself:
//*** PSEUDO CODE *** //
//loaded from an embedded resource file in a .dll. no physical file on file system
string templateContents = ...;
//has properties used by the template
object complexParameter = ...;
string generatedCode = generator.MakeCode(templateContents, complexParameter);
However, we're currently running into problems trying to get the T4 template generation to do what we want. The actual code we're using is:
var templatingEngine = new Engine();
//T4TextTemplateHost is our own class implementing ITextTemplatingEngineHost & IServiceProvider
var templateHost = new T4TextTemplateHost(references, imports)
{
Properties = parameters,
//this is supposed to be a file path? the generation bombs if this is left null
TemplateFile = "Dummy Value"
};
var templateContents = GetTemplateFileContents();
var retVal = templatingEngine.ProcessTemplate(templateContents, templateHost);
//if a CompilerError occurs, we get NO code, just a "ErrorGeneratingOutput" message
foreach (CompilerError error in templateHost.Errors)
//this information is pretty worthless: a compile error with line number for a
//non-existant code file
retVal += String.Format("{0}{2}Line: {1}{2}{2}", error.ErrorText,
error.Line, Environment.NewLine);
The problem is that the code generator seems to expect a physical file somewhere, and when things go wrong, we don't get code back, we get useless error messages back. It is our strong preference not NOT have the code automatically compiled, especially when the generated code has an error (we want a full, broken file to examine when troubleshooting).
We also want the output as a string, that we can take and do with whatever we wish.
Is there a way to make T4 code generation work more like the pseudo code example? We're on the verge of abandoning the T4 tool in favor of something like CodeSmith because T4 seems like it's too limited/geared toward a very specific way of managing templates and processing output.
I don't think it is possible to get T4 to generate anything if there are errors in the template you pass in. T4 will try to convert your template into codedom with extra statements that write out to a stringwriter, the final stringwriter is then returned as the result. If there are any errors in the template, the code will not complie and thus it will have nothing to return to you. The errors you get back should resolve to the lines in the template you passed in, at least that has been my experience.
I am not sure if Code Smith works in a different way but depending on the complexity of what you are trying to render you might have some luck using Nustache if it's simple enough. It's a dot net version of mustache templates. It supports basic looping and if/then type control blocks. I have successfully used it with embedded text files to generate simple templates for emailing and reports.
I've been trying to deal with some delimited text files that have non standard delimiters (not comma/quote or tab delimited). The delimiters are random ASCII characters that don't show up often between the delimiters. After searching around, I've seem to have only found no solutions in .NET will suit my needs and the custom libraries that people have written for this seem to have some flaws when it comes to gigantic input (4GB file with some field values having very easily several million characters).
While this seems to be a bit extreme, it is actually a standard in the Electronic Document Discovery (EDD) industry for some review software to have field values that contain the full contents of a document. For reference, I've previously done this in python using the csv module with no problems.
Here's an example input:
Field delimiter =
quote character = þ
þFieldName1þþFieldName2þþFieldName3þþFieldName4þ
þValue1þþValue2þþValue3þþSomeVery,Very,Very,Large value(5MB or so)þ
...etc...
Edit:
So I went ahead and created a delimited file parser from scratch. I'm kind of weary using this solution as it may be prone to bugs. It also doesn't feel "elegant" or correct to have to write my own parser for a task like this. I also have a feeling that I probably didn't have to write a parser from scratch for this anyway.
Use the File Helpers API. It's .NET and open source. It's extremely high performance using compiled IL code to set fields on strongly typed objects, and supports streaming.
It supports all sorts of file types and custom delimiters; I've used it to read files larger than 4GB.
If for some reason that doesn't do it for you, try just reading line by line with a string.split:
public IEnumerable<string[]> CreateEnumerable(StreamReader input)
{
string line;
while ((line = input.ReadLine()) != null)
{
yield return line.Split('þ');
}
}
That'll give you simple string arrays representing the lines in a streamy fashion that you can even Linq into ;) Remember however that the IEnumerable is lazy loaded, so don't close or alter the StreamReader until you've iterated (or caused a full load operation like ToList/ToArray or such - given your filesize however, I assume you won't do that!).
Here's a good sample use of it:
using (StreamReader sr = new StreamReader("c:\\test.file"))
{
var qry = from l in CreateEnumerable(sr).Skip(1)
where l[3].Contains("something")
select new { Field1 = l[0], Field2 = l[1] };
foreach (var item in qry)
{
Console.WriteLine(item.Field1 + " , " + item.Field2);
}
}
Console.ReadLine();
This will skip the header line, then print out the first two field from the file where the 4th field contains the string "something". It will do this without loading the entire file into memory.
Windows and high performance I/O means, use IO Completion ports. You may have todo some extra plumbing to get it working in your case.
This is with the understanding that you want to use C#/.NET, and according to Joe Duffy
18) Don’t use Windows Asynchronous Procedure Calls (APCs) in managed
code.
I had to learn that one the hard way ;), but ruling out APC use, IOCP is the only sane option. It also supports many other types of I/O, frequently used in socket servers.
As far as parsing the actual text, check out Eric White's blog for some streamlined stream use.
I would be inclined to use a combination of Memory Mapped Files (msdn point to a .NET wrapper here) and a simple incremental parse, yielding back to an IEnumerable list of your record / text line (or whatever)
You mention that some fields are very very big, if you try to read them in their entirety to memory you may be getting yourself into trouble. I would read through the file in 8K (or small chunks), parse the current buffer, keep track of state.
What are you trying to do with this data that you are parsing? Are you searching for something? Are you transforming it?
I don't see a problem with you writing a custom parser. The requirements seem sufficiently different to anything already provided by the BCL, so go right ahead.
"Elegance" is obviously a subjective thing. In my opinion, if your parser's API looks and works like a standard BCL "reader"-type API, then that is quite "elegant".
As for the large data sizes, make your parser work by reading one byte at a time and use a simple state machine to work out what to do. Leave the streaming and buffering to the underlying FileStream class. You should be OK with performance and memory consumption.
Example of how you might use such a parser class:
using(var reader = new EddReader(new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read, 8192)) {
// Read a small field
string smallField = reader.ReadFieldAsText();
// Read a large field
Stream largeField = reader.ReadFieldAsStream();
}
While this doesn't help address the large input issue, a possible solution to the parsing issue might include a custom parser that users the strategy pattern to supply a delimiter.