Caching Multiple Dynamically Generated Regex Expressions to Disk

Caching Multiple Dynamically Generated Regex Expressions to Disk - c#

I have a program that I'm working on which uses a very large number ( >100) of dynamically generated Regex expressions. Each regex is being used against a large number of strings (depending on the situation, this can be > 2k), so I have them compiled, and cache their use internally. However, the program gets run repeatidly (it's part of a build tool), and the compiling of those dynamically generated expressions takes a significant amount of time every time the program starts. I already have an on-disk cache (no parsing is required if the cache is valid), and could store the compiled Regex expressions in it, however I can't seem to figure out a way to do this correctly. I first thought of using Regex.CompileToAssembly, but Mono doesn't support it, and the program needs to run on Mono as well as MS.net. Because of that I can't figure out a good way of caching the expressions. I only need the IsMatch(string) method from the compiled Regex, and I do have the option of modifying the Mono Regex implementation and including it in my program, but I have no idea where to start with that.

You may create another program, that will compile expression in a build step. And then supply already compiled assemblies with your project. That will eliminate the problem of not supported Regex.CompileToAssembly on Mono.

It's not quite a full solution, but I was able to use binary serialization to produce a noticable improvement in startup time with the cache vs. without. I suspect most of this is actually just time it's saving parsing the regex, and that it's still having to do the actual compiling, but it's a good enough difference for what I need.

Related

Is there a way to produce executables from an existing Antlr4 grammar within expressions from the user?

I've already an existing application, in which I use Antlr4 in order to declare a customized grammar, compile the .g4 files into our c# base parser and lexer and also I've implemented the visitors for expression parsing.
The question is about finding a way to change the behavior from interpretation to compilation.
The way the app works today, we receive an expression from our users (in the customized grammar format), pass it through Antlr4 implementations in order to get our visitors running and executing the expression. This is a very repetitive process, considering the same expression gets evaluated over and over with just different arguments, the implemented logic is just the same.
I'd like to ask about if I could compile the provided expression of my users, save the compiled artifact so I can load it up and call it instead of parsing their expressions every time.
This is similar I do with C# programming, considering that I produce a DLL file that will get loaded up and executed later, without needing to get interpreted every time (not considering JIT at this context ;).
Hope I made my self clear enough about that.
It's not a problem to change architecture for this implementation, so we do need a "facelift" on the project, because of performance issues. Our customers use to produce very large expressions, which take lots of memory to be parsed and are causing some issues at runtime.
Thanks a lot.

After some extra time analysing the implementation and also the usage we have of AntLr, I could see that even though I could find a way to compile expressions analysis, the output results wouldn't be executable for our scenario, because of many usages of local libraries.
The path I choose to run on was to create, dynamicaly, C# code while visiting AntLr expression navigation, through visit methods overrides and, later, compile this C# code into a memory assembly, so I can locate the execution class, create instance and invoke it's execute method. Also I'm using Rosyln to achieve this approach.

Check that program logic is deterministic

Don't know if this is right title for what I need. I need to run program with same input data few times and ensure that every time program take exactly the same path and produced exactly the same output. I even need to make sure that some iterator proccessed elements in same order.
Maybe there is some tools for that purpose? Or maybe there is some standard way what-to-do in order to check that? I put C# in tags because I need solution specifically for that language (and I'm coding in VS2012 if that can be of any help).
Edit:
Input of my program consists of list of integers and output is simple boolean. Even if I'll write tests - there can be very big difference in calculations and yet same result. I especially need to check that program code taken the same path every time.

You can use test framework and use mocks with excepts and asssert the output

Memory usage and known issues with RegEx and different Framework versions

We have a Windows Service created in .Net 4.0, the services parses large text files that are made up of lines of comma separated values (Several million lines, with between 5-10 values), no problem here, we can read the lines, split them into a Key/Value collection and process the values. To validate the values we are using Data Paralellism to pass the Values, which is basically an array of values in specific formats, to a method that performs RegEx validation on individual values.
Up until now we have used static Regular Expressions, not the static RegEx.IsMatch method but a static RegEx property with the RegexOption defined as RegexOptions.Compiled, as detailed below.
private static Regex clientIdentityRegEx = new Regex("^[0-9]{4,9}$", RegexOptions.Compiled);
Using this method we had a pretty standard memory footprint, the memory increased marginally with the greater number of values in each line, the time taken was more or less linear to the total number of lines.
To allow the Regular Expression to be used in other projects, of varying Framework versions, we recently moved the static RegEx properties to a common utilities project that is now compiled using the .Net 2.0 CLR (the actual Regular Expressions have not changed), the number of RegEx properties exposed has increased to about 60, from 25 or so. Since doing this we have started running into memory issues, an increase in memory 3 or more times that of the original project. When we profile the running service we can see the memory appears to be "leaking" from the RegEx.IsMatch, not any specific RegEx but various depending on which are called.
I found the following comment on a old MSDN blog post from one of the BCL team relating to .Net 1.0/1.1 RegEx.
There are even more costs for compilation that should mentioned, however. Emitting IL with Reflection.Emit loads a lot of code and uses a lot of memory, and that's not memory that you'll ever get back. In addition. in v1.0 and v1.1, we couldn't ever free the IL we generated, meaning you leaked memory by using this mode. We've fixed that problem in Whidbey. But the bottom line is that you should only use this mode for a finite set of expressions which you know will be used repeatedly.
I will add we have profiled "most" of the common RegEx calls and cannot replicate the issue individually.
Is this a known issue with the .Net 2.0 CLR?
In the article are the writers states "But the bottom line is that you should only use this mode for a finite set of expressions which you know will be used repeatedly", what is likely to be the finite number of expressions used in this manner, and is this likely to be a cause?
Update: In line with answer from #Henk Holterman is there any best practices for benchmark testing Regular Expressions, specifically RegEx.IsMatch, other than using sheer brute force by volume and parameter format?
Answer: Hanks answer of "The scenario calls for a limited, fixed number of RegEx objects" was pretty much spot on, we added the static RegEx'es to the class until we isolated the expressions with a notible increase in memory usage, these were migrated to separate static classes which seems to have solved some of the memory issues.
It appears, although I cannot cofirm this, there is a difference between compiled RegEx usage between the .Net 2.0 CLR and the .Net 4.0 CLR as the memory issues do not occur when the complied solely for the .Net 4.0 framework. (Any confirmations?)

The scenario calls for a limited, fixed number of RegEx objects. That shouldn't leak. You should verify that in the new situation the RegEx objects are still being reused.
The other possibility is the increased number (60 from 25) expressions. Could just one of them maybe be a little more complex, leading to excessive backtracking?

allow user to enter equation used to evaluate telemetry data

I currently have sensor data being dumped into a database. This is raw data, and needs an equation applied to it in order for it to make any sense to the end users. The problem I have, is that I do not know most of the formulas yet, and would also like the program to be flexible enough that when a new sensor is added to the system, the user would be able to enter in the calibration equation that would be able to convert the raw data into something useful.
I have never worked with letting a user enter in an equation to manipulate data. I would appreciate any input that might help. What direction should I be looking, should I be trying out lambda expression trees, evaluating the equation and compiling it using CodeDom, or looking in another direction? I have never done much with either lambda expression trees or CodeDom, and like always and on a fairly tight schedule, so the learning curve does count. I will have the opportunity to go back and make it better at a later date, they just need it up and running for now.
Thanks for any input.

I highly recommend FLEE for expression parsing/evaluation. It has a custom IL compiler that emits fast IL that doesn't have the memory problems that CodeDOM has.
It also has the desirable attribute of being easy to code with and extend.

I think you need to see what works for you. I also thought of the two only to find out you have mentioned them. I think the other alternative is to allow for parameters of a few major formulae to be stored (i.e. cubic, quadratic, exponential, log, ...) and one selected as the one to be used.
I would personally use the expression trees because it is the cleanest. One problem with CodeDom is the memory leak caused by compiling code especially if the user changes the code and builds the formula multiple times. One solution would be to load the compiled code in a separate AppDomain and then unload the whole appdomain.

Make an executable at runtime

Ok, so I was wondering how one would go about creating a program, that creates a second program(Like how most compression programs can create self extracting self excutables, but that's not what I need).
Say I have 2 programs. Each one containing a class. The one program I would use to modify and fill the class with data. The second file would be a program that also had the class, but empty, and it's only purpose is to access this data in a specific way. I don't know, I'm thinking if the specific class were serialized and then "injected" into the second file. But how would one be able to do that? I've found modifying files that were already compiled fascinating, though I've never been able to make changes that didn't cause errors.
That's just a thought. I don't know what the solution would be, that's just something that crossed my mind.
I'd prefer some information in say c or c++ that's cross-platform. The only other language I'd accept is c#.
also
I'm not looking for 3-rd party library's, or things such as Boost. If anything a shove in the right direction could be all I need.
++also
I don't want to be using a compiler.
Jalf actually read what I wrote
That's exactly what I would like to know how to do. I think that's fairly obvious by what I asked above. I said nothing about compiling the files, or scripting.
QUOTE "I've found modifying files that were already compiled fascinating"
Please read and understand the question first before posting.
thanks.

Building an executable from scratch is hard. First, you'd need to generate machine code for what the program would do, and then you need to encapsulate such code in an executable file. That's overkill unless you want to write a compiler for a language.
These utilities that generate a self-extracting executable don't really make the executable from scratch. They have the executable pre-generated, and the data file is just appended to the end of it. Since the Windows executable format allows you to put data at the end of the file, caring only for the "real executable" part (the exe header tells how big it is - the rest is ignored).
For instance, try to generate two self-extracting zip, and do a binary diff on them. You'll see their first X KBytes are exactly the same, what changes is the rest, which is not an executable at all, it's just data. When the file is executed, it looks what is found at the end of the file (the data) and unzips it.
Take a look at the wikipedia entry, go to the external links section to dig deeper:
http://en.wikipedia.org/wiki/Portable_Executable
I only mentioned Windows here but the same principles apply to Linux. But don't expect to have cross-platform results, you'll have to re-implement it to each platform. I couldn't imagine something that's more platform-dependent than the executable file. Even if you use C# you'll have to generate the native stub, which is different if you're running on Windows (under .net) or Linux (under Mono).

Invoke a compiler with data generated by your program (write temp files to disk if necessary) and or stored on disk?
Or is the question about the details of writing the local executable format?

Unfortunately with compiled languages such as C, C++, Java, or C#, you won't be able to just ``run'' new code at runtime, like you can do in interpreted languages like PHP, Perl, and ECMAscript. The code has to be compiled first, and for that you will need a compiler. There's no getting around this.
If you need to duplicate the save/restore functionality between two separate EXEs, then your best bet is to create a static library shared between the two programs, or a DLL shared between the two programs. That way, you write that code once and it's able to be used by as many programs as you want.
On the other hand, if you're really running into a scenario like this, my main question is, What are you trying to accomplish with this? Even in languages that support things like eval(), self modifying code is usually some of the nastiest and bug-riddled stuff you're going to find. It's worse even than a program written completely with GOTOs. There are uses for self modifying code like this, but 99% of the time it's the wrong approach to take.
Hope that helps :)

I had the same problem and I think that this solves all problems.
You can put there whatever code and if correct it will produce at runtime second executable.
--ADD--
So in short you have some code which you can hard-code and store in the code of your 1st exe file or let outside it. Then you run it and you compile the aforementioned code. If eveything is ok you will get a second executable runtime- compiled. All this without any external lib!!

Ok, so I was wondering how one would
go about creating a program, that
creates a second program
You can look at CodeDom. Here is a tutorial

Have you considered embedding a scripting language such as Lua or Python into your app? This will give you the ability to dynamically generate and execute code at runtime.
From wikipedia:
Dynamic programming language is a term used broadly in computer science to describe a class of high-level programming languages that execute at runtime many common behaviors that other languages might perform during compilation, if at all. These behaviors could include extension of the program, by adding new code, by extending objects and definitions, or by modifying the type system, all during program execution. These behaviors can be emulated in nearly any language of sufficient complexity, but dynamic languages provide direct tools to make use of them.

Depending on what you call a program, Self-modifying code may do the trick.
Basically, you write code somewhere in memory as if it were plain data, and you call it.
Usually it's a bad idea, but it's quite fun.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Caching Multiple Dynamically Generated Regex Expressions to Disk - c#

You may create another program, that will compile expression in a build step. And then supply already compiled assemblies with your project. That will eliminate the problem of not supported Regex.CompileToAssembly on Mono.

Related

Is there a way to produce executables from an existing Antlr4 grammar within expressions from the user?

Check that program logic is deterministic

Memory usage and known issues with RegEx and different Framework versions

allow user to enter equation used to evaluate telemetry data

Make an executable at runtime

Categories

Resources