Parsing C Header Files in C#

Parsing C Header Files in C# - c#

I'm working with Visual Studio C#, and I need to parse C header files to extract information only about the function declarations contained within. For each function I need the name, return type, and its parameters. If possible, I'd like the parameters in the order in which they appear in the function declaration.
I've seen stuff online about using visual studios tags, or Exhuberant Ctags, etc. But from what I gathered those aren't really options that let me perform the parse from my C# program with C# code (I may be mistaken?). I've also looked through all the other answers to related questions but they don't seem really apply to my situation (I may just be dumb).
If I could at least get all the lines of code that represent function declarations I'd have a good start and could hand-parse the rest myself.
Thanks in advance

To "parse" C (header) files in a deep sense and pick up the type information for function declarations, in practice you need:
a full preprocessor (including the pecaddillos added by the vendor, MS has some pretty odd stuff in their headers),
a full (syntax) parser/AST builder for the C dialect of interest (there's no such thing as "C"; there is what the vendor offers in this revision of the compiler)
a full symbol table construction (because typedefs are aliases for the actual types of interest)
Many people will suggest "write your own parser (for C)". Mostly those people haven't done this; its a lot more work to do this and get it right than they understand. If you don't start with a production-level machinery, you won't get through real C header files without fixing it all.
Just parsing plain C is hard; consider the problem of parsing the ambiguous phrase
T*X;
A classic parser cannot parse this without additional hackery.
You will also not be able to parse a C header file by itself, in general. You need to have the source code context (often including the compiler command line) in which it is included, or typedefs, preprocessor conditionals and macros in a specific header file will be undefined and therefore unexpandable into the valid C that the compiler normally sees.
You are better off getting pre-existing pre-tested machinery that will do this for you. Clang comes to mind as an option, although I'm not sure it handles the MS header files. GCC is kind of an option, but it really, really wants to be a compiler, not your local friendly C source code analysis tool, and again I'm unsure of its support for MS dialects of C. Our DMS Software Reengineering Toolkit has all of the above for various MS dialects of C.
Having chosen a tool that can actually parse such headers, you'll likely want to do something with the collected header information. You are vague about what you want to accomplish. Having mentioned C# and C in the same breath, there's a hint that you want to call C programs from C# code, and thus need to generate C# equivalent APIs for the C code. For this you will need machinery to manipulate the type information provided, and to build the "text" for the C# declarations. For this, you are likely to find that you need other supporting tooling to do that part, too. Here GCC is a complete non-starter; it will offer you no additional help. Clang and DMS are both designed to be libraries of custom-tool building machinery.
Of course, this may all be moot depending on how much header file text you want to handle; it if is just one header file, doing it manually is probably easiest. You suggest you are willing to do that ("could hand-parse..."). In that case, all you really need to do is to run the preprocessor and interpret the output. I beleive you can do with command line switches for GCC and Clang and even the MS compilers; I know DMS can do this. For easily avialable options here, see How do I see a C/C++ source file after preprocessing in Visual Studio?

Related

Convert assembly to C using Replace some groups with Regex in C#

I want to convert asm to c(assembly to C)
I saw http://www.textmaestro.com/InfoEx_17_Convert_Assembly.htm
(please the page)
page on web and easily after that i try to Do this job using find and Replace with Regex in C#
i am not computer field student so i am not professional at Regex.
I am working 5 days and after this time now i know that i cant do this.i wrote very code but without any success
sample program:
mov r1,1;
mov r2,2;
convert to :
r1=1;
r2=2;
please help me to do this correctly

OP has (painfully) learned that regexps are not a good solution to problems that involve analysis or translation of software. Processing strings simply is not the same as building context-sensitive analyses of text with complex structure.
People keep re-learning this lesson. It is true that you can use repeated regex to simulate Post rewriting systems, and Post systems, being Turing capable, can technically do anything. It is also true that nobody really wants to, or more importantly, nobody can write a very complex program for a real Turing machine [or an equivalent Post system]. This is why we have all these other computer languages and tools. [The TextMaestro system to which OP refers is trying to be exactly that Post system.]
However, the task he wants to do is possible and practical with the proper tools: program transformation systems (PTS).
In particular, he should see this technical paper for a description of precisely how this has been done with one particular PTS: See Pigs from sausages? Reengineering from assembler to C via FermaT transformations. Such a tool in effect is a custom compiler from assembly source code to the target language, and includes parsing, name (label) resolution, often data flow analysis and complex code generation and optimization. A PTS is used because they make it relatively easy to build that kind of compiler. This tool has been used for at least Intel assembly to C, and mainframe (System 360/370/Z) assembly to C, for large-scale tasks. (I have no relationship to this tool but do have huge respect for the authors).
The naysayers in the comments seem to think this is impossible to do except for extremely constrained circumstances. It is true that the more one knows about the assembly code in terms of idioms, the somewhat easier this gets, but the technical approach in the paper is not limited to specific compiler output by any means. It is also true that truly arcane assembler code (especially self-modifying or having runtime code generation) is extremely difficult to translate.

Code parsing C#

I am researching ways, tools and techniques to parse code files in order to support syntax highlighting and intellisence in an editor written in c#.
Does anyone have any ideas/patterns & practices/tools/techiques for that.
EDIT: A nice source of info for anyone interested:
Parsing beyond Context-free grammars
ISBN 978-3-642-14845-3

My favourite parser for C# is Irony: http://irony.codeplex.com/ - i have used it a couple of times with great success
Here is a wikipedia page listing many more: http://en.wikipedia.org/wiki/Compiler-compiler

There are two basic aproaches:
1) Parse the entire solution and everything it references so you understand all the types involved in the code
2) Parse locally and do your best to guess what types etc are.
The trouble with (2) is that you have to guess, and in some circumstances you just can't tell from a code snippet exactly what everything is. But if you're happy with the sort oif syntax highlighting shown on (e.g.) Stack Overflow, then this approach is easy and quite effective.
To do (1) then you need to do one of (in decreasing order of difficulty):
Parse all the source code. Not possible if you reference 3rd party assemblies.
Use reflection on the compiled code to garner type information you can use when parsing the source.
Use the host IDE's (if avaiable - so not applicable in your case!) code element interfaces to provide the information you need

You could take a look at how http://www.icsharpcode.net/ did it. They wrote a book doing just that, Dissecting a C# Application: Inside SharpDevelop, it even has a chapter called
Implement a parser to provide syntax
highlighting and auto-completion as
users type

Make an executable at runtime

Ok, so I was wondering how one would go about creating a program, that creates a second program(Like how most compression programs can create self extracting self excutables, but that's not what I need).
Say I have 2 programs. Each one containing a class. The one program I would use to modify and fill the class with data. The second file would be a program that also had the class, but empty, and it's only purpose is to access this data in a specific way. I don't know, I'm thinking if the specific class were serialized and then "injected" into the second file. But how would one be able to do that? I've found modifying files that were already compiled fascinating, though I've never been able to make changes that didn't cause errors.
That's just a thought. I don't know what the solution would be, that's just something that crossed my mind.
I'd prefer some information in say c or c++ that's cross-platform. The only other language I'd accept is c#.
also
I'm not looking for 3-rd party library's, or things such as Boost. If anything a shove in the right direction could be all I need.
++also
I don't want to be using a compiler.
Jalf actually read what I wrote
That's exactly what I would like to know how to do. I think that's fairly obvious by what I asked above. I said nothing about compiling the files, or scripting.
QUOTE "I've found modifying files that were already compiled fascinating"
Please read and understand the question first before posting.
thanks.

Building an executable from scratch is hard. First, you'd need to generate machine code for what the program would do, and then you need to encapsulate such code in an executable file. That's overkill unless you want to write a compiler for a language.
These utilities that generate a self-extracting executable don't really make the executable from scratch. They have the executable pre-generated, and the data file is just appended to the end of it. Since the Windows executable format allows you to put data at the end of the file, caring only for the "real executable" part (the exe header tells how big it is - the rest is ignored).
For instance, try to generate two self-extracting zip, and do a binary diff on them. You'll see their first X KBytes are exactly the same, what changes is the rest, which is not an executable at all, it's just data. When the file is executed, it looks what is found at the end of the file (the data) and unzips it.
Take a look at the wikipedia entry, go to the external links section to dig deeper:
http://en.wikipedia.org/wiki/Portable_Executable
I only mentioned Windows here but the same principles apply to Linux. But don't expect to have cross-platform results, you'll have to re-implement it to each platform. I couldn't imagine something that's more platform-dependent than the executable file. Even if you use C# you'll have to generate the native stub, which is different if you're running on Windows (under .net) or Linux (under Mono).

Invoke a compiler with data generated by your program (write temp files to disk if necessary) and or stored on disk?
Or is the question about the details of writing the local executable format?

Unfortunately with compiled languages such as C, C++, Java, or C#, you won't be able to just ``run'' new code at runtime, like you can do in interpreted languages like PHP, Perl, and ECMAscript. The code has to be compiled first, and for that you will need a compiler. There's no getting around this.
If you need to duplicate the save/restore functionality between two separate EXEs, then your best bet is to create a static library shared between the two programs, or a DLL shared between the two programs. That way, you write that code once and it's able to be used by as many programs as you want.
On the other hand, if you're really running into a scenario like this, my main question is, What are you trying to accomplish with this? Even in languages that support things like eval(), self modifying code is usually some of the nastiest and bug-riddled stuff you're going to find. It's worse even than a program written completely with GOTOs. There are uses for self modifying code like this, but 99% of the time it's the wrong approach to take.
Hope that helps :)

I had the same problem and I think that this solves all problems.
You can put there whatever code and if correct it will produce at runtime second executable.
--ADD--
So in short you have some code which you can hard-code and store in the code of your 1st exe file or let outside it. Then you run it and you compile the aforementioned code. If eveything is ok you will get a second executable runtime- compiled. All this without any external lib!!

Ok, so I was wondering how one would
go about creating a program, that
creates a second program
You can look at CodeDom. Here is a tutorial

Have you considered embedding a scripting language such as Lua or Python into your app? This will give you the ability to dynamically generate and execute code at runtime.
From wikipedia:
Dynamic programming language is a term used broadly in computer science to describe a class of high-level programming languages that execute at runtime many common behaviors that other languages might perform during compilation, if at all. These behaviors could include extension of the program, by adding new code, by extending objects and definitions, or by modifying the type system, all during program execution. These behaviors can be emulated in nearly any language of sufficient complexity, but dynamic languages provide direct tools to make use of them.

Depending on what you call a program, Self-modifying code may do the trick.
Basically, you write code somewhere in memory as if it were plain data, and you call it.
Usually it's a bad idea, but it's quite fun.

C# Interpreter (without compilation) [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Is there a ready-to-use C# interpreter out there, that is does not rely on runtime compilation?
My requirements are :
A scripting engine
Must Handle C# syntax
Must work on medium-trust environments
Must not use runtime compilation (CodeDomProvider ...)
Open source (or at least free of charge both for personal and professional use)
If this is not clear, I need something like Jint (http://jint.codeplex.com/), but which allows me to write C# scripts instead of JavaScript ones.
Thanks for your help.

Have you looked at paxScript.NET?

Check out the Mono project. They recently demoed CsharpRepl which sounds like what you're after. The PDC 2008 video here.
Update:
On a close look it seems like using Mono.CSharp service to evaluate scripts won't be possible. Currently it is linked to the Mono runtime and they don't expect it to run in a medium trust environment. See this discussion for more info.
On alternative possibility is to include the Mono C# compiler (sources here) in your project and use it to generate assemblies that you load from the file system. It you are worried about the resources required to load all those assemblies you might have to load them in a separate AppDomain.

I need to evaluate 10000+ small
scripts that are all differents,
compiling all of them would be just
dramatically slow
Interpretting these would be even more painfully slow. We have a similar issue that we address as follows:
We use the Gold Parser project to parse source code and convert it to an XML based 'generic language'. We run this through a transform that generates VB.Net source code (simply because it's case insensitive). We then compile these using the .Net runtime into a standalone DLL, and call this using heavily restricted access.
It sounds as though you are creating something like a dynamic website where people can create custom modules or snippets of functionality, but using C# to do this introduces a couple of main problems; C# has to be compiled, and the only way around this is to interpet it at runtime, and this is unfeasible, and even if you do compile each snippet then you end up with 10,000 DLLs, which is impractical and unusable.
If your snippets are rarely changing, then I would consider programatically wrapping them into a single set of source, with each having a unique name, then compile them in a single shot (or as a timed process every 10mins?). This is what we do, as it also allows 'versioning' of peoples sessions so they continue using the version of DLL they had at the start of their session, but when every session stops using an old version then it's removed.
If your snippets change regularly throughout the day then I would suggest you look at an interpretted scripting language instead, even PHP, and mix your languages depending on the functionality you require. Products such as CScript and LinqPad all use the CodeDomProvider, because you have to have IMSL somewhere if you want to program compiled logic.
The only other option is to write your own interpretter and use reflection to access all the other libraries you need to access, but this is extremely complex and horrible.
As your requirements are effectively unachievable, I would suggest you take a step back and figure out a way of removing one or more restrictions. Whether you find a FullTrust environment to compile your snippets in, remove the need for full code support (i.e. move to interpretted code snippet support), or even change the whole framework to something non .Net.

LINQPad can work as a code snippet IDE. The application is very small and lightweight. It is free (as in beer) but not open-source. Autocompletion costs extra but not much ($19).
Edit: after reading over the comments in this post a little more carefully, I don't think LINQPad is what you want. You need something that can programmatically evaluate thousands of little scripts dynamically, right? I did this at work using Iron Ruby very easily. If you're willing to use a DLR language, this would probably be more feasible. I also did some similar work with some code that could evaluate a C# lambda expression passed in as a string but that was extremely limited.

I have written an open source project, Dynamic Expresso, that can convert text expression written using a C# syntax into delegates (or expression tree). Expressions are parsed and transformed into Expression Trees without using compilation or reflection.
You can write something like:
var interpreter = new Interpreter();
var result = interpreter.Eval("8 / 2 + 2");
or
var interpreter = new Interpreter()
.SetVariable("service", new ServiceExample());
string expression = "x > 4 ? service.SomeMethod() : service.AnotherMethod()";
Lambda parsedExpression = interpreter.Parse(expression,
new Parameter("x", typeof(int)));
parsedExpression.Invoke(5);
My work is based on Scott Gu article http://weblogs.asp.net/scottgu/archive/2008/01/07/dynamic-linq-part-1-using-the-linq-dynamic-query-library.aspx .

or http://www.csscript.net/
Oleg was writing a good intro at code project

It doesn't handle exact C# syntax, but PowerShell is so well enmeshed with the .NET framework and is such a mature product, I think you would be unwise to ignore it as at least a possible solution. Most server products being put out by Microsoft are now supporting PowerShell for their scripting interface including Microsoft Exchange and Microsoft SQL Server.

I believe Mono has mint, an interpreter they use before implementing the JIT for a given platform. While the docs in the official site (e.g. Runtime) say it's just an intermediate state before consolidating the jitting VM, I'm pretty sure it was there the last time I compiled it on Linux. I can't quite check it right now, unfortunately, but maybe it's in the direction you want.

bungee# is the thing that you want, in a short time, bungee sharp will be an open source project in
http://www.crssoft.com/Services/Bungee
. you can create scripts with the same c# syntaxt. there is no assembly creation when you run the script, interpretation is done on the fly, so the performance is high. all the keywords are available like c#. I hope u will like it very much..

I faced the same problem. In one project I was looking to provide a generic way to specify conditions controlling when a certain letter has to be generated. In another project the conditions were controlling how cases were assigned to queues. In both of them The following solution worked perfectly:
The Language for the snippets - I chose JScript so that I do not have to worry about variable types.
The Compilation - yes it requires full trust, but you can place your code in a separate assembly and give it full trust. Do not forget to mark it with AllowPartiallyTrustedCaller attribute.
Number of code snippets - I treated every snippet as a method, not a class. This way multiple methods can be combined into a single assembly
Disk usage - I did all compilation in memory without saving the assembly to disk. It also helps if you need to reload it.
All of this works in production without any problems
Edit
Just to clarify 'snippet' - The conditions I am talking about are just boolean expressions. I programatically add additional text to turn it to methods and methods to compilable classes.
Also I can do the same with C# although I still think JScript is better for code snippets
And BTW my code is open source feel free to browse. Just keep in mind there is a lot of code there unrelated to this discussion. Let me know if you need help to locate the pieces concerning the topic

This one works really well
c# repl and interactive interpreter

Is Snippet Compiler something you looking for?

How do languages like C# and Java avoid C/C++-like independent compilation?

For my programming languages class, I'm writing a research paper on some papers by some important people in the history of language design. One by CAR Hoare struck me as odd because it speaks against independent compilation techniques used in C and later C++ before C even became popular.
Since this is primarily an optimization to speed up compilation times, what is it about Java and C# that make them able to avoid reliance on independent compilation? Is it a compiler technique or are there elements of the language that facilitate this? And are there any other compiled languages that used these techniques before them?

Short answer: Java and C# don't avoid separate compilation; they make full use of it.
Where they differ is that they don't require the programmer to write a pair of separate header/implementation files when writing a reusable library. The user writes the definition of a class once, and the compiler extracts the information equivalent to the "header" from that single definition and includes it in the output file as "type metadata". So the output file (a .jar full of .class files in Java, or an .dll assembly in .NET-based languages) is a combination of binaries AND headers in a single package.
Then when another class is compiled and it depends on the first class, it can look at the metadata instead of having to find a separate include file.
It happens that they target a virtual machine rather than a specific chip architecture, but that's a separate issue; they could put x86 machine code in as the binary and still have the header-like metadata in the same file as well (this is in fact an option in .NET, albeit rarely used).
In C++ compilers it is common to try to speed up compilation by using "pre-compiled headers". The metadata in .NET .dll and .class files is much like a pre-compiled header - already parsed and indexed, ready for rapid look-ups.
The upshot is that in these modern languages, there is one way of doing modularization, and it has the characteristics of a perfectly organised and hand-optimised C++ modular build system - pretty nifty, speaking ASFAC++B.

IMO, one of the biggest factors here is that both java and .NET use intermediate languages; that means that the compiled unit (jar/assembly) contains, as a pre-requisite, a lot of expressive metadata about the types, methods, etc; meaning that it is already laid out conveniently for reference checking. The runtime still checks anyway, in case you are pulling a fast one ;-p
This isn't very far removed from the MIDL that underpins COM, although there the TLB is often a separate entity.
If I've misunderstood your meaning, please let me know...

You could consider a java .class file to be similar to a precompiled header file in C/C++. Essentially the .class file is the intermediate form that a C/C++ linker would need as well as all of the information contained in the header (Java just doesn't have a separate header).
Form your comment in another post:
"I'm basically meaning the idea in
C/C++ that each source file is its own
individual compilation unit. This
doesn't as much seem to be the case in
C# or Java."
In Java (I cannot speak for C#, but I assume it is the same) each source file is its own individual compilation unit. I am not sure why you would think it is not... perhaps we have different definitions of compilation unit?

It requires some language support (otherwise, C/C++ compilers would do it too)
In particular, it requires that the compiler generates self-contained modules, which expose metadata that other modules can reference to call into them.
.NET assemblies are a straightforward example. All the files in a project are compiled together, generating one dll. This dll can be queried by .NET to determine which types it contains, so that other assemblies can call functions defined in it.
And to make use of this, it must be legal in the language to reference other modules.
In C++, what defines the boundary of a module? The language specifies that the compiler only considers data in its current compilation unit (.cpp file + included headers). There is no mechanism for specifying "I'd like to call function Foo in module Bar, even though I don't have the prototype or anything for it at compile-time". The only mechanism you have for sharing type information between files is with #includes.
There is a proposal to add a module system to C++, but it won't be in C++0x. Last I saw, the plan was to consider it for a TR1 after 0x is out.
(It's worth mentioning that the #include system in C/C++ was originally used because it'd speed up compilation. Back in the 70's, it allowed the compiler to process the code in a simple linear scan. It didn't have to build syntax trees or other such "advanced" features. Today, the tables have turned and it's become a huge bottleneck, both in terms of usability and compilation speed.)

The object files generated by a C/C++ are ment to be read only by the linker, not by the compiler.

As to other languages: IIRC Turbo Pascal had "units" which you could use without having any source code. I think the point is to create metadata along with compiled code which can then be used by the compiler to figure out the interface to the module (i.e. signatures of functions, class layout etc.)
One problem with C/C++ which prevents just replacing #include with some kind of #import is also the preprocessor, which can completely change the meaning/syntax etc of included/imported modules. This would be very difficult (if not impossible) with a Java-like module system.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.