declaration capture phase in compilation - c#

languages like C and C++ rely on forward declarations to resolve cyclic dependencies in type or function declarations. In C#, this is not required anymore because the declaration capture phase is split in two phases; one capturing symbol names and a second one actually doing the symbol declaration construction.
Is there a standard name for the symbol name capture phase? i would assume that declaration capture would be left for the traditional phase that involves resolving all symbols in the declaration

The C# compiler actually has a declaration phase where it builds the symbol table. The Roslyn C# compiler is not so clear, because not everything is done in large sweeping phases. Instead, each symbol is constructed individually, on demand. However, there is still a step where type and member declarations in syntax are converted into symbols. The binding phase comes after this logically, where references to type and member names are resolved using the declared symbol table.

I think these two phases are called
parsing
binding
Parsing is syntactic. Binding is assigning meaning to identifiers and names.
C++ could do the same. It is just defined not to.

With .Net 5 Microsoft will be introducing Roslyn, the compiler as a service. This overview for Roslyn describes the compiler's three steps prior to code generation as being: lexical analysis, syntactic analysis and semantic analysis. The Roslyn project's overview confirms these descriptions, but with less precise language.

I found this blog post from Eric Lippert which gives the best explanation of what i was looking for:
The C# language does not require that declarations occur before
usages, which has two impacts, again, on the user and on the compiler
writer. The impact on the user is that you don’t get to recompile just
the IL that changed when you change a file; the whole assembly is
recompiled. Fortunately the C# compiler is sufficiently fast that this
is seldom a huge issue. (Another way to look at this is that the
“granularity” of recompilation in C# is at the project level, not the
file level.)
The impact on the compiler writer is that we have to have a “two pass”
compiler. In the first pass, we look for declarations and ignore
bodies. Once we have gleaned all the information from the declarations
that we would have got from the headers in C++, we take a second pass
over the code and generate the IL for the bodies.
...
We then do a "declaration" pass where we make notes about the
location of every namespace and type declaration in the program. At
this point we're done looking at the source code for the first phase;
every subsequent pass is over the set of "symbols" deduced from the
declarations.
We then do a pass where we verify that all the types declared have no
cycles in their base types. We need to do this first because in every
subsequent pass we need to be able to walk up type hierarchies without
having to deal with cycles.
http://blogs.msdn.com/b/ericlippert/archive/2010/02/04/how-many-passes.aspx

Related

How to detect static code dependencies in C# code in the presence of constants?

I use NDepend to analyze static code dependencies. However, it does not recognize dependencies introduced by constants, because constants are inlined by the compiled and so the dependency is not visible to reflection used by NDepend.
I am at a loss here. I cannot replace constants with Enums - too much code, too many non integer constants (like strings).
Theoretically using Roslyn API should help here, but I do not understand what am I to do exactly. Given a source code file, do I need to build the syntax tree of the file and scan every node looking for a constant? I have not seen any nodes dedicated to constants, so it must be more complicated than just filtering the root.DescendantNodes().
Maybe there is an undocumented compiler option that helps somehow here. I could not find anything.
The context of this request is refactoring a big monolithic application and part of this work is identifying compile time dependencies.

Is there a way to find what Types are referenced by a c# assembly?

The Assembly class has a GetReferencedAssemblies method that returns the
referenced assemblies. Is there a way to find what Types are referenced?
The CLR wont be able to tell you at runtime. You would have to do some serious static analysis of the source files - similar to the static analysis done by resharper or visual studio.
Static analysis is fairly major undertaking. You basically need a c# parser, a symbol table and plenty of time to work through all the cases that come up in abstract syntax trees.
Why can't the CLR tell you at run time? It is just in time compiled, this means that CLR bytcode is converted into machine code just before execution. Reflection only tells you stuff known statically at runtime about your types, and the CLR would only know if a type is referenced when the code is run. The CLR only knows when a type is loaded at execution time - at the point of just in time compilation.
Use System.Reflection.Assembly.GetTypes().
Types are not referenced separately from assemblies. If an assembly references another assembly, it automatically references (at least in the technical context) all the types within that assembly, as well. In order to get all the types defined (not referenced) in an assembly, you can use the Assembly.GetTypes method.
It may be possible, but sounds like a rather arduous task, to scan an assembly for which actual types it references (i.e. which types it actually invokes or otherwise mentions). This will probably involve working with IL. Something like this is best to be avoided.
Edit: Actually, when I think about it, this is not possible at all. Whatsoever. On a quite basic level. The thing is, types can be instantiated and referenced willy-nilly. It's not even uncommon for this to happen. Not to mention late binding. All this means trying to analyze an assembly for all the types it references is something like predicting the future.
Edit 2: Comments
While the question, as stated, isn't possible due to all sorts of dynamic references, it is possible greatly shrink all sorts of binary files using difference encoding. This basically allows you to get a file containing the differences between two binary files, which in the case of executables/libraries, tends to be vastly smaller than either of the actual files. Here are some applications that perform this operation. Note that bsdiff doesn't run on Windows, but there is a link to a port there, and you can find many more ports (including to .NET) with the aid of Google.
XDelta
bsdiff
If you'd look, you'll find many more such applications. One of the best parts is, they are totally self-contained and involve very little work on your part.

Order of Classes within an Assembly

What determines the order of classes within an Assembly?
And.. is there a way to change it?
Additional info: you can check the ordering either through reflection yourself, or you can use a tool like ILDASM, disable the alphabetical sorting, and then you will also get the order.
Order seems to be in a strange way determined by the compiler.
I already tried some things.. like renaming the classes (order stays the same), also editing the .csproj file to change the order of the .cs files.
My main focus is VS2008, C#, .net 3.5.
Update: I do have a scenario where the order matters (external program going through my assembly through reflection) - and I need special order there.
Apart from this - you are totally right - order really should not matter.
I'm going to stick my neck out here and say this is an implementation detail and may well be decided by any particular compiler.
Since this is an implementation detail you shouldn't or needn't be concerned. Of course if this really is important (can't see why) you can always write your own IL.
I leave you with the following quote from Eric's blog:
Is compiling the same C# program twice guaranteed to produce the same
binary output?
No.
What determines the order of classes within an Assembly?
The compiler.
And.. is there a way to change it?
Write your own IL directly.
That being said, the order of the types within the assembly really doesn't matter. You can use the types with no regard to their order.

C# language specification "Program Instantiation" appears to be mis-identified

In the C# language specification a Program is defined as
Program the input to the compiler.
While an Application is defined as
Application an assembly that has an entry point
But, they define
Program instantiation — the execution of an application.
Given the definition of "Program", shouldn't this be...
Application instantiation — the execution of an application.
instead?
As far as I can tell, this term doesn't occur in the Microsoft version of the specification. The annotated ECMA spec has this annotation after "Program":
Programs, assemblies, applications and class libraries
This definition of program differs from common usage. In C#, a program is just the input to the compiler. The output of the compiler is an assembly, which is either an application or a class library.
There aren't any other annotations nearby though. It does seem somewhat odd, which is perhaps why it doesn't appear in the MS spec.
No.
They're definitions, so they can be whatever they want. Your mistake is attempting to find a semantic link in the word program, where there is none. They are, as you've noted, unrelated.
What they're saying is "this is how we use this term"; there's basically nothing wrong with choosing any term, as long as the definitions are consistent. foo, bar, and baz would have been just as correct as program instantiation. As long as the names are internally consistent, and the definitions are correct, the names could be anything. They're just labels.
Someone at Microsoft obviously thought that it was more important that the term program instantiation be reflected in it's common usage. The term program probably didn't get the same treatment but, again, they're just names. And the names are "atomic": the word program is not at all related to the term program instantiation.
Since they're just labels, they terms can be replaced by anything. One possibility is:
X = the input to the compiler.
Y = an assembly that has an entry point
Z = the execution of Y.
Replacing any of the names with anything else makes no difference in their usage.
If I replace the above definition of Z with a new term XY:
XY = the execution of Y
this still holds. It's just a label, it gets semantic content from the definition, not from it's name. XY has no semantic relationship to X, and it's relationship to Y is only incidental.
When you read definitions of things, especially technical specifications, it's important to keep this in mind. There's often no best term for something, as there are often multiple common terms for the same thing, and they're not often defined rigourously enough to be meaningful in a precise specification.
There's an entire branch of philosophy dedicated to issues like this, and causing a "conflict" in the sense that you cite is pretty much unavoidable.
The writer's of the C# specification made their choice, and as long as it's internally consistent, it's "correct".
Based on given input, what the compiler outputs is still the program, only in the format of an assembly / application, such that it is:
Program, valid — a C# program constructed according to the syntax rules and diagnosable semantic rule.
I'll take a brave step here and say that we could remove ourselves from such specific context and look at an available definition of usage in English for Program:
(6) A set of coded instructions that enables a machine, especially a
computer, to perform a desired
sequence of operations.
(7) An instruction sequence in programmed instruction.
Both what is input to the compiler and that which is output can both be labelled by the above.
But actually, I'm going to have to vote to close as this is a question regarding the semantics of the English language.

How do languages like C# and Java avoid C/C++-like independent compilation?

For my programming languages class, I'm writing a research paper on some papers by some important people in the history of language design. One by CAR Hoare struck me as odd because it speaks against independent compilation techniques used in C and later C++ before C even became popular.
Since this is primarily an optimization to speed up compilation times, what is it about Java and C# that make them able to avoid reliance on independent compilation? Is it a compiler technique or are there elements of the language that facilitate this? And are there any other compiled languages that used these techniques before them?
Short answer: Java and C# don't avoid separate compilation; they make full use of it.
Where they differ is that they don't require the programmer to write a pair of separate header/implementation files when writing a reusable library. The user writes the definition of a class once, and the compiler extracts the information equivalent to the "header" from that single definition and includes it in the output file as "type metadata". So the output file (a .jar full of .class files in Java, or an .dll assembly in .NET-based languages) is a combination of binaries AND headers in a single package.
Then when another class is compiled and it depends on the first class, it can look at the metadata instead of having to find a separate include file.
It happens that they target a virtual machine rather than a specific chip architecture, but that's a separate issue; they could put x86 machine code in as the binary and still have the header-like metadata in the same file as well (this is in fact an option in .NET, albeit rarely used).
In C++ compilers it is common to try to speed up compilation by using "pre-compiled headers". The metadata in .NET .dll and .class files is much like a pre-compiled header - already parsed and indexed, ready for rapid look-ups.
The upshot is that in these modern languages, there is one way of doing modularization, and it has the characteristics of a perfectly organised and hand-optimised C++ modular build system - pretty nifty, speaking ASFAC++B.
IMO, one of the biggest factors here is that both java and .NET use intermediate languages; that means that the compiled unit (jar/assembly) contains, as a pre-requisite, a lot of expressive metadata about the types, methods, etc; meaning that it is already laid out conveniently for reference checking. The runtime still checks anyway, in case you are pulling a fast one ;-p
This isn't very far removed from the MIDL that underpins COM, although there the TLB is often a separate entity.
If I've misunderstood your meaning, please let me know...
You could consider a java .class file to be similar to a precompiled header file in C/C++. Essentially the .class file is the intermediate form that a C/C++ linker would need as well as all of the information contained in the header (Java just doesn't have a separate header).
Form your comment in another post:
"I'm basically meaning the idea in
C/C++ that each source file is its own
individual compilation unit. This
doesn't as much seem to be the case in
C# or Java."
In Java (I cannot speak for C#, but I assume it is the same) each source file is its own individual compilation unit. I am not sure why you would think it is not... perhaps we have different definitions of compilation unit?
It requires some language support (otherwise, C/C++ compilers would do it too)
In particular, it requires that the compiler generates self-contained modules, which expose metadata that other modules can reference to call into them.
.NET assemblies are a straightforward example. All the files in a project are compiled together, generating one dll. This dll can be queried by .NET to determine which types it contains, so that other assemblies can call functions defined in it.
And to make use of this, it must be legal in the language to reference other modules.
In C++, what defines the boundary of a module? The language specifies that the compiler only considers data in its current compilation unit (.cpp file + included headers). There is no mechanism for specifying "I'd like to call function Foo in module Bar, even though I don't have the prototype or anything for it at compile-time". The only mechanism you have for sharing type information between files is with #includes.
There is a proposal to add a module system to C++, but it won't be in C++0x. Last I saw, the plan was to consider it for a TR1 after 0x is out.
(It's worth mentioning that the #include system in C/C++ was originally used because it'd speed up compilation. Back in the 70's, it allowed the compiler to process the code in a simple linear scan. It didn't have to build syntax trees or other such "advanced" features. Today, the tables have turned and it's become a huge bottleneck, both in terms of usability and compilation speed.)
The object files generated by a C/C++ are ment to be read only by the linker, not by the compiler.
As to other languages: IIRC Turbo Pascal had "units" which you could use without having any source code. I think the point is to create metadata along with compiled code which can then be used by the compiler to figure out the interface to the module (i.e. signatures of functions, class layout etc.)
One problem with C/C++ which prevents just replacing #include with some kind of #import is also the preprocessor, which can completely change the meaning/syntax etc of included/imported modules. This would be very difficult (if not impossible) with a Java-like module system.

Categories