C# language specification "Program Instantiation" appears to be mis-identified

C# language specification "Program Instantiation" appears to be mis-identified - c#

In the C# language specification a Program is defined as
Program the input to the compiler.
While an Application is defined as
Application an assembly that has an entry point
But, they define
Program instantiation — the execution of an application.
Given the definition of "Program", shouldn't this be...
Application instantiation — the execution of an application.
instead?

As far as I can tell, this term doesn't occur in the Microsoft version of the specification. The annotated ECMA spec has this annotation after "Program":
Programs, assemblies, applications and class libraries
This definition of program differs from common usage. In C#, a program is just the input to the compiler. The output of the compiler is an assembly, which is either an application or a class library.
There aren't any other annotations nearby though. It does seem somewhat odd, which is perhaps why it doesn't appear in the MS spec.

No.
They're definitions, so they can be whatever they want. Your mistake is attempting to find a semantic link in the word program, where there is none. They are, as you've noted, unrelated.
What they're saying is "this is how we use this term"; there's basically nothing wrong with choosing any term, as long as the definitions are consistent. foo, bar, and baz would have been just as correct as program instantiation. As long as the names are internally consistent, and the definitions are correct, the names could be anything. They're just labels.
Someone at Microsoft obviously thought that it was more important that the term program instantiation be reflected in it's common usage. The term program probably didn't get the same treatment but, again, they're just names. And the names are "atomic": the word program is not at all related to the term program instantiation.
Since they're just labels, they terms can be replaced by anything. One possibility is:
X = the input to the compiler.
Y = an assembly that has an entry point
Z = the execution of Y.
Replacing any of the names with anything else makes no difference in their usage.
If I replace the above definition of Z with a new term XY:
XY = the execution of Y
this still holds. It's just a label, it gets semantic content from the definition, not from it's name. XY has no semantic relationship to X, and it's relationship to Y is only incidental.
When you read definitions of things, especially technical specifications, it's important to keep this in mind. There's often no best term for something, as there are often multiple common terms for the same thing, and they're not often defined rigourously enough to be meaningful in a precise specification.
There's an entire branch of philosophy dedicated to issues like this, and causing a "conflict" in the sense that you cite is pretty much unavoidable.
The writer's of the C# specification made their choice, and as long as it's internally consistent, it's "correct".

Based on given input, what the compiler outputs is still the program, only in the format of an assembly / application, such that it is:
Program, valid — a C# program constructed according to the syntax rules and diagnosable semantic rule.
I'll take a brave step here and say that we could remove ourselves from such specific context and look at an available definition of usage in English for Program:
(6) A set of coded instructions that enables a machine, especially a
computer, to perform a desired
sequence of operations.
(7) An instruction sequence in programmed instruction.
Both what is input to the compiler and that which is output can both be labelled by the above.
But actually, I'm going to have to vote to close as this is a question regarding the semantics of the English language.

Related

Implement Language Auto-Completion based on ANTLR4 Grammar

I am wondering if are there any examples (googling I haven't found any) of TAB auto-complete solutions for Command Line Interface (console), that use ANTLR4 grammars for predicting the next term (like in a REPL model).
I've written a PL/SQL grammar for an open source database, and now I would like to implement a command line interface to the database that provides the user the feature of completing the statements according to the grammar, or eventually discover the proper database object name to use (eg. a table name, a trigger name, the name of a column, etc.).
Thanks for pointing me to the right direction.

Actually it is possible! (Of course, based on the complexity of your grammar.) Problem with auto-completion and ANTLR is that you do not have complete expression and you want to parse it. If you would have complete expression, it wont be any big problem to know what kind of element is at what place and to know what can be used at such a place. But you do not have complete expression and you cannot parse the incomplete one. So what you need to do is to wrap the input into some wrapper/helper that will complete the expression to create a parse-able one. Notice that nothing that is added only to complete the expression is important to you - you will only ask for members up to last really written character.
So:
A) Create the wrapper that will change this (excel formula) '=If(' into '=If()'
B) Parse the wrapped input
C) Realize that you are in the IF function at the first parameter
D) Return all that can go into that place.
It actually works, I have completed intellisense editor for several simple languages. There is much more infrastructure than this, but the basic idea is as I wrote it. Only be careful, writing the wrapper is not easy if not impossible if the grammar is really complex. In that case look at Papa Carlo project. http://lakhin.com/projects/papa-carlo/

As already mentioned auto completion is based on the follow set at a given position, simply because this is what we defined in the grammar to be valid language. But that's only a small part of the task. What you need is context (as Sam Harwell wrote: it's a semantic process, not a syntactic one). And this information is independent of the parser. And since a parser is made to parse valid input (and during auto completion you have most of the time invalid input), it's not the right tool for this task.
Knowing what token can follow at a given position is useful to control the entire process (e.g. you don't want to show suggestions if only a string can appear), but is most of the time not what you actually want to suggest (except for keywords). If an ID is possible at the current position, it doesn't tell you what ID is actually allowed (a variable name? a namespace? etc.). So what you need is essentially 3 things:
A symbol table that provides you with all possible names sorted by scope. Creating this depends heavily on the parsed language. But this is a task where a parser is very helpful. You may want to cache this info as it is time consuming to run this analysis step.
Determine in which scope you are when invoking auto completion. You could use a parser as well here (maybe in conjunction with step 1).
Determine what type of symbol(s) you want to show. Many people think this is where a parser can give you all necessary information (the follow set). But as mentioned above that's not true (keywords aside).
In my blog post Universal Code Completion using ANTLR3 I especially addressed the 3rd step. There I don't use a parser, but simulate one, only that I don't stop when a parser would, but when the caret position is reached (so it is essential that the input must be valid syntax up to that point). After reaching the caret the collection process starts, which not only collects terminal nodes (for keywords) but looks at the rule names to learn what needs to be collected too. Using specific rule names is my way there to put context into the grammar, so when the collection code finds a rule table_ref it knows that it doesn't need to go further down the rule chain (to the ultimate ID token), but instead can use this information to provide a list of tables as suggestion.
With ANTLR4 things might become even simpler. I haven't used it myself yet, but the parser interpreter could be a big help here, as it essentially doing what I do manually in my implementation (with the ANTLR3 backend).

This is probably pretty hard to do.
Fundamentally you want to use some parser to predict "what comes next" to display as auto-completion. This has to at least predict what the FIRST token is at the point where the user's input stops.
For ANTLR, I think this will be very difficult. The reason is that ANTLR generates essentially procedural, recursive descent parsers. So at runtime, when you need to figure out what FIRST tokens are, you have to inspect the procedural source code of the generated parser. That way lies madness.
This blog entry claims to achieve autocompletion by collecting error reports rather than inspecting the parser code. Its sort of an interesting idea, but I do not understand how his method really works, and I cannot see how it would offer all possible FIRST tokens; it might acquire some of them. This SO answer confirms my intuition.
Sam Harwell discusses how he has tackled this; he is one of the ANTLR4 implementers and if anybody can make this work, he can. It wouldn't surprise me if he reached inside ANTLR to extract the information he needs; as an ANTLR implementer he would certainly know where to tap in. You are not likely to be so well positioned. Even so, he doesn't really describe what he did in detail. Good luck replicating. You might ask him what he really did.
What you want is a parsing engine for which that FIRST token information is either directly available (the parser generator could produce it) or computable based on the parser state. This is actually possible to do with bottom up parsers such as LALR(k); you can build an algorithm that walks the state tables and computes this information. (We do this with our DMS Software Reengineering Toolkit for its GLR parser precisely to produce syntax error reports that say "missing token, could be any of these [set]")

declaration capture phase in compilation

languages like C and C++ rely on forward declarations to resolve cyclic dependencies in type or function declarations. In C#, this is not required anymore because the declaration capture phase is split in two phases; one capturing symbol names and a second one actually doing the symbol declaration construction.
Is there a standard name for the symbol name capture phase? i would assume that declaration capture would be left for the traditional phase that involves resolving all symbols in the declaration

The C# compiler actually has a declaration phase where it builds the symbol table. The Roslyn C# compiler is not so clear, because not everything is done in large sweeping phases. Instead, each symbol is constructed individually, on demand. However, there is still a step where type and member declarations in syntax are converted into symbols. The binding phase comes after this logically, where references to type and member names are resolved using the declared symbol table.

I think these two phases are called
parsing
binding
Parsing is syntactic. Binding is assigning meaning to identifiers and names.
C++ could do the same. It is just defined not to.

With .Net 5 Microsoft will be introducing Roslyn, the compiler as a service. This overview for Roslyn describes the compiler's three steps prior to code generation as being: lexical analysis, syntactic analysis and semantic analysis. The Roslyn project's overview confirms these descriptions, but with less precise language.

I found this blog post from Eric Lippert which gives the best explanation of what i was looking for:
The C# language does not require that declarations occur before
usages, which has two impacts, again, on the user and on the compiler
writer. The impact on the user is that you don’t get to recompile just
the IL that changed when you change a file; the whole assembly is
recompiled. Fortunately the C# compiler is sufficiently fast that this
is seldom a huge issue. (Another way to look at this is that the
“granularity” of recompilation in C# is at the project level, not the
file level.)
The impact on the compiler writer is that we have to have a “two pass”
compiler. In the first pass, we look for declarations and ignore
bodies. Once we have gleaned all the information from the declarations
that we would have got from the headers in C++, we take a second pass
over the code and generate the IL for the bodies.
...
We then do a "declaration" pass where we make notes about the
location of every namespace and type declaration in the program. At
this point we're done looking at the source code for the first phase;
every subsequent pass is over the set of "symbols" deduced from the
declarations.
We then do a pass where we verify that all the types declared have no
cycles in their base types. We need to do this first because in every
subsequent pass we need to be able to walk up type hierarchies without
having to deal with cycles.
http://blogs.msdn.com/b/ericlippert/archive/2010/02/04/how-many-passes.aspx

Any reason not to mark a DLL as CLSCompliant?

I am currently testing out Ndepend, and it gives me a warning that assemblies should be marked as CLSCompliant.
Our project is all C#, so it is not really needed.
What I am wondering is: are there any negative effects of marking a dll as clscompliant or should I just disable the warning?
Note I am not asking what CLSCompliant means that is covered here: What is the 'CLSCompliant' attribute in .NET?

This is one of those subtle cases... CLS compliance is probably of most importance to library authors, who can't control who the caller is. In your case, you state "our project is all C#", in which case you are right: it isn't needed. It adds restrictions (for example, on unsigned types) which might (or might not) affect representing your data in the most obvious way.
So: if it adds no value to you whatsoever, then frankly: turn that rule off. If you can add it for free (no code changes except the attributes), then maybe OK - but everything is a balance - effort vs result. If there is no gain here, don't invest time.
If you are a library author (merchant or OSS), then you should follow it.

There are several C# features that are not CLS compliant, for example unsigned types. Since several languages are case insensitive, there must be no types and members that differ only by case, e.g. MyObject and myObject, and several other features. Thus, if you don't plan to work with other .NET languages there is no reason to mark your code as CLSCompliant.

The only negative effects would be compiler errors when your code is marked as CLSCompliant, but it is not.

Why there is a convention of declaring default namespace/libraries in any programming language?

Why don't any programming language load the default libraries like stdio.h, iostream.h or using Systemso that there declaration is avoided?
As these namespace/libraries are required in any program, why the compilers expect it to be declared by the user.
Do any programs exist without using namespace/headers? even if yes, whats wrong in loading a harmless default libraries?
I don't mean that .. I am lazy to write a line of code but it makes less sense (as per me) for a compiler to cry for declaration of so called default thingummies ending up in a compilation error.

It's because there are programs which are written without the standard libraries. For example, there are plenty of C programs running on embedded systems that don't provide stdio.h, since it doesn't make any sense on those platforms (in C, such environments are referred to as "freestanding", as opposed to the more usual "hosted").

The “default” libraries are not “required in any program”, and indeed there are many cases where they are not even available (operating system kernel/drivers, microcontrollers, etc). And more in the mainstream, many high-level graphical programs use system-specific GUI/graphics libraries instead of standard I/O.

For stdio.h/iostream(.h): the quick answer is that in the biggest part of your software, they are not needed (definitively not both). Headless devices/servers should having a logging module instead and GUI's don't always have a console to interface with.

Many languages (especially scripting languages, and languages that carry a standard runtime as part of the language spec) do do this.
The trade-off is convenience versus software-engineering goodness. The problem with opening namespaces by default is you end up with a lot of names being available immediately at the top level, which can cause name clashes and confusion, pollute Intellisense/autocompletion lists, etc.

To follow up on caf's answer.
You need to tell the compiler about these headers/libraries so you do not have to include anything that you do not want. Because they are not needed in every program. Any programmer is able to write a library in c or c++ that does not depend on any runtime libraries. This ability makes it possible to write as lean software as possible, and save memory/diskspace/compile time/link time (pick what you need most). In low level languages you should only pay for what you need, nothing more.

There is name collision problem exist. As language standards developed, they provide more and more features, giving them more amd more names. And the probability that system name and name defined in user program raises. To avoid this, new features defined in modules which will not included if program not using it. As system libraries use many common usage words as its symbols (like open, restricted) problem is serious.
But explicit module inclusion is not only method for avoiding collisions. They also are: using "standard" namespace for system names (e.g. C++, namespace std), reserving names and name patterns (e.g. C, double underscores), allowed redefinition (e.g. Forth).

Because libraries are external component of the language. If a day all library (or part of it) change headers, namespaces the language element don't change with them. The compiler checks the syntax and rules of programming language only.

How to identify what state variables are read/written in a given method in C#

What is the simplest way to identify if a given method is reading or writing a member variable or property?
I am writing a tool to assist in an RPC system, in which access to remote objects is expensive. Being able to detect if a given object is not used in a method could allow us to avoid serializing its state. Doing it on source code is perfectly reasonable (but being able to do it on compiled code would be amazing)
I think I can either write my own simple parser, I can try to use one of the existing C# parsers and work with the AST. I am not sure if it is possible to do this with Assemblies using Reflection. Are there any other ways? What would be the simplest?
EDIT: Thanks for all the quick replies. Let me give some more information to make the question clearer. I definitely prefer correct, but it definitely shouldn't be extremely complex. What I mean is that we can't go too far checking for extremes or impossibles (as the passed-in delegates that were mentioned, which is a great point). It would be enough to detect those cases and assume everything could be used and not optimize there. I would assume that those cases would be relatively uncommon.
The idea is for this tool to be handed to developers outside of our team, that should not be concerned about this optimization. The tool takes their code and generates proxies for our own RPC protocol. (we are using protobuf-net for serialization only, but no wcf nor .net remoting). For this reason, anything we use has to be free or we wouldn't be able to deploy the tool for licensing issues.

You can have simple or you can have correct - which do you prefer?
The simplest way would be to parse the class and the method body. Then identify the set of tokens which are properties and field names of the class. The subset of those tokens which appears in the method body are the properties and field names you care about.
This trivial analysis of course is not correct. If you had
class C
{
int Length;
void M() { int x = "".Length; }
}
Then you would incorrectly conclude that M references C.Length. That's a false positive.
The correct way to do it is to write a full C# compiler, and use the output of its semantic analyzer to answer your question. That's how the IDE implements features like "go to definition".

Before attempting to write this kind of logic yourself, I would check to see if you can leverage NDepend to meet your needs.
NDepend is a code dependency analysis tool ... and much more. It implements a sophisticated analyzer for examining relationships between code constructs and should be able to answer that question. It also operates on both source and IL, if I'm not mistaken.
NDepend exposes CQL - Code Query Language - which allows you to write SQL-like queries against the relationships between structures in your code. NDepend has some support for scripting and is capable of being integrated with your build process.

To complete the LBushkin answer on NDepend (Disclaimer: I am one of the developer of this tool), NDepend can indeed help you on that. The Code LINQ Query (CQLinq) below, actually match methods that...
shouldn't provoque any RPC calls but
that are reading/writing any fields of any RPC types,
or that are reading/writing any properties of any RPC types,
Notice how first we define the 4 sets: typesRPC, fieldsRPC, propertiesRPC, methodsThatShouldntUseRPC - and then we match methods that violate the rule. Of course this CQLinq rule needs to be adapted to match your own typesRPC and methodsThatShouldntUseRPC:
warnif count > 0
// First define what are types whose call are RDC
let typesRPC = Types.WithNameIn("MyRpcClass1", "MyRpcClass2")
// Define instance fields of RPC types
let fieldsRPC = typesRPC.ChildFields()
.Where(f => !f.IsStatic).ToHashSet()
// Define instance properties getters and setters of RPC types
let propertiesRPC = typesRPC.ChildMethods()
.Where(m => !m.IsStatic && (m.IsPropertyGetter || m.IsPropertySetter))
.ToHashSet()
// Define methods that shouldn't provoke RPC calls
let methodsThatShouldntUseRPC =
Application.Methods.Where(m => m.NameLike("XYZ"))
// Filter method that should do any RPC call
// but that is using any RPC fields (reading or writing) or properties
from m in methodsThatShouldntUseRPC.UsingAny(fieldsRPC).Union(
methodsThatShouldntUseRPC.UsingAny(propertiesRPC))
let fieldsRPCUsed = m.FieldsUsed.Intersect(fieldsRPC )
let propertiesRPCUsed = m.MethodsCalled.Intersect(propertiesRPC)
select new { m, fieldsRPCUsed, propertiesRPCUsed }

My intuition is that detecting which member variables will be accessed is the wrong approach. My first guess at a way to do this would be to just request serialized objects on an as-needed basis (preferably at the beginning of whatever function needs them, not piecemeal). Note that TCP/IP (i.e. Nagle's algorithm) should stuff these requests together if they are made in rapid succession and are small

Eric has it right: to do this well, you need what amounts to a compiler front end. What he didn't emphasize enough is the need for strong flow analysis capabilities (or a willingness to accept very conservative answers possibly alleviated by user annotations). Maybe he meant that in the phrase "semantic analysis" although his example of "goto definition" just needs a symbol table, not flow analysis.
A plain C# parser could only be used to get very conservative answers (e.g., if method A in class C contains identifier X, assume it reads class member X; if A contains no calls then you know it can't read member X).
The first step beyond this is having a compiler's symbol table and type information (if method A refers to class member X directly, then assume it reads member X; if A contains **no* calls and mentions identifier X only in the context of accesses to objects which are not of this class type then you know it can't read member X). You have to worry about qualified references, too; Q.X may read member X if Q is compatible with C.
The sticky point are calls, which can hide arbitrary actions. An analysis based on just parsing and symbol tables could determine that if there are calls, the arguments refer only to constants or to objects which are not of the class which A might represent (possibly inherited).
If you find an argument that has an C-compatible class type, now you have to determine whether that argument can be bound to this, requiring control and data flow analysis:
method A( ) { Object q=this;
...
...q=that;...
...
foo(q);
}
foo might hide an access to X. So you need two things: flow analysis to determine whether the initial assignment to q can reach the call foo (it might not; q=that may dominate all calls to foo), and call graph analysis to determine what methods foo might actually invoke, so that you can analyze those for accesses to member X.
You can decide how far you want to go with this simply making the conservative assumption "A reads X" anytime you don't have enough information to prove otherwise. This will you give you a "safe" answer (if not "correct" or what I'd prefer to call "precise").
Of frameworks that might be helpful, you might consider Mono, which surely parses and builds symbol tables. I don't know what support it provides for flow analysis or call graph extraction; I would not expect the Mono-to-IL front-end compiler to do a lot of that, as people usually hide that machinery in the JIT part of JIT-based systems. A downside is that Mono may be behind the "modern C#" curve; last time I heard, it handled only C# 2.0 but my information may be stale.
An alternative is our DMS Software Reengineering Toolkit and its C# Front End.
(Not an open source product).
DMS provides general source code parsing, tree building/inspection/analysis, general symbol table support and built-in machinery for implementing control-flow analysis, data flow analysis, points-to analysis (needed for "What does object O point to?"), and call graph construction. This machinery has all been tested by fire with DMS's Java and C front ends, and the symbol table support has been used to implement full C++ name and type resolution, so its pretty effective. (You don't want to underestimate the work it takes to build all that machinery; we've been working on DMS since 1995).
The C# Front End provides for full C# 4.0 parsing and full tree building. It presently does not build symbol tables for C# (we're working on this) and that's a shortcoming compared to Mono. With such a symbol table, however, you would have access to all that flow analysis machinery (which has been tested with DMS's Java and C front ends) and that might be a big step up from Mono if it doesn't provide that.
If you want to do this well, you have a considerable amount of work in front of you. If you want to stick with "simple", you'll have to do with just parsing the tree and being OK with being very conservative.
You didn't say much about knowing if a method wrote to a member. If you are going to minimize traffic the way you describe, you want to distinguish "read", "write" and "update" cases and optimize messages in both directions. The analysis is obviously pretty similar for the various cases.
Finally, you might consider processing MSIL directly to get the information you need; you'll still have the flow analysis and conservative analysis issues. You might find the following technical paper interesting; it describes a fully-distributed Java object system that has to do the same basic analysis you want to do,
and does so, IIRC, by analyzing class files and doing massive byte code rewriting.
Java Orchestra System

By RPC do you mean .NET Remoting? Or DCOM? Or WCF?
All of these offer the opportunity to monitor cross process communication and serialization via sinks and other constructs, but they are all platform specific, so you'll need to specify the platform...

You could listen for the event that a property is being read/written to with an interface similar to INotifyPropertyChanged (although you obviously won't know which method effected the read/write.)

I think the best you can do is explicitly maintain a dirty flag.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.