I am researching the .NET Common Language Infrastructure, and before I get into the nitty-gritty of the compiler I'm going to write, I want to be sure that certain features are available. In order to do that, I must understand how they work.
One feature I'm unsure of is the .NET Primary Interop Assembly embedding. I'm not quite sure how .NET goes about embedding only the types you use versus the types that are exposed by the types you use. From the bit of research I've done into this, I've noticed that it emits a bare-bones interface that utilizes vtable gap methods, where the method name format is VtblGap{0}_{1} where {0} is the index of the gap and {1} is the member size of the gap. These methods are marked rtspecialname and specialname. Whether this is accurate or not, is the question.
Assuming the above is true, how would I go about obtaining the necessary information to embed similar metadata into the resulted application?
From what I can tell, you can order the MemberInfo objects obtained via their metadata tokens for the order, and the dispid information is obtained via the attributes from the interop assembly. The area I'm most confused on are the interfaces that are imported that seem to have no direct correlation with the other embedded types, sequentially indexed interfaces that seem to be there for versioning reasons. Is their inclusion based off of their indexing or is there some other logic used? An example is Microsoft.Office.Interop.Word, when you add a document to an Application and, in doing something with it, it imports the document, its events, and so on.
Here's hoping someone in-the-know can clue me in on what else might be involved in embedding these types.
Related
I'm basically know about metadata in c#.net and I recently heard about .net Obfuscating.
I want to know if I use any obfuscator to make my assemblies from being understood it will obfuscate the IL, but will it also change metadata? Then can I add it as a reference to my project and see the real name for classes and its members?
These days most obfuscators can basically rewrite your assembly for you. The majority of the features include:
Renaming (tool vendors often will provide an option to create a map so you can manually map a renamed member to the original member name with a tool like Reflector)
String encryption - this encrypts string constants in the code (stored in the string heap area of the meatadata) so if you open the file in Reflector it will usually show encrypted. The encrypted values still get decrypted right before using them.
IL obfuscation - control flow rewriting of the IL to make spaghetti code and difficult to follow
There are also other tools that go way beyond this but they all just raise the bar of what it takes to reverse something.
If you set a reference to an obfuscated dll/exe you'll see the obfuscated/renamed members, but if the vendor provides a map (most will) you can figure out which is which. You can also typically use interfaces that are not obfuscated if you need a readable api to use. An example would be Reflector - the addin apis are all interfaces that are not obfuscated but all implementations of the concrete classes are obfuscated.
Try using Confuser, as there's still no Deobfuscator for this one.
http://confuser.codeplex.com/
You won't see normal names of classes and methods as it hashes them and also many more. It is basically impossible to get anything out of code afterwards.
The Assembly class has a GetReferencedAssemblies method that returns the
referenced assemblies. Is there a way to find what Types are referenced?
The CLR wont be able to tell you at runtime. You would have to do some serious static analysis of the source files - similar to the static analysis done by resharper or visual studio.
Static analysis is fairly major undertaking. You basically need a c# parser, a symbol table and plenty of time to work through all the cases that come up in abstract syntax trees.
Why can't the CLR tell you at run time? It is just in time compiled, this means that CLR bytcode is converted into machine code just before execution. Reflection only tells you stuff known statically at runtime about your types, and the CLR would only know if a type is referenced when the code is run. The CLR only knows when a type is loaded at execution time - at the point of just in time compilation.
Use System.Reflection.Assembly.GetTypes().
Types are not referenced separately from assemblies. If an assembly references another assembly, it automatically references (at least in the technical context) all the types within that assembly, as well. In order to get all the types defined (not referenced) in an assembly, you can use the Assembly.GetTypes method.
It may be possible, but sounds like a rather arduous task, to scan an assembly for which actual types it references (i.e. which types it actually invokes or otherwise mentions). This will probably involve working with IL. Something like this is best to be avoided.
Edit: Actually, when I think about it, this is not possible at all. Whatsoever. On a quite basic level. The thing is, types can be instantiated and referenced willy-nilly. It's not even uncommon for this to happen. Not to mention late binding. All this means trying to analyze an assembly for all the types it references is something like predicting the future.
Edit 2: Comments
While the question, as stated, isn't possible due to all sorts of dynamic references, it is possible greatly shrink all sorts of binary files using difference encoding. This basically allows you to get a file containing the differences between two binary files, which in the case of executables/libraries, tends to be vastly smaller than either of the actual files. Here are some applications that perform this operation. Note that bsdiff doesn't run on Windows, but there is a link to a port there, and you can find many more ports (including to .NET) with the aid of Google.
XDelta
bsdiff
If you'd look, you'll find many more such applications. One of the best parts is, they are totally self-contained and involve very little work on your part.
F# 3.0 has added type providers.
I wonder if it is possible to add this language feature to other languages running on the CLR like C# or if this feature only works well in a more functional/less OO programming style?
As Tomas says, it is theoretically straightforward to add this kind of feature to any statically-typed language (though still a lot of grunt-work).
I am not a meta-programming expert, but #SK-logic asks why not a general compile-time meta-programming system instead, and I shall try to answer. I don't think you can easily achieve what you can do with F# type providers using meta-programming, because F# type providers can be lazy and dynamically interactive at design-time. Let's give an example that Don has demo-ed in one of his earlier videos: a Freebase type provider. Freebase is kind of like a schematized, programmable wikipedia, it has data on everything. So you can end up writing code along the lines of
for e in Freebase.Science.``Chemical Elements`` do
printfn "%d: %s - %s" e.``Atomic number`` e.Name e.Discoverer.Name
or whatnot (I don't have the exact code offhand), but just as easily write code that gets information about baseball statistics, or when famous actors have been in drug rehab facilities, or a zillion other types of information available through Freebase.
From an implementation point-of-view, it is infeasible to generate a schema for all of Freebase and bring it into .NET a-priori; you can't just do one compile-time step at the beginning to set all this up. You can do this for small data sources, and in fact many other type providers use this strategy, e.g. a SQL type provider gets pointed at a database, and generates .NET types for all the types in that database. But this strategy does not work for large cloud data stores like Freebase, because there are too many interrelated types (if you tried to generate .NET metadata for all of Freebase, you'd find that there are so many millions of types (one of which is ChemicalElement with AtomicNumber and Discoverer and Name and many other fields, but there are literally millions of such types) that you need more memory than is available to a 32-bit .NET process just to represent the entire type schema.
So the F# type-provider strategy is an API architecture that allows type providers to supply information on-demand, running at design-time within the IDE. Until you type e.g. Freebase.Science., the type provider does not need to know about the entities under the science categories, but once you do press . after Science, then the type provider can go and query the APIs to learn one-more-level of the overall schema, to know what categories exist under Science, one of which is ChemicalElements. And then as you try to "dot into" one of those, it will discover that elements have atomic numbers and what-not. So the type provider lazily fetches just enough of the overall schema to deal with the exact code the user happens to be typing into the editor at that moment in time. As a result, the user still has the freedom to explore any part of the universe of information, but any one source code file or interactive session will only explore a tiny fraction of what is available. When it comes time to compile/codegen, the compiler need only generate enough code to accomodate exactly the bits that the user has actually used in his code, rather than the potentially huge runtime bits to afford the possibility of talking to the whole data store.
(Maybe you can do that with some of today's meta-programming facilities now, I don't know, but the ones I learned about in school a long while back could not have easily handled this.)
As Brian and Tomas point out, there's nothing particularly "functional" about this feature. It's just a particularly slick way to provide metadata to the compiler.
The C# design team has been kicking around ideas like this for a long time. There was a proposal a few years before I joined the C# team for a feature that was going to be called "type blueprints" (or something like that) whereby a combination of XML documents, XML schema and custom code that proffered up type metadata could be used by the C# compiler. I don't recall the details and it never came to fruition, obviously. (Though it did influence the design and the implementation of the Visual Studio Tools for Office document format, which I was working on at the time.)
In any event, we have no plans on the immediate horizon for adding such a feature to C#, but we are watching with great interest to see if it does a good job of solving customer problems in F#.
(As always, Eric's musings about possible future features of unnannounced and entirely hypothetical products are for entertainment purposes only.)
I don't see any technical reason why something like type providers couldn't be added to C# or similar languages. The only family of langauges that make it difficult to add type providers (in a similar way as in F#) are dynamically typed languages.
F# type providers rely on the fact that the type information that are generated by the provider nicely propagate through the program and the editor can use them to show useful IntelliSense. In dynamically typed languages, this would require more elaborate IDE support (and "type providers" for dynamic langauges reduce to just IDE or IntelliSense).
Why are they implemented directly as a feature of F#? I think the meta-programming system would have to be really complex (note that the types are not actually generated) to support this. The other things that could be done using it wouldn't contribute to the F# language that much (they would only make it too complex, which is a bad thing). However, you could get similar thing if you had some sort of compiler extensibility.
In fact, I think this is how the C# team will add something like type providers in the future (they talked about compiler extensibility for some time now).
This question appears to have died, so I've decided to offer a bounty.
What I'm most interested in knowing is if my scenario in the ETA1 below is viable and is used. If it isn't, then a good explanation of why not would be a good answer. Another good answer would be an alternative (but not including the internalsvisibleto attribute).
The best answer would be, yes, it's viable, everyone does it and customers love it!
ETA2: I think I've thought of a good solution. I provide the customer with a distributable edition that is as functional as their edition but is unlicensed and has the classes and members hidden, using attributes.
I can do this with compiler directives, on every single important member, but I wondered if there was some global way to hide all members of a class?
A simplified scenario:-
I have a class that extends a control in someway and I want to sell my class under two licenses;
(1) Standard - The customer gets x number of controls that use my class but can't instantiate the class (its internal).
(2) Developer - The same as Standard except they can create their own controls that use my class.
My problem is that when the developer customer comes to sell their controls, they can't help but expose my class to all their customers.
--- Ignore this
The only way around it, in my scrambled mind, would be for the developer to somehow integrate my assembly into theirs, and in that way I can keep the constructor internal. Or, use the internals visible to attribute. / Ignore this ---
I'm sure someone here has had the same situation and any help would be greatly appreciated.
ETA1: I'm thinking aloud here, but, I could have a list of permissable calling assembly names which the customer could add to. When they ship their product, their customers' assemblies would not be in the list and therefore they wouldn't be able to instantiate certain classes. (The list could obviously be hashed).
I believe you will store the licensing information (i.e, Standard and Developer) somewhere in the registry. In such case, I suppose the simpler solution would be to implement LicenseManager. This is what most .NET component vendor use.
http://msdn.microsoft.com/en-us/library/fe8b1eh9.aspx
Hope this helps !
I believe you've come up with the only real solution, assuming the runtime will support it. As long as yours is a separate DLL, if the developers can instantiate your objects then so can anyone else, whether you try to hide it behind a constructor, a factory, whatever.
I wonder, though, whether consumers might not even be able to get around that restriction by integrating the shipped assembly into their own?
Why don't you use license keys? Your class reads the license key and depending on what permissions the license offers it disables methods at runtime?
The license key could be defined in the config file.
It's a tough one, just due to the nature of .NET. It's a shot in the dark, but you could look into products such as CodeVeil which provides assembly encryption at the IL level. Your assembly would essentially be shipped encrypted and the key would be handed to your customer. The customer would then be the only entity with the ability to decrypt your assembly instructions. Now, CodeVeil claims the following about its decryption keys:
Even though the key is stored in the application that does not make is insecure. In fact the key itself is not as important as the transformation of the data itself. CodeVeil also uses many runtime-protection operations to frustrate hackers attempting to capture the decrypted assembly. In addition CodeVeil uses a very special decryption system that decrypts only enough information for the .NET runtime to execute that specific method. The code is never stored in the same memory as the assembly itself so the decrypted code cannot be dumped to disk for analysis.
This is obviously a good thing, but this is the part you'd have to research because i am not familiar with the other techniques they use as part of their decryption algorithm. The cool thing about this is if it works, your customers will be happy and THEY can make their customers happy by exposing parts of your assembly through their own API. At the same time your code stays protected from tools such as ILDASM and Reflector.
What is the simplest way to identify if a given method is reading or writing a member variable or property?
I am writing a tool to assist in an RPC system, in which access to remote objects is expensive. Being able to detect if a given object is not used in a method could allow us to avoid serializing its state. Doing it on source code is perfectly reasonable (but being able to do it on compiled code would be amazing)
I think I can either write my own simple parser, I can try to use one of the existing C# parsers and work with the AST. I am not sure if it is possible to do this with Assemblies using Reflection. Are there any other ways? What would be the simplest?
EDIT: Thanks for all the quick replies. Let me give some more information to make the question clearer. I definitely prefer correct, but it definitely shouldn't be extremely complex. What I mean is that we can't go too far checking for extremes or impossibles (as the passed-in delegates that were mentioned, which is a great point). It would be enough to detect those cases and assume everything could be used and not optimize there. I would assume that those cases would be relatively uncommon.
The idea is for this tool to be handed to developers outside of our team, that should not be concerned about this optimization. The tool takes their code and generates proxies for our own RPC protocol. (we are using protobuf-net for serialization only, but no wcf nor .net remoting). For this reason, anything we use has to be free or we wouldn't be able to deploy the tool for licensing issues.
You can have simple or you can have correct - which do you prefer?
The simplest way would be to parse the class and the method body. Then identify the set of tokens which are properties and field names of the class. The subset of those tokens which appears in the method body are the properties and field names you care about.
This trivial analysis of course is not correct. If you had
class C
{
int Length;
void M() { int x = "".Length; }
}
Then you would incorrectly conclude that M references C.Length. That's a false positive.
The correct way to do it is to write a full C# compiler, and use the output of its semantic analyzer to answer your question. That's how the IDE implements features like "go to definition".
Before attempting to write this kind of logic yourself, I would check to see if you can leverage NDepend to meet your needs.
NDepend is a code dependency analysis tool ... and much more. It implements a sophisticated analyzer for examining relationships between code constructs and should be able to answer that question. It also operates on both source and IL, if I'm not mistaken.
NDepend exposes CQL - Code Query Language - which allows you to write SQL-like queries against the relationships between structures in your code. NDepend has some support for scripting and is capable of being integrated with your build process.
To complete the LBushkin answer on NDepend (Disclaimer: I am one of the developer of this tool), NDepend can indeed help you on that. The Code LINQ Query (CQLinq) below, actually match methods that...
shouldn't provoque any RPC calls but
that are reading/writing any fields of any RPC types,
or that are reading/writing any properties of any RPC types,
Notice how first we define the 4 sets: typesRPC, fieldsRPC, propertiesRPC, methodsThatShouldntUseRPC - and then we match methods that violate the rule. Of course this CQLinq rule needs to be adapted to match your own typesRPC and methodsThatShouldntUseRPC:
warnif count > 0
// First define what are types whose call are RDC
let typesRPC = Types.WithNameIn("MyRpcClass1", "MyRpcClass2")
// Define instance fields of RPC types
let fieldsRPC = typesRPC.ChildFields()
.Where(f => !f.IsStatic).ToHashSet()
// Define instance properties getters and setters of RPC types
let propertiesRPC = typesRPC.ChildMethods()
.Where(m => !m.IsStatic && (m.IsPropertyGetter || m.IsPropertySetter))
.ToHashSet()
// Define methods that shouldn't provoke RPC calls
let methodsThatShouldntUseRPC =
Application.Methods.Where(m => m.NameLike("XYZ"))
// Filter method that should do any RPC call
// but that is using any RPC fields (reading or writing) or properties
from m in methodsThatShouldntUseRPC.UsingAny(fieldsRPC).Union(
methodsThatShouldntUseRPC.UsingAny(propertiesRPC))
let fieldsRPCUsed = m.FieldsUsed.Intersect(fieldsRPC )
let propertiesRPCUsed = m.MethodsCalled.Intersect(propertiesRPC)
select new { m, fieldsRPCUsed, propertiesRPCUsed }
My intuition is that detecting which member variables will be accessed is the wrong approach. My first guess at a way to do this would be to just request serialized objects on an as-needed basis (preferably at the beginning of whatever function needs them, not piecemeal). Note that TCP/IP (i.e. Nagle's algorithm) should stuff these requests together if they are made in rapid succession and are small
Eric has it right: to do this well, you need what amounts to a compiler front end. What he didn't emphasize enough is the need for strong flow analysis capabilities (or a willingness to accept very conservative answers possibly alleviated by user annotations). Maybe he meant that in the phrase "semantic analysis" although his example of "goto definition" just needs a symbol table, not flow analysis.
A plain C# parser could only be used to get very conservative answers (e.g., if method A in class C contains identifier X, assume it reads class member X; if A contains no calls then you know it can't read member X).
The first step beyond this is having a compiler's symbol table and type information (if method A refers to class member X directly, then assume it reads member X; if A contains **no* calls and mentions identifier X only in the context of accesses to objects which are not of this class type then you know it can't read member X). You have to worry about qualified references, too; Q.X may read member X if Q is compatible with C.
The sticky point are calls, which can hide arbitrary actions. An analysis based on just parsing and symbol tables could determine that if there are calls, the arguments refer only to constants or to objects which are not of the class which A might represent (possibly inherited).
If you find an argument that has an C-compatible class type, now you have to determine whether that argument can be bound to this, requiring control and data flow analysis:
method A( ) { Object q=this;
...
...q=that;...
...
foo(q);
}
foo might hide an access to X. So you need two things: flow analysis to determine whether the initial assignment to q can reach the call foo (it might not; q=that may dominate all calls to foo), and call graph analysis to determine what methods foo might actually invoke, so that you can analyze those for accesses to member X.
You can decide how far you want to go with this simply making the conservative assumption "A reads X" anytime you don't have enough information to prove otherwise. This will you give you a "safe" answer (if not "correct" or what I'd prefer to call "precise").
Of frameworks that might be helpful, you might consider Mono, which surely parses and builds symbol tables. I don't know what support it provides for flow analysis or call graph extraction; I would not expect the Mono-to-IL front-end compiler to do a lot of that, as people usually hide that machinery in the JIT part of JIT-based systems. A downside is that Mono may be behind the "modern C#" curve; last time I heard, it handled only C# 2.0 but my information may be stale.
An alternative is our DMS Software Reengineering Toolkit and its C# Front End.
(Not an open source product).
DMS provides general source code parsing, tree building/inspection/analysis, general symbol table support and built-in machinery for implementing control-flow analysis, data flow analysis, points-to analysis (needed for "What does object O point to?"), and call graph construction. This machinery has all been tested by fire with DMS's Java and C front ends, and the symbol table support has been used to implement full C++ name and type resolution, so its pretty effective. (You don't want to underestimate the work it takes to build all that machinery; we've been working on DMS since 1995).
The C# Front End provides for full C# 4.0 parsing and full tree building. It presently does not build symbol tables for C# (we're working on this) and that's a shortcoming compared to Mono. With such a symbol table, however, you would have access to all that flow analysis machinery (which has been tested with DMS's Java and C front ends) and that might be a big step up from Mono if it doesn't provide that.
If you want to do this well, you have a considerable amount of work in front of you. If you want to stick with "simple", you'll have to do with just parsing the tree and being OK with being very conservative.
You didn't say much about knowing if a method wrote to a member. If you are going to minimize traffic the way you describe, you want to distinguish "read", "write" and "update" cases and optimize messages in both directions. The analysis is obviously pretty similar for the various cases.
Finally, you might consider processing MSIL directly to get the information you need; you'll still have the flow analysis and conservative analysis issues. You might find the following technical paper interesting; it describes a fully-distributed Java object system that has to do the same basic analysis you want to do,
and does so, IIRC, by analyzing class files and doing massive byte code rewriting.
Java Orchestra System
By RPC do you mean .NET Remoting? Or DCOM? Or WCF?
All of these offer the opportunity to monitor cross process communication and serialization via sinks and other constructs, but they are all platform specific, so you'll need to specify the platform...
You could listen for the event that a property is being read/written to with an interface similar to INotifyPropertyChanged (although you obviously won't know which method effected the read/write.)
I think the best you can do is explicitly maintain a dirty flag.