I am trying to produce a tool which is smart enough to programmtically examine release version binaries produced by identical C# code compiled on two seperate machinces at different times and conclude that the code was identical while being able to pick up any code changes if present in the c# code used to produce these binaries.
I have tried using a number of approaches but in order to keep this short i'll just stick to the latest attempt.
I run ildasm with the /text option on the binaries and replace the GUIDs for anonymous fields etc in text, but when the binaries come from different pcs i find that the text produced by ILDASM /text option is reordered. The binaries originating from the same code but compiled by same setup on different machines also appear heavily reordered. Any suggestion how one may be able to control this reordering of IL would be much appreciated ?
Cheers
PS: Any alternative strategies of reliably accomplishing this are also most welcome.
Waiting for Eric Lippert to wake up :) - community wiki out of #mikez 's comment:
When a principal developer (Eric Lippert) on the compiler team speaks, you should listen: http://ericlippert.com/2012/05/31/past-performance-is-no-guarantee-of-future-results/ contains detailed explanation and strong recommendation for not doing it (likely in response to this precise question):
Is compiling the same C# program twice guaranteed to produce the same binary output?
No.
I found that a solution in accordance to what Eric Lippert's mentioned in his post what his client ended up settling for can be reached by setting the processor affinity for the compilation process to 01. After this the executables/ dlls produced are almost identical in excpetion to som mvid and guids used. Running ILDASM on these binaries with the text mode and building a simple hashing tool to strip away this random stuff provides such a solution. I am just providing this for the sake of completion and to help others who may face this problem.
Related
I'm using a build script to compile several C# projects. The binary output is copied to a result folder, overwriting the previous version of the files, and then added/committed to subversion.
I noticed that the binary output of the compilation are different even when there was no change to the source or environment at all. How is this possible? Isn't the binary result supposed to be exactly equal for the same input?
I'm not intentionally using any kind of special timestamps anywhere, but does the compiler (Microsoft, the one included in .NET 4.0) possibly add timestamps itself?
The reason I'm asking is I'm committing the output to subversion, and due to the way our build server works the checked in changes trigger a rebuild, causing the once again modified binary files to be checked in in a circle.
ANOTHER UPDATE:
Since 2015 the compiler team has been making an effort to get sources of non-determinism out of the compiler toolchain, so that identical inputs really do produce identical outputs. See the "Concept-determinism" tag on the Roslyn github for more details.
UPDATE: This question was the subject of my blog in May 2012. Thanks for the great question!
How is this possible?
Very easily.
Isn't the binary result supposed to be exactly equal for the same input?
Absolutely not. The opposite is true. Every time you run the compiler you should get a different output. Otherwise how could you know that you'd recompiled?
The C# compiler embeds a freshly generated GUID in an assembly on every compilation, thereby guaranteeing that no two compilations produce exactly the same result.
Moreover -- even without the GUID, the compiler makes no guarantees whatsoever that two "identical" compilations will produce the same results.
In particular, the order in which the metadata tables are populated is highly dependent on details of the file system; the C# compiler starts generating metadata in the order in which the files are given to it, and that can be subtly changed by a variety of factors.
due to the way our build server works the checked in changes trigger a rebuild, causing the once again modified binary files to be checked in in a circle.
I'd fix that if I were you.
Yes, the compiler includes a timestamp. Additionally, in some cases the compiler will auto-increment the assembly version number. I haven't seen any guarantee anywhere that the binary result is meant to be identical.
(Note that if the source is already in Subversion, I'd generally steer clear of also adding the binary files in there. I'd usually only include releases of third-party libraries. It depends on exactly what you're doing though.)
As mentioned by others, the compiler does generate a distinct build hence the different result. What you are looking for is the ability to create deterministic builds and now this is included as part of the roslyn compiler.
Roslyn command line options
/deterministic Produce a deterministic assembly (including module
version GUID and timestamp)
Read more about this feature
https://github.com/dotnet/roslyn/blob/master/docs/compilers/Deterministic%20Inputs.md
As far as I know, only MS binaries are different on every compile. 20 or so years ago, it wasn't like that. The MS binaries were the same after every compile (assuming the source code was the same).
I want to convert asm to c(assembly to C)
I saw http://www.textmaestro.com/InfoEx_17_Convert_Assembly.htm
(please the page)
page on web and easily after that i try to Do this job using find and Replace with Regex in C#
i am not computer field student so i am not professional at Regex.
I am working 5 days and after this time now i know that i cant do this.i wrote very code but without any success
sample program:
mov r1,1;
mov r2,2;
convert to :
r1=1;
r2=2;
please help me to do this correctly
OP has (painfully) learned that regexps are not a good solution to problems that involve analysis or translation of software. Processing strings simply is not the same as building context-sensitive analyses of text with complex structure.
People keep re-learning this lesson. It is true that you can use repeated regex to simulate Post rewriting systems, and Post systems, being Turing capable, can technically do anything. It is also true that nobody really wants to, or more importantly, nobody can write a very complex program for a real Turing machine [or an equivalent Post system]. This is why we have all these other computer languages and tools. [The TextMaestro system to which OP refers is trying to be exactly that Post system.]
However, the task he wants to do is possible and practical with the proper tools: program transformation systems (PTS).
In particular, he should see this technical paper for a description of precisely how this has been done with one particular PTS: See Pigs from sausages? Reengineering from assembler to C via FermaT transformations. Such a tool in effect is a custom compiler from assembly source code to the target language, and includes parsing, name (label) resolution, often data flow analysis and complex code generation and optimization. A PTS is used because they make it relatively easy to build that kind of compiler. This tool has been used for at least Intel assembly to C, and mainframe (System 360/370/Z) assembly to C, for large-scale tasks. (I have no relationship to this tool but do have huge respect for the authors).
The naysayers in the comments seem to think this is impossible to do except for extremely constrained circumstances. It is true that the more one knows about the assembly code in terms of idioms, the somewhat easier this gets, but the technical approach in the paper is not limited to specific compiler output by any means. It is also true that truly arcane assembler code (especially self-modifying or having runtime code generation) is extremely difficult to translate.
Has anyone come across a tool to report on commented-out code in a .NET app? I'm talking about patterns like:
//var foo = "This is dead";
And
/*
var foo = "This is dead";
*/
This won't be found by tools like ReSharper or FxCop which look for unreferenced code. There are obvious implications around distinguishing commented code from commented text but it doesn't seem like too great a task.
Is there any existing tool out there which can pick this up? Even if it was just reporting of occurrences by file rather than full IDE integration.
Edit 1: I've also logged this as a StyleCop feature request. Seems like a good fit for the tool.
Edit 2: Yes, there's a good reason why I'd like to do this and it relates to code quality. See my comment below.
You can get an approximate answer by using a regexp that recognizes comments, that end with ";" or "}".
For a more precise scheme, see this answer:
Tool to find commented out VHDL code
I've been done this road for the same reason. I more or less did was Ira Baxter suggested (though I focused on variable_type variable = value and specifically looked for lines that consisted of 0 or more whitespace at beginning followed by // followed by code (and to handle /* */, I wrote a preprocessor that converted it into //'s. I tweaked the reg exp to cut down on false positives and also did a manual inspection just to be safe; fortunately, there were very few cases where the comment was doing pseudo-code like things as drachenstern suggests above; YMMV. I'd love to find a tool that could do this but some false positives from valid but possibly overly detailed pseudo code are going to be really to rule out, especially if they're using literate programming techniques to make the code as "readable" as pseudo code.
(I also did this for VB6 code; on the one hand, the lack of ;'s made it harder to right an "easy" reg exp, on the other hand the code used a lot less classes so it was easier to match on variable types which tend not to be in pseudo code)
Another option I looked at but didn't have available was to look at the revision control logs for lines that were code in version X and then //same line in version Y... this of courses assumes that A) they are using revision control B) you can view it, and C) they didn't write code and comment it out in the same revision. (And gets a little trickier if they use /* */ comments
There is another option for this, Sonar. It is actually a Java centric app but there is a plugin that can handle C# code. Currently it can report on:
StyleCop errors
FxCop errors
Gendarme errors
Duplicated code
Commented code
Unit test results
Code coverage (using NCover, OpenCover)
It does take a while for it to scan mostly due to the duplication checks (AFAIK uses text matching rather than C# syntax trees) and if you use the internal default derby database it can fail on large code bases. However it is very useful to for the code-base metrics you gain and it has snapshot features that enable you to see how things have changed and (hopefully) got better over time.
Since StyleCop is not actively maintained (see https://github.com/StyleCop/StyleCop#considerations), I decided to roll out my own, dead-csharp:
https://github.com/mristin/dead-csharp
Dead-csharp uses heuristics to find code patterns in the comments. The comments starting with /// are intentionally ignored (so that you can write code in structured comments).
StyleCop will catch the first pattern. It suggests you use //// as a comment for code so that it will ignore the rule.
Seeing as you mentioned NDepends it also can tell you Percentage commented http://www.ndepend.com/Metrics.aspx#PercentageComment. This is defined for application, assemblies, namespaces, types and methods.
I'll give you a little bit of background first as to why I'm asking this question:
I am currently working in a stricly-regulated industry and as such our code is quite carefully looked-over by official test houses. These test houses expect to be able to build the code and generate an .exe or .dll which is EXACTLY the same each and every time (without changing any code obviously!). They check the MD5 and the SHA1 of the executables that they create to ensure this.
Up until this point I have predominantly been coding in C++, where (after a few project setting tweaks) I managed to get the projects to rebuild consistantly to the same MD5/SHA1. I am now using C# in a project and am having great difficulty getting the MD5's to match after a rebuild. I am aware that there are "Time-Stamps" in the PE header of the file, and they have been cleared to 0. I am also aware that there is a GUID for the .exe, which again has been cleared to 00 00 00... etc. However the files still don't match.
I'm using CFF Explorer to view and edit the PE Header to remove the time and date stamps. After using a binary comparison tool there are only 2 blocks of bytes in the .exe's that are different (both very small).
One of the inconsistant blocks appears just before some binary code, which in ASCII details the path of the *Project*\obj\Release\xxx.pdb file.
EDIT: This is now known to be the GUID of the *.pdb file, however I still don't know if I can modify it without causing any errors!?
The other block appears in the middle of what looks to be function names, ie. (a typical section) AssemblyName.GetName.Version.get_Version.System.IO.Ports.SerialPort.Parity.Byte.<PrivateImplementationDetails>{
then the different code block:
4A134ACE-D6A0-461B-A47C-3A4232D90816
followed by:
"}.ValueType.__StaticArrayInitTypeSize=7.$$method0x60000ab-1.RuntimeFieldHandle.InitializeArray`... etc..
Any ideas or suggestions would be most welcome!
Update: Roslyn seems to have a /feature:deterministic compiler flag for reproducible builds, although it's not 100% working yet.
You should be able to get rid of the debug GUID by disabling PDB generation. If not, setting the GUID to zeroes is fine - only debuggers look at that section (you won't be able to debug the assembly anymore, but it should still run fine).
The PrivateImplementationDetails are a bit more difficult - these are internal helper classes generated by the compiler for certain language constructs (array initializers, switch statements using strings, etc.). Because they are only used internally, the class name doesn't really matter, so you could just assign a running number to them.
I would do this by going through the #Strings metadata stream and replacing all strings of the form "<PrivateImplementationDetails>{GUID}" with "<PrivateImplementationDetails>{running number, padded to same length as a GUID}".
The #Strings metadata stream is simply the list of strings used by the metadata, encoded in UTF-8 and separated by \0; so finding and replacing the names should be easy once you know where the #Strings stream is inside the executable file.
Unfortunately the "metadata stream headers" containing this information are quite buried inside the file format. You'll have to start at the NT Optional Header, find the pointer to the CLI Runtime Header, resolve it to a file position using the PE section table (it's an RVA, but you need a position inside the file), then go to the metadata root and read the stream headers.
I'm not sure about this, but just a thought: are you using any anonymous types for which the compiler might generate names behind the scenes, which might be different each time the compiler runs? Just a possibility which occurred to me. Probably one for Jon Skeet ;-)
Update: You could perhaps also use Reflector addins for comparison and disassembly.
Regarding the PDB GUID problem, if you specify that a PDB shouldn't be generated at compilation for Release builds, does the binary still contain the PDB's file system GUID?
To disable PDB generation:
Right-click your project in Solution Explorer and select Properties.
From the menu along the left, select Build.
Ensure that the Configuration selection is Release (you'll still want a PDB for debugging).
Click the Advanced button in the bottom right.
Under Output / Debug Info, select None.
If you're building from the console, use /debug- to get the same result.
Take a look at the answers from this question. Especially on the external link provided in the 3rd one.
EDIT:
I actually wantetd to link to this article.
You said that after a few project tweaks you were able to get C++ apps to compile repeatably to the same SHA1/MD5 values. I'm in the same boat as you in being in an industry with a third party test lab that needs to rebuild exactly the same executables repeatably.
In researching how to make this happen in VS2005, I came across your post here. Could you share the project tweaks you did to make the C++ apps build to the same SHA1/MD5 values consistently? It would be of great help to myself and perhaps any others that share this requirement.
Use ildasm.exe to fully disassemble both programs and compare the IL. Then you can "clean" the code using text-based methods and (predictably) recompile it again.
I work on a team with about 10 developers. Some of the developers have very exacting formatting needs. I would like to find a pretty printer that I could configure to these specifications and then add to the build processes. In this way no matter how badly other people mess up the format when it is pulled down from source control it will look acceptable.
The easiest solution is for the team lead to mandate a format and everyone use it. The VS defaults are pretty good.
Jeff Atwood did that to us here on Stack Overflow and while I rebelled at first, I got over it :) Makes everything much easier!
Coding standards are definitely something we have. The coding formatting I am talking about is imposed by a grizzled architect that is, lets say, set in his ways and extremely particular. Lets just pretend that we can not address the human factor. I was looking for a way to circumvent the whole human processes.
The visual studio defaults sadly do not address line breaks very well. I am just making this line chopping style up but....
ServiceLocator.Logger.WriteDefault(string.format("{0}{1}"
,foo
,bar)
,Logging.SuperDuper);
another example of formatting visual studio is not too hot at....
if( foo
&& ( bar
|| baz
|| apples
|| oranges)
&& IsFoo()
&& IsBar() ){
}
Visual studio does not play well at all will stuff like this. We are currently using ReSharper to allow for more granularity with formating but it sadly falls sort in many areas.
Don't get me wrong though coding standards are great. The goal of the pretty printer as part of the build process is to get 'perfect' looking code no matter how well people are paying attention or counting their spaces.
The edge cases around code formatting are very solvable since it is a well defined grammar.
As far as the VS defaults go I can only say: BSD style or die!
So all that brings me full circle back to: Is there a configurable pretty printer for C#? As much as lexical analysis and parsing fascinate I have about had my fill making a YAML C# tool chain.
Your issue was the primary intent for creating NArrange (beta). It allows configurable reformatting of C# code and you can use one common configuration file to be shared by the entire team. Since its focus is primarily on reordering members in classes and controlling regions, it is still lacking many necessary formatting options (especially formatting within member code lines).
The normal usage scenario is for each developer to run the tool prior to check-in. I'm not aware of any one running it is part of their build process, but there is no reason why you couldn't, since it is a command-line tool. One idea that I've contemplated is running NArrange on files as part of a pre-commit step. If the original file contents being checked in don't match the NArrange formatted output on the source repository server, then the developer didn't reformat to the rules and a check-in error can be raised.
For more information, see my CodeProject article on Using NArrange to Organize C# Code.
Update 2023-02-25: NArrange appears to have moved to Github. The NArrange site (referenced above) is no longer available although there are copies in web.archive.org
I second Jarrod's answer. If you have 2 developers with conflicting coding preferences, then get the rest of the team to vote, and then get the boss to back the majority decision.
Additionally, the problem with trying to automatically apply a pretty printer like that, is that there will always be exceptional cases where your blanket coding standard is not the best or most readable solution, and you will lose out by squashing them with an automated tool.
Coding Standards are just that, standards. They don't call them Coding Laws or Coding Rules, and there's a good reason for that.