I'm interested in data mining projects, and have always wanted to create a classification algorithm that would determine which specific check-ins need code-reviews, and which may not.
I've developed many heuristics for my algorithm, although I've yet to figure out the killer...
How can I programmatically check the computational complexity of a chunk of code?
Furthermore, and even more interesting - how could I use not just the code but the diff that the source control repository provides to obtain better data there..
IE: If I add complexity to the code I'm checking in - but it reduces complexity in the code that is left - shouldn't that be considered 'good' code?
Interested in your thoughts on this.
UPDATE
Apparently I wasn't clear. I want this
double codeValue = CodeChecker.CheckCode(someCodeFile);
I want a number to come out based on how good the code was. I'll start with numbers like VS2008 gives when you calculate complexity, but would like to move to further heuristics.
Anyone have any ideas? It would be much appreciated!
Have you taken a look at NDepend? This tool can be used to calculated code complexity and supports a query language by which you can get an incredible amount of data on your application.
The NDepend web site contains a list of definitions of various metrics. Deciding which are most important in your environment is largely up to you.
NDepend also has a command line version that can be integrated into your build process.
Also, Microsoft's Code Analysis (ships with VS Team Suite) includes metrics which check the cyclomatic complexity of code, and raises a build error (or warning) if this number is over a certain threshold.
I don't know off hand, but ut may be worth checking whether this number is configurable to your requirements. You could then modify your build process to run code analysis any time something is checked in.
See Semantic Designs C# Metrics Tool for a tool that computes a variety of standard metrics value both over complete files, and all reasonable subdivisions (methods, classes, ...).
The output is an XML document, but extracting the value(s) you want from that should be trivial with an XML reader.
Related
I'm looking into using SQ to track TechDebt/CodeSmells in our c# projects.
I've made a roslyn analyzer that looks for e.g. [TechDebt(Smell.IndecentExposure,...)] attributes, and this works well using the SQ roslyn sdk. However the only options for the rules in SQALE are CONSTANT_ISSUE and LINEAR, whereas the effort for these items usually needs to be reviewed and can be different for each issue, rather than being fixed or depenant on the number of lines of code.
I can't find any way to change the issue effort on the server after a scan, and I don't see any actions for it in here either.
Is adding a changeEffortAction similar to the changeSeverityAction the simplist thing for me to do? I'm looking at ScannerReport.Issue.getGap() too, possibly can hack the scanner output to put a value there originating from the attribute?
Is there a quicker way to accomplish what I'm trying to do here?
There are currently no provisions for updating effort post-analysis. Remediation costs are estimated on an "average developer, average day" basis.
On high level my problem is -
We have couple of applications which have millions of lines of legacy code (C# and SQL). I need to figure out code areas which are being used most?
It may not be possible to find exact figures (especially in apps when code is being called based on user's action in GUI).
However, to get some rough figures few thoughts I have are to find out:
1) Find out List of Classes and Methods
2) Find out number of time they are called from within the code. (by means of direct method calls/delegates etc)
3) Find out all the stored procs/db functions (this would be bit staright forward)
4) Find out all the calls to stored procs
Could you please let me know - if you are aware of any tools to achive this?
Or any other idea to fetch above 4 details? Also, apart from these any other way to to do this analysis?
Thanks in advance!
I have used Red Gate's ANTS Profiler before:
http://www.red-gate.com/products/dotnet-development/ants-performance-profiler/
It's powerful and very easy to use (comes with a visual studio plugin). 14 days free!
One way you could achieve this is using Aspect Oriented Programming (AOP). I have used this previously in Java with the Spring Framework, but haven't used it before on .NET projects.
You could check out something like;
http://blogs.msdn.com/b/morgan/archive/2008/12/18/method-entry-exit-logging.aspx
This will give you an idea of how frequently methods are being called. Your will need simply need to collate the data in the logs into some form giving you an overall idea of usage patterns of the codebase.
Edit:
Further information on using this method can be found on other SO posts;
Logging entry and exit of methods along with parameters automagically?
https://stackoverflow.com/a/25825/685760
I currently have sensor data being dumped into a database. This is raw data, and needs an equation applied to it in order for it to make any sense to the end users. The problem I have, is that I do not know most of the formulas yet, and would also like the program to be flexible enough that when a new sensor is added to the system, the user would be able to enter in the calibration equation that would be able to convert the raw data into something useful.
I have never worked with letting a user enter in an equation to manipulate data. I would appreciate any input that might help. What direction should I be looking, should I be trying out lambda expression trees, evaluating the equation and compiling it using CodeDom, or looking in another direction? I have never done much with either lambda expression trees or CodeDom, and like always and on a fairly tight schedule, so the learning curve does count. I will have the opportunity to go back and make it better at a later date, they just need it up and running for now.
Thanks for any input.
I highly recommend FLEE for expression parsing/evaluation. It has a custom IL compiler that emits fast IL that doesn't have the memory problems that CodeDOM has.
It also has the desirable attribute of being easy to code with and extend.
I think you need to see what works for you. I also thought of the two only to find out you have mentioned them. I think the other alternative is to allow for parameters of a few major formulae to be stored (i.e. cubic, quadratic, exponential, log, ...) and one selected as the one to be used.
I would personally use the expression trees because it is the cleanest. One problem with CodeDom is the memory leak caused by compiling code especially if the user changes the code and builds the formula multiple times. One solution would be to load the compiled code in a separate AppDomain and then unload the whole appdomain.
I need to write a simple source control system and wonder what algorithm I would use for file differences?
I don't want to look into existing source code due to license concerns. I need to have it licensed under MPL so I can't look at any of the existing systems like CVS or Mercurial as they are all GPL licensed.
Just to give some background, I just need some really simple functions - binary files in a folder. no subfolders and every file behaves like it's own repository. No Metadata except for some permissions.
Overall really simple stuff, my single concern really is how to store only the differences of a file from revision to revision without wasting too much space but also without being too inefficient (Maybe store a full version every X changes, a bit like Keyframes in Videos?)
Longest Common Subsequence algorithms are the primary mechanism used by diff-like tools, and can be leveraged by a source code control system.
"Reverse Deltas" are a common approach to storage, since you primarily need to move backwards in time from the most recent revision.
Patience Diff is a good algorithm for finding deltas between two files that are likely to make sense to people. This often gives better results than the naive "longest common subsequence" algorithm, but results are subjective.
Having said that, many modern revision control systems store complete files at each stage, and compute the actual differences later, only when needed. For binary files (which probably aren't terribly compressible), you may find that storing reverse deltas might be ultimately more efficient.
How about looking the source code of Subversion ? its licensed under Apache License 2.0
Gene Myers has written a good paper An O(ND) Difference Algorithm and its Variations. When it comes to comparing sequences, Myers is the man. You probably should also read Walter Tichy's paper on RCS; it explains how to store a set of files by storing the most recent version plus differences.
The idea of storing deltas (forwards or backwards) is classic with respect to version control. The issue has always been, "what delta do you store?"
Lots of source control systems store deltas as computed essentially by "diff", e.g, line-oriented complement of longest-common-subsequences. But you can compute deltas for specific types of documents in a way specific to those documents, to get smaller (and often more understandable) deltas.
For programming languages source code, one can compute Levenshtein distances over program structures. A set of tools for doing essentially this for a variety of popular programming langauges can be found at Smart Differencer
If you are storing non-text files, you might be able to take advantage of their structure to compute smaller deltas.
Of course, if what you want is a minimal implementation, then just storing the complete image of each file version is easy. Terabyte disks make that solution workable if not pretty. (The PDP10 file system used to do this implicitly).
Though fossil is GPL, the delta algorithm is based on rsync and described here
I was actually thinking about something similar to this the other day... (odd, huh?)
I don't have a great answer for you but I did come to the conclusion that if I were to write a file diff tool, that I would do so with an algorithm (for finding diffs) that functions somewhat like how REGEXes function with their greediness.
As for storing DIFFs... If I were you, instead of storing forward-facing DIFFs (i.e. you start with your original file and then computer 150 diffs against it when you're working with version 151), use stored DIFFs for your history but have your latest file stored as a full version. If you do it this way, then whenever you're working with the latest file (which is probably 99% of the time), you'll get the best performance.
I work on a team with about 10 developers. Some of the developers have very exacting formatting needs. I would like to find a pretty printer that I could configure to these specifications and then add to the build processes. In this way no matter how badly other people mess up the format when it is pulled down from source control it will look acceptable.
The easiest solution is for the team lead to mandate a format and everyone use it. The VS defaults are pretty good.
Jeff Atwood did that to us here on Stack Overflow and while I rebelled at first, I got over it :) Makes everything much easier!
Coding standards are definitely something we have. The coding formatting I am talking about is imposed by a grizzled architect that is, lets say, set in his ways and extremely particular. Lets just pretend that we can not address the human factor. I was looking for a way to circumvent the whole human processes.
The visual studio defaults sadly do not address line breaks very well. I am just making this line chopping style up but....
ServiceLocator.Logger.WriteDefault(string.format("{0}{1}"
,foo
,bar)
,Logging.SuperDuper);
another example of formatting visual studio is not too hot at....
if( foo
&& ( bar
|| baz
|| apples
|| oranges)
&& IsFoo()
&& IsBar() ){
}
Visual studio does not play well at all will stuff like this. We are currently using ReSharper to allow for more granularity with formating but it sadly falls sort in many areas.
Don't get me wrong though coding standards are great. The goal of the pretty printer as part of the build process is to get 'perfect' looking code no matter how well people are paying attention or counting their spaces.
The edge cases around code formatting are very solvable since it is a well defined grammar.
As far as the VS defaults go I can only say: BSD style or die!
So all that brings me full circle back to: Is there a configurable pretty printer for C#? As much as lexical analysis and parsing fascinate I have about had my fill making a YAML C# tool chain.
Your issue was the primary intent for creating NArrange (beta). It allows configurable reformatting of C# code and you can use one common configuration file to be shared by the entire team. Since its focus is primarily on reordering members in classes and controlling regions, it is still lacking many necessary formatting options (especially formatting within member code lines).
The normal usage scenario is for each developer to run the tool prior to check-in. I'm not aware of any one running it is part of their build process, but there is no reason why you couldn't, since it is a command-line tool. One idea that I've contemplated is running NArrange on files as part of a pre-commit step. If the original file contents being checked in don't match the NArrange formatted output on the source repository server, then the developer didn't reformat to the rules and a check-in error can be raised.
For more information, see my CodeProject article on Using NArrange to Organize C# Code.
Update 2023-02-25: NArrange appears to have moved to Github. The NArrange site (referenced above) is no longer available although there are copies in web.archive.org
I second Jarrod's answer. If you have 2 developers with conflicting coding preferences, then get the rest of the team to vote, and then get the boss to back the majority decision.
Additionally, the problem with trying to automatically apply a pretty printer like that, is that there will always be exceptional cases where your blanket coding standard is not the best or most readable solution, and you will lose out by squashing them with an automated tool.
Coding Standards are just that, standards. They don't call them Coding Laws or Coding Rules, and there's a good reason for that.