Algorithm for Source Control System? - c#

I need to write a simple source control system and wonder what algorithm I would use for file differences?
I don't want to look into existing source code due to license concerns. I need to have it licensed under MPL so I can't look at any of the existing systems like CVS or Mercurial as they are all GPL licensed.
Just to give some background, I just need some really simple functions - binary files in a folder. no subfolders and every file behaves like it's own repository. No Metadata except for some permissions.
Overall really simple stuff, my single concern really is how to store only the differences of a file from revision to revision without wasting too much space but also without being too inefficient (Maybe store a full version every X changes, a bit like Keyframes in Videos?)

Longest Common Subsequence algorithms are the primary mechanism used by diff-like tools, and can be leveraged by a source code control system.
"Reverse Deltas" are a common approach to storage, since you primarily need to move backwards in time from the most recent revision.

Patience Diff is a good algorithm for finding deltas between two files that are likely to make sense to people. This often gives better results than the naive "longest common subsequence" algorithm, but results are subjective.
Having said that, many modern revision control systems store complete files at each stage, and compute the actual differences later, only when needed. For binary files (which probably aren't terribly compressible), you may find that storing reverse deltas might be ultimately more efficient.

How about looking the source code of Subversion ? its licensed under Apache License 2.0

Gene Myers has written a good paper An O(ND) Difference Algorithm and its Variations. When it comes to comparing sequences, Myers is the man. You probably should also read Walter Tichy's paper on RCS; it explains how to store a set of files by storing the most recent version plus differences.

The idea of storing deltas (forwards or backwards) is classic with respect to version control. The issue has always been, "what delta do you store?"
Lots of source control systems store deltas as computed essentially by "diff", e.g, line-oriented complement of longest-common-subsequences. But you can compute deltas for specific types of documents in a way specific to those documents, to get smaller (and often more understandable) deltas.
For programming languages source code, one can compute Levenshtein distances over program structures. A set of tools for doing essentially this for a variety of popular programming langauges can be found at Smart Differencer
If you are storing non-text files, you might be able to take advantage of their structure to compute smaller deltas.
Of course, if what you want is a minimal implementation, then just storing the complete image of each file version is easy. Terabyte disks make that solution workable if not pretty. (The PDP10 file system used to do this implicitly).

Though fossil is GPL, the delta algorithm is based on rsync and described here

I was actually thinking about something similar to this the other day... (odd, huh?)
I don't have a great answer for you but I did come to the conclusion that if I were to write a file diff tool, that I would do so with an algorithm (for finding diffs) that functions somewhat like how REGEXes function with their greediness.
As for storing DIFFs... If I were you, instead of storing forward-facing DIFFs (i.e. you start with your original file and then computer 150 diffs against it when you're working with version 151), use stored DIFFs for your history but have your latest file stored as a full version. If you do it this way, then whenever you're working with the latest file (which is probably 99% of the time), you'll get the best performance.

Related

Fuzzy Text Matching C#

I'm writing a desktop UI (.Net WinForms) to assist a photographer clean up his image meta data. There is a list of 66k+ phrases. Can anyone suggest a good open source/free .NET component I can use that employs some sort of algorithm to identify potential candiates for consolidation? For example there may be two or more entries which are actually the same word or phrase that only differ by whitespace or punctuation or even slight mis-spelling. The application will ultimately rely on the user to action the consolidation of phrases but having an effective way to automatically find potential candidates will prove invaluable.
Let me introduce you to the Levenshtein distance formula. It is awesome:
http://en.wikipedia.org/wiki/Levenshtein_distance
In information theory and computer science, the Levenshtein distance is a string metric for measuring the amount of difference between two sequences. The term edit distance is often used to refer specifically to Levenshtein distance.
Personally I used this in a healthcare setting, where Provider names were checked for duplicates. Using the Levenshtein process, we gave them a confidence rating and allowed them to determine if it was a true duplicate or something unique.
I know this is an old question, but I feel like this answer can help people who are dealing with the same issue in current time.
Please have a look at https://github.com/JakeBayer/FuzzySharp
It is a c# NuGet package that has multiple methods that implement a certain way of fuzzy search. Not sure, but perhaps Fosco's anwer is also used in one of them.
Edit:
I just noticed a comment about this package, but I think it deserves a better place inside this question

Options for dealing with very very large strings

I have a C# program that uses a production grammar to generate 3D models of trees and flowers and similar organic entities (see wikipedia entry for more info on L-Systems) - when I'm generating a large tree with leaves, I (expectedly) get exponential growth in the string that would go up to 100's of gigs if I'd let it (and I'd like to).
Constraints - I have to do this (sort of) in C# - the C++/native side is busy compiling and rendering the rather immense geometry that's produced.
So StringBuilder is right out --- even if it could handle it, I don't have enough memory!
I don't want to do a pure file based solution - waaaaaayyyyyyyy toooooooooooo sloooooooooooowwww!
I can't change the grammar - I realize I could compress the standard L-Systems notation, but it's a context sensitive grammar, so once you've got it working, you become positively superstitious about fiddling with it.
Things I've considered
Memory mapped files - I don't mind using P/Invoke to get to the native layer to support things, I just don't want to rewrite the whole production system in C++ - but I haven't found much in the way of handy libraries for C# to access this functionality
Low level mucking about with memory management/page faulting, etc - but hey, if I did that I might as well sell it as a product - makes the slow pure file based solution not look like such a bad idea
Anybody got any ideas here ? How do I effeciently traverse/manipulate/expand multigig strings produced by a production grammar ?
If you can upgrade to .net 4.0 then then you can use memory mapped files without needing to P/Invoke.
http://msdn.microsoft.com/en-us/library/dd997372.aspx
You're quite right that the typical approach to compression involves the notion of a pre-existing plaintext. What I'm talking about here is something like the idea of using a trie data structure as opposed to a dictionary. It's not just about passively compressing, but rather using an inherently more compact representation that encodes the redundancies implicitly. If you're at the 100G mark today, you're within an order of magnitude of bursting past the limits of affordable hard drives, so you might benefit from rethinking the solution.
If this is only for your development machines then a "back to the future" solution might be a RAM Disk, aka RAM Drive.
A RAM disk or RAM drive is a block of RAM (primary storage or volatile memory) that a computer's software is treating as if the memory were a disk drive (secondary storage).
One product for example. Search for RAM Disk or RAM drive and you'll get a cornucopia of choices.

Pattern (regex) based searching systems

I'm looking for a way to search through terabytes of data for patterns matching regexes. The implementation does need to support a lot of the finer capabilities of regexes, such as beginning and end of line data, full TR1 support (preferably with POSIX and/or PCRE support), and the like. We're effectively using this application to test policy regarding storage of potentially sensitive information.
I've looked into indexing solutions, but the majority of the commercial suites don't seem to have the finer regex capabilites we'd like (to date, they've all utterly failed at parsing the complex regexes we're using).
This is a complicated problem because of the sheer mass of the amount of data we have, and the amount of system resources we have to dedicate to the task of scanning (not much, its just checks on policy compliance, so there isn't much of a budget there for hardware).
I looked into Lucene but I'm a little hesitant about using index systems that aren't fully capable of dealing with our regex battery, and while searching through the entire dataset would remedy this problem, we'd have to let the servers chug along at performing these actions for a couple weeks at least.
Any suggestions?
PowerGREP can handle any regular expression and has been designed for exactly this purpose. I've found it to be extremely fast searching through large amounts of data, but I haven't tried it on the order of terabytes yet. But since there's a 30 day trial, it's worth a shot, I'd say.
It's especially powerful when it comes to searching specific parts of files. You can section the file according your own criteria, and then apply another search only on those sections. Plus, it has got very good reporting capabilities.
You might want to take a look at Apache Hadoop. Enormous sites like Yahoo and Facebook use Hadoop for a variety of things, one of them being processing multi-TB of text logs.
In the Hadoop documentation there is an example of a distributed Grep that could be scaled to handle any concievable data set size.
There is also a SequenceFileInputFilter.RegexFilter in the Hadoop API if you wanted to roll your own solution.
I can only offer a high-level answer. Building on Tim's and shadit's answers, use a two-pass approach implemented as a MapReduce algorithm on EC2 or Azure Compute. In each pass the Map could take a chunk of data with an identifier and return to Reduce the identifier if a match is found, else a null value. Scale it as wide as you need to shrink the processing time.
The grep program is highly optimized for regex searching in files, to the point where I would say you could not beat it with any general-purpose regex library. Even that would be impractically slow for searching terabytes, so I think you're out of luck on doing full regex searches.
One option might be to use an indexer as a first-pass to find likely matches, then extract some bytes on either side of each match and run a full regex match on it.
disclaimer: i am not a search expert.
if you really need all the generality of regexps then there's going to be nothing better than trawling through all the data (but see comments below on speeding that up).
however, i would guess that is not really the case. so the first thing to do is see if you can use an index to identify possible documents. if, for example, you know that you all your matches will include a word (any word) then you can index the words, use that to find the (hopefully small) set of documents that include that word, and then use grep or equivalent only on those files.
so, for example, maybe you need to find documents that have "FoObAr" at the start of the line. you would start with a caseless index to identify files that have "foobar" anywhere, and then grep (only) those for "^FoObAr".
next, how to grep as quickly as possible. you're likely going to be limited by io speed. so look at using several disks (there may be no need to use raid - you could just have one thread per disk). also, consider compression. you don't need random access to these files, and if they are text (i assume they are if you are grepping them) then they will compress nicely. that will reduce the amount of data you need to read (and store).
finally, note that if your index doesn't work for ALL queries, then it's probably not worth using. you can "grep" for all expressions in a single pass, and the expensive process is reading the data, not the details of the grep, so even if there is "just one" query that cannot be indexed, and you therefore need to scan everything, then building and using an index is probably not a good use of your time.

Programmatically checking code complexity, possibly via c#?

I'm interested in data mining projects, and have always wanted to create a classification algorithm that would determine which specific check-ins need code-reviews, and which may not.
I've developed many heuristics for my algorithm, although I've yet to figure out the killer...
How can I programmatically check the computational complexity of a chunk of code?
Furthermore, and even more interesting - how could I use not just the code but the diff that the source control repository provides to obtain better data there..
IE: If I add complexity to the code I'm checking in - but it reduces complexity in the code that is left - shouldn't that be considered 'good' code?
Interested in your thoughts on this.
UPDATE
Apparently I wasn't clear. I want this
double codeValue = CodeChecker.CheckCode(someCodeFile);
I want a number to come out based on how good the code was. I'll start with numbers like VS2008 gives when you calculate complexity, but would like to move to further heuristics.
Anyone have any ideas? It would be much appreciated!
Have you taken a look at NDepend? This tool can be used to calculated code complexity and supports a query language by which you can get an incredible amount of data on your application.
The NDepend web site contains a list of definitions of various metrics. Deciding which are most important in your environment is largely up to you.
NDepend also has a command line version that can be integrated into your build process.
Also, Microsoft's Code Analysis (ships with VS Team Suite) includes metrics which check the cyclomatic complexity of code, and raises a build error (or warning) if this number is over a certain threshold.
I don't know off hand, but ut may be worth checking whether this number is configurable to your requirements. You could then modify your build process to run code analysis any time something is checked in.
See Semantic Designs C# Metrics Tool for a tool that computes a variety of standard metrics value both over complete files, and all reasonable subdivisions (methods, classes, ...).
The output is an XML document, but extracting the value(s) you want from that should be trivial with an XML reader.

How do text differencing applications work?

How do applications like DiffMerge detect differences in text files, and how do they determine when a line is new, and not just on a different line than the file being checked against?
Is this something that is fairly easy to implement? Are there already libraries to do this?
Here's the paper that served as the basis for the UNIX command-line tool diff.
That's a complex question. Performing a diff means finding the minimum edit distance between the two files. That is, the minimum number of changes you must make to transform one file into the other. This is equivalent to finding the longest common subsequence of lines between the two files, and this is the basis for the various diff programs. The longest common subsequence problem is well known, and you should be able to find the dynamic programming solution on google.
The trouble with the dynamic programming approach is that it's O(n^2). It's thus very slow on large files and unusable for large, binary strings. The hard part in writing a diff program is optimizing the algorithm for your problem domain, so that you get reasonable performance (and reasonable results). The paper "An Algorithm for Differential File Comparison" by Hunt and McIlroy gives a good description of an early version of the Unix diff utility.
There are libraries. Here's one: http://code.google.com/p/google-diff-match-patch/
StackOverflow uses Beyond Compare for its diff. I believe it works by calling Beyond Compare from the command line.
It actually is pretty simple; DIFF programes - most of the time - are based on the Longest Common Sequence, which can be solved using a graph algorithm.
This web page gives example implementations in C#.

Categories