How do text differencing applications work?

How do text differencing applications work? - c#

How do applications like DiffMerge detect differences in text files, and how do they determine when a line is new, and not just on a different line than the file being checked against?
Is this something that is fairly easy to implement? Are there already libraries to do this?

Here's the paper that served as the basis for the UNIX command-line tool diff.

That's a complex question. Performing a diff means finding the minimum edit distance between the two files. That is, the minimum number of changes you must make to transform one file into the other. This is equivalent to finding the longest common subsequence of lines between the two files, and this is the basis for the various diff programs. The longest common subsequence problem is well known, and you should be able to find the dynamic programming solution on google.
The trouble with the dynamic programming approach is that it's O(n^2). It's thus very slow on large files and unusable for large, binary strings. The hard part in writing a diff program is optimizing the algorithm for your problem domain, so that you get reasonable performance (and reasonable results). The paper "An Algorithm for Differential File Comparison" by Hunt and McIlroy gives a good description of an early version of the Unix diff utility.

There are libraries. Here's one: http://code.google.com/p/google-diff-match-patch/
StackOverflow uses Beyond Compare for its diff. I believe it works by calling Beyond Compare from the command line.

It actually is pretty simple; DIFF programes - most of the time - are based on the Longest Common Sequence, which can be solved using a graph algorithm.
This web page gives example implementations in C#.

Related

Schedule with Constraints

I want to schedule tasks with Constraints (similar to the job shop scheduling problem) and thought I could use something like the Microsoft Solver Foundation (I need to use C#). But as far as I know you can only solve problems by finding the optimal maximal or minimal which takes way to long. I need an approximation so the scheduling is not optimal (as good as possible) concerning the total time but all the Constraints are fulfilled.
Any ideas how to approach this problem?

I would suggest you using Z3 solver. It provides you C# API. Basically, it is a SMT solver, which searches for 'good enough' solution with respect to given constraints. It could be rather difficult to define your problem in SMTLIB language.
If it's too hard for you, look at Minizinc or Clingo solvers - just generate problem formulation as a text file, run a solver as a separate process from your C# code, parse solution back from output text file.
EDIT
If you want to minimize a length of a schedule, you can try the following approach. Let's assume that there is a schedule of length K. Is your planning problem satisfiable under this assumption? Let's call a solver to find this out! Generate several problems with different K's and run the solver iteratively. Use binary search to reduce the number of trials.

How can I use SharpNLP to detect the possibility that a line of text is a sentence?

I've written a small C# program that compiles a bunch of words into a line of text and I want to use NLP only to give me a percentage possibility that the bunch of words is a sentence. I don't need tokens, or tagging, all that can be in the background if it needs to be done. I have OpenNLP and SharpEntropy referenced in my project, but I'm coming up with an error "Array dimensions exceeded supported range." when using these, so I've also attempted using IKVM created OpenNLP without sharp entropy, but without documentation, I can't seem to wrap my head around the proper steps to get only the percentage probability.
Any help or direction would be appreciated.

I'll recommend 2 relatively simple measures that might help you classify a word sequence as sentence/non-sentence. Unfortunately, I don't know how well SharpNLP will handle either. More complete toolkits exist in Java, Python, and C++ (LingPipe, Stanford CoreNLP, GATE, NLTK, OpenGRM, ...)
Language-model probability: Train a language model on sentences with start and stop tokens at the beginning/end of the sentence. Compute the probability of your target sequence per that language model. Grammatical and/or semantically sensible word sequences will score much higher than random word sequences. This approach should work with a standard n-gram model, a discriminative conditional probability model, or pretty much any other language modeling approach. But definitely start with a basic n-gram model.
Parse tree probability: Similarly, you can measure the inside probability of recovered constituency structure (e.g. via a probabilistic context free grammar parse). More grammatical sequences (i.e., more likely to be a complete sentence) will be reflected in higher inside probabilities. You will probably get better results if you normalize by the sequence length (the same may apply to a language-modeling approach as well).
I've seen preliminary (but unpublished) results on tweets, that seem to indicate a bimodal distribution of normalized probabilities - tweets that were judged more grammatical by human annotators often fell within a higher peak, and those judged less grammatical clustered into a lower one. But I don't know how well those results would hold up in a larger or more formal study.

Algorithm for Source Control System?

I need to write a simple source control system and wonder what algorithm I would use for file differences?
I don't want to look into existing source code due to license concerns. I need to have it licensed under MPL so I can't look at any of the existing systems like CVS or Mercurial as they are all GPL licensed.
Just to give some background, I just need some really simple functions - binary files in a folder. no subfolders and every file behaves like it's own repository. No Metadata except for some permissions.
Overall really simple stuff, my single concern really is how to store only the differences of a file from revision to revision without wasting too much space but also without being too inefficient (Maybe store a full version every X changes, a bit like Keyframes in Videos?)

Longest Common Subsequence algorithms are the primary mechanism used by diff-like tools, and can be leveraged by a source code control system.
"Reverse Deltas" are a common approach to storage, since you primarily need to move backwards in time from the most recent revision.

Patience Diff is a good algorithm for finding deltas between two files that are likely to make sense to people. This often gives better results than the naive "longest common subsequence" algorithm, but results are subjective.
Having said that, many modern revision control systems store complete files at each stage, and compute the actual differences later, only when needed. For binary files (which probably aren't terribly compressible), you may find that storing reverse deltas might be ultimately more efficient.

How about looking the source code of Subversion ? its licensed under Apache License 2.0

Gene Myers has written a good paper An O(ND) Difference Algorithm and its Variations. When it comes to comparing sequences, Myers is the man. You probably should also read Walter Tichy's paper on RCS; it explains how to store a set of files by storing the most recent version plus differences.

The idea of storing deltas (forwards or backwards) is classic with respect to version control. The issue has always been, "what delta do you store?"
Lots of source control systems store deltas as computed essentially by "diff", e.g, line-oriented complement of longest-common-subsequences. But you can compute deltas for specific types of documents in a way specific to those documents, to get smaller (and often more understandable) deltas.
For programming languages source code, one can compute Levenshtein distances over program structures. A set of tools for doing essentially this for a variety of popular programming langauges can be found at Smart Differencer
If you are storing non-text files, you might be able to take advantage of their structure to compute smaller deltas.
Of course, if what you want is a minimal implementation, then just storing the complete image of each file version is easy. Terabyte disks make that solution workable if not pretty. (The PDP10 file system used to do this implicitly).

Though fossil is GPL, the delta algorithm is based on rsync and described here

I was actually thinking about something similar to this the other day... (odd, huh?)
I don't have a great answer for you but I did come to the conclusion that if I were to write a file diff tool, that I would do so with an algorithm (for finding diffs) that functions somewhat like how REGEXes function with their greediness.
As for storing DIFFs... If I were you, instead of storing forward-facing DIFFs (i.e. you start with your original file and then computer 150 diffs against it when you're working with version 151), use stored DIFFs for your history but have your latest file stored as a full version. If you do it this way, then whenever you're working with the latest file (which is probably 99% of the time), you'll get the best performance.

Pattern (regex) based searching systems

I'm looking for a way to search through terabytes of data for patterns matching regexes. The implementation does need to support a lot of the finer capabilities of regexes, such as beginning and end of line data, full TR1 support (preferably with POSIX and/or PCRE support), and the like. We're effectively using this application to test policy regarding storage of potentially sensitive information.
I've looked into indexing solutions, but the majority of the commercial suites don't seem to have the finer regex capabilites we'd like (to date, they've all utterly failed at parsing the complex regexes we're using).
This is a complicated problem because of the sheer mass of the amount of data we have, and the amount of system resources we have to dedicate to the task of scanning (not much, its just checks on policy compliance, so there isn't much of a budget there for hardware).
I looked into Lucene but I'm a little hesitant about using index systems that aren't fully capable of dealing with our regex battery, and while searching through the entire dataset would remedy this problem, we'd have to let the servers chug along at performing these actions for a couple weeks at least.
Any suggestions?

PowerGREP can handle any regular expression and has been designed for exactly this purpose. I've found it to be extremely fast searching through large amounts of data, but I haven't tried it on the order of terabytes yet. But since there's a 30 day trial, it's worth a shot, I'd say.
It's especially powerful when it comes to searching specific parts of files. You can section the file according your own criteria, and then apply another search only on those sections. Plus, it has got very good reporting capabilities.

You might want to take a look at Apache Hadoop. Enormous sites like Yahoo and Facebook use Hadoop for a variety of things, one of them being processing multi-TB of text logs.
In the Hadoop documentation there is an example of a distributed Grep that could be scaled to handle any concievable data set size.
There is also a SequenceFileInputFilter.RegexFilter in the Hadoop API if you wanted to roll your own solution.

I can only offer a high-level answer. Building on Tim's and shadit's answers, use a two-pass approach implemented as a MapReduce algorithm on EC2 or Azure Compute. In each pass the Map could take a chunk of data with an identifier and return to Reduce the identifier if a match is found, else a null value. Scale it as wide as you need to shrink the processing time.

The grep program is highly optimized for regex searching in files, to the point where I would say you could not beat it with any general-purpose regex library. Even that would be impractically slow for searching terabytes, so I think you're out of luck on doing full regex searches.
One option might be to use an indexer as a first-pass to find likely matches, then extract some bytes on either side of each match and run a full regex match on it.

disclaimer: i am not a search expert.
if you really need all the generality of regexps then there's going to be nothing better than trawling through all the data (but see comments below on speeding that up).
however, i would guess that is not really the case. so the first thing to do is see if you can use an index to identify possible documents. if, for example, you know that you all your matches will include a word (any word) then you can index the words, use that to find the (hopefully small) set of documents that include that word, and then use grep or equivalent only on those files.
so, for example, maybe you need to find documents that have "FoObAr" at the start of the line. you would start with a caseless index to identify files that have "foobar" anywhere, and then grep (only) those for "^FoObAr".
next, how to grep as quickly as possible. you're likely going to be limited by io speed. so look at using several disks (there may be no need to use raid - you could just have one thread per disk). also, consider compression. you don't need random access to these files, and if they are text (i assume they are if you are grepping them) then they will compress nicely. that will reduce the amount of data you need to read (and store).
finally, note that if your index doesn't work for ALL queries, then it's probably not worth using. you can "grep" for all expressions in a single pass, and the expensive process is reading the data, not the details of the grep, so even if there is "just one" query that cannot be indexed, and you therefore need to scan everything, then building and using an index is probably not a good use of your time.

Programmatically checking code complexity, possibly via c#?

I'm interested in data mining projects, and have always wanted to create a classification algorithm that would determine which specific check-ins need code-reviews, and which may not.
I've developed many heuristics for my algorithm, although I've yet to figure out the killer...
How can I programmatically check the computational complexity of a chunk of code?
Furthermore, and even more interesting - how could I use not just the code but the diff that the source control repository provides to obtain better data there..
IE: If I add complexity to the code I'm checking in - but it reduces complexity in the code that is left - shouldn't that be considered 'good' code?
Interested in your thoughts on this.
UPDATE
Apparently I wasn't clear. I want this
double codeValue = CodeChecker.CheckCode(someCodeFile);
I want a number to come out based on how good the code was. I'll start with numbers like VS2008 gives when you calculate complexity, but would like to move to further heuristics.
Anyone have any ideas? It would be much appreciated!

Have you taken a look at NDepend? This tool can be used to calculated code complexity and supports a query language by which you can get an incredible amount of data on your application.

The NDepend web site contains a list of definitions of various metrics. Deciding which are most important in your environment is largely up to you.
NDepend also has a command line version that can be integrated into your build process.

Also, Microsoft's Code Analysis (ships with VS Team Suite) includes metrics which check the cyclomatic complexity of code, and raises a build error (or warning) if this number is over a certain threshold.
I don't know off hand, but ut may be worth checking whether this number is configurable to your requirements. You could then modify your build process to run code analysis any time something is checked in.

See Semantic Designs C# Metrics Tool for a tool that computes a variety of standard metrics value both over complete files, and all reasonable subdivisions (methods, classes, ...).
The output is an XML document, but extracting the value(s) you want from that should be trivial with an XML reader.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.