The software and text similarity tester SIMSIM tests lexical similarity in natural language texts and in programs in C, Java, Pascal, Modula-2, Miranda, and Lisp. It is used
64-bit version -- compares over 4 GB (works also on 32-bit machines with software 64-bit emulator)SIM 2.89 is available as C sources and as MSDOS binaries. (The C sources for the previous version, 2.77, are still available here.)
There is a Unix-style manual page.
The software similarity tester is very efficient and allows us to compare this year's students' work with that collected from many past years (much to the dismay of some, mostly non-CS, students). Students are told that their work is going to be compared, but some are non-believers ...
We are not afraid that students would try to tune their work to the similarity tester. We reckon if they can do that they can also do the exercise.
Since this piece of handicraft did not qualify as research, there are no international papers on it. The work was described in Dutch in Dick Grune, Matty Huntjens, Het detecteren van kopieën bij informatica-practica, Informatie, 31, 11, Nov 1989, pp. 864-867 ( lit. ref.)). An English translation of the paper is also available. There is a (probably obsolete) terse technical report about the internal workings of the email@example.com