Source code plagiarism detection tool that checks against online content
I asked this question on Stackoverflow and was directed to ask it here. The question is:
I wanted to ask about plagiarism detection tools for source code written in C++. When I searched Google the ones I found compare between documents that you already have. What I'm looking for is a tool that compares against content on the internet (i.e Github).
I think it is unlikely to be of use except for the most blatant and poorly done plagiarism. After all, I can copy code, paste in my IDE, Edit -> Refactor -> Change variable names and get brand new code in no time.
@Davidmh: Surprising as it is, students keep trying to conduct such "most blatant and poorly done plagiarism".
While this isn't a real answer, what I use is Google. Take the code snippet and throw it into Google search. You'll quickly find if there are any exact copies. it doesn't easily find those that have been refactored but it does find those really lazy cheaters.
@O.R.Mapper They do indeed. There was an incident where someone copied handwritten homework word for word from another classmate. *They even copied their classmate's name at the top of the paper*. Then they both handed it in in class, one submission on top of the other (as if the TA never marks work sequentially). Stupid is as stupid does. If an intelligent plagiarizer substantially improves or obfuscates the code that they plagiarized, they probably actually managed to understand the program and you'll have a hard time proving they plagiarized.
@Davidmh: Modern plagiarism tools (such as Moss) are not fooled by changing variable names.
Not really meant for plagiarism, but there is Black Duck Software. It's a tool that checks against open source code, to allow companies to make sure that their employees didn't copy copy-left source code into their proprietary code base. https://www.blackducksoftware.com/ I assume it must also check for source code from the public domain or from the Apache/MIT licenses, since knowing that a piece of code has been legally copied is also useful information for companies to know.
We use the MOSS (Measure of Software Similarity) system provided by Stanford, at here.
I am unable to register for MOSS and download and use it, I also have not got any reply of the email I sent them.
I think MOSS just compares between the corpus that you submit, not against code in the web. I'm not aware of anything that indexes code on the web for comparison in this way.
There are many general sites such as:
Sure, they will flag everything that "has to be copied" such as int main but everything that "has to be copied" should be common between your students. It should then be fairly easy to not take it into account using a simple script removing anything shared by more than 80% of your students, preferably before submitting it to the plagiarism checker.
It would probably be easier using a website with an API such as: https://api.plagscan.com/guide
This doesn't answer the question, the OP was looking for a plagiarism detector for source code, not free-form text.