Machine Learning for Source-code Plagiarism Detection

Jitendra Yasaswi Bharadwaj katta

Abstract

This thesis presents a set of machine learning and deep learning approaches for building systems with the goal of source-code plagiarism detection. The task of plagiarism detection can be treated as assessing the amount of similarity presented within given entities. These entities can be anything like documents containing text, source-code etc. Plagiarism detection can be formulated as a fine-grained pattern classification problem. The detection process begins by transforming the entity into feature representations. These features are representatives of their corresponding entities in a discriminative high-dimensional space, where we can measure for similarity. Here, by entity we mean solution to programming assignments in typical computer science courses. The quality of the features determine the quality of detection As our first contribution, we propose a machine learning based approach for plagiarism detection in programming assignments using source-code metrics. Most of the well known plagiarism detectors either employ a text-based approach or use features based on the property of the program at a syntactic level. However, both these approaches succumb to code obfuscation which is a huge obstacle for automatic software plagiarism detection. Our proposed method uses source-code metrics as features, which are extracted from the intermediate representation of a program in a compiler infrastructure such as gcc. We demonstrate the use of unsupervised and supervised learning techniques on the extracted feature representations and show that our system is robust to code obfuscation. We validate our method on assignments from introductory programming course. The preliminary results show that our system is better when compared to other popular tools like MOSS. For visualizing the local and global structure of the features, we obtained the low-dimensional representations of our features using a popular technique called t-SNE, a variation of Stochastic Neighbor Embedding, which can preserve neighborhood identity in low-dimensions. Based on this idea of preserving neighborhood identity, we mine interesting information such as the diversity in student solution approaches to a given problem. The presence of well defined clusters in low-dimensional visualizations demonstrate that our features are capable of capturing interesting programming patterns. As our second contribution, we demonstrate how deep neural networks can be employed to learn features for source-code plagiarism detection. We employ a character-level Recurrent Neural Network (char- RNN ), a character-level language model to map the characters in a source-code to continuous-valued vectors called embeddings. We use these program embeddings as deep features for plagiarismdetection in programming assignments. Many popular plagiarism detection tools are based on n-gram techniques at syntactic level. However, these approaches to plagiarism detection fail to capture long term dependencies (non-contiguous interaction) present in the source-code. Contrarily, the proposed deep features capture non-contiguous interaction within n-grams. These are generic in nature and there is no need to fine-tune the char- RNN model again to program submissions from each individual problem-set. Our experiments show the effectiveness of deep features in the task of classifying assignment program submissions as copy, partial-copy and non-copy. As our final contribution, we demonstrate how to extract local deep features from source-code. We represent programs using local deep features and develop a framework to retrieve suspicious plagiarized cases for a given query program. Such representations are useful for identification of near-duplicate program pairs, where only a part of the program is copied or certain lines, blocks of code may be copied etc. In such cases, obtaining local feature representations for a program is more useful than representing a program with a single global feature. We develop a retrieval framework using Bag of Words (BoW) approach to retrieve susceptible plagiarized and partial-plagiarized (near-duplicate) cases for a given query program.

Year of completion:	July 2018
Advisor :	Prof. C V Jawahar and Suresh Purini

Related Publications

Jitendra Yasaswi, Suresh Purini and C. V. Jawahar - Plagiarism detection in Programming Assignments Using Deep Features 4th Asian Conference on Pattern Recognition (ACPR 2017), Nanjing, China, 2017.[PDF]
Jitendra Yasaswi Bharadwaj katta, Srikailash G, Anil Chilupuri, Suresh Purini and C.V. Jawahar - Unsupervised Learning Based Approach for Plagiarism Detection in Programming Assignments ISEC. 2017. [PDF]