Code Similarity – Programming Systems Laboratory

Dynamic Code Similarity: This is a multi-disciplinary project joint with Profs. Simha Sethumadhavan and Tony Jebara. “Code clones” are statically similar code fragments that usually arise via copy/paste or independently writing lookalike code; best practice removes clones (refactoring) or tracks them (e.g., to ensure bugs fixed in one clone are also fixed in others). This part of the project instead studies dynamically similar code for two different similarity models. One model is functional similarity, finding code fragments that exhibit similar input/output behavior during execution. Our other dynamic similarity model is the novel notion of behavioral similarity, which we call “code relatives”. Two or more code fragments are deemed code relatives if their executions are similar. We model this as finding similarities among the dynamic data dependency graphs representing instruction-level execution traces. We used machine learning techniques to devise a (relatively) fast inexact subgraph isomorphism algorithm to cluster these execution-level similarities. Our experiments show that both of our tools find most of the same “similar” code as the best static code clone detectors but also find many others they can’t, because the code looks very different even though functionally and/or behaviorally similar; however, dynamic detection will not necessarily find all static code clones because lookalike code involving polymorphism need not exhibit the same function/behavior. Our behavioral and functional similarity detectors do not always find the same similarities, because two or more code fragments may compute the same function using very different algorithms. Thus these kinds of techniques complement each other. Beyond the conventional applications of static code clone detection, dynamic similarity detection also addresses malware detection, program understanding, re-engineering legacy software to use modern APIs, and informing design of hardware accelerators and compiler optimizations.

Static Code Similarity: We also investigate of static similarity detection to augment our similarity detection toolkit. This work is joint with Prof. Baishakhi Ray of the University of Virginia and Prof. Jonathan Bell of George Mason University. Unlike most other static code clone research, we look for similarities at the instruction level rather than in the source code, so our techniques can work even on obfuscated executables where no source code is available and thus conventional static detectors cannot be applied. This situation arises for both malware and misappropriated intellectual property. We exploit the increasingly popular notion of “big code”, i.e., training from open-source repositories, using features that combine instruction-level call graph analysis and topic modeling (an NLP-based machine learning technique). We believe we can effectively deobfuscate most suspect code by finding similarities within a corpus consisting of known code and its obfuscated counterparts. Our approach handles control flow transformations and introduction of extraneous methods, not just method names.

Contact Gail Kaiser (kaiser@cs.columbia.edu)

Team Members

Faculty
Gail Kaiser

Former Graduate Students
Fang-Hsiang (“Mike”) Su
Jonathan Bell
Kenny Harvey
Apoorv Patwardhan

Links

Publications

Fang-Hsiang Su, Jonathan Bell, Kenneth Harvey, Simha Sethumadhavan, Gail Kaiser and Tony Jebara. Code Relatives: Detecting Similarly Behaving Software. 24th ACM SIGSOFT International Symposium on the Foundations of Software Engineering (FSE), November 2016. Artifact accepted as platinum.

Fang-Hsiang Su, Jonathan Bell, Gail Kaiser and Simha Sethumadhavan. Identifying Functionally Similar Code in Complex Codebases. 24th IEEE International Conference on Program Comprehension (ICPC), May 2016, pp. 1-10. (ACM SIGSOFT Distinguished Paper Award)

Fang-Hsiang Su, Jonathan Bell, and Gail Kaiser. Challenges in Behavioral Code Clone Detection (Position Paper). 10th International Workshop on Software Clones (IWSC), affiliated with IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER), March 2016, volume 3, pp. 21-22. (People’s Choice Award for Best Position Paper)

Software

Download DyCLink from github.

Download HitoshiIO from github.

Download Code Similarity Experiments toolkit from github.