Large-scale and Robust Code Authorship Identification with Deep Feature Learning

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Successful software authorship de-anonymization has both software forensics applications and privacy implications. However, the process requires an efficient extraction of authorship attributes. The extraction of such attributes is very challenging, due to various software code formats from executable binaries with different toolchain provenance to source code with different programming languages. Moreover, the quality of attributes is bounded by the availability of software samples to a certain number of samples per author and a specific size for software samples. To this end, this work proposes a deep Learning-based approach for software authorship attribution, that facilitates large-scale, format-independent, language-oblivious, and obfuscation-resilient software authorship identification. This proposed approach incorporates the process of learning deep authorship attribution using a recurrent neural network, and ensemble random forest classifier for scalability to de-anonymize programmers. Comprehensive experiments are conducted to evaluate the proposed approach over the entire Google Code Jam (GCJ) dataset across all years (from 2008 to 2016) and over real-world code samples from 1,987 public repositories on GitHub. The results of our work show high accuracy despite requiring a smaller number of samples per author. Experimenting with source-code, our approach allows us to identify 8,903 GCJ authors, the largest-scale dataset used by far, with an accuracy of 92.3%. Using the real-world dataset, we achieved an identification accuracy of 94.38% for 745 C programmers on GitHub. Moreover, the proposed approach is resilient to language-specifics, and thus it can identify authors of four programming languages (e.g., C, C++, Java, and Python), and authors writing in mixed languages (e.g., Java/C++, Python/C++). Finally, our system is resistant to sophisticated obfuscation (e.g., using C Tigress) with an accuracy of 93.42% for a set of 120 authors. Experimenting with executable binaries, our approach achieves 95.74% for identifying 1,500 programmers of software binaries. Similar results were obtained when software binaries are generated with different compilation options, optimization levels, and removing of symbol information. Moreover, our approach achieves 93.86% for identifying 1,500 programmers of obfuscated binaries using all features adopted in Obfuscator-LLVM tool.

Related collections

Most cited references 58

Record: found
Abstract: found
Article: not found

A fast learning algorithm for deep belief nets.

Geoffrey E. Hinton, Simon Osindero, Yee-Whye Teh (2006)

We show how to use "complementary priors" to eliminate the explaining-away effects that make inference difficult in densely connected belief nets that have many hidden layers. Using complementary priors, we derive a fast, greedy algorithm that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory. The fast, greedy algorithm is used to initialize a slower learning procedure that fine-tunes the weights using a contrastive version of the wake-sleep algorithm. After fine-tuning, a network with three hidden layers forms a very good generative model of the joint distribution of handwritten digit images and their labels. This generative model gives better digit classification than the best discriminative learning algorithms. The low-dimensional manifolds on which the digits lie are modeled by long ravines in the free-energy landscape of the top-level associative memory, and it is easy to explore these ravines by using the directed connections to display what the associative memory has in mind.

0 comments Cited 1117 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Representation learning: a review and new perspectives.

Y Bengio, A. Courville, P. Vincent (2013)

The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representation-learning algorithms implementing such priors. This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, autoencoders, manifold learning, and deep networks. This motivates longer term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation, and manifold learning.

0 comments Cited 960 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Article: not found

Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups

Geoffrey E Hinton, Li Deng, Dong Yu … (2012)

0 comments Cited 558 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Title: ACM Transactions on Privacy and Security

Abbreviated Title: ACM Trans. Priv. Secur.

Publisher: Association for Computing Machinery (ACM)

ISSN (Print): 2471-2566

ISSN (Electronic): 2471-2574

Publication date Created: November 30 2021

Publication date (Electronic): July 19 2021

Publication date (Print): November 30 2021

Volume: 24

Issue: 4

Pages: 1-35

Affiliations

[1 ]Loyola University Chicago

[2 ]Sungkyunkwan University

[3 ]University of Central Florida

[4 ]Ewha Womans University

Article

DOI: 10.1145/3461666

SO-VID: 6992eba1-abd0-4246-861f-39ffdd176f9c

History

Data availability:

Large-scale and Robust Code Authorship Identification with Deep Feature Learning

Read this article at

Abstract

Related collections

Resource Identification

Most cited references 58

A fast learning algorithm for deep belief nets.

Representation learning: a review and new perspectives.

Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups

Author and article information

Journal

Affiliations

Article

History

Comments

Comment on this article

Similar content 78

Most referenced authors 360