Cnerator: A Python application for the controlled stochastic generation of standard C source code

The Big Code and Mining Software Repositories research lines analyze large amounts of source code to improve software engineering practices. Massive codebases are used to train machine learning models aimed at improving the software development process. One example is decompilation, where C code and...

Full description

Bibliographic Details
Main Authors: Francisco Ortin, Javier Escalada
Format: Article
Language:English
Published: Elsevier 2021-07-01
Series:SoftwareX
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S235271102100056X
Description
Summary:The Big Code and Mining Software Repositories research lines analyze large amounts of source code to improve software engineering practices. Massive codebases are used to train machine learning models aimed at improving the software development process. One example is decompilation, where C code and its compiled binaries can be used to train machine learning models to improve decompilation. However, obtaining massive codebases of portable C code is not an easy task, since most applications use particular libraries, operating systems, or language extensions. In this paper, we present Cnerator, a Python application that provides the stochastic generation of large amounts of standard C code. It is highly configurable, allowing the user to specify the probability distributions of each language construct, properties of the generated code, and post-processing modifications of the output programs. Cnerator has been successfully used to generate code that, utilized to train machine learning models, has improved the performance of existing decompilers. It has also been used in the implementation of an infrastructure for the automatic extraction of code patterns.
ISSN:2352-7110