Learning Best Coding Practices from Examples
At Packmind, our Ph.d student, Corentin Latappy, led in late 2022 a research work involving teams from the LaBRI (Bordeaux, France), Montpellier University, and IMT Mines Ales (both also in France). This work resulted in a scientific paper entitled “MLinter: Learning Coding Practices from Examples—Dream or Reality?”, to be published in the SANER 2023 conference (IEEE International Conference on Software Analysis, Evolution and Reengineering). This post summarizes this work; you’ll find the whole paper here to go further.
What problem do we want to solve?
Packmindis a knowledge-sharing solution for software engineers. It promotes collaboration to create, share and disseminate coding practices. Thanks to Packmind, developers regularly organize workshops to discuss their good or bad practices. In addition, once they have identified practices, developers can define regular expressions that automatically detect them. The problem is that defining regular expressions is complex. This raises the question of their automatic generation by exploiting machine learning algorithms. It is this R&D project that we present here.
What is the main idea?
In Packmind, a practice has positive and negative examples (in the Do/Don’t model). The idea of our R&D project is to create a model for each practice based on its examples. We then hope this model will automatically detect similar examples (positive and negative) when analyzing the code later.
This is an example of best coding practice in Packmind:
What are the research questions?
There are two research questions in our R&D project.
RQ1: How many examples are needed to learn a practice? With Packmind, the developers define the practices with a few examples.
RQ2: What are the best code examples to learn a practice? Should we provide only the code examples of the practice, or should we add other unrelated codes to train the models properly? Should positive or negative examples be preferred?
What experiments were done to verify the hypothesis?
We used the state-of-the-art ML technique CodeBERT to learn from code examples. CodeBERT is a pre-trained model for programming languages (6 programming languages: Python, Java, JavaScript, PHP, Ruby, Go).
The dataset. We focused on the JavaScript language. We chose a base of 58 best practices that was extracted from ESLint. We analyzed 550 popular GitHub projects. We found 13 million examples, covering 38 best practices over the 58. For 20 of them, we did not get enough examples, so they’ve been discarded.
The train sets. Two configurations were defined to train the model.
- The first configuration concerns the dataset size (Small=10 examples, Medium=100, Large=1000).
- The second one involves the content of the dataset:
- 50% non-conforming code (”Don’t”) and 50% repaired (turned into “Do”) code;
- 50% non-conforming code and 50% existing code (code with no violation and no fixes);
- 50% non-conforming code, 25% repaired code, and 25% existing code.
The combination of these two dimensions generates nine possible configurations (3*3). We randomly selected lines to build the train set respecting each configuration.
Measure the model efficiency. From our two training sets, we then used two validation methods:
- The Balanced validation: we use the same construction process as the training set, by including the same number of examples as the initial dataset;
- The Real validation: we want to simulate a realistic scenario by using, for each best practice, 5 full source code files having at least a non-conforming code for each of them.
For each method, we ensured that each line selected is not present in the train set.
What were the results of the experiments?
We had the first result that was expected: the larger the training set, the better the results. This seems logical because ML algorithms need a lot of data to be efficient.
Our second result is less intuitive. Even with large training sets (1000 examples), there remains a significant number of false positives. Indeed, even if the models built by learning have good detection scores (above 95%, sometimes even 99%), they still make errors. When we use them in real conditions (analyzing a project with several thousand lines), they will report several errors. This bias is unfortunately known.
In conclusion, as it is, it’s not yet ready for an integration with Packmind, since the size of the dataset is limited, and we still have an error rate to improve.
What are the next steps?
We have focused on the analysis of CodeBERT. We want to explore other ML techniques, such as anomaly detection models.
We’ll keep on working to make our ambition from a dream to a reality. Meanwhile, with Packmind we offer a real solution to share your best coding practices. Want to give a try?