At Promyze, our Ph.d student, Corentin Latappy, led in late 2022 a research work involving teams from the LaBRI (Bordeaux, France), Montpellier University, and IMT Mines Ales (both also in France). This work resulted in a scientific paper entitled “MLinter: Learning Coding Practices from Examples—Dream or Reality?”, to be published in the SANER 2023 conference (IEEE International Conference on Software Analysis, Evolution and Reengineering). This post summarizes this work; you’ll find the whole paper here to go further.
Promyze is a knowledge-sharing solution for software engineers. It promotes collaboration to create, share and disseminate coding practices. Thanks to Promyze, developers regularly organize workshops to discuss their good or bad practices. In addition, once they have identified practices, developers can define regular expressions that automatically detect them. The problem is that defining regular expressions is complex. This raises the question of their automatic generation by exploiting machine learning algorithms. It is this R&D project that we present here.
In Promyze, a practice has positive and negative examples (in the Do/Don’t model). The idea of our R&D project is to create a model for each practice based on its examples. We then hope this model will automatically detect similar examples (positive and negative) when analyzing the code later.
This is an example of best coding practice in Promyze:
There are two research questions in our R&D project.
RQ1: How many examples are needed to learn a practice? With Promyze, the developers define the practices with a few examples.
RQ2: What are the best code examples to learn a practice? Should we provide only the code examples of the practice, or should we add other unrelated codes to train the models properly? Should positive or negative examples be preferred?
The train sets. Two configurations were defined to train the model.
The combination of these two dimensions generates nine possible configurations (3*3). We randomly selected lines to build the train set respecting each configuration.
Measure the model efficiency. From our two training sets, we then used two validation methods:
For each method, we ensured that each line selected is not present in the train set.
We had the first result that was expected: the larger the training set, the better the results. This seems logical because ML algorithms need a lot of data to be efficient.
Our second result is less intuitive. Even with large training sets (1000 examples), there remains a significant number of false positives. Indeed, even if the models built by learning have good detection scores (above 95%, sometimes even 99%), they still make errors. When we use them in real conditions (analyzing a project with several thousand lines), they will report several errors. This bias is unfortunately known.
In conclusion, as it is, it’s not yet ready for an integration with Promyze, since the size of the dataset is limited, and we still have an error rate to improve.
We have focused on the analysis of CodeBERT. We want to explore other ML techniques, such as anomaly detection models.
We’ll keep on working to make our ambition from a dream to a reality. Meanwhile, with Promyze we offer a real solution to share your best coding practices. Want to give a try?
Promyze, the collaborative platform dedicated to improve developers’ skills through best practices sharing and definition.
Crafted from Bordeaux, France.