Updating Method-Level Comments using Generative AI

This study explores the feasibility of using GitDiff to automatically update method-level comments with GenAI. For this, we leverage the information contained in GitDiff, i.e., a patch representing the changes between two states of the file, to help infer the modifications made to the code for updating the method-level comments. For our study as depicted by below figure, we evaluate the following two GenAI architectures:

CodeT5 (small): A code-specific text-to-text transformer from the T5 family.
Gemma 2B IT: A decoder-only LLM released by Google.

To study the effectiveness of the Git-Diff parameter when it comes to improving the quality of updated comments, this study conducts four different experiments shown in the figure below:

We obtained the following average METEOR score for CodeT5-small and Gemma 2B IT:

Experiment	CodeT5-small	Gemma 2B IT
New Code	0.1127	-
Old Comment + New Code (Baseline)	0.2384	0.05095
Old Code + Old Comment + New Code	0.6942	-
Old Code + Old Comment + Git-Diff	0.7608	0.05794

Note: Due to computational constraints, we could not evaluate the performance of Gemma on experiments with "New Code" only and "Old Code + Old Comment + New Code". Additionally, we have not used any instruction while fine-tuning and evaluating Gemma for this task.

Dataset

The dataset used for our study can be found here. The dataset is already split into train-valid-test. Execute the respective scripts under the Dataset directory to apply the necessary preprocessing and compute GitDiff from <OldCode> and <NewCode>.

Executing CodeT5 or Gemma 2B IT

CodeT5:
1. The fine-tuning script can be found under the "Training Script T5" directory
2. To execute the script on Cluster, run the T5.sh file.
3. To execute the script locally, use the below command:
```
python TrainingScript.py [OPTIONS]
```

Gemma 2B IT:

The fine-tuning script can be found under the "Training-Script-Gemma" directory
To execute the script on the Canada Compute's Narval Cluster, run the launch_training_accelarate.sh file.

To execute the script locally, use the below command:

python finetuning_gemma_for_cc.py [OPTIONS]

OPTIONS:
 data_dir: path to the data directory
 experiment: experiment number. Default = 4
 max_epochs: maximum number of epochs to train Gemma. Default = 1
 batch_size: batch size to use for training, validation, and testing. Default = 4
 max_new_tokens: maximum number of tokens to be generated by Gemma. Default = 128

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating Method-Level Comments using Generative AI

Dataset

Executing CodeT5 or Gemma 2B IT

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Updating Method-Level Comments using Generative AI

Dataset

Executing CodeT5 or Gemma 2B IT