SELMA Logo

SELMA

A Semantic Local Code Search Platform

Abstract

Searching for the right code snippet is daunting and not a trivial task. Online platforms such as Github.com or searchcode.com provide tools to search, they are limited to publicly available and internet-hosted code. However, during the development of research prototypes or confidential tools, it is preferable to store source code locally. Consequently, the use of external code search tools becomes impractical. Here, we present Selma: a local code search platform that enables term-based and semantic retrieval of source code. Selma searches source code and comments, annotates undocumented code to enable term-based search in natural language, and trains locally deployed neural models for code retrieval.

On this website you can find the link to our demo video, model, and source code.

Huggingface Model

Our CodeColBERT model can be used or downloaded from the Huggingface Model Hub: ddrg/codecolbert.

Demonstration Videos

Videos of our system can be found here.

Source Code

Our source code is online here.

System

SELMA Code Search
Searching on the SELMA front page.

SELMA supports searching code from Git repositories. During the setup of the system, a user once need to add one or several repositories to an index. Multiple repositories can be contained in the same index. When adding a new repository, the user is asked whether to add it to an existing index or create a new one. Once the Git repository is cloned, the index is built. Here, two indexes exist that can be selected. Moreover, SELMA includes a Code Expansion, the generation of documentation, which facilitates search.

Expanded Documentation and Source Code
Workflow from adding a repository to searching it.

With SELMA you can choose two different index types: BM25 and CodeColBERT. BM25 offers stable results, but does only perform search using string-matching. CodeColBERT is based on powerful Transformer-Encoder-Models, which are trained to "understand" the users query and find matching code snippets. Because of this, it delivers better results than BM25. Therefore, we recommend using the CodeColBERT index. However, it requires at least one GPU to run, while BM25 does not. If the server where the SELMA is deployed does not have access to a GPU with CUDA support, we suggest using the BM25 index.

To enable semantic code search also when using BM25, SELMA offers the Code Expansion feature. Using two Transformer-based models, SELMA generates documentation for each code snippet. This way, the code is annotated using natural language words which helps the BM25 algorithm to find more relevant results when searching using natural language keywords. We apply two different models to increase the diversity of generated terms, since we found that the generation of both models overlaps by only 36% - 50%. Therefore, we recommend using both models for code expansion. However, each of the models can be disabled and during the setup process this entire feature can be activated or not. Keep in mind that this process of generating documentation strings can take some time. In the following image, the impact of the code expansion feature is visible. The documentation generated by the models is depicted in blue. This way, the user can also find this code snippet when searching for "mean" even though this word does occur in the code or in its documentation.

Expanded Documentation and Source Code
Expanded (generated) documentation and its source code, BM25 searches both, CodeColBERT searches source code.

The speed at which results can be delivered by the system heavily relies on several factors, including the hardware configuration of the server, the size of the index (corresponding to the size of the code repositories), and the specific type of index being employed. In general, BM25 provides a swift search experience. On the other hand, the CodeColBERT index is approximately 10 times slower, but it remains applicable in real-world scenarios, as demonstrated in our videos. As previously mentioned, if a GPU is not available, we recommend opting for BM25. Moreover, when dealing with an extensive volume of code, CodeColBERT might exhibit slower performance. In such cases, we suggest starting with CodeColBERT and assessing its retrieval speed. If it fails to meet your expectations, transitioning to BM25 is advantageous due to its faster retrieval times.

Recommended Settings

The following table sums up in which case which index should be used.

BM25 BM25 with Code Expansion CodeColBERT
Results - + ++
GPU Requirement Not Required Not Required Required
Speed ++ ++ +

Comparison to State of the Art systems

To evaluate our retrieval system, we compare our setup with other system on the CodeSearchNet Java Benchmark. The benchmark uses 90 natural language queries to evaluate the systems using nDCG and nDCG (within). The latter does not take documents into account that are not judged, while for nDCG not judged documents count as not relevant.

Model nDCG nDCG (within)
neuralbowhybrid-2020 (best on CSN) -- 0.425
Baseline NBoW-NBoW 0.500 0.355
Baseline 1D-CNN-1D-CNN 0.407 0.189
Baseline biRNN-biRNN 0.165 0.056
Baseline SelfAtt-SelfAtt 0.431 0.233
Baseline SelfAtt-NBoW 0.514 0.340
Baseline ElasticSearch 0.257 0.190
BM25 (ours) 0.408 0.205
BM25+CodeExpansion (ours) 0.466 0.247
CodeColBERT (ours) 0.507 0.202
CodeColBERT+CodeExpansion (ours) 0.507 0.207