Embeddings

Embeddings are a way of representing data in a vectorised format, making it easy and efficient to find similar documents.

Currently embeddings are only generated for issues which allows for features such as

Architecture

Embeddings are stored in Elasticsearch which is also used for Advanced Search.

graph LR
  A[database record] --> B[ActiveRecord callback]
  B --> C[build embedding reference]
  C -->|add to queue| N[queue]
  E[cron worker every minute] <-->|pull from queue| N
  E --> G[deserialize reference]
  G --> H[generate embedding]
  H <--> I[AI Gateway]
  I <--> J[Vertex API]
  H --> K[upsert document with embedding]
  K --> L[Elasticsearch]

The process is driven by Search::Elastic::ProcessEmbeddingBookkeepingService which adds and pulls from a Redis queue.

Adding to the embedding queue

The following process description uses issues as an example.

An issue embedding is generated from the content "issue with title '#{issue.title}' and description '#{issue.description}'".

Using ActiveRecord callbacks defined in Search::Elastic::IssuesSearch, an embedding reference is added to the embedding queue if it is created or if the title or description is updated and if embedding generation is available for the issue.

Pulling from the embedding queue

A Search::ElasticIndexEmbeddingBulkCronWorker cron worker runs every minute and does the following:

graph LR
  A[cron] --> B{endpoint throttled?}
  B -->|no| C[schedule 16 workers]
  C ..->|each worker| D{endpoint throttled?}
  D -->|no| E[fetch 19 references from queue]
  E ..->|each reference| F[increment endpoint]
  F --> G{endpoint throttled?}
  G -->|no| H[call AI Gateway to generate embedding]

Therefore we always make sure that we don't exceed the rate limit setting of 450 embeddings per minute even with 16 concurrent processes generating embeddings at the same time.

Backfilling

An Advanced Search migration is used to perform the backfill. It essentially adds references to the queue in batches which are then processed by the cron worker as described above.

Adding a new embedding type

The following process outlines the steps to get embeddings generated and stored in Elasticsearch.

Do a cost and resource calculation to see if the Elasticsearch cluster can handle embedding generation or if it needs additional resources.
Decide where to store embeddings. Look at the existing indices in Elasticsearch and if there isn't a suitable existing index, create a new index.
Add embedding fields to the index: example.
Update the way content is generated to accommodate the new type.
Add a new unit primitive: here and here.
Use Elastic::ApplicationVersionedSearch to access callbacks and add the necessary checks for when to generate embeddings. See Search::Elastic::IssuesSearch for an example.
Backfill embeddings: example.

Adding issue embeddings locally

Prerequisites

Make sure Elasticsearch is running.
If you have an existing Elasticsearch setup, make sure the AddEmbeddingToIssues migration has been completed by executing the following until it returns:
```
Elastic::MigrationWorker.new.perform
```
Make sure you can run GitLab Duo features on your local environment.
Ensure running the following in a rails console outputs an embedding (a vector of 768 dimensions). If not, there is a problem with the AI setup.
```
Gitlab::Llm::VertexAi::Embeddings::Text.new('text', user: nil, tracking_context: {}, unit_primitive: 'semantic_search_issue').execute
```

Running the backfill

To backfill issue embeddings for a project's issues, run the following in a rails console:

Gitlab::Duo::Developments::BackfillIssueEmbeddings.execute(project_id: project_id)

The task adds the issues to a queue and processes them in batches, indexing embeddings into Elasticsearch. It respects a rate limit of 450 embeddings per minute. Reach out to @maddievn or #g_global_search in Slack if there are any issues.

Verify

If the following returns 0, all issues for the project have embeddings:

curl "http://localhost:9200/gitlab-development-issues/_count" \
--header "Content-Type: application/json" \
--data '{"query": {"bool": {"filter": [{"term": {"project_id": PROJECT_ID}}], "must_not": [{"exists": {"field": "embedding"}}]}}}' | jq '.count'

Replacing PROJECT_ID with your project ID.