使用 Spring AI 和 PGVector 实现语义搜索

1、概览

搜索 是软件中的一个基本概念,目的是在大数据集中查找相关信息。它涉及在一系列项目中查找特定项目。

本文将带你了解如何使用 Spring AIPGVectorOllama 实现语义搜索。

2、基础概念

语义搜索是一种先进的搜索技术,它利用词语的含义来查找最相关的结果。要构建语义搜索应用,需要了解一些关键概念:

  • 词嵌入(Word Embedding):词嵌入是一种词表示类型,允许具有相似含义的单词具有类似的表示形式。词嵌入将单词转换为数值向量,可以在机器学习模型中使用。
  • 语义相似性:语义相似性是衡量两段文本在意义上相似程度的一种方法。它用于比较单词、句子或文档的含义。
  • 向量空间模型:向量空间模型是一种数学模型,用于将文本文档表示为高维空间中的向量。在该模型中,每个单词都表示为一个向量,两个单词之间的相似度根据它们向量之间的距离来计算。
  • 余弦相似度:余弦相似度是内积空间中两个非零向量之间的相似度量,它测量的是两个向量之间角度的余弦值。它计算向量空间模型中两个向量之间的相似度。

3、先决条件

首先,我们需要在机器上安装 Docker,以便运行 PGVectorOllama

然后,在 Spring 应用中添加 Spring AI OllamaPGVector 依赖:

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-ollama-spring-boot-starter</artifactId>
</dependency>
<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-pgvector-store-spring-boot-starter</artifactId>
</dependency>

还要添加 Spring BootDocker Compose 支持,以管理 OllamaPGVector Docker 容器:

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-docker-compose</artifactId>
    <version>3.1.1</version>
</dependency>

除了依赖外,还要在 docker-compose.yml 文件中描述这两个服务:

services:
  postgres:
    image: pgvector/pgvector:pg17
    environment:
      POSTGRES_DB: vectordb
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: postgres
    ports:
      - "5434:5432"
    healthcheck:
      test: [ "CMD-SHELL", "pg_isready -U postgres" ]
      interval: 10s
      timeout: 5s
      retries: 5

  ollama:
    image: ollama/ollama:latest
    ports:
      - "11435:11434"
    volumes:
      - ollama_data:/root/.ollama
    healthcheck:
      test: [ "CMD", "curl", "-f", "http://localhost:11435/api/health" ]
      interval: 10s
      timeout: 5s
      retries: 10

volumes:
  ollama_data:

4、配置应用

接着,配置 Spring Boot 应用以使用 OllamaPGVector 服务。

application.yml 文件中,我们定义了几个属性。需要特别注意一下 ollamavectorstore 属性:

spring:
  ai:
    ollama:
      init:
        pull-model-strategy: when_missing
        chat:
          include: true
      embedding:
        options:
          model: nomic-embed-text
    vectorstore:
      pgvector:
        initialize-schema: true
        dimensions: 768
        index-type: hnsw
  docker:
    compose:
      file: docker-compose.yml
      enabled: true
  datasource:
    url: jdbc:postgresql://localhost:5434/vectordb
    username: postgres
    password: postgres
    driver-class-name: org.postgresql.Driver
  jpa:
    database-platform: org.hibernate.dialect.PostgreSQLDialect

我们选择 nomic-embed-text 作为我们的 Ollama 模型。如果没有下载,Spring AI 会自动拉取的。

PGVector 设置通过初始化数据库 Schema(initialize-schema: true) 来正确设置向量存储,通过将向量维度与常见嵌入尺寸对齐(dimensions: 768),并利用分层可导航小世界(Hierarchical Navigable Small World,HNSW)索引(index-type: hnsw)来优化搜索效率,以便进行快速的近似最近邻搜索。

5、实现语义搜索

我们要实现一个智能图书搜索引擎,允许用户根据图书内容搜索图书。

首先,使用 PGVector 构建一个简单的搜索功能,随后,使用 Ollama 对其进行增强,以提供更多上下文感知响应。

定义一个代表书籍实体的 Book 类:

public record Book(String title, String author, String description) {
}

在搜索图书之前,需要先将图书数据导入 PGVector Store。如下:

void run() {
    var books = List.of(
            new Book("The Great Gatsby", "F. Scott Fitzgerald", "The Great Gatsby is a 1925 novel by American writer F. Scott Fitzgerald. Set in the Jazz Age on Long Island, near New York City, the novel depicts first-person narrator Nick Carraway's interactions with mysterious millionaire Jay Gatsby and Gatsby's obsession to reunite with his former lover, Daisy Buchanan."),
            new Book("To Kill a Mockingbird", "Harper Lee", "To Kill a Mockingbird is a novel by the American author Harper Lee. It was published in 1960 and was instantly successful. In the United States, it is widely read in high schools and middle schools."),
            new Book("1984", "George Orwell", "Nineteen Eighty-Four: A Novel, often referred to as 1984, is a dystopian social science fiction novel by the English novelist George Orwell. It was published on 8 June 1949 by Secker & Warburg as Orwell's ninth and final book completed in his lifetime."),
            new Book("The Catcher in the Rye", "J. D. Salinger", "The Catcher in the Rye is a novel by J. D. Salinger, partially published in serial form in 1945-1946 and as a novel in 1951. It was originally intended for adults but is often read by adolescents for its themes of angst, alienation, and as a critique on superficiality in society."),
            new Book("Lord of the Flies", "William Golding", "Lord of the Flies is a 1954 novel by Nobel Prize-winning British author William Golding. The book focuses on a group of British")
    );

    List<Document> documents = books.stream()
            .map(book -> new Document(book.toString()))
            .toList();

    vectorStore.add(documents);
}

上述方法将样本图书数据添加到了 PGVector Store中,然后,就可以实现语义搜索功能了。

5.1、语义搜索

我们的目标是实现一个语义搜索 API,让用户可以根据图书内容查找图书。

定义一个与 PGVector 交互的 Controller,以执行相似性搜索:

@RequestMapping("/books")
class BookSearchController {
    final VectorStore vectorStore;
    final ChatClient chatClient;

    BookSearchController(VectorStore vectorStore, ChatClient.Builder chatClientBuilder) {
        this.vectorStore = vectorStore;
        this.chatClient = chatClientBuilder.build();
    }
}
...

接着,创建一个 POST /search 端点,接受用户的搜索条件并返回匹配的图书列表:

@PostMapping("/search")
List<String> semanticSearch(@RequestBody String query) {
    return vectorStore.similaritySearch(SearchRequest.builder()
        .query(query)
        .topK(3)
        .build())
       .stream()
      .map(Document::getText)
      .toList();
}

注意,我们使用了 VectorStore#similaritySearch 方法。这将在我们之前存储的图书中执行语义搜索。

启动应用程序后,我们就可以执行搜索了。

使用 cURL 搜索 1984 的实例:

curl -X POST --data "1984" http://localhost:8080/books/search

响应包含了三本书:一本完全匹配,两本部分匹配:

[
  "Book[title=1984, author=George Orwell, description=Nineteen Eighty-Four: A Novel, often referred to as 1984, is a dystopian social science fiction novel by the English novelist George Orwell.]",
  "Book[title=The Catcher in the Rye, author=J. D. Salinger, description=The Catcher in the Rye is a novel by J. D. Salinger, partially published in serial form in 1945–1946 and as a novel in 1951.]",
  "Book[title=To Kill a Mockingbird, author=Harper Lee, description=To Kill a Mockingbird is a novel by the American author Harper Lee.]"
]

5.2、利用 Ollama 增强语义搜索

我们可以按照以下三个步骤整合 Ollama,生成附带释义的回复,为改进语义搜索结果提供更多上下文:

  1. 从搜索查询中检索前三个匹配的图书描述。
  2. 将这些描述输入到 Ollama 中,可生成更自然、更能感知上下文的响应。
  3. 在答复中对信息进行概括和转述,提供更清晰、更贴切的见解。

BookSearchController 中创建一个新方法,使用 Ollama 生成查询的解析:

@PostMapping("/enhanced-search")
String enhancedSearch(@RequestBody String query) {
    String context = vectorStore.similaritySearch(SearchRequest.builder()
        .query(query)
        .topK(3)
        .build())
      .stream()
      .map(Document::getText)
      .reduce("", (a, b) -> a + b + "\n");

    return chatClient.prompt()
      .system(context)
      .user(query)
      .call()
      .content();
}

现在,通过向 /books/enhanced-search 端点发送 POST 请求来测试增强的语义搜索功能:

curl -X POST --data "1984" http://localhost:8080/books/enhanced-search
1984 is a classic dystopian novel written by George Orwell. Here's an excerpt from the book:

"He loved Big Brother. He even admired him. After all, who wouldn't? Big Brother was all-powerful, all-knowing, and infinitely charming. And now that he had given up all his money in bank accounts with his names on them, and his credit cards, and his deposit slips, he felt free."

This excerpt sets the tone for the novel, which depicts a totalitarian society where the government exercises total control over its citizens. The protagonist, Winston Smith, is a low-ranking member of the ruling Party who begins to question the morality of their regime.

Would you like to know more about the book or its themes?

Ollama 不会像简单的语义搜索那样返回三个独立的书籍描述,而是从搜索结果中综合出最相关的信息。在本例中,《1984》 是最相关的匹配信息,因此 Ollama 专注于提供详细的摘要,而不是列出不相关的书籍。这模仿了人类的搜索帮助,使搜索结果更吸引人、更有洞察力。

6、总结

本文介绍了如何使用 Spring AI、PGVector 和 Ollama 实现语义搜索。


Ref:https://www.baeldung.com/spring-ai-pgvector-semantic-search