Revolutionizing Search with AI: Semantic Search and RAG

Artificial Intelligence (AI) has transformed the way we search for information. Search behavior has evolved from simple keyword searches like "running shoes" to specific and personalized queries like "comfortable shoes for jogging” or “athletic shoes for beginners”. This change will continue as customer behavior evolves with technology.

A McKinsey study reports that 71% of users expect personalized search results and are often frustrated when these expectations are not met.

The traditional keyword-based systems do struggle with natural queries and miss the context in the process, while semantic search understands the intent behind the query and delivers more relevant results.

Hence, we leveraged AI by merging our understanding of Large Language Models (LLM), vector databases, and Retrieval Augmented Generation (RAG), to create an advanced semantic search system to respond to complex but natural human queries.

Limitations of traditional searches and LLMs

We examined the problem closely to ensure that our solution effectively addresses the challenges with traditional keyword-based systems and to build an outcome, that can be put to use.

Here are some reasons why we chose to go ahead with our semantic search and custom result generation experiment:

Challenges with traditional search systems

Speed and efficiency: Traditional search methods may not deliver quick or efficient results whereas semantic search can sort through large amounts of data in real time to find results faster.

Relevance: Traditional search relies on exact keyword matches and often misses the context of the search queries. Semantic search, on the other hand, understands the intent behind a query and looks for relevant results.

Complex infrastructure: When it comes to traditional searches, integrating AI functionalities can be quite a complex task. It involves setting up additional databases and fine-tuning models just to understand search queries better.

High latency: Dealing with large sets of records in traditional databases becomes a slow process and at times expensive when it comes to storing and searching for information.

Drawbacks of foundational LLMs

LLMs are advanced AI-driven models designed to understand and generate human-like text. Some examples of foundational LLMs include OpenAI's ChatGPT or GPT-3.5-Turbo, and Anthropic's Claude.

Static nature: LLMs, such as GPT, have a fixed knowledge base, and they can't adapt to or include new real-time information once they've completed their training.

Lack of domain-specific knowledge: LLMs are trained for general tasks and would not have specific details from private or recent datasets, like a company's latest products or updates.

Black box functionality: It can be quite challenging to understand where the answers generated by an LLM come from as it's not always clear which sources or reasoning it relies on.

Cost and inefficiency: Developing and launching foundational models like GPT demands significant resources and makes it difficult for many organizations to customize them for specific tasks.

Limited contextual understanding: Although LLMs can produce clear responses without extra assistance (like using RAG), they may not consistently provide the most context-aware answers. For example, helping a customer book a flight often requires real-time data and context, which a basic GPT model might lack.

Accuracy: Using LLMs without any tweaks or adjustments can sometimes result in errors, incorrect information, or less-than-ideal answers for specific, detailed tasks.

User queries have changed from simple keyword-based questions to more complex, context-heavy inquiries. This shift shows how people increasingly depend on technology to understand detailed and varied questions.

While traditional keyword-based systems such as Solr, Google PSE, and Algolia expect users to adjust their questions to fit the system's limitations, newer AI-enhanced platforms are setting a new standard. They're now adapting to users, grasping their intentions, understanding the context, and even picking up on emotions.

This change marks a move towards more intuitive, conversational, and human-friendly interactions with technology, reflecting our natural desire for clear, relevant, and instant responses.

Designing a semantic search and RAG system

Our goal was to create a system that could understand language more effectively, efficiently manage vast amounts of data, and deliver personalized results. We focused on addressing the challenges posed by traditional search methods and the drawbacks of foundational LLMs.

We aimed to use our insights to build a system capable of finding the most relevant information and providing useful answers. Our approach was simple, focused on practical solutions, and leveraged the following key features:

AI-powered search

We wanted to develop a search system that uses semantic search. We achieved this by taking user search queries and converting them into numerical vectors using machine-learned meaning in the backend. These sets of numbers or vectors were then matched against a database of similar vectors to identify the most similar results.

We enhanced the search results by understanding the user's intended meaning and providing more relevant results. It's worth noting that semantic search can sometimes produce inaccurate results when dealing with short queries, typically comprising just two or three specific keywords.

Contextual response generation with AI

We improved the user experience by providing responses that are aware of the context of the user's queries through an 'Ask AI' function. This function considers all the relevant results gathered from previous smart searches.

The data returned from the smart search can be used as context for the user's query and sent to the OpenAI Chat API to generate a continuous text response. This process is known as Retrieval Augmented Generation.

With these features, we believe our AI-driven search system can greatly enhance the way people search for information.

Our toolbox

We chose the following tools for storing data:

Pinecone: It's an AI solution that uses vectors for scalability and doesn't require any maintenance or troubleshooting. It speeds up data searches within milliseconds, applies filters to metadata, and supports indexes that help find precise search results for various tasks.
Meilisearch: This is a free and open-source alternative to Algolia, that comes with an exciting experimental feature allowing us to store vectors with traditional data. It combines the key features of both Algolia and Pinecone and delivers results in just milliseconds.
OpenAI API: This API gives access to all the general-purpose models, including GPT-4 and more. It can be easily integrated into applications to handle complex tasks and enhance experiences with AI.

As for the demo, we chose the Tokio web server (built on Rust) for the backend and its main query handler due to its high performance. It's reliable and provides a safe environment, making it an ideal choice for building an API.

Combining AI with Rust brings benefits like speed, handling multiple tasks simultaneously, ensuring memory safety, and the ability to work with C libraries. Additionally, Rust provides a supportive community and a growing set of libraries for creating web services and APIs.

The demo search application

Challenges with the RAG system

Limited model context windows and large text blogs

A context window is like the number of tokens or text the model can take in before it generates more text. For example, common models like GPT-3.5-Turbo have a context window of 4,096 tokens, which equals about 3,000 English words. Meanwhile, bigger models like GPT-4 have 32,768 tokens, which would be about 24,500 words. (These are approximate figures and can vary depending on the text and how it's processed by the tokenizer model.)

When we add more data, these can become limitations. Some models, like Claude 2, have a wider context window of 100,000 tokens, but you start seeing diminishing results, and issues like hallucinations pop up.

We tried out different approaches, but for our experiment, we went with the simplest one, "trimming." However, we considered the Pinecone scores before cutting out parts of the context.

We're also exploring other ideas, like creating smaller vectors and using their metadata to connect back to the main data. This could potentially save a lot of tokens.

A steerable prompt for the GPT model

We tried out various prompts to arrive at what we have now. In this experiment, we used the following prompt: (Note that prompting changes with models and version. The one we used here is GPT-4 2023 June 13th snapshot)

    
    {
    // The 'system' role is what drives the model, in other words, defines the goal or purpose of the model to behave.
        role: "system",
        content: `You are an advanced legal aid search engine bot, developed by ILAO - Illinois Legal Aid Online. Your primary role is to deliver highly relevant, accurate, and useful search results to users based on their Query and the available Context.
    Please follow these guidelines strictly:
    1. Provide responses directly related to the user's Query. If the query is unclear or insufficient, summarize the Context and include any pertinent details about the Query.
    2. Don't ask the user questions as they don't have the capability to respond.
    3. Don't introduce yourself. The goal is to provide search results swiftly and efficiently.
    4. Strive to provide the best possible results for each Query, like a dedicated legal search engine. 
    5. Use the Context provided to craft comprehensive, succinct, and user-friendly answers to the Query.
    6. Refer to results from the Context using [context-id] notation for citation. For example: 'some text [1] some other text [2]'.
    7. Do not include the full text of cited sources. These will be managed by separate software. Try to avoid citing the sources too many times.
    8. In cases where the Query relates to multiple subjects sharing the same name, formulate separate responses for each subject to ensure clarity.
    9. Utilize markdown formatting for clarity and readability.
    10. Limit responses to a maximum of 300 words to provide concise and focused answers.
    Remember, your ultimate goal is to assist users in navigating legal information quickly and accurately, in line with the mission of Illinois Legal Aid Online.`,
    }

This prompt may evolve as we work on improving it. We're looking to refine it by referring to the details provided in "What are tokens and how to count them?" by OpenAI.

Streaming responses in JS

When we were working on the demo, the official OpenAI JS library didn't have support for streaming responses. So, we used an alternative library called "openai-ext," which not only allowed us to implement streaming responses but also made it easier to manage the buttons' states.

API latency

Pinecone claims that some customers can get their search results in under 100 ms, but we haven't been able to achieve the same speed with our HTTP requests to the API. We're still figuring out how to reach that kind of latency.

One way could be by using gRPC implementation in Python, or maybe some other method we haven't tried yet. We're also exploring options for on-premises solutions with custom search algorithms that might give us response times faster than 100 ms.

Lately, we've noticed that the OpenAI embedding API has been slowing down. Initially, a few results occasionally took over 300 ms, but now it's happening more often. It's not as fast as we'd like for an instant search experience.

To make it work smoothly, either OpenAI needs to upgrade its servers to generate embeddings faster, or we'll need to find on-premises solutions, like using local Bert models for embeddings, which could give us an average response time of less than 60 ms.

Dive deeper into our demo search experience

Take a closer look at our demo search experience in action. You can explore our semantic search in action in two informative blogs:

Revolutionizing Search with AI: Diving Deep into Semantic Search - This blog will give you an inside look at our demo application, explaining how we implemented semantic search and built the infrastructure using Rust.

Revolutionizing Search with AI: RAG for Contextual Response - In this blog, we uncover the inner workings of RAG paired with GPT. You'll discover how we transform user queries into personalized responses that make interactions feel truly human.

What's next in AI search series

We are continuously seeking new ways to make the experience more personal for our clients and their users. Along with tackling current challenges, here are some of the ideas we're exploring:

On-premises and open-source alternatives

Our clients have diverse needs. Some prefer on-premises solutions, while others rely on open-source software. Non-profit organizations often seek a balance between the two. To cater to this variety, we're considering different technologies to expand our offerings. These include cloud-based LLMs like Azure-hosted GPTs and Claude 2, open-source LLaMA 2, vector solutions such as local Milivus, and hybrid search solutions like Typesense and Meilisearch.

Fine-tuning search experience

We're running experiments with feedback loops and using user data to personalize search results and improve the one-on-one user experience. OpenAI recently announced that their GPT-3.5-turbo model can be fine-tuned with custom data, which makes the Reinforcement Learning through Human Feedback (RLHF) approach easier and more cost-effective.

Summary

We believe that semantic search can transform the search experience, making it more intuitive and user-friendly. It has the potential to simplify the process of finding information, even when dealing with complex queries.

Our journey with AI continues to evolve through these ideas and search experiments. We are constantly striving to innovate solutions that can bring exceptional experiences to today's digital platforms, setting the stage for tomorrow's personalized AI-driven Digital Experience Platforms (DXPs).