In this third installment of the 'Revolutionizing Search with AI' series, we further delve into the power of semantic search with RAG to elevate search engine capabilities. In our initial blog, we provided an overview of the complete application, and the second blog delved into the intricacies of semantic search and the backend of our demo application.
Our primary objective has been to enhance the search experience through the introduction of a feature known as 'Ask AI,' which is similar to Google's 'Converse' function in the new Google Search Generative Experience (SGE). This feature goes beyond basic keyword searches. It considers all the relevant results from our previous searches, leveraging data from the Pinecone Vector Database.
This information then acts as the context for our new query and is sent to the OpenAI Chat API to generate a dynamic text response in real time. We call this process RAG or Retrieval Augmented Generation. By weaving these features together, we're confident that the AI-powered search system we're building would transform the search engine landscape.
In this blog, we'll explore RAG and how we've applied its principles to create a more user-friendly search engine experience.
Understanding Retrieval Augmented Generation (RAG)
Back in 2020, Patric Lewis and his team at Meta introduced a concept known as Retrieval Augmented Generation. However, it wasn't until organizations recognized the potential of advanced language models like GPT-4 from OpenAI and Claude 2 from Anthropic, with their ability to handle large amounts of contextual information, that this technique gained widespread adoption.
These models allow us to enhance the quality of generated content by incorporating real-time data. As we started to use ChatGPT and similar large language models, it became evident that while they excel at content generation, they are not without their limitations.
Drawbacks of current LLMs
- Limited to training data: These models, like static textbooks, rely solely on the data they were initially trained on. As a result, they lack knowledge of recent developments not included in their training data.
- Broad but not specialized: Foundational language models like GPT and Claude are built to handle a wide range of tasks effectively. Even so, they may not perform as well when it comes to specialized knowledge and domain-specific tasks.
- Lack of transparency: Since these models are designed to handle a wide range of information from various sources, it can be challenging to trace which specific data they used to generate their responses.
- Cost and expertise barrier: Training or fine-tuning these models can be financially challenging for many organizations. For instance, a cutting-edge model like GPT requires investments in the millions of dollars, making it a costly option, particularly for smaller companies.
- Hallucinations: Because these models are general and don't always have access to reference data, they can sometimes generate responses that may not be entirely accurate.
How does RAG work
RAG begins by retrieving data. In our case, this involves a semantic search using the Pinecone Vector Database API. This search finds relevant information for the user's query.
Next, RAG enhances the initial user query with this data. It then feeds this improved prompt into a Generative AI model, like OpenAI's GPT-4 via their 'Chat Completion API.'
This process results in the final response or answer to the user's query. RAG combines retrieval and generation techniques to provide context-aware and informed responses.
In our system, we employ semantic search within a vector database where our data is stored. This database is designed to understand natural language, such as the user's query or prompt for an LLM.
What sets it apart is its flexibility; it can be updated or modified just like any other database. It solves the challenge of dealing with static or frozen LLMs.
Benefits of RAG
- Offers a dynamic experience, unlike static models.
- Tailored to meet specific business needs by incorporating relevant data.
- Enhances transparency by avoiding the black box issue.
- More cost-effective compared to training or fine-tuning a language model from scratch.
- Simplifies the process by eliminating the need for complex prompts.
- Empowers to create domain-specific chatbots and similar apps in minutes.
Blending Pinecone Vector Store with GPT
In RAG, two key components play a vital role. First, is thevector database, which serves as the repository of updated information. For our purposes, we rely on the Pinecone Vector Database API. Its user-friendly nature and flexibility makes it an ideal choice. The second component is the text generation model, typically an LLM like GPT or Claude.
Combining the Pinecone Vector Store with the power of GPT brings a new dimension to the search experience. We seamlessly integrated these elements to go beyond mere semantic search, delivering a more comprehensive and context-aware solution.
A deeper dive into our backend solution
Here is a glimpse of our solution.
Similar to the ‘q’ route we developed in the previous blog of the series, we are using Rust and a different route namely ‘qa’:
In our demo, we focused on optimizing a specific aspect: latency. To achieve this, we implemented multiple in-memory caches.
Let's explore the custom trimmer we implemented, known as the 'model_selector function'.
Now, let's consider a crucial aspect of the trimming process, the 'document_vec_trimmer function.' We've developed a smart trimmer that doesn't simply remove content to fit within a model's constraints. It takes into account Pinecone scores to trim in a way that respects the document's importance.
Please note that another efficient approach involves breaking down the original data into smaller segments before vectorizing them. This would allow us to include key source data-related information in the metadata of the database entry.
For our demo development, we opted for a more simpler approach.
Take a quick look at the final function within this API, which is the simple 'context_trimmer'.
The remaining components of the streaming solution are managed within the demo React site that we created.
We are using the 'openai-ext' library to handle streaming responses.
It's worth noting that the necessity for a third-party library like 'openai-ext' may no longer be required with the introduction of 'openai-node' v4, which offers built-in streaming capabilities.
Let's get the request ready for OpenAI, which is essentially the message array for the LLM.
Next, let's check the 'openai-ext' configuration for generating streaming responses using the augmented message array we've prepared.
Below is how we managed the interactive button states for actions like Ask AI, Stop Response, and Regenerate.
A glimpse of our demo application
The challenges we faced
Huge text blogs and limited API/model context windows
We explored various approaches to address the issue of dealing with extensive text documents and the constraints of limited API or model context windows. For our experiment, we opted for the simplest solution, which involved "trimming" the content. What set our approach apart is that we considered Pinecone scores before making trimming decisions.
The following is a rough formula of what we did for trimming:
- Normalize scores to [0, 1] range: Normalized Score = (Score - MinScore) / (MaxScore - MinScore)
- Calculate adjusted scores: Adjusted Score = Normalized Score^2
- Calculate the adjusted ratio for each document: Adjusted Ratio = Adjusted Score / Total Adjusted Score
- Calculate the total number of excess tokens across all documents: Excess Tokens = Total Token Count - Target Token Size
- Determine trim ratio based on the excess tokens and total tokens: Trim Ratio = Excess Tokens / Total Token Count
- For each document, calculate the target token count: Target Token Count = max(((1 - Trim Ratio) * Current Token Count), Minimum Tokens in Document)
Another approach we've explored involves an internal application: the division of extensive content or blogs into smaller data segments. This can be done using methods like paragraph splitting or more advanced techniques, such as contextual understanding splits, where we dissect the content based on its context and meaning.
Regardless of the approach chosen for splitting, we then convert these segments into vectors and store them. This ensures that when we create the augmented prompt for the LLM, we minimize data loss during trimming and save costs, especially when the fetched data is limited.
We'll delve into these advanced splitting methods in a future blog.
Writing a steerable prompt for the GPT model
While we aimed to simplify prompt engineering, it remains essential to guide the model in utilizing the provided or augmented context effectively.
We conducted experiments with various prompt styles to arrive at our current approach.
The prompt we employed to develop our AI Lawyer. Please note that this may evolve with user feedback and design updates. For data privacy, we've shortened the organization name using ***.
This prompt will change over time. We are working on a better prompt by referring to the specifics here “What are tokens and how to count them?” by OpenAI.
Streaming response in JS
Streaming responses aren't currently supported by the official OpenAI library (expected in the next version, 4.0, but now available in the v4 beta). We had to explore alternatives, and the most straightforward choice was 'openai-ext,' which also simplifies button state management.
Pinecone API latency
Pinecone mentions that some customers achieve query speeds of less than 100 ms, but we haven't been able to achieve the same level of speed through HTTP requests to the API.
We're still in the process of experimenting with methods to reach this kind of latency, which may involve gRPC implementation in Python or other unexplored approaches.
The OpenAI embedding API performance has shown some slowdown in recent weeks. During our initial testing, we observed response times ranging from approximately 250 to 500 ms. It has now become significantly slower.
While it remains a top-notch solution, its current speed doesn't align with the requirements of a search engine. We are hopeful that OpenAI will upgrade its servers to enable faster embedding generation.
Below is the current latency taken from the demo hosted on a remote server.
We've already experimented with quicker and more efficient methods for generating embeddings and conducting searches. We managed to achieve a response time of approximately 50 ms using open-source alternatives.
Next, we plan to continue our exploration by experimenting with more objective-oriented models. Interestingly, some benchmarks have shown that even state-of-the-art models can face challenges in specific domains. Models that might rank below them in general tests excel in particular areas. We'll further delve into open-source variations to uncover new possibilities.
We are soon planning to launch a platform where you can play with all these experiments. In conclusion, our journey to enhance the AI-driven search experience is an ongoing one, marked by experimentation and discovery. We will continue to update our insights in our upcoming blogs.