Implementing a Full Stack Production RAG with Docker, Ray, Qdrant and LM Studio

6 min readMay 14, 2024

Introduction

In this tutorial, I will deploy a full stack Retrieval Augmented Generation (RAG) AI application using Docker, which can help developers understand how to deploy applications in production. My use case is text extraction and processing from PDF documents. To serve LLMs locally, I will use LM Studio.

System Architecture

Qdrant: Vector Store for storing and retrieving text embeddings.
Ray: Handles parallel processing for generating embeddings, the part of the application that often is most computationally intensive.
Gradio: Helps create user interfaces quickly.
Langchain: Orchestration framework that helps bring together various AI components, such as LLMs, vector stores, knowledge graphs and more.

Python Implementation

STEP 0: Pre-requisites

The 10 essential packages along with their uses that are required to replicate the project based on the provided descriptions.

ray — For distributed and parallel computation, crucial for handling intensive tasks like text embedding generation.
qdrant-client — To interact with the Qdrant vector database for storing and querying embeddings.
gradio — To build interactive web interfaces that allow users to interact with your machine learning model.
transformers — Provides pre-trained models and utilities for natural language processing tasks.
PyPDF2 — Enables the extraction of text from PDF files, which is necessary for processing document data.
nltk — Useful for text processing tasks such as tokenization, stemming, and removing stopwords.
torch — PyTorch library, essential for building and running deep learning models.
numpy — A fundamental package for scientific computing with Python, often used for data manipulation and analysis.
sentence-transformers — A library for computing dense vector representations for sentences and paragraphs, based on transformer models.
docker — For creating and managing application containers.

STEP 1: Import the necessary libraries

import os
import pandas as pd
from langchain.embeddings import HuggingFaceBgeEmbeddings
from langchain.vectorstores import Qdrant
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain.chains import create_retrieval_chain
from langchain_core.prompts import MessagesPlaceholder
from langchain_core.messages import HumanMessage, AIMessage
from langchain.chains.history_aware_retriever import create_history_aware_retriever
import PyPDF2
import nltk
import re
import io
import gradio as gr
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer


import ray
from nltk.corpus import stopwords
from transformers import AutoModel, AutoTokenizer
import torch

STEP 2: Utilizing Ray for Parallel Processing

Ray is used here to parallelize the generation of text embeddings. Tasks are distributed across multiple processors to speed up the computation, crucial for handling large datasets efficiently. If initialized twice, it will give an error.

ray.init()


@ray.remote
def compute_embeddings(doc, model_name):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)
   
    # Check if CUDA is available and set the device accordingly
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)  # Move the model to the specified device


    inputs = tokenizer(doc, return_tensors='pt', padding=True, truncation=True)
    inputs = {k: v.to(device) for k, v in inputs.items()}


    with torch.no_grad():
        outputs = model(**inputs)


    embeddings = outputs.last_hidden_state[:, 0, :].squeeze()
    return embeddings.cpu().numpy()  # Move embeddings back to CPU if needed and convert to numpy array


def create_db(docs):
    print('Ray has been initiated')
    model_name = "BAAI/bge-large-en"  


    # Dispatch the embedding computation tasks
    embedding_futures = [compute_embeddings.remote(doc, model_name) for doc in docs]


    # Retrieve results from Ray's object store
    embeddings = ray.get(embedding_futures)

    # Create a list of dictionaries containing text and embeddings
    documents_with_embeddings = [{"text": doc, "embeddings": emb} for doc, emb in zip(docs, embeddings)]
    url = "http://localhost:6333"

    # Pass the list of preprocessed text strings directly to from_documents
    vectorStore = Qdrant.from_documents(docs, embeddings, url=url, prefer_grpc=False, collection_name="vector_db_2")
    print("Vector DB Created successfully")
   
    ray.shutdown()
    print('Ray has been shut down')
    return vectorStore

STEP 3: PDF Text Extraction

Let’s extract text from PDF documents using PyPDF2. This function iterates through PDF pages and combines them to create text data.

def extract_text_from_pdf(pdf_content):
    """Extract text from a PDF file content."""
    pdf_reader = PyPDF2.PdfReader(io.BytesIO(pdf_content))
    pdf_text = ""
    for page in pdf_reader.pages:
        pdf_text += page.extract_text() if page.extract_text() else ""
    return pdf_text

Ray Logs:

STEP 4: Text Preprocessing

Now let’s remove figures, tables, and in-text citations. After that, we will apply tokenization, lemmatization, and stopword removal.

def preprocess_text(text):
    """Preprocess the text by removing stop words, stemming, and lemmatizing."""
    stop_words = set(stopwords.words('english'))
    stemmer = PorterStemmer()
    lemmatizer = WordNetLemmatizer()


    tokens = nltk.word_tokenize(text)
    tokens = [word for word in tokens if word.lower() not in stop_words]
    lemmatized_tokens = [lemmatizer.lemmatize(stemmer.stem(word)) for word in tokens]
    return ' '.join(lemmatized_tokens)


def final_result(pdf_file, query):
    # Extract text from the PDF file content
    pdf_text = extract_text_from_pdf(pdf_file)

    # Clean and preprocess the extracted text
    cleaned_text = remove_figures_tables_citations(pdf_text)
    preprocessed_text = preprocess_text(cleaned_text)

    # Create a list of documents (currently just one document)
    docs = [preprocessed_text]

    # Create the vector database
    vectorStore = create_db(docs)

    return "Processed and created the vector database successfully!"

STEP 5: Deploying Mistral-7b on LM Studio

- Install LM Studio: Begin by downloading the installer from [https://lmstudio.ai](https://lmstudio.ai).

- Download an LLM: Open LM Studio, search for and download an LLM, such as “TheBloke/Mistral-7B-Instruct-v0.2-GGUF” (approximately 4GB in size). It only accepts models in the GGUF format as that is the format needed by LlamaCpp, which is the underlying library used for model deployment in Open LM Studio.

- Navigate to the Local Server: Click on the Local Server tab (icon with <->) on the left.

- Load the LLM: Select the downloaded LLM from the dropdown menu.

- Start the Server: Click on the green ‘Start Server’ button to initiate the server. This will launch the model instance on “http://localhost:6333".

- Minimize LM Studio: Once the server is running, you can minimize the app. The server will continue to operate and handle API requests.

STEP 6: Gradio Web Interface

Implements an interface for uploading PDFs and entering queries. Displays processing results on a webpage.

# Gradio Interface Setup
interface = gr.Interface(
    fn=final_result,
    inputs=[
        gr.File(type="binary", label="Upload a PDF"),
        gr.Textbox(lines=2, label="Enter your query")
    ],
    outputs="text",
    title="Process PDF and Query",
    description="Upload a PDF file and enter a query The app will process the PDF and perform some operations based on the query."
)


# Launch the interface
if __name__ == "__main__":
    interface.launch()

Docker Deployment

The Dockerfile configures the environment for the web application, as illustrated below:

# Start with the Python 3.9 image
FROM python:3.10


# Set the working directory to /Chatbot
WORKDIR /webapp


# running requirements- requirements file provide all necessary library with version that could enable building up the docker
COPY requirements.txt .


RUN pip install -r requirements.txt 


# Copy the Python script generated from your notebook
COPY webapp.py .


# Expose the port Gradio will run on, assuming 7860 (default Gradio port)
EXPOSE 7860
EXPOSE 6333
EXPOSE 1234
# Command to run your Gradio app
CMD ["python", "webapp.py"]

Make sure the above is saved as Dockerfile without any file extension in your project directory. Also, ensure you have a requirements.txt file in the same directory, listing all the Python packages your application needs.

Open your terminal or command prompt and navigate to the directory containing your Dockerfile and requirements.txt. Use the following command to build the Docker image:

docker build -t webapp-image .

Here, -t webapp-image tags the image with the name webapp-image, and the . at the end tells Docker to use the Dockerfile in the current directory.

Once the image is built, you can run a container from it. Since you expose multiple ports (7860, 6333, and 1234) in your Dockerfile, you should map these ports to ports on your host machine.

docker run -p 7860:7860 -p 6333:6333 -p 1234:1234 webapp-image

Local Deployment with LM Studio

Manual configuration in LM Studio involves setting up Docker containers and ensuring that all services can communicate effectively in a local deployment environment. This setup allows developers to manage resources effectively and tweak the system as needed for optimal performances.

Output after Deployment

Conclusion

This deployment strategy using Docker and Ray provides a scalable, efficient solution for document processing applications. The use of Ray for parallel processing significantly enhances the system’s ability to handle large volumes of data, making it an ideal choice for enterprise-level applications.

Additional Resources and References

For those interested in further exploring or building upon the concepts discussed in this article, the following resources are recommended:

Docker Documentation: https://docs.docker.com/
Ray Project: Ray GitHub Repository
Gradio Web Interface: https://www.gradio.app/docs
Langchain GitHub Repository: https://python.langchain.com/docs/get_started/introduction
Qdrant Vector Search Engine: Qdrant Documentation

A big shoutout to Abhishek Maurya for practically doing everything here! He’s worked with me as an intern at Jupiter and I’ve had the most wonderful experience mentoring him.

If you’re here, do give me a follow on Medium and connect with me on Linkedin to chat more about Data Science!