Building Multimodal AI in Healthcare Using GPT and Qdrant

Sarthak Arora
10 min readOct 2, 2024

--

Image by AI about AI. Heh.

The accompanying code for this blog/project is here.

AI is transforming everything — and fast. New tools and APIs drop almost daily, and staying updated is no easy feat. Luckily, many great wrappers (not rappers!) simplify the process, making it easier than ever to build AI-powered tools for any use case.

Recently, Abhishek and I attempted a project to do just that. And trust me — it was easier than you’d think! Our goal was to help radiologists by creating a tool that not only analyzes medical reports and images but also suggests possible next steps to provide the best care for their patients.

In technical terms, it’s a RAG (Retrieval-Augmented Generation) system. But for us, it’s more like a “RAG-to-riches” story, where “riches” means productivity. And who wouldn’t want that, especially when doctors are already putting in long hours? This project aims to give them an extra helping hand.

So, let’s dive into how we built it!

So, what is Multimodal AI and how are we implementing it?

Multimodal AI involves the integration of different types of data — such as text, images, and audio — into a single model capable of processing complex information in a way that mimics human cognition.

This approach allows the AI system to leverage multiple sources of information to gain a comprehensive understanding of a subject. For instance, in the healthcare domain, a multimodal model can analyze both medical images and textual patient records to deliver more accurate diagnostic insights.

The future of AI is inherently multimodal, as it aligns with how humans interpret and process information through various senses. This capability enables more robust, context-aware models that can excel in diverse applications such as healthcare, autonomous driving, and content generation.

In this project, we aim to demonstrate how to build a multimodal AI system using GPT-4o for natural language understanding and Qdrant for managing and querying vector embeddings. This system will combine text and image data to provide a powerful diagnostic tool for healthcare applications.

Pre-requisites

Before you begin, ensure that you have the following:
- Python 3.8+ installed.
- API Keys for GPT-4o and Qdrant.
- Installed dependencies like fastembed, qdrant-client, openai, gradio. (Don’t worry, these are mentioned in requirements.txt file as well, link to the repo is mentioned in the end.)
- Dataset: Medical scans and corresponding patient records in text format. (Explained below.)
- Patience (This is the most important, you know!)

Dataset

For this project, we are using the Radiology Objects in Context (ROCO) dataset, a large-scale medical imaging dataset that includes thousands of anonymized captions and corresponding medical scans.

The ROCO dataset is specifically tailored for radiological contexts, making it ideal for developing multimodal models that analyze both medical images and associated textual information.

In the repository, I’ve prepared a subset of the ROCO dataset in the data folder, which contains a variety of medical scans and related text descriptions. This sample includes a few images representing different medical conditions, sufficient for demonstrating the project’s capabilities.

You can access the full dataset here. There is a script that you can run after forking the repository which will download the dataset on your local machine.

Ensure you review the dataset’s licensing and accessibility guidelines for research purposes.

Architecture

Here’s a pretty intuitive architecture that talks about how our project is structured

Steps to Implement Multimodal AI

1. Sign up for GPT-4o and Qdrant:

- GPT-4o: Create an account on the OpenAI platform and obtain an API key to access GPT-4o’s natural language processing capabilities. Refer to this guide to get the API key. https://www.merge.dev/blog/chatgpt-api-key
- Qdrant: Register on the Qdrant website to manage and query your vector embeddings efficiently. We are going to store our vector embeddings locally for the purpose of this project. The Qdrant Python wrappers are elaborate enough to help us do this with a few commands.

2. Data Preparation and Embedding Creation:

Text Embeddings - To convert patient records into vector embeddings, we use the FastEmbed library. Each record is transformed into a vector representation, with metadata such as patient ID, diagnosis, and symptoms included in the payload.

Image Embeddings - To convert medical images into vector embeddings, we use the CLIP model. Each image is processed into a vector, and metadata such as scan type and diagnosis is stored along with it.

File: src/embeddings_utils.py


from typing import List
from fastembed import TextEmbedding, ImageEmbedding

TEXT_MODEL_NAME = "Qdrant/clip-ViT-B-32-text"
IMAGE_MODEL_NAME = "Qdrant/clip-ViT-B-32-vision"

def convert_text_to_embeddings(documents: List[str], embedding_model: str = TEXT_MODEL_NAME) -> List:
text_embedding_model = TextEmbedding(model_name=embedding_model)
text_embeddings = list(text_embedding_model.embed(documents)) # Returns a generator of embeddings
return text_embeddings

def convert_image_to_embeddings(images: List[str], embedding_model: str = IMAGE_MODEL_NAME) -> List:
image_model = ImageEmbedding(model_name=embedding_model)
images_embedded = list(image_model.embed(images))
return images_embedded

3. Storing Embeddings in Qdrant:

We store the text and image embeddings in the Qdrant vector database. The metadata associated with each embedding helps in performing accurate similarity searches later.

Named vectors like ‘text’ and ‘image’ refer to structured embeddings of each input modality. By naming them, we can easily distinguish between text and image data during processing. This ensures that the model handles each input appropriately and combines them effectively in tasks like retrieval or content generation, improving the system’s accuracy and coherence.

File: src/create_data_embeddings.py

import os
import uuid

import pandas as pd
from fastembed import TextEmbedding, ImageEmbedding
from qdrant_client import QdrantClient, models

from src.embeddings_utils import convert_text_to_embeddings, convert_image_to_embeddings, TEXT_MODEL_NAME, \
IMAGE_MODEL_NAME

DATA_PATH = '/Users/sarthak/Documents/Work/Personal_Projects/healthcare_multimodal_ai/data/'

def create_uuid_from_image_id(image_id):
NAMESPACE_UUID = uuid.UUID('12345678-1234-5678-1234-567812345678')
return str(uuid.uuid5(NAMESPACE_UUID, image_id))

def create_embeddings(collection_name):
# Read captions txt data
path = DATA_PATH + 'captions.txt'
caption_df = pd.read_csv(path, sep='\t', header=None, names=['image_id', 'caption'])


# Read images
image_directory = os.listdir(DATA_PATH + 'images')


# Filter out images that are not in the captions
images = []
for image in image_directory:
if image.split('.')[0] in caption_df['image_id'].values:
images.append(image)


# Create image_id, caption, image_path list of dictionaries
image_docs = []
for image in images:
image_id = image.split('.')[0]
caption = caption_df[caption_df['image_id'] == image_id]['caption'].values[0]
image_path = DATA_PATH + 'images/' + image
image_docs.append({'image_id': image_id, 'caption': caption, 'image_path': image_path})


# Convert text to embeddings using Fastembed and images to embeddings using CLIP
captions = [doc['caption'] for doc in image_docs]
embeddings = convert_text_to_embeddings(captions)
for idx, embedding in enumerate(embeddings):
image_docs[idx]['caption_embedding'] = embedding


# Convert image to embeddings using CLIP
image_embeddings = convert_image_to_embeddings([doc['image_path'] for doc in image_docs])


for idx, embedding in enumerate(image_embeddings):
image_docs[idx]['image_embedding'] = embedding


# Save the embeddings to vector database
client = QdrantClient(":memory:")


text_model = TextEmbedding(model_name=TEXT_MODEL_NAME)
text_embeddings_size = text_model._get_model_description(TEXT_MODEL_NAME)["dim"]


image_model = ImageEmbedding(model_name=IMAGE_MODEL_NAME)
image_embeddings_size = image_model._get_model_description(IMAGE_MODEL_NAME)["dim"]


if not client.collection_exists(collection_name):
client.create_collection(
collection_name=collection_name,
vectors_config={
"image": models.VectorParams(size=image_embeddings_size, distance=models.Distance.COSINE),
"text": models.VectorParams(size=text_embeddings_size, distance=models.Distance.COSINE),
}
)
client.upload_points(
collection_name=collection_name,
points=[
models.PointStruct(
# Convert image_id to UUID
id=create_uuid_from_image_id(doc['image_id']),
vector={
"text": doc['caption_embedding'],
"image": doc['image_embedding'],
},
payload={
"image_id": doc['image_id'],
"caption": doc['caption'],
"image_path": doc['image_path']
}
)
for doc in image_docs
]
)
return client

4. Query Conversion and Similarity Search:

We convert the user’s query (either text or image) into a vector embedding and perform a similarity search in the Qdrant database to find the most relevant text or images. This allows for both ‘query by text’ and ‘query by image,’ improving retrieval accuracy across modalities.

File: src/embeddings_utils.py

# Search for similar text and get corresponding images as well
def search_similar_text(collection_name, client, query, limit=3):
text_model = TextEmbedding(model_name=TEXT_MODEL_NAME)
search_query = text_model.embed([query])
search_results = client.search(
collection_name=collection_name,
query_vector=('text', list(search_query)[0]),
with_payload=['image_path', 'caption'],
limit=limit,
)
return search_results

# Search for similar images and get corresponding text as well
def search_similar_image(collection_name, client, query_image_path, limit=3):
# Convert the query image into an embedding using the same model used for image embeddings
image_embedding_model = ImageEmbedding(model_name=IMAGE_MODEL_NAME)


# Embed the provided query image (assumed to be a file path)
query_image_embedding = list(image_embedding_model.embed([query_image_path]))[0] # Embedding for the query image


# Perform the similarity search in the Qdrant collection for image embeddings
search_results = client.search(
collection_name=collection_name,
query_vector=('image', query_image_embedding),
with_payload=['image_path', 'caption'], # Fetch image paths and captions as metadata
limit=limit,
)
return search_results

def merge_results(text_results, image_results):
# Combine based on some metadata, or simply concatenate
combined_results = text_results + image_results
return combined_results

5. Response Generation with GPT-4o:

We use the retrieved data to prompt the GPT-4o model, generating a detailed response based on the relevant context.

Make sure you give a detailed intuitive role to the model. That’ll help you give more appropriate results.

File: src/gpt_utils.py

import base64
import json
import requests

from config import Config

class GPTClient:
def __init__(self):
self.api_key = Config.OPENAI_API_KEY
self.api_url = "https://api.openai.com/v1/chat/completions"

def query(self, prompt, retrieved_contexts, user_image=None):
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {self.api_key}"
}

chatbot_role = """
You are a radiologist with an experience of 30 years.
You analyse medical scans and text, and help diagnose underlying issues.
"""

# Initialize message structure with the system prompt
messages = [
{"role": "system", "content": chatbot_role},
{"role": "user", "content": [
{"type": "text", "text": prompt}
]}
]


# Add the user-uploaded image (if any)
if user_image:
with open(user_image, "rb") as image_file:
base64_image = base64.b64encode(image_file.read()).decode('utf-8')
messages[1]["content"].append({
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{base64_image}"
}
})


messages[1]["content"].append({
"type": "text",
"text": "Additional context that you may use as a reference. Use them if you feel they are relevant to "
"the case."
"NOTE: They are not the patient's images. They are of other patients which can be used as a "
"reference, if required."
})


# Add the retrieved contexts, which should include both captions and corresponding images
for context in retrieved_contexts:
caption = context.payload['caption']
image_path = context.payload['image_path']

# Add the caption to the message
messages[1]["content"].append({
"type": "text",
"text": f"Caption: {caption}"
})

# Add the corresponding image to the message
with open(image_path, "rb") as image_file:
base64_image = base64.b64encode(image_file.read()).decode('utf-8')
messages[1]["content"].append({
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{base64_image}"
}
})

# Prepare the payload for the GPT API
data = {
"model": "gpt-4o",
"messages": messages,
"max_tokens": 600
}

# Send the request to the API
response = requests.post(self.api_url, headers=headers, data=json.dumps(data))
return response.json()

def process_response(self, response):
if 'choices' in response and len(response['choices']) > 0:
return response['choices'][0]['message']['content']

6. Tying All the Pieces Together:

Here, we are setting up the multimodal RAG system class. The method process_query will orchestrate the retrieval and pass the query to ChatGPT.

File: src/multimodal_rag_system.py

from src.create_data_embeddings import create_embeddings
from src.embeddings_utils import search_similar_text, search_similar_image, merge_results
from src.gpt_utils import GPTClient

COLLECTION_NAME = "medical_images_text"

class MultimodalRAGSystem:
collection_name: str

def __init__(self):
self.gpt_client = GPTClient()
self.qdrant_client = create_embeddings(COLLECTION_NAME)
self.collection_name = COLLECTION_NAME

def process_query(self, query, query_image_path=None, top_k=3):
# 1. Text-based search if query is textual
search_results_text = search_similar_text(self.collection_name, self.qdrant_client, query, limit=top_k)

# 2. Image-based search if a query image is provided
search_results_image = []
if query_image_path: # Only perform image retrieval if an image path is provided
search_results_image = search_similar_image(self.collection_name, self.qdrant_client, query_image_path, limit=top_k)

# 3. Combine both results - merging text and image results
combined_results = merge_results(search_results_text, search_results_image)

# 4. Query GPT with the context and images
gpt_response = self.gpt_client.query(query, combined_results, query_image_path)

# 5. Process and return the response
return self.gpt_client.process_response(gpt_response)

7. Showcasing Results Using Gradio:

We use Gradio to create an interactive user interface where users can input text queries and upload medical images. The system will then provide a detailed response based on the multimodal data.

File: src/main.py

import gradio as gr
from PIL import Image
from multimodal_rag_system import MultimodalRAGSystem

# Initialize the MultimodalRAGSystem
system = MultimodalRAGSystem()

# Define the Gradio function that will process the user input and image
def chatbot_interface(user_query, user_image=None):
if user_image is not None:
# Convert numpy array to a PIL Image and save it temporarily
user_image = Image.fromarray(user_image)
user_image_path = ("/Users/sarthak/Documents/Work/Personal_Projects/healthcare_multimodal_ai/data"
"/user_input_image.jpg")
user_image.save(user_image_path)
else:
user_image_path = None

# Get the response from the Multimodal AI system
response = system.process_query(user_query, query_image_path=user_image_path)
return response

# Create the Gradio interface with text input and image input
interface = gr.Interface(
fn=chatbot_interface,
inputs=[gr.components.Textbox(lines=5, label="User Query"), gr.components.Image(label="Upload Medical Image")],
outputs=gr.components.Textbox(),
title="Multimodal Medical Assistant",
description="Ask medical-related questions and upload relevant medical images for analysis."
)

# Launch the Gradio interface
if __name__ == "__main__":
interface.launch()

Output

Voila! The attached screenshots show our first attempt at uploading scans and getting a relevant analysis related to the scan. In the background, it augments the knowledge by retrieving relevant case studies.

If there is no relevant image found, it would do a preliminary diagnosis and suggest potential next steps.

Neck Scan: Using Gradio interface for querying, uploading and getting responses related to the images/query
Chest Scan: Using Gradio interface for querying, uploading and getting responses related to the images/query
Knee Scan: Using Gradio interface for querying, uploading and getting responses related to the images/query

Next Steps

We can improve this further by providing as many historical cases as possible in the form of images and captions. For example: a diagnostic/imaging centre which has handled thousands of cases would have a lot of data which can be augmented to our query to make informed decisions about the cases.

There is a lot of tweaking that can be done to the prompt as well, which would depend on the use-case of the application.

If applied right, this could prove to be a fantastic tool for radiologists and has the potential to increase the turn-around-time for cases by 2x.

References and Resources

Github Repo: https://github.com/iasarthak/healthcare_multimodal_ai

GPT-4 Documentation: https://www.openai.com/research

Qdrant Documentation: https://qdrant.tech/documentation/

FastEmbed Library: https://github.com/fastembed

CLIP Paper: https://arxiv.org/abs/2103.00020

A big shoutout to Abhishek Maurya for being a huge help here. He’s worked with me as an intern at Jupiter and I’ve had the most wonderful experience mentoring him.

If you’re here, do give me a follow on Medium and connect with me on Linkedin to chat more about Data Science!

--

--

Sarthak Arora

Data Scientist @ Jupiter.co | Ex - Assistant Manager in Analytics @ Paisabazaar | I write about Data Science and ML | https://www.linkedin.com/in/iasarthak/