import ollama
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.embeddings import OllamaEmbeddings
from chromadb.config import Settings
from chromadb import Client
from langchain.vectorstores import Chroma
import gradio as gr
import re
from concurrent.futures import ThreadPoolExecutor
localRAGchatbot
You should not try to render this file until you have run the code in the accompanying slides. You must have Ollama and DeepSeek-R1 installed on your machine, which is described in the lecture slides. Then you must be running Ollama via ollama serve
in a separate terminal window.
We will follow a modification of the tutorial by Aashi Dutt at https://www.datacamp.com/tutorial/deepseek-r1-rag
Installations
You must at least install the following packages (maybe more—I have over 300 python packages installed on my machine, and may just happen to have some required packages already installed). Note that, when you render this file, the following chunk will not be executed. This is because the keyword bash
at the beginning of the chunk is not in curly braces. The reason for this is that the chunk should only be run once. The other chunks need to be run every time you want to run the code. You should copy and paste this first chunk into your terminal and run it.
pip install langchain chromadb gradio ollama pymupdf langchain_ollama langchain_chroma
pip install -U langchain-community
Unfortunately, I experienced some problems with the packages. I ran pip-upgrade
, which is defined as
alias pip-upgrade="pip list -o | cut -f1 -d' ' | tr ' ' '\n' | awk '{if(NR>=3)print}' | cut -d' ' -f1 | xargs -n1 pip install -U"
in my .bash_profile
. Somewhere in a kadillion lines of output was an error message that I did not catch at first, saying that my version of pydantic
was incompatible with my version of pydantic-core
. You can check your versions by running pip list
. You may need to downgrade pydantic_core
to 2.27.2 as I did. You can do this by running pip install pydantic-core==2.27.2
after uninstalling pydantic-core
with pip uninstall pydantic-core
. Why would you need to do this? I found I was getting a lot of errors in pydantic
that I could not resolve and that were not mentioned online. That’s when I looked back at the pip-upgrade
output and found the error message.
Package imports
You must import the following packages but be careful to just run this chunk first and watch the output. A couple of the functions are deprecated but still work. That may change and the error messages should give clues about what to do in that case.
The document
The whole point of this tutorial is create a chatbot that can answer questions about the book Foundations of LLMs by Xiao and Zhu. The following chunk loads the document. It assumes you have downloaded the document and saved it in the same directory as this file. You then split it into smaller chunks, use DeepSeek-R1 to generate embeddings, and store the embeddings in a vector store.
Packages
You will use a number of packages to accomplish this. First is PyMuPDFLoader
, which allows you to load a PDF document. You can read more about it at https://pymupdf.readthedocs.io/en/latest/rag.html. It has many more capabilities we won’t use and handles many other filetypes. We’re just going to load a well-behaved PDF.
Second, we split the texts into chunks. The RecursiveCharacterTextSplitter
is a very popular tool provided by LangChain
for, as you might guess, splitting large amounts of text, such as our 231 page book, into small chunks. LangChain
is a popular framework for building LLM applications. It orchestrates the various tools you use. You can read more about it at https://python.langchain.com/docs/.
Next we generate embeddings for each chunk, using DeepSeek R1. The actual generation process is quite time-consuming on a typical laptop. You can see in the comments that it takes seven minutes on my machine and that you can tell when it’s done by watching the window running ollama serve
.
What are we going to do with these embeddings, which constitute a vector representation of the text? We will use them to create a vector database, which we will use to answer questions about the book. For this, we’ll use ChromaDB, a popular vector database. You can read more about it at https://docs.trychroma.com/.
The setup code
#. Step 1: Load the document using PyMuPDFLoader
= PyMuPDFLoader("Xiao2025.pdf")
loader = loader.load()
documents
#. Step 2: Split text into smaller chunks
= RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
text_splitter = text_splitter.split_documents(documents)
chunks
#. Step 3: Initialize Ollama embeddings
= OllamaEmbeddings(model="deepseek-r1")
embedding_function
#. Step 4: Parallelize embedding generation
def generate_embedding(chunk):
return embedding_function.embed_query(chunk.page_content)
#. This next line takes about 7 minutes on my M1 Macbook Pro with 32GB RAM;
#. you can tell when it's done by watching the window running `ollama serve`
with ThreadPoolExecutor() as executor:
= list(executor.map(generate_embedding, chunks))
embeddings
#. Step 5: Recreate the collection
= Client(Settings())
client #. client.delete_collection(
#. name="foundations_of_llms"
#. ) # Delete any existing collection if needed
= client.create_collection(name="foundations_of_llms")
collection
#. Step 6: Add documents and embeddings to Chroma
for idx, chunk in enumerate(chunks):
collection.add(=[chunk.page_content],
documents=[{"id": idx}],
metadatas=[embeddings[idx]],
embeddings=[str(idx)], # Ensure IDs are strings
ids
)
print("Embeddings stored successfully!")
/var/folders/hg/8vdwcz0s305br3gmnqmf1rrw0000gs/T/ipykernel_90862/3000413336.py:10: LangChainDeprecationWarning: The class `OllamaEmbeddings` was deprecated in LangChain 0.3.1 and will be removed in 1.0.0. An updated version of the class exists in the :class:`~langchain-ollama package and should be used instead. To use it run `pip install -U :class:`~langchain-ollama` and import as `from :class:`~langchain_ollama import OllamaEmbeddings``.
embedding_function = OllamaEmbeddings(model="deepseek-r1")
Embeddings stored successfully!
Retrieve the context to answer the question
Next you have to initialize a retriever that will retrieve the context from the vector store.
#. initialize retriever using chroma collection
= Chroma(
retriever ="foundations_of_llms",
collection_name=client,
client=embedding_function,
embedding_function
).as_retriever()
def retrieve_context(question):
= retriever.get_relevant_documents(question)
results = "\n\n".join([doc.page_content for doc in results])
context return context
/var/folders/hg/8vdwcz0s305br3gmnqmf1rrw0000gs/T/ipykernel_90862/403020567.py:3: LangChainDeprecationWarning: The class `Chroma` was deprecated in LangChain 0.2.9 and will be removed in 1.0. An updated version of the class exists in the :class:`~langchain-chroma package and should be used instead. To use it run `pip install -U :class:`~langchain-chroma` and import as `from :class:`~langchain_chroma import Chroma``.
retriever = Chroma(
Create the prompt
Next, create the prompt, send it to DeepSeek-R1 using Ollama, and obtain a response. If this doesn’t work, it may mean that you are not running Ollama via ollama serve
in a separate terminal window.
def query_deepseek(question, context):
# Format the input as a structured prompt
= f"Question: {question}\n\nContext: {context}"
formatted_promt
# Send the prompt to DeepSeek-R1 using Ollama
= ollama.chat(
response ="deepseek-r1", messages=[{"role": "user", "content": formatted_promt}]
model
)
# Extract and clean the response
= response["message"]["content"]
response_content = re.sub(
final_answer r"<think>.*?</think>", "", response_content, flags=re.DOTALL
).strip()return final_answer
Define the RAG pipeline
Next, define the RAG (retrieval augmented generation) pipeline.
def rag_pipeline(question):
# Retrieve context from the vector store
= retrieve_context(question)
context
# Generate an answer using DeepSeek-R1
= query_deepseek(question, context)
answer return answer
Actually run the RAG pipeline.
def ask_question(question):
# Run the RAG pipeline
return rag_pipeline(question)
Gradio interface
Finally, create a Gradio interface for the chatbot. This will appear as a web page in your browser. The user can enter as many prompts as they like. Gradio is a popular library for creating web interfaces for machine learning models. You can read more about it at https://gradio.app/.
#. Create a Gradio interface
= gr.Interface(
interface =ask_question,
fn="text",
inputs="text",
outputs="RAG Chatbot: Foundations of LLMs",
title="Ask any question about the Foundations of LLMs book. Powered by DeepSeek-R1.",
description )
Note that I have commented out the following line that actually launches the Gradio interface. This is only for rendering purposes. It should be commented out when you render the document. When you want to actually run the code, you should uncomment it. The interface will then appear in your browser if you point to http://localhost:7860
.
#. Launch the Gradio app
#. interface.launch(debug=True)
Here is a screenshot of the interface, answering a simple question.