Building an AI-Powered Document Retrieval System with Docling and Granite 3.1

6 min readJust now

A recipe from Granite

In this recipe, you’ll learn to harness the power of advanced tools to build AI-powered document retrieval systems. It will guide you through:

Document Processing: Learn to handle documents from various sources, parse and transform them into usable formats, and store them in vector databases using Docling.
Retrieval-Augmented Generation (RAG): Understand how to connect large language models (LLMs) like Granite 3.1 with external knowledge bases to enhance query responses and generate valuable insights.
LangChain for Workflow Integration: Discover how to use LangChain to streamline and orchestrate document processing and retrieval workflows, enabling seamless interaction between different components of the system.

This recipe leverages three cutting-edge technologies:

Docling: An open-source toolkit for parsing and converting documents.
Granite™ 3.1: A state-of-the-art LLM available via an API through Replicate, providing robust natural language capabilities.
LangChain: A powerful framework for building applications powered by language models, designed to simplify complex workflows and integrate external tools seamlessly.

By the end of this recipe, you will:

Gain proficiency in document processing and chunking.
Integrate vector databases to enhance retrieval capabilities.
Utilize RAG to perform efficient and accurate data retrieval for real-world applications.

This recipe is designed for AI developers, researchers, and enthusiasts looking to enhance their knowledge of document management and advanced NLP techniques.

Prerequisites

Familiarity with Python programming.
Basic understanding of large language models and natural language processing concepts.

Step 1: Setting up the environment

Ensure you are running python 3.10 or 3.11 in a freshly-created virtual environment.

import sys
assert sys.version_info >= (3, 10) and sys.version_info < (3, 12), "Use Python 3.10 or 3.11 to run this notebook."

Step 2: Install dependencies

! pip install "git+https://github.com/ibm-granite-community/utils.git" \
    transformers \
    langchain_community \
    langchain_huggingface \
    langchain_milvus \
    docling \
    replicate

Step 3: Selecting System Components

Choose your Embeddings Model

Specify the model to use for generating embedding vectors from text. Here we will be using one of the new Granite Embeddings models

To use a model from another provider, replace this code cell with one from this Embeddings Model recipe.

from langchain_huggingface import HuggingFaceEmbeddings
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("ibm-granite/granite-3.1-8b-instruct")
embeddings_model = HuggingFaceEmbeddings(model_name="ibm-granite/granite-embedding-30m-english")
embeddings_tokenizer = AutoTokenizer.from_pretrained("ibm-granite/granite-embedding-30m-english")

Use the Granite 3.1 model

Select a Granite model from the ibm-granite org on Replicate. Here we use the Replicate Langchain client to connect to the model.

To get set up with Replicate, see Getting Started with Replicate.

To connect to a model on a provider other than Replicate, substitute this code cell with one from the LLM component recipe.

from langchain_community.llms import Replicate
from ibm_granite_community.notebook_utils import get_env_var
model = Replicate(
    model="ibm-granite/granite-3.1-8b-instruct",
    replicate_api_token=get_env_var("REPLICATE_API_TOKEN"),
    model_kwargs={
        "max_tokens": 1000, # Set the maximum number of tokens to generate as output.
        "min_tokens": 100, # Set the minimum number of tokens to generate as output.
    },
)

Now that we have the model downloaded, let’s try asking it a question

query = "Who won in the Pantoja vs Asakura fight at UFC 310?"
prompt_guide_template = """\
<|start_of_role|>user<|end_of_role|>{prompt}<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>"""
prompt = prompt_guide_template.format(prompt=query)
output = model.invoke(prompt)
print(output)

Now, I know that UFC 310 happened in 2024, and this does not seem to be the right Pantoja. The model doesn’t seem to know the answer but at least understands that this matchup did not occur. Let’s see if it has some specific UFC rules info.

query1 = "How much weight allowance is allowed in non championship fights in the UFC?"

prompt = prompt_guide_template.format(prompt=query1)
output = model.invoke(prompt)
print(output)

Based on the official UFC rules, this is also incorrect. Let’s try getting some documents that contains this information for the model.

Choose your Vector Database

Specify the database to use for storing and retrieving embedding vectors.

To connect to a vector database other than Milvus, replace this code cell with one from this Vector Store recipe.

from langchain_milvus import Milvus
import tempfile
db_file = tempfile.NamedTemporaryFile(prefix="milvus_", suffix=".db", delete=False).name
print(f"The vector database will be saved to {db_file}")
vector_db = Milvus(
    embedding_function=embeddings_model,
    connection_args={"uri": db_file},
    auto_id=True,
    enable_dynamic_field=True,
    index_params={"index_type": "AUTOINDEX"},
)

Step 4: Building the Vector Database

In this example, from a set of source documents, we use Docling to convert the documents into text and then split the text into chunks, derive embedding vectors using the embedding model, and load it into the vector database. Creating this vector database will allow us to easily search across our documents, enabling us to use RAG.

Use Docling to download the documents, convert to text, and split into chunks

Here we have found a website that gives us information on UFC 310, as well as a PDF of the official UFC rules. Below, we will see that Docling can both convert and chunk the two documents.

# Docling imports
from docling.document_converter import DocumentConverter
from docling_core.transforms.chunker.hybrid_chunker import HybridChunker
from docling_core.types.doc.labels import DocItemLabel
from langchain_core.documents import Document
# Here are our documents, feel free to add more documents in formats that Docling supports
sources = [
    "https://www.ufc.com/news/main-card-results-highlights-winner-interviews-ufc-310-pantoja-vs-asakura",
    "https://media.ufc.tv/discover-ufc/Unified_Rules_MMA.pdf",
]
converter = DocumentConverter()
# Convert and chunk out documents
i = 0
texts: list[Document] = [
    Document(page_content=chunk.text, metadata={"doc_id": (i:=i+1), "source": source})
    for source in sources
    for chunk in HybridChunker(tokenizer=embeddings_tokenizer).chunk(converter.convert(source=source).document)
    if any(filter(lambda c: c.label in [DocItemLabel.TEXT, DocItemLabel.PARAGRAPH], iter(chunk.meta.doc_items)))
]
print(f"{len(texts)} document chunks created")
# Print all created documents
for document in texts:
    print(f"Document ID: {document.metadata['doc_id']}")
    print(f"Source: {document.metadata['source']}")
    print(f"Content:\n{document.page_content}")
    print("=" * 80)  # Separator for clarity

Populate the vector database

NOTE: Population of the vector database may take over a minute depending on your embedding model and service.

ids = vector_db.add_documents(texts)
print(f"{len(ids)} documents added to the vector database")

Step 5: RAG with Granite

Now that we have succesfully converted our documents and vectorized them, we can set up out RAG pipeline.

Retrieve relevant chunks

Here we will test the as_retriever method to search through our newly created vector database for chunks that are relevant to our original query

retriever = vector_db.as_retriever()
docs = retriever.invoke(query)
print(docs)

Looks like it pulled some chunks that would have the information we are looking for. Let’s go ahead and contruct our RAG pipeline.

Create the prompt for Granite 3.1

Next, we construct the prompt pipeline. This creates the prompt which holds the retrieved chunks from out previous search and feeds this to the model as context for answering our question.

from langchain.prompts import PromptTemplate
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
# Create a prompt template for question-answering with the retrieved context.
prompt_template = """<|start_of_role|>system<|end_of_role|>\
{{
  Context information is below.\n---------------------\n{context}\n---------------------\nGiven the context information and not prior knowledge, answer the query.\nQuery: {input}\nAnswer:\n
  }}
}}<|end_of_text|>
<|start_of_role|>user<|end_of_role|>{input}"""
# Assemble the retrieval-augmented generation chain.
qa_chain_prompt = PromptTemplate.from_template(prompt_template)
combine_docs_chain = create_stuff_documents_chain(model, qa_chain_prompt)
rag_chain = create_retrieval_chain(vector_db.as_retriever(), combine_docs_chain)

Generate a retrieval-augmented response to a question

Using the chunks from the similarity search as context, the response from Granite RAG 3.1 is recieved in JSON document. This cell then parses the JSON document to retrieve the sentences of the response along with metadata about the sentence which can be used to guide the displayed output.

output = rag_chain.invoke({"input": query})
print(output['answer'])

Awesome! It looks like the model figured out our first question. Let’s see if it figure out the rule we were looking for.

output = rag_chain.invoke({"input": query1})
print(output['answer'])

Awesome! We can now see that we have created a pipeline that can successfully leverage knowledge from multiple document types for generation.

Next Steps

Explore advanced RAG workflows for other industries
Experiment with other document types and larger datasets.
Optimize prompt engineering for better Granite responses.

Thank you for using this recipe!

Link

Original site: https://github.com/ibm-granite-community/granite-snack-cookbook/blob/main/recipes/RAG/Granite_Docling_RAG.ipynb