Build a containerized PDF to Markdown converter using Docling and Streamlit ☁️

Alain Airom (Ayrom)
5 min readFeb 4, 2025

Introduction

In one of my last discussions, I was asked to help a business partner to start using Docling for document preparation to be sent to a RAG. Putting aside the project details, the application is meant to run on a Kubernetes cluster.

The sample provided here provides the steps to create a Python application using Docling document conversion capabilities, and using Streamlit to build the web UI. The application can be containerized to be deployed on a k8s cluster.

What is Docling

Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.

Features

  • 🗂️ Parsing of multiple document formats incl. PDF, DOCX, XLSX, HTML, images, and more
  • 📑 Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more
  • 🧬 Unified, expressive DoclingDocument representation format
  • ↪️ Various export formats and options, including Markdown, HTML, and lossless JSON
  • 🔒 Local execution capabilities for sensitive data and air-gapped environments
  • 🤖 Plug-and-play integrations incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
  • 🔍 Extensive OCR support for scanned PDFs and images
  • 💻 Simple and convenient CLI

The sample application

Disclaimer: this is a very simple implementation, in order to showcase the capacities. It should be enhanced for a scaled / industrialized application.

On Intel machines, most likely there is a need to troubleshoot the configuration in order to make it runnable. The Conda activation will probably be required. Hereafter is what I had to do on my laptop.

# conda installation
v=3; f=Miniconda${v}-latest-MacOSX-x86_64.sh; cd $TMPDIR; { curl -LfOsS https://repo.anaconda.com/miniconda/$f ; cd -; } && bash $TMPDIR/$f -b && echo ". ~/miniconda${v}/etc/profile.d/conda.sh" >> ~/.bash_profile; . ~/miniconda${v}/bin/activateconda

# initialization
conda init


source ~/.bashrc
conda create --name py11 python==3.11

# activation
conda activate py11

And hereafter the application;

import streamlit as st
import os

from pathlib import Path

from docling.backend.docling_parse_backend import DoclingParseDocumentBackend
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
EasyOcrOptions,
OcrMacOptions,
PdfPipelineOptions,
RapidOcrOptions,
TesseractCliOcrOptions,
TesseractOcrOptions,
)
from docling.document_converter import DocumentConverter, PdfFormatOption

def main():
def main():
st.title("File Selection App")

# Initialize session state for storing the concatenated path
if 'full_file_path' not in st.session_state:
st.session_state.full_file_path = None

# Add description
st.write("Select a file from your computer to view its path information")

# File uploader with type filtering
allowed_types = ["pdf", "doc", "docx", "ppt", "pptx", "jpg", "jpeg"]
uploaded_file = st.file_uploader(
"Choose a file",
type=allowed_types,
help="Supported files: PDF, Word, PowerPoint, and JPEG"
)

# Display file details if a file is uploaded
if uploaded_file is not None:
# Get the current working directory
current_dir = os.getcwd()

# Create full path (this will be where Streamlit temporarily stores the file)
full_path = os.path.join(current_dir, uploaded_file.name)

# Store the concatenated path in session state
st.session_state.full_file_path = full_path

file_details = {
"Filename": uploaded_file.name,
"Full Path": full_path,
"File size": f"{uploaded_file.size / 1024:.2f} KB",
"File type": uploaded_file.type,
"Concatenated Path Variable": st.session_state.full_file_path
}

st.write("### File Details:")
for key, value in file_details.items():
st.write(f"**{key}:** {value}")

# Example of using the concatenated path in the app
st.write("### Using the Concatenated Path")
st.code(f"""
# You can access the full file path anywhere in your app using:
full_file_path = st.session_state.full_file_path

# Example usage:
if st.session_state.full_file_path:
# Do something with the path
print(f'Working with file: {st.session_state.full_file_path}')
"""
)

# Add a button to demonstrate using the path
if st.button("Print Path to Console"):
st.write(f"Path has been printed to console: {st.session_state.full_file_path}")
print(f"Full file path: {st.session_state.full_file_path}")

# Add a note about the path
st.info("Note: The full path shown is where Streamlit temporarily stores the uploaded file. "
"To get the original file path, you would need to use a different approach as browsers "
"don't provide the original file path for security reasons.")

######
##
input_doc = st.session_state.full_file_path
##

pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True

# Any of the OCR options can be used:EasyOcrOptions, TesseractOcrOptions, TesseractCliOcrOptions, OcrMacOptions(Mac only), RapidOcrOptions
# ocr_options = EasyOcrOptions(force_full_page_ocr=True)
# ocr_options = TesseractOcrOptions(force_full_page_ocr=True)
# ocr_options = OcrMacOptions(force_full_page_ocr=True)
# ocr_options = RapidOcrOptions(force_full_page_ocr=True)
ocr_options = TesseractCliOcrOptions(force_full_page_ocr=True)
pipeline_options.ocr_options = ocr_options

converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_options=pipeline_options,
)
}
)

doc = converter.convert(input_doc).document
md = doc.export_to_markdown()
print(md)

if __name__ == "__main__":
main()

In order to make the application distributable I set up the following “requirements.txt” (not the best type of requirements which could be written by a real Python developer 😅).

streamlit
pathlib
docling
logging
time
datetime
pandas
tensorflow

To test the app locally, the following simple command line does the job!

streamlit run app.py

And there goes the Dockerfile.

# Use an official Python runtime as a parent image
FROM python:3.11-slim

# Set environment variables
ENV PYTHONUNBUFFERED=1 \
PYTHONDONTWRITEBYTECODE=1 \
PIP_NO_CACHE_DIR=1

# Set the working directory in the container
WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y \
build-essential \
curl \
software-properties-common \
git \
&& rm -rf /var/lib/apt/lists/*

# Copy only requirements first to leverage Docker cache
COPY requirements.txt .

# Install Python dependencies
RUN pip install --upgrade pip setuptools wheel \
&& pip install -r requirements.txt \
&& rm -rf /root/.cache/pip

# Copy the rest of the application
COPY . .

# Expose Streamlit port
EXPOSE 8501

# Add healthcheck
HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health

# Set Streamlit configuration
ENV STREAMLIT_SERVER_PORT=8501
ENV STREAMLIT_SERVER_ADDRESS=0.0.0.0

# Run the application
CMD ["streamlit", "run", "app.py"]

I use Podman to build and deploy the image, by aliasing Podman to Docker, all Docker commands could be used.

docker build -t app:latest .
#
docker run -p 8501:8501 app:latest
# or
docker run -d -p 8501:8501 app:latest

Conclusion

This article shows the ease of Docling implementation in order to make enterprise documents ready to be ingested easily for a RAG application for example.

Thanks for reading 🙏

Useful links

Sign up to discover human stories that deepen your understanding of the world.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Alain Airom (Ayrom)
Alain Airom (Ayrom)

Written by Alain Airom (Ayrom)

IT guy for a long time, looking for technical challenges everyday!

No responses yet

What are your thoughts?