Sitemap

What Are Vision-Language Models (VLMs) and How Do They Work?

10 min readJun 17, 2025

--

Decoding VLMs: A Simple Explanation of Vision-Language Models

Courtesy to Nvidia

Introduction

In the rapidly evolving landscape of artificial intelligence, Vision-Language Models (VLMs) represent a significant leap forward, bridging the previously distinct domains of computer vision and natural language processing. Unlike earlier AI systems that excelled at either “seeing” (analyzing images) or “understanding” (processing text), VLMs are multimodal powerhouses, capable of seamlessly interpreting and generating content that incorporates both visual and linguistic information. At their core, VLMs typically integrate a sophisticated vision encoder, which extracts meaningful features from images or videos, with a powerful language model (often a large language model or LLM) that understands and generates human-like text. These two components are meticulously trained together on massive datasets of image-text pairs, enabling the VLM to learn intricate correlations between what it sees and how it’s described. This inherent ability to connect visual cues with semantic meaning unlocks a vast array of applications, from generating descriptive image captions and answering complex questions about visual content to aiding in visual search, content moderation, and even assisting autonomous systems in comprehending their surroundings. In the broader ecosystem of generative AI and AI in general, VLMs serve as crucial building blocks, often working in conjunction with other specialized tools. For instance, a VLM might analyze an image to provide context for a text-to-image generator, allowing users to create visually consistent new images based on textual prompts and existing visual elements. Similarly, in an AI-powered analytical pipeline, a VLM could process vast amounts of video data to summarize events, detect anomalies, or extract specific information, which then feeds into other AI models for further analysis, decision-making, or even automated reporting. This synergistic integration allows for the creation of more intelligent, intuitive, and versatile AI applications that can better perceive and interact with the complex, multimodal world around us.

Implementing VLM by Docling

Docling significantly simplifies the implementation and utilization of Vision-Language Models (VLMs) for end-to-end document conversion. At its core, Docling offers a VlmPipeline designed to process documents using a VLM, with the flexibility to output results in various formats such as DocTags (its preferred choice), Markdown, or HTML. It provides clear pathways for both Command Line Interface (CLI) and Python users to integrate VLMs into their workflows, even offering pre-configured options for local model execution using popular frameworks like Hugging Face Transformers and MLX for Apple devices. Docling supports a range of readily available local VLM instances, including SmolDocling, Qwen2.5-VL, Pixtral, Gemma, Granite Vision, and Phi-4, allowing users to select models based on performance and device compatibility. Furthermore, Docling's VlmPipeline is highly customizable, enabling users to configure other Hugging Face models by specifying their repo_id, prompt, and inference parameters. Beyond local execution, Docling also supports offloading inference to remote services that offer an OpenAI-compatible API, such as vLLM or Ollama, providing even greater flexibility and scalability for VLM deployment.


The `VlmPipeline` in Docling allows to convert documents end-to-end using a vision-language model.

Docling supports vision-language models which output:

- DocTags (e.g. [SmolDocling](https://huggingface.co/ds4sd/SmolDocling-256M-preview)), the preferred choice
- Markdown
- HTML


For running Docling using local models with the `VlmPipeline`:

=== "CLI"

```bash
docling --pipeline vlm FILE
```

=== "Python"

See also the example [minimal_vlm_pipeline.py](./../examples/minimal_vlm_pipeline.py).

```python
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.pipeline.vlm_pipeline import VlmPipeline

converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_cls=VlmPipeline,
),
}
)

doc = converter.convert(source="FILE").document
```

## Available local models

By default, the vision-language models are running locally.
Docling allows to choose between the Hugging Face [Transformers](https://github.com/huggingface/transformers) framweork and the [MLX](https://github.com/Blaizzy/mlx-vlm) (for Apple devices with MPS acceleration) one.

The following table reports the models currently available out-of-the-box.

| Model instance | Model | Framework | Device | Num pages | Inference time (sec) |
| ---------------|------ | --------- | ------ | --------- | ---------------------|
| `vlm_model_specs.SMOLDOCLING_TRANSFORMERS` | [ds4sd/SmolDocling-256M-preview](https://huggingface.co/ds4sd/SmolDocling-256M-preview) | `Transformers/AutoModelForVision2Seq` | MPS | 1 | 102.212 |
| `vlm_model_specs.SMOLDOCLING_MLX` | [ds4sd/SmolDocling-256M-preview-mlx-bf16](https://huggingface.co/ds4sd/SmolDocling-256M-preview-mlx-bf16) | `MLX`| MPS | 1 | 6.15453 |
| `vlm_model_specs.QWEN25_VL_3B_MLX` | [mlx-community/Qwen2.5-VL-3B-Instruct-bf16](https://huggingface.co/mlx-community/Qwen2.5-VL-3B-Instruct-bf16) | `MLX`| MPS | 1 | 23.4951 |
| `vlm_model_specs.PIXTRAL_12B_MLX` | [mlx-community/pixtral-12b-bf16](https://huggingface.co/mlx-community/pixtral-12b-bf16) | `MLX` | MPS | 1 | 308.856 |
| `vlm_model_specs.GEMMA3_12B_MLX` | [mlx-community/gemma-3-12b-it-bf16](https://huggingface.co/mlx-community/gemma-3-12b-it-bf16) | `MLX` | MPS | 1 | 378.486 |
| `vlm_model_specs.GRANITE_VISION_TRANSFORMERS` | [ibm-granite/granite-vision-3.2-2b](https://huggingface.co/ibm-granite/granite-vision-3.2-2b) | `Transformers/AutoModelForVision2Seq` | MPS | 1 | 104.75 |
| `vlm_model_specs.PHI4_TRANSFORMERS` | [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) | `Transformers/AutoModelForCasualLM` | CPU | 1 | 1175.67 |
| `vlm_model_specs.PIXTRAL_12B_TRANSFORMERS` | [mistral-community/pixtral-12b](https://huggingface.co/mistral-community/pixtral-12b) | `Transformers/AutoModelForVision2Seq` | CPU | 1 | 1828.21 |

_Inference time is computed on a Macbook M3 Max using the example page `tests/data/pdf/2305.03393v1-pg9.pdf`. The comparision is done with the example [compare_vlm_models.py](./../examples/compare_vlm_models.py)._

For choosing the model, the code snippet above can be extended as follow

```python
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.pipeline.vlm_pipeline import VlmPipeline
from docling.datamodel.pipeline_options import (
VlmPipelineOptions,
)
from docling.datamodel import vlm_model_specs

pipeline_options = VlmPipelineOptions(
vlm_options=vlm_model_specs.SMOLDOCLING_MLX, # <-- change the model here
)

converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_cls=VlmPipeline,
pipeline_options=pipeline_options,
),
}
)

doc = converter.convert(source="FILE").document
```

### Other models

Other models can be configured by directly providing the Hugging Face `repo_id`, the prompt and a few more options.

For example:

```python
from docling.datamodel.pipeline_options_vlm_model import InlineVlmOptions, InferenceFramework, TransformersModelType

pipeline_options = VlmPipelineOptions(
vlm_options=InlineVlmOptions(
repo_id="ibm-granite/granite-vision-3.2-2b",
prompt="Convert this page to markdown. Do not miss any text and only output the bare markdown!",
response_format=ResponseFormat.MARKDOWN,
inference_framework=InferenceFramework.TRANSFORMERS,
transformers_model_type=TransformersModelType.AUTOMODEL_VISION2SEQ,
supported_devices=[
AcceleratorDevice.CPU,
AcceleratorDevice.CUDA,
AcceleratorDevice.MPS,
],
scale=2.0,
temperature=0.0,
)
)
```


## Remote models

Additionally to local models, the `VlmPipeline` allows to offload the inference to a remote service hosting the models.
Many remote inference services are provided, the key requirement is to offer an OpenAI-compatible API. This includes vLLM, Ollama, etc.

More examples on how to connect with the remote inference services can be found in the following examples:

- [vlm_pipeline_api_model.py](./../examples/vlm_pipeline_api_model.py)

Docling provides a powerful and systematic way to compare the performance and output quality of various Vision-Language Models (VLMs) through its VlmPipeline. As illustrated in the provided code, Docling allows users to define a suite of VLMs, encompassing different architectures (like SmolDocling, Qwen2.5-VL, Pixtral, Gemma, Granite Vision, and Phi-4) and inference frameworks (such as Hugging Face Transformers and MLX, with a check for macOS compatibility to exclude MLX models if not applicable). For each model in this comparison set, Docling rigorously processes specified input documents (e.g., a PDF file), measuring crucial metrics like the inference time per page and the cumulative total prediction time for the entire document. Beyond mere speed, the system also facilitates qualitative comparison by generating comprehensive outputs for each VLM in multiple formats, including JSON (for structured data), Markdown, and HTML, which allows users to directly inspect the conversion accuracy and fidelity. Finally, all the collected data—including the source document, model ID, framework used, number of pages processed, and total inference time—is consolidated and presented in a clear, easy-to-read table using the tabulate library. This comprehensive VLM comparison capability empowers developers and researchers to make informed decisions when selecting the most suitable model for their document processing tasks, balancing performance, output quality, and compatibility with their hardware environment.

# Compare VLM models
# ==================
#
# This example runs the VLM pipeline with different vision-language models.
# Their runtime as well output quality is compared.

import json
import sys
import time
from pathlib import Path

from docling_core.types.doc import DocItemLabel, ImageRefMode
from docling_core.types.doc.document import DEFAULT_EXPORT_LABELS
from tabulate import tabulate

from docling.datamodel import vlm_model_specs
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
VlmPipelineOptions,
)
from docling.datamodel.pipeline_options_vlm_model import InferenceFramework
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.pipeline.vlm_pipeline import VlmPipeline


def convert(sources: list[Path], converter: DocumentConverter):
model_id = pipeline_options.vlm_options.repo_id.replace("/", "_")
framework = pipeline_options.vlm_options.inference_framework
for source in sources:
print("================================================")
print("Processing...")
print(f"Source: {source}")
print("---")
print(f"Model: {model_id}")
print(f"Framework: {framework}")
print("================================================")
print("")

res = converter.convert(source)

print("")

fname = f"{res.input.file.stem}-{model_id}-{framework}"

inference_time = 0.0
for i, page in enumerate(res.pages):
inference_time += page.predictions.vlm_response.generation_time
print("")
print(
f" ---------- Predicted page {i} in {pipeline_options.vlm_options.response_format} in {page.predictions.vlm_response.generation_time} [sec]:"
)
print(page.predictions.vlm_response.text)
print(" ---------- ")

print("===== Final output of the converted document =======")

with (out_path / f"{fname}.json").open("w") as fp:
fp.write(json.dumps(res.document.export_to_dict()))

res.document.save_as_json(
out_path / f"{fname}.json",
image_mode=ImageRefMode.PLACEHOLDER,
)
print(f" => produced {out_path / fname}.json")

res.document.save_as_markdown(
out_path / f"{fname}.md",
image_mode=ImageRefMode.PLACEHOLDER,
)
print(f" => produced {out_path / fname}.md")

res.document.save_as_html(
out_path / f"{fname}.html",
image_mode=ImageRefMode.EMBEDDED,
labels=[*DEFAULT_EXPORT_LABELS, DocItemLabel.FOOTNOTE],
split_page_view=True,
)
print(f" => produced {out_path / fname}.html")

pg_num = res.document.num_pages()
print("")
print(
f"Total document prediction time: {inference_time:.2f} seconds, pages: {pg_num}"
)
print("====================================================")

return [
source,
model_id,
str(framework),
pg_num,
inference_time,
]


if __name__ == "__main__":
sources = [
"tests/data/pdf/2305.03393v1-pg9.pdf",
]

out_path = Path("scratch")
out_path.mkdir(parents=True, exist_ok=True)

## Use VlmPipeline
pipeline_options = VlmPipelineOptions()
pipeline_options.generate_page_images = True

## On GPU systems, enable flash_attention_2 with CUDA:
# pipeline_options.accelerator_options.device = AcceleratorDevice.CUDA
# pipeline_options.accelerator_options.cuda_use_flash_attention2 = True

vlm_models = [
## DocTags / SmolDocling models
vlm_model_specs.SMOLDOCLING_MLX,
vlm_model_specs.SMOLDOCLING_TRANSFORMERS,
## Markdown models (using MLX framework)
vlm_model_specs.QWEN25_VL_3B_MLX,
vlm_model_specs.PIXTRAL_12B_MLX,
vlm_model_specs.GEMMA3_12B_MLX,
## Markdown models (using Transformers framework)
vlm_model_specs.GRANITE_VISION_TRANSFORMERS,
vlm_model_specs.PHI4_TRANSFORMERS,
vlm_model_specs.PIXTRAL_12B_TRANSFORMERS,
]

# Remove MLX models if not on Mac
if sys.platform != "darwin":
vlm_models = [
m for m in vlm_models if m.inference_framework != InferenceFramework.MLX
]

rows = []
for vlm_options in vlm_models:
pipeline_options.vlm_options = vlm_options

## Set up pipeline for PDF or image inputs
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_cls=VlmPipeline,
pipeline_options=pipeline_options,
),
InputFormat.IMAGE: PdfFormatOption(
pipeline_cls=VlmPipeline,
pipeline_options=pipeline_options,
),
},
)

row = convert(sources=sources, converter=converter)
rows.append(row)

print(
tabulate(
rows, headers=["source", "model_id", "framework", "num_pages", "time"]
)
)

print("see if memory gets released ...")
time.sleep(10)

This Python code serves as a practical demonstration of Docling’s VlmPipeline, showcasing how to convert documents end-to-end using Vision-Language Models. It illustrates two primary methods for utilizing Docling's VLM capabilities, both involving fetching a PDF from an arXiv link and converting its content into Markdown.

Default VLM Pipeline Usage

The first section of the code sets up a DocumentConverter with minimal configuration. By simply specifying VlmPipeline as the desired pipeline for PDF input, Docling automatically employs its default VLM settings. This typically means using a pre-selected VLM (like SmolDocling) and its default inference framework (often Hugging Face Transformers). The converter.convert(source=source).document line then processes the PDF, and doc.export_to_markdown() extracts the content into a readable Markdown format, demonstrating a straightforward, out-of-the-box conversion.

Custom VLM Pipeline with MPS Acceleration

The second part of the code delves into more explicit VLM configuration, particularly for users with Apple devices equipped with MPS (Metal Performance Shaders) acceleration. Here, a VlmPipelineOptions object is created and specifically configured to use vlm_model_specs.SMOLDOCLING_MLX. This SMOLDOCLING_MLX specification instructs Docling to load the SmolDocling model optimized for the MLX framework, leveraging the MPS accelerator for potentially faster inference on compatible hardware. The converter is then initialized with these custom pipeline_options, and the document conversion proceeds similarly, highlighting how Docling allows users to fine-tune VLM selection and hardware utilization for optimized performance based on their environment. The final print(doc.export_to_markdown()) again outputs the converted document, allowing for a comparison of output or performance if this were part of a larger benchmark.

This sample effectively illustrates Docling’s flexibility in providing both convenient default settings for quick usage and granular control over VLM selection and hardware configuration for more advanced scenarios.️

from docling.datamodel import vlm_model_specs
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
VlmPipelineOptions,
)
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.pipeline.vlm_pipeline import VlmPipeline

source = "https://arxiv.org/pdf/2501.17887"

###### USING SIMPLE DEFAULT VALUES
# - SmolDocling model
# - Using the transformers framework

converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_cls=VlmPipeline,
),
}
)

doc = converter.convert(source=source).document

print(doc.export_to_markdown())


###### USING MACOS MPS ACCELERATOR
# For more options see the compare_vlm_models.py example.

pipeline_options = VlmPipelineOptions(
vlm_options=vlm_model_specs.SMOLDOCLING_MLX,
)

converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_cls=VlmPipeline,
pipeline_options=pipeline_options,
),
}
)

doc = converter.convert(source=source).document

print(doc.export_to_markdown())

SmolDocling Ultra-Compact VLM

SmolDocling; An ultra-compact vision-language model for end-to-end multi-modal document conversion” significantly simplifies the implementation of VLMs by offering a highly efficient and self-contained solution for document understanding. Unlike traditional approaches that often rely on large, resource-intensive foundational models or complex ensembles of specialized tools, SmolDocling provides an end-to-end VLM designed specifically for comprehensive document processing. Its ultra-compact nature, with only 256 million parameters, drastically reduces computational requirements and memory footprint, making it accessible for deployment on devices with limited resources, including consumer-grade hardware. A key innovation of SmolDocling is its ability to generate DocTags, a new universal markup format that precisely captures all page elements — from text and tables to images and equations — along with their structural and spatial context. This unified representation streamlines the conversion process and ensures that all critical information within a document is accurately preserved and organized. By offering robust performance in recognizing diverse document features and competing with much larger VLMs, SmolDocling empowers developers to implement powerful document conversion capabilities with fewer overheads, fostering the creation of more efficient and scalable AI applications.

Conclusion

Bringing all these points together, the advancement of Vision-Language Models (VLMs) marks a pivotal moment in AI, seamlessly merging computer vision and natural language understanding to unlock unprecedented capabilities in processing multimodal data. Tools like Docling are at the forefront of this revolution, not only by providing a flexible pipeline for deploying various VLMs — whether locally or via remote services — but also by offering critical functionalities for comparative analysis of these models. This enables users to meticulously assess trade-offs in runtime performance and output quality across different VLM architectures and frameworks. Furthermore, the emergence of specialized, highly efficient models such as SmolDocling underscores a significant trend towards creating compact, end-to-end solutions that are accessible and performant even on resource-constrained devices. By facilitating comprehensive document conversion through universal markup formats like DocTags, these innovations collectively empower developers and researchers to build more intelligent, intuitive, and scalable AI applications that truly understand and interact with the complex visual and textual world.

Links

--

--

Alain Airom (Ayrom)
Alain Airom (Ayrom)

Written by Alain Airom (Ayrom)

IT guy... sharing my hands-on experiences and technical subjects of my interest. A bit "touche à tout"!

No responses yet