Andrej Baranovskij Blog: vLLM

Showing posts with label vLLM. Show all posts

Tuesday, April 21, 2026

Gemma 4 for Structured Data Extraction: Can It Beat Qwen 3.5?

In this video, I put Gemma 4 to the test on a real-world task — extracting structured data from bank statements — and benchmark it head-to-head against Mistral's Ministral and Qwen 3.5.

I run both the MoE and Dense variants of Gemma 4 to see how architecture affects accuracy on financial documents, then compare the results side-by-side.

My takeaway: Gemma 4 holds its own and performs on par with Qwen 3.5 — a strong result for local structured extraction workflows.

Thursday, April 2, 2026

Running Multiple Models on One GPU with vLLM and GPU Memory Utilization

In this video I show how to run multiple vLLM model instances on the same GPU (Nvidia) in parallel by adjusting the --gpu-memory-utilization flag.

You'll see:

- How to launch separate vLLM servers for different models

- How to split GPU memory between them without running out of VRAM

This approach works when you want to serve several smaller models concurrently on limited hardware.

Tuesday, March 24, 2026

How to Cache vLLM Model in FastAPI for Faster Inference

I show you how to keep your vLLM model loaded in FastAPI cache for much faster inference — without reloading it on every request.

Wednesday, November 26, 2025

Ollama and MLX-VLM Accuracy Review (Qwen3-VL and Mistral Small 3.2)

I was running detail tests to compare accuracy for the same models (Qwen3-VL and Mistral Small 3.2) running on Ollama and MLX-VLM (recent 0.3.7 version). MLX-VLM runs faster, but with lower accuracy. The same is valid across different models.

Tuesday, September 16, 2025

Ollama vs MLX Inference Speed on Mac Mini M4 Pro 64GB

MLX runs faster on first inference, but thanks to model caching or other optimizations by Ollama, second and next inference runs faster on Ollama.

Wednesday, September 10, 2025

Advanced Structured Data Processing in Sparrow

I added instruction and validation functionality into Sparrow. This allows to process business logic with document data directly through Sparrow query. For example, it allows to check if given fields are present in the document.

Monday, June 30, 2025

Solving Upwork Client Task with Sparrow

I show how Sparrow can be used to handle complex Upwork task, with accurate table data extraction. Key requirement is to prevent Vision LLM hallucinations, this is achieved by Sparrow hybrid data processing approach.

Tuesday, June 10, 2025

Solving Vision LLM Number Formatting Issues Using PaddleOCR and Sparrow

Discover how to fix number formatting errors in vision LLMs like Mistral! In this video, I show how Mistral misreads "56,000" as "56000" and how combining PaddleOCR’s text extraction with Sparrow’s spatial data processing solves this hallucination issue.

Monday, April 28, 2025

Vision LLM on Mac Mini M4 Pro: Real-World MLX Performance

I discuss the real-world MLX performance of Sparrow for structured data extraction with public access. The current Sparrow online instance runs on a Mac Mini M4 Pro with 64GB of memory. On average, it processes one page in 100 seconds. I explain why tokens-per-second measurements can be misleading when evaluating structured data extraction.

Tuesday, December 17, 2024

Streamlined Table Data Extraction with Sparrow | Table Transformer, Qwen2 VL, MLX, & Mac Mini M4 Pro

Learn how to streamline table data extraction with Sparrow, Table Transformer, Qwen2 VL, and MLX on the Mac Mini M4 Pro. Simplify your workflow and get accurate results!

Sunday, September 8, 2024

Table Parsing with Qwen2-VL-7B

I show how to retrieve structured JSON output from table image using Qwen2-VL-7B. This VLLM performs OCR and data mapping tasks all out of the box, also it can return structured JSON output without use of intermediate frameworks.