Ollama's new engine for multimodal models

(ollama.com)

Comments

simonw 16 May 2025
The timing on this is a little surprising given llama.cpp just finally got a (hopefully) stable vision feature merged into main: https://simonwillison.net/2025/May/10/llama-cpp-vision/

Presumably Ollama had been working on this for quite a while already - it sounds like they've broken their initial dependency on llama.cpp. Being in charge of their own destiny makes a lot of sense.

oezi 16 May 2025
I wish multimodal would imply text, image and audio (+potentially video). If a model supports only image generation or image analysis, vision model seems the more appropriate term.

We should aim to distinguish multimodal modals such as Qwen2.5-Omni from Qwen2.5-VL.

In this sense: Ollama's new engine adds vision support.

tommica 16 May 2025
Sidetangent: why is ollama frowned upon by some people? I've never really got any other explanation than "you should run llama.CPP yourself"
ics 16 May 2025
I'll have to try this later but appreciate that the article gets straight to the point with practical examples and then the details.
andy_xor_andrew 16 May 2025
They are talking a lot about this new engine - I'd love to see details on how it's actually implemented. Given llama.cpp is a herculean feat, if you are going to claim to have some replacement for it, an example of how you did it would be good!

Based on this part:

> We set out to support a new engine that makes multimodal models first-class citizens, and getting Ollama’s partners to contribute more directly the community - the GGML tensor library.

And from clicking through a github link they had:

https://github.com/ollama/ollama/blob/main/model/models/gemm...

My takeaway is, the GGML library (the thing that is the backbone for llama.cpp) must expose some FFI (foreign function interface) that can be invoked from Go, so in the ollama Go code, they can write their own implementations of model behavior (like Gemma 3) that just calls into the GGML magic. I think I have that right? I would have expected a detail like that to be front and center in the blog post.

clpm4j 16 May 2025
The whole '*llama' naming convention in the LLM world is more confusing to me than it probably should be. So many llamas running around out here.
yossi_peti 16 May 2025
Their example "understanding and translating vertical Chinese spring couplets to English" has a lot of mistakes in it. I'm guessing the person writing the blog post to show off that example doesn't actually know Chinese.

What is actually written: Top: 家和国盛 Left: 和谐生活人人舒畅迎新春 Right: 平安社会家家欢乐辞旧岁

What Ollama saw: Top: 盛和家国 (correct characters but wrong order) Left: It reads "新春" (new spring) as 舒畅 (comfortable) Right: 家家欢欢乐乐辞旧岁 (duplicates characters and omits the first four)

bradly 16 May 2025
The strength with Ollama for me was the ease of being able to run a simple Docker command and be up and running locally without any tinkering, but with image and video Docker is no longer an option as Docker does not use the GPU. I'm curious how Ollama plans to support their Docker integration going forward or if it is a less important part of the project that I'm giving it credit for.
newusertoday 16 May 2025
why does ollama engine has to change to support new models? every time a new model comes ollama has to be upgraded.
mark_l_watson 16 May 2025
I have mostly used Ollama to run local models for close to a year, love it, but I have barely touched Llava, etc. multi modal support because all my personal use cases are text based.

Question: what are cool and useful multi modal projects have people here built using local models?

I am looking for personal project ideas.

JKCalhoun 16 May 2025
Does Ollama support the "user context" that higher level LLMs like ChatGPT have?

I'm not clear what they are called (or how implemented) — but perhaps 1) the initial prompt/context (that, for example, Grok has got in trouble with recently) and 2) the kind of saved context that allows ChatGPT to know things about your prompt-history so it can better answer future queries.

(My use of ollama has been pretty bare-bones and I have not seen anything covering these higher level features in -help.)

ac29 16 May 2025
I am amused that one of the handful examples they chose to use is wrong:

"The best way to get to Stanford University from the Ferry Building in San Francisco depends on your preferences and budget. Here are a few options:

1. *By Car*: Take US-101 South to CA-85 South, then continue on CA-101 South."

CA 85 is significantly farther down 101 than Palo Alto.

Koshima 16 May 2025
The timing makes sense if you consider the broader trend in the LLM space. We're moving from just text to more integrated, multimodal experiences, and having a tightly controlled engine like this could be a game changer for developers building apps that require real-time, context-rich understanding.