A step-by-step guide to running Qwen 3.6 27B locally on a 24GB GPU using Podman, NVIDIA CDI GPU passthrough, and llama.cpp model presets.

Hardware and Software Setup

Hardware:

  • CPU: AMD Ryzen 9 5950X (16 cores)
  • RAM: 64 GB DDR4
  • GPU 1: NVIDIA RTX 3090 (24GB VRAM) - Primary inference device
  • GPU 2: NVIDIA GTX 1070 Ti (8GB VRAM) - Not used in this test

Software:

  • OS: Fedora 44 Workstation
  • podman 5.8.2
  • NVIDIA driver version: 580.159.03 (using this version to maintain support for 1070 Ti; newer drivers drop legacy architecture support)
  • NVIDIA Container Toolkit 1.19.0
  • Container image: ghcr.io/ggml-org/llama.cpp:full-cuda (CUDA 12 - based image)
  • llama.cpp version: 9309 (6d57c26ef)
  • Model: Qwen 3.6 27B - Unsloth GGUF UD-Q4_K_XL quantization

Qwen 3.6

Qwen 3.6 is an open-weight LLM from Alibaba. The 27B parameter variant fits on a 24GB consumer GPU and handles tool calling, coding, and reasoning tasks proficiently. Running it locally keeps your data on your machine.

llama.cpp

The most widely used open-source inference library for GGUF models, written in C/C++, actively maintained. It is used by popular frontends like Ollama and LM Studio. The official container image supports CUDA out of the box. Compared to using Ollama directly, this llama.cpp on Podman setup gives you finer control over KV cache type, batch size, and model presets, at the cost of a bit more manual configuration.

Podman

I use Podman instead of Docker for its rootless default, daemon-less architecture, and systemd integration via Quadlet units.

GPU passthrough with NVIDIA CDI

NVIDIA recommends using the Container Device Interface (CDI) for GPU access in containers. CDI simplifies device passthrough, offering better compatibility with modern Podman workflows, including rootless setups and fine-grained device selection.

Prerequisites:

  1. Ensure NVIDIA drivers are installed and nvidia-smi functions correctly.
  2. Install the NVIDIA Container Toolkit.
# Add the NVIDIA Container Toolkit repository
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \
  sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo

# Install the toolkit (Version 1.19.0)
export NVIDIA_CONTAINER_TOOLKIT_VERSION=1.19.0-1
sudo dnf install -y \
    nvidia-container-toolkit-${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
    nvidia-container-toolkit-base-${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
    libnvidia-container-tools-${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
    libnvidia-container1-${NVIDIA_CONTAINER_TOOLKIT_VERSION}

Starting with NVIDIA Container Toolkit v1.18.0, the CDI specification is automatically generated and updated by the nvidia-cdi-refresh systemd service. This service generates the CDI specification at /var/run/cdi/nvidia.yaml when:

  • The NVIDIA Container Toolkit is installed or upgraded
  • The NVIDIA GPU drivers are installed or upgraded
  • The system reboots

This ensures the CDI specifications stay current with the installed drivers and devices, making CDI Devices available for use with container engines like Podman.

To list the available CDI Devices, run:

$ nvidia-ctk cdi list
INFO[0000] Found 5 CDI devices                          
nvidia.com/gpu=0
nvidia.com/gpu=1
nvidia.com/gpu=GPU-38125bdc-f63a-09b2-bce7-b35c0f95d12c
nvidia.com/gpu=GPU-797ee00e-48ea-2830-1220-fef25cf6e475
nvidia.com/gpu=all

In Fedora, with SELinux, to allow containers to use GPU devices, you may need to run:

sudo setsebool -P container_use_devices true

Finally, to verify that everything works correctly: run nvidia-smi inside an NVIDIA CUDA container

$ podman run --rm --device nvidia.com/gpu=all docker.io/nvidia/cuda:13.2.1-base-opensuse15 nvidia-smi
Tue May 26 14:09:41 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.159.03             Driver Version: 580.159.03     CUDA Version: 13.2     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:0D:00.0  On |                  N/A |
| 45%   55C    P8             47W /  280W |   21866MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce GTX 1070 Ti     Off |   00000000:0E:00.0 Off |                  N/A |
|  0%   47C    P8              5W /  180W |       8MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Download Qwen 3.6 27B from Hugging Face

The UD-Q4_K_XL quantization balances quality and VRAM usage well for 24GB GPUs, making it ideal for local coding, reasoning, and tool-calling workloads. There is also the Qwen 3.6 35B-A3B version, a sparse Mixture of Experts model with 35B total parameters but only 3B active per token. It is faster but requires more VRAM and leaves less space available for context.

Model selection: UD-Q4_K_XL variant quantized by Unsloth. This model leaves enough free VRAM for a large context window: with f16 you can typically reach up to 64K, and with q8 KV cache quantization you can extend it to 128K.

Download the model from Hugging Face, and place the model file in ~/.cache/llama/models/Qwen3.6/Qwen3.6-27B-UD-Q4_K_XL.gguf.

Or you can use the Hugging Face Hub CLI to download the Q4 model together with the projection model:

hf download unsloth/Qwen3.6-27B-GGUF --local-dir ~/.cache/llama/models/Qwen3.6/ --include "*mmproj-BF16*" --include "*UD-Q4_K_XL*"

Running llama.cpp in Podman

We will use the official llama.cpp image, optimized for CUDA 12: ggml-org/llama.cpp:full-cuda. The community around llama.cpp is very active, and new images are released frequently with improvements and bug fixes.

Here is the run command for our setup:

podman run \
  --name llamacpp \
  --device nvidia.com/gpu=all \
  -p 127.0.0.1:8080:8080 \
  -v ~/.cache/llama/models:/models:ro,z \
  ghcr.io/ggml-org/llama.cpp:full-cuda \
  --server \
  --parallel 1 \
  --models-preset /models/presets.ini \
  --flash-attn on \
  --host 0.0.0.0 \
  --port 8080

Key Flags Explained:

  • --device nvidia.com/gpu=all: Passes all NVIDIA GPUs via CDI specifications.
  • -p 127.0.0.1:8080:8080: Binds the port only to localhost, preventing external access. Router mode (see --models-preset below) is experimental and lacks authentication: this setup is meant for local use only.
  • -v ~/.cache/llama/models:/models:ro,z: Mounts the model directory as read-only (ro) and relabels it for SELinux compatibility (z).
  • --parallel 1: Sets the number of concurrent request slots to 1. By default, llama.cpp sets this to auto, which picks a value based on available hardware (on my machine, 4). Each parallel slot reserves its own KV cache portion, consuming VRAM even when idle. Since this setup targets a single user with no need for concurrent requests, forcing it to 1 maximizes available memory for the context window. Keep in mind that this also prevents the server from batching requests efficiently, so overall throughput drops as soon as multiple clients connect.
  • --models-preset /models/presets.ini: Starts the server in router mode, exposing multiple model presets behind a single endpoint. Models are loaded on demand, so you can switch between presets such as coding and vision without restarting the server.
  • --flash-attn on: Enables Flash Attention, reducing memory usage and accelerating inference for large context windows. Flash Attention is most beneficial when using long contexts (like, for example, 32K+ tokens); for short prompts it has minimal impact, and it requires a GPU with enough VRAM to hold the model plus the KV cache.

Model Presets

Recent llama.cpp builds support a presets .ini file.

Create ~/.cache/llama/models/presets.ini. Each section defines a distinct behavior profile for the same model file.

Thinking - General

Balanced preset for complex reasoning tasks like problem solving, strategic planning and data analysis.

[qwen36-thinking-general]
model = /models/Qwen3.6/Qwen3.6-27B-UD-Q4_K_XL.gguf
cache-type-k = f16
cache-type-v = f16
ctx-size = 32768
temp = 1.0
min-p = 0.0
top-p = 0.95
top-k = 20
presence-penalty = 0.0
repeat-penalty = 1.0

Thinking - Coder

Higher context window, lower temperature: use it for complex coding tasks like algorithm design.

preserve_thinking includes reasoning steps in the output (useful for debugging) and keeps them in the KV cache, which reduces redundant re-reasoning in multi-turn conversations and saves tokens.

[qwen36-thinking-coder]
model = /models/Qwen3.6/Qwen3.6-27B-UD-Q4_K_XL.gguf
cache-type-k = f16
cache-type-v = f16
ctx-size = 65536
temp = 0.6
min-p = 0.0
top-p = 0.95
top-k = 20
presence-penalty = 0.0
repeat-penalty = 1.0
chat-template-kwargs = {"preserve_thinking": true}

For larger context windows on a 24GB GPU, compress the KV cache. The trade-off is lower VRAM usage at the cost of some accuracy and slower prompt processing.

[qwen36-thinking-coder-max-ctx]
model = /models/Qwen3.6/Qwen3.6-27B-UD-Q4_K_XL.gguf
cache-type-k = q8_0
cache-type-v = q8_0
ctx-size = 131072
temp = 0.6
min-p = 0.0
top-p = 0.95
top-k = 20
presence-penalty = 0.0
repeat-penalty = 1.0
chat-template-kwargs = {"preserve_thinking": true}

Instruct

Disables chain of thought for faster responses. Use for straightforward tasks: text processing, translation, documentation.

[qwen36-instruct]
model = /models/Qwen3.6/Qwen3.6-27B-UD-Q4_K_XL.gguf
cache-type-k = f16
cache-type-v = f16
ctx-size = 65536
temp = 0.7
min-p = 0.0
top-p = 0.8
top-k = 20
presence-penalty = 1.5
repeat-penalty = 1.0
reasoning = off

Vision

The vision preset adds the mmproj projection model so Qwen can process images alongside text, but the mmproj file must correspond exactly to the model variant you’re using; if it doesn’t match, the server will fail to load the preset. If you downloaded the model with the suggested Hugging Face CLI command, you should have already downloaded the right projection model.

[qwen36-vision]
model = /models/Qwen3.6/Qwen3.6-27B-UD-Q4_K_XL.gguf
mmproj = /models/Qwen3.6/mmproj-BF16.gguf
cache-type-k = f16
cache-type-v = f16
ctx-size = 32768
temp = 0.7
min-p = 0.0
top-p = 0.8
top-k = 20
presence-penalty = 1.5
repeat-penalty = 1.0
reasoning = off

VRAM Considerations

On the 24GB RTX 3090, f16 KV cache presets comfortably fit up to 64K context. For larger windows, q8 cache quantization extends reach up to 128K, with minimal quality loss. Actual VRAM usage depends on the context consumed per request - monitor with nvidia-smi or btop during your workload.

Verifying the setup

Once the container is up and running, you can review its logs with:

$ podman logs -f llamacpp 
0.00.184.426 I log_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
0.00.184.539 I system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 500,610,700,750,800,860,890,1200 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
0.00.184.605 I srv          init: running without SSL
0.00.184.827 I srv          init: using 31 threads for HTTP server
0.00.185.838 I srv   load_models: Loaded 0 cached model presets
0.00.186.739 I srv   load_models: Loaded 5 custom model presets from /models/presets.ini
0.00.187.141 I srv    operator(): Available models (5) (*: custom preset)
0.00.187.143 I srv    operator():   * qwen36-instruct
0.00.187.143 I srv    operator():   * qwen36-thinking-coder
0.00.187.143 I srv    operator():   * qwen36-thinking-coder-max-ctx
0.00.187.143 I srv    operator():   * qwen36-thinking-general
0.00.187.144 I srv    operator():   * qwen36-vision
0.00.187.206 I srv          main: starting router server, no model will be loaded in this process
0.00.187.208 I srv         start: binding port with default address family
0.00.188.497 I srv          main: router server is listening on http://0.0.0.0:8080
0.00.188.498 W srv          main: NOTE: router mode is experimental
0.00.188.499 W srv          main:       it is not recommended to use this mode in untrusted environments

Or you can interact with it using the llama.cpp graphical interface, accessible at http://localhost:8080/.

llama.cpp GUI

I asked “Write a tetris clone in a single html file” to qwen36-thinking-coder. The model processed the 20 tokens prompt in 0.2s (126.14 tokens/second) and generated a response in 2 min 48s, with a total of 5403 tokens and generation speed of 32.10 tokens/second. Here is the end result.

.

Moreover, other options to interact with llama.cpp are OpenWebUI, or directly via the API, with AI coding agents like OpenCode or Pi.

If you prefer to call the API directly, you can test the qwen36-instruct preset with a simple curl:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen36-instruct",
    "messages": [
      {"role": "user", "content": "Write the history of programming languages in one page."}
    ]
  }'

Caveats

  • Router mode is experimental. The llama.cpp logs flag it explicitly: this is fine for local use, but not something to expose on a network.
  • KV cache quantization trades accuracy for context. If you need a 128K window, expect some quality loss on long conversations.
  • Flash Attention requires a GPU backend and sufficient VRAM. It won’t help if the model itself doesn’t fit.

What’s next

llama.cpp recently merged support for Multi-Token Prediction (MTP). Early benchmarks suggest up to 2x faster inference with MTP enabled - a follow-up post will benchmark it on and off using the same 24GB GPU setup.