Skip to content

[Bug]: Android Vulkan GPU offload crashes during llama.cpp backend tensor allocation #2399

@pharoxe

Description

@pharoxe

🐛 Bug Description

We are trying to run QVAC local LLM inference with hardware acceleration enabled on Android using the llama.cpp Vulkan backend.

On a OnePlus 10R device with Mali-G610 MC6 GPU, model initialization crashes natively when Vulkan/GPU offload is enabled, even with gpu_layers: 1. The crash happens inside @qvac/llm-llamacpp during backend tensor buffer allocation, before generation begins.

CPU mode works, but GPU mode consistently crashes the app.

🔄 Steps to Reproduce

Device:

OnePlus 10R / CPH2423
Android 15
MediaTek Dimensity 8100-Max / MT6895
GPU: Mali-G610 MC6
Vulkan device API: 1.1.177
Vulkan instance API: 1.3.0
Vulkan driver: Mali-G610 MC6
Driver info: v1.r32p1-01eac0.b89152572cfa9465230812a8225a45a0
Driver version: 32.21.0
Vendor: ARM / 0x13B5

Repro:

  • Build and install Expo/React Native Android app using QVAC SDK.

  • Download/install a small GGUF LLM model.

  • Enable hardware acceleration.

  • Initialize QVAC llama.cpp model with GPU/Vulkan enabled.

  • Use minimal GPU offload, e.g. gpu_layers: 1.

  • Start profiler/chat generation.

  • App crashes during model initialization.

  • Models tested include:

-- Qwen 3.5 Mobile 0.8B GGUF Q4_K_M
-- Llama 3.2 1B / 3B Q4 variants
-- MedPsy 1.7B GGUF

Mitigations attempted:

VK_LOADER_LAYERS_DISABLE="*"
GGML_VK_FORCE_LINEAR="1"
GGML_VK_DISABLE_F16="1"
cache-type-k: f16
cache-type-v: f16
flash-attn: off
split-mode: none
main-gpu: integrated
Removed no_mmap, because QVAC/llama parser rejected it as invalid
Updated to QVAC SDK 0.12.0 / @qvac/llm-llamacpp 0.22.1

✅ Expected Behavior

With hardware acceleration enabled and gpu_layers: 1, the model should initialize successfully using Vulkan GPU offload, or fail gracefully with a JavaScript/native error that can be caught and used to fall back to CPU.

❌ Actual Behavior

The app crashes with a native SIGSEGV during model initialization.

The crash occurs in libqvac__llm-llamacpp.0.22.1.so, specifically in the llama.cpp backend buffer allocation path:

ggml_backend_alloc_ctx_tensors_from_buft
llama_model::create_backend_buffers
llama_model::load_tensors
llama_model_load_from_file
common_init_from_params
LlamaModel::init
JsInterface::activate

This appears to be the same root issue across tested models: native Vulkan backend allocation crashes before generation.

📜 Stack Trace / Error Output

Fatal signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x0
Process: io.daemon.mobile
Thread: mqt_v_js

Cause: null pointer dereference

#04 libqvac__llm-llamacpp.0.22.1.so
#05 libqvac__llm-llamacpp.0.22.1.so
#06 ggml_backend_alloc_ctx_tensors_from_buft + 116
#07 llama_model::create_backend_buffers(ggml_backend_sched*, std::__ndk1::vector<ggml_backend_buffer_type*, std::__ndk1::allocator<ggml_backend_buffer_type*> >&) + 1008
#08 llama_model::load_tensors(llama_model_loader&) 
#10 llama_model_load_from_file + 132
#11 common_init_result::common_init_result(common_params&) + 456
#12 common_init_from_params(common_params&) + 56
#13 initFromConfig(...) + 752
#14 LlamaModel::init(bool) + 1948
#16 InitLoader::waitForLoadInitialization() + 68
#18 qvac_lib_inference_addon_cpp::JsInterface::activate(...) + 112
#19 libbare-kit.so js_callback_s::on_call(...)

💻 Platform / OS

Android (Expo)

🖥️ OS Version

Android 15

⚙️ Runtime Environment

Expo (React Native)

📦 Runtime Version

Expo: 54.0.34, React Native: 0.81.5, react-native-bare-kit: 0.14.2, bare-pack: 2.0.1, React: 19.1.0

🏷️ SDK Version

0.12.0

📋 Relevant Dependencies

@qvac/sdk: 0.12.0
@qvac/llm-llamacpp: 0.22.1
@qvac/cli: 0.5.0
react-native-bare-kit: 0.14.2
bare-pack: 2.0.1
expo: 54.0.34
react-native: 0.81.5
react: 19.1.0

🔁 Frequency

Always (100%)

🔥 Severity

Critical - Complete blocker, no workaround

🩹 Workaround

No response

📎 Additional Context

Can QVAC confirm whether @qvac/llm-llamacpp Vulkan offload is expected to work on Mali-G610 MC6 / Vulkan 1.1.177? If this GPU/driver is supported, we would appreciate guidance on a Mali-safe config or build option to avoid the crashing allocation path in ggml_backend_alloc_ctx_tensors_from_buft.

Specifically, is there a QVAC-supported way to:

  • Disable pinned/staging memory behavior for Vulkan
  • Force safer Mali memory types/layouts
  • Avoid Vulkan backend allocation for full context/scratch buffers when only gpu_layers: 1
  • Gracefully fall back to CPU instead of native SIGSEGV when Vulkan allocation fails

✅ Checklist

  • I have searched existing issues for duplicates
  • I have included a minimal reproduction
  • I am using a supported SDK version

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions