[Bug]: Android Vulkan GPU offload crashes during llama.cpp backend tensor allocation

### 🐛 Bug Description

We are trying to run QVAC local LLM inference with hardware acceleration enabled on Android using the llama.cpp Vulkan backend.

On a OnePlus 10R device with Mali-G610 MC6 GPU, model initialization crashes natively when Vulkan/GPU offload is enabled, even with gpu_layers: 1. The crash happens inside @qvac/llm-llamacpp during backend tensor buffer allocation, before generation begins.

CPU mode works, but GPU mode consistently crashes the app.

### 🔄 Steps to Reproduce

Device:

> OnePlus 10R / CPH2423
> Android 15
> MediaTek Dimensity 8100-Max / MT6895
> GPU: Mali-G610 MC6
> Vulkan device API: 1.1.177
> Vulkan instance API: 1.3.0
> Vulkan driver: Mali-G610 MC6
> Driver info: v1.r32p1-01eac0.b89152572cfa9465230812a8225a45a0
> Driver version: 32.21.0
> Vendor: ARM / 0x13B5

Repro:

- Build and install Expo/React Native Android app using QVAC SDK.
- Download/install a small GGUF LLM model.
- Enable hardware acceleration.
- Initialize QVAC llama.cpp model with GPU/Vulkan enabled.
- Use minimal GPU offload, e.g. gpu_layers: 1.
- Start profiler/chat generation.
- App crashes during model initialization.

- Models tested include:

-- Qwen 3.5 Mobile 0.8B GGUF Q4_K_M
-- Llama 3.2 1B / 3B Q4 variants
-- MedPsy 1.7B GGUF


Mitigations attempted:

VK_LOADER_LAYERS_DISABLE="*"
GGML_VK_FORCE_LINEAR="1"
GGML_VK_DISABLE_F16="1"
cache-type-k: f16
cache-type-v: f16
flash-attn: off
split-mode: none
main-gpu: integrated
Removed no_mmap, because QVAC/llama parser rejected it as invalid
Updated to QVAC SDK 0.12.0 / @qvac/llm-llamacpp 0.22.1

### ✅ Expected Behavior

With hardware acceleration enabled and gpu_layers: 1, the model should initialize successfully using Vulkan GPU offload, or fail gracefully with a JavaScript/native error that can be caught and used to fall back to CPU.

### ❌ Actual Behavior

The app crashes with a native SIGSEGV during model initialization.

The crash occurs in libqvac__llm-llamacpp.0.22.1.so, specifically in the llama.cpp backend buffer allocation path:

> ggml_backend_alloc_ctx_tensors_from_buft
> llama_model::create_backend_buffers
> llama_model::load_tensors
> llama_model_load_from_file
> common_init_from_params
> LlamaModel::init
> JsInterface::activate
> 

This appears to be the same root issue across tested models: native Vulkan backend allocation crashes before generation.

### 📜 Stack Trace / Error Output

```shell
Fatal signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x0
Process: io.daemon.mobile
Thread: mqt_v_js

Cause: null pointer dereference

#04 libqvac__llm-llamacpp.0.22.1.so
#05 libqvac__llm-llamacpp.0.22.1.so
#06 ggml_backend_alloc_ctx_tensors_from_buft + 116
#07 llama_model::create_backend_buffers(ggml_backend_sched*, std::__ndk1::vector<ggml_backend_buffer_type*, std::__ndk1::allocator<ggml_backend_buffer_type*> >&) + 1008
#08 llama_model::load_tensors(llama_model_loader&) 
#10 llama_model_load_from_file + 132
#11 common_init_result::common_init_result(common_params&) + 456
#12 common_init_from_params(common_params&) + 56
#13 initFromConfig(...) + 752
#14 LlamaModel::init(bool) + 1948
#16 InitLoader::waitForLoadInitialization() + 68
#18 qvac_lib_inference_addon_cpp::JsInterface::activate(...) + 112
#19 libbare-kit.so js_callback_s::on_call(...)
```

### 💻 Platform / OS

Android (Expo)

### 🖥️ OS Version

Android 15

### ⚙️ Runtime Environment

Expo (React Native)

### 📦 Runtime Version

Expo: 54.0.34, React Native: 0.81.5, react-native-bare-kit: 0.14.2, bare-pack: 2.0.1, React: 19.1.0

### 🏷️ SDK Version

0.12.0

### 📋 Relevant Dependencies

```json
@qvac/sdk: 0.12.0
@qvac/llm-llamacpp: 0.22.1
@qvac/cli: 0.5.0
react-native-bare-kit: 0.14.2
bare-pack: 2.0.1
expo: 54.0.34
react-native: 0.81.5
react: 19.1.0
```

### 🔁 Frequency

Always (100%)

### 🔥 Severity

Critical - Complete blocker, no workaround

### 🩹 Workaround

_No response_

### 📎 Additional Context

Can QVAC confirm whether @qvac/llm-llamacpp Vulkan offload is expected to work on Mali-G610 MC6 / Vulkan 1.1.177? If this GPU/driver is supported, we would appreciate guidance on a Mali-safe config or build option to avoid the crashing allocation path in ggml_backend_alloc_ctx_tensors_from_buft.

Specifically, is there a QVAC-supported way to:

- Disable pinned/staging memory behavior for Vulkan
- Force safer Mali memory types/layouts
- Avoid Vulkan backend allocation for full context/scratch buffers when only gpu_layers: 1
- Gracefully fall back to CPU instead of native SIGSEGV when Vulkan allocation fails

### ✅ Checklist

- [x] I have searched existing issues for duplicates
- [x] I have included a minimal reproduction
- [x] I am using a supported SDK version

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Android Vulkan GPU offload crashes during llama.cpp backend tensor allocation #2399

🐛 Bug Description

🔄 Steps to Reproduce

✅ Expected Behavior

❌ Actual Behavior

📜 Stack Trace / Error Output

💻 Platform / OS

🖥️ OS Version

⚙️ Runtime Environment

📦 Runtime Version

🏷️ SDK Version

📋 Relevant Dependencies

🔁 Frequency

🔥 Severity

🩹 Workaround

📎 Additional Context

✅ Checklist

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug]: Android Vulkan GPU offload crashes during llama.cpp backend tensor allocation #2399

Description

🐛 Bug Description

🔄 Steps to Reproduce

✅ Expected Behavior

❌ Actual Behavior

📜 Stack Trace / Error Output

💻 Platform / OS

🖥️ OS Version

⚙️ Runtime Environment

📦 Runtime Version

🏷️ SDK Version

📋 Relevant Dependencies

🔁 Frequency

🔥 Severity

🩹 Workaround

📎 Additional Context

✅ Checklist

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions