Skip to content

docker model runner llmfit launch#837

Open
sathiraumesh wants to merge 4 commits intodocker:mainfrom
sathiraumesh:launch-llmfit-747
Open

docker model runner llmfit launch#837
sathiraumesh wants to merge 4 commits intodocker:mainfrom
sathiraumesh:launch-llmfit-747

Conversation

@sathiraumesh
Copy link
Copy Markdown
Contributor

@sathiraumesh sathiraumesh commented Apr 6, 2026

Changes

  • Add llmfit CLI tool to docker model launch

fixes #747

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • In printAppConfig, the host port is only printed when ca.defaultHostPort > 0, which means an explicitly overridden --port for apps with no defaultHostPort (e.g. future apps like llmfit) would be hidden; consider checking hostPort > 0 instead so overrides are reflected in the config output.
  • In launchContainerApp, a non-zero portOverride is silently ignored when ca.containerPort == 0 (e.g. for llmfit); consider either validating and erroring on --port usage in that case or logging a message so users know their port setting is not applied.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `printAppConfig`, the host port is only printed when `ca.defaultHostPort > 0`, which means an explicitly overridden `--port` for apps with no defaultHostPort (e.g. future apps like `llmfit`) would be hidden; consider checking `hostPort > 0` instead so overrides are reflected in the config output.
- In `launchContainerApp`, a non-zero `portOverride` is silently ignored when `ca.containerPort == 0` (e.g. for `llmfit`); consider either validating and erroring on `--port` usage in that case or logging a message so users know their port setting is not applied.

## Individual Comments

### Comment 1
<location path="cmd/cli/commands/launch.go" line_range="223-224" />
<code_context>
+		if ca.containerPort > 0 {
+			cmd.Printf("  Container port: %d\n", ca.containerPort)
+		}
+		if ca.defaultHostPort > 0 {
+			cmd.Printf("  Host port:      %d\n", hostPort)
+		}
 		if ca.envFn != nil {
</code_context>
<issue_to_address>
**issue (bug_risk):** Host port visibility should depend on the effective hostPort value, not defaultHostPort.

Using `ca.defaultHostPort > 0` here can suppress valid user overrides. For example, if `defaultHostPort == 0` but the user passes `--port`, `hostPort` will be non-zero yet the host port line won’t print. This should check `hostPort > 0` instead so the output matches the actual binding.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the llmfit container application and updates the launch command to support container apps without port mappings. The changes include conditional logic for port display and Docker flags, along with corresponding tests. Review feedback identifies a logic error in how host ports are displayed when no default is provided and notes a formatting inconsistency in the containerApps configuration map.

@ericcurtin
Copy link
Copy Markdown
Contributor

More work needed I think, this doesn't behave as expected, I tested on macOS got his:

$ ./cmd/cli/model-cli launch llmfit
{
"models": [
{
"best_quant": "Q5_K_M",
"category": "Chat",
"context_length": 4096,
"estimated_tps": 76.7,
"fit_level": "Marginal",
"gguf_sources": [],
"installed": false,
"is_moe": true,
"license": null,
"memory_available_gb": 6.32,
"memory_required_gb": 4.3,
"moe_offloaded_gb": null,
"name": "microsoft/Phi-mini-MoE-instruct",
"notes": [
"CPU-only: model loaded into system RAM",
"MoE architecture, but expert offloading requires a GPU",
"No GPU -- inference will be slow",
"Best quantization for hardware: Q5_K_M (model default: Q4_K_M)",
"Baseline estimated speed: 76.7 tok/s"
],
"parameter_count": "7.6B",
"params_b": 7.65,
"provider": "Microsoft",
"release_date": null,
"run_mode": "CPU",
"runtime": "llama.cpp",
"runtime_label": "llama.cpp",
"score": 89.2,
"score_components": {
"context": 100.0,
"fit": 100.0,
"quality": 73.0,
"speed": 100.0
},
"total_memory_gb": 4.3,
"use_case": "Instruction following, chat",
"utilization_pct": 68.0
},
{
"best_quant": "Q4_K_M",
"category": "General",
"context_length": 128000,
"estimated_tps": 75.9,
"fit_level": "Marginal",
"gguf_sources": [
{
"provider": "unsloth",
"repo": "unsloth/LFM2-8B-A1B-GGUF"
}
],
"installed": false,
"is_moe": true,
"license": null,
"memory_available_gb": 6.32,
"memory_required_gb": 4.7,
"moe_offloaded_gb": null,
"name": "LiquidAI/LFM2-8B-A1B",
"notes": [
"Context capped at 8192 tokens for estimation (model supports up to 128000; use --max-context to override)",
"CPU-only: model loaded into system RAM",
"MoE architecture, but expert offloading requires a GPU",
"No GPU -- inference will be slow",
"Baseline estimated speed: 75.9 tok/s"
],
"parameter_count": "8.3B",
"params_b": 8.34,
"provider": "Liquid AI",
"release_date": "2025-10-07",
"run_mode": "CPU",
"runtime": "llama.cpp",
"runtime_label": "llama.cpp",
"score": 86.5,
"score_components": {
"context": 100.0,
"fit": 100.0,
"quality": 70.0,
"speed": 100.0
},
"total_memory_gb": 4.7,
"use_case": "General purpose text generation",
"utilization_pct": 74.4
},
{
"best_quant": "Q6_K",
"category": "Chat",
"context_length": 4096,
"estimated_tps": 80.5,
"fit_level": "Marginal",
"gguf_sources": [
{
"provider": "mradermacher",
"repo": "mradermacher/OLMoE-1B-7B-0125-Instruct-GGUF"
}
],
"installed": false,
"is_moe": true,
"license": null,
"memory_available_gb": 6.32,
"memory_required_gb": 3.9,
"moe_offloaded_gb": null,
"name": "allenai/OLMoE-1B-7B-0125-Instruct",
"notes": [
"CPU-only: model loaded into system RAM",
"MoE architecture, but expert offloading requires a GPU",
"No GPU -- inference will be slow",
"Best quantization for hardware: Q6_K (model default: Q4_K_M)",
"Baseline estimated speed: 80.5 tok/s"
],
"parameter_count": "6.9B",
"params_b": 6.92,
"provider": "allenai",
"release_date": null,
"run_mode": "CPU",
"runtime": "llama.cpp",
"runtime_label": "llama.cpp",
"score": 83.6,
"score_components": {
"context": 100.0,
"fit": 100.0,
"quality": 59.0,
"speed": 100.0
},
"total_memory_gb": 3.9,
"use_case": "Instruction following, chat",
"utilization_pct": 61.7
},
{
"best_quant": "Q6_K",
"category": "General",
"context_length": 262144,
"estimated_tps": 185.0,
"fit_level": "Marginal",
"gguf_sources": [],
"installed": false,
"is_moe": true,
"license": null,
"memory_available_gb": 6.32,
"memory_required_gb": 3.6,
"moe_offloaded_gb": null,
"name": "apolo13x/Qwen3.5-35B-A3B-quantized.w4a16",
"notes": [
"Context capped at 8192 tokens for estimation (model supports up to 262144; use --max-context to override)",
"CPU-only: model loaded into system RAM",
"MoE architecture, but expert offloading requires a GPU",
"No GPU -- inference will be slow",
"Best quantization for hardware: Q6_K (model default: Q4_K_M)",
"Baseline estimated speed: 185.0 tok/s"
],
"parameter_count": "6.4B",
"params_b": 6.38,
"provider": "apolo13x",
"release_date": null,
"run_mode": "CPU",
"runtime": "llama.cpp",
"runtime_label": "llama.cpp",
"score": 82.5,
"score_components": {
"context": 100.0,
"fit": 100.0,
"quality": 61.0,
"speed": 100.0
},
"total_memory_gb": 3.6,
"use_case": "General purpose",
"utilization_pct": 57.0
},
{
"best_quant": "Q8_0",
"category": "Chat",
"context_length": 4096,
"estimated_tps": 125.0,
"fit_level": "Marginal",
"gguf_sources": [],
"installed": false,
"is_moe": true,
"license": null,
"memory_available_gb": 6.32,
"memory_required_gb": 2.1,
"moe_offloaded_gb": null,
"name": "microsoft/Phi-tiny-MoE-instruct",
"notes": [
"CPU-only: model loaded into system RAM",
"MoE architecture, but expert offloading requires a GPU",
"No GPU -- inference will be slow",
"Best quantization for hardware: Q8_0 (model default: Q4_K_M)",
"Baseline estimated speed: 125.0 tok/s"
],
"parameter_count": "3.8B",
"params_b": 3.76,
"provider": "Microsoft",
"release_date": null,
"run_mode": "CPU",
"runtime": "llama.cpp",
"runtime_label": "llama.cpp",
"score": 82.0,
"score_components": {
"context": 100.0,
"fit": 86.6,
"quality": 60.0,
"speed": 100.0
},
"total_memory_gb": 2.1,
"use_case": "Instruction following, chat",
"utilization_pct": 33.2
}
],
"system": {
"available_ram_gb": 6.32,
"backend": "CPU (ARM)",
"cpu_cores": 14,
"cpu_name": "0",
"gpu_count": 0,
"gpu_name": null,
"gpu_vram_gb": null,
"gpus": [],
"has_gpu": false,
"total_ram_gb": 7.65,
"unified_memory": false
}
}

@sathiraumesh
Copy link
Copy Markdown
Contributor Author

sathiraumesh commented Apr 6, 2026

Hi @ericcurtin, can you detail what is expected, like the behavior? If you want the TUI version, we need to make more changes to implement it. Are you expecting that behavior?

@ericcurtin
Copy link
Copy Markdown
Contributor

Yes, the TUI version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

docker model launch llmfit

2 participants