Models
    gpt-oss-120b, Meta Llama 3.2, or Gemma
    (just depends on what I’m doing)

  Hardware
    - Apple M4 Max (128 GB RAM)
      paired with a GPD Win 4 running Ubuntu 24.04 over USB-C networking

  Software
    - Claude Code
    - RA.Aid
    - llama.cpp

  For CUDA computing, I use an older NVIDIA RTX 2080 in an old System76 workstation.

  Process

    I create a good INSTRUCTIONS.md for Claude/Raid that specifies a task & production process with a task list it maintains. I use Claude Agents with an Agent Organizer that helps determine which agents to use. It creates the architecture, prd and security design, writes the code, and then lints, tests and does a code review.

What does the GPD Win 4 do in this scenario? Is there a step w/ Agent Organizer that decides if a task can go to a smaller model on the Win 4 vs a larger model on your Mac?

What sorts of token/s are you getting with each model?

Model performance summary:

  **openai/gpt-oss-120b** — MLX (MXFP4), ~66 tokens/sec @ Hugging Face: `lmstudio-community/gpt-oss-120b-MLX-8bit`

  **google/gemma-3-27b** — MLX (4-bit), ~27 tokens/sec @ Hugging Face: `mlx-community/gemma-3-27b-it-qat-4bit`

  **qwen/qwen3-coder-30b** — MLX (8-bit), ~78 tokens/sec @ Hugging Face: `Qwen/Qwen3-Coder-30B-A3B-Instruct`

Will reply back and add Meta Llama performance shortly.

What is the Agent Organizer you use?

It’s a Claude agent prompt. I don’t recall who originally shared it, so I can’t yet attribute the source, but I’ll track that down shortly and add proper attribution here.

Here’s the Claude agent markdown:

https://github.com/lst97/claude-code-sub-agents/blob/main/ag...

Edit: Updated from the old Pastebin link to the GitHub version. Attribution found: lst97 on GitHub

How it looks like Claude agent is written by Claude...