Hey HN, we’re Andrew and Derik at RunRL (https://runrl.com/). We've built a platform to improve models and agents with reinforcement learning. If you can define a metric, we'll make your model or agent better, without you having to think about managing GPU clusters.

Here's a demo video: https://youtu.be/EtiBjs4jfCg

I (Andrew) was doing a PhD in reinforcement learning on language models, and everyone kept...not using RL because it was too hard to get running. At some point I realized that someone's got to sit down and actually write a good platform for running RL experiments.

Once this happened, people started using it for antiviral design, formal verification, browser agents, and a bunch of other cool applications, so we decided to make a startup out of it.

How it works:

- Choose an open-weight base model (weights are necessary for RL updates; Qwen3-4B-Instruct-2507 is a good starting point)

- Upload a set of initial prompts ("Generate an antiviral targeting Sars-CoV-2 protease", "Prove this theorem", "What's the average summer high in Windhoek?")

- Define a reward function, using Python, an LLM-as-a-judge, or both

- For complex settings, you can define an entire multi-turn environment

- Watch the reward go up!

For most well-defined problems, a small open model + RunRL outperforms frontier models. (For instance, we've seen Qwen-3B do better than Claude 4.1 Opus on antiviral design.) This is because LLM intelligence is notoriously "spiky"; often models are decent-but-not-great at common-sense knowledge, are randomly good at a few domains, but make mistakes on lots of other tasks. RunRL creates spikes precisely on the tasks where you need them.

Pricing: $80/node-hour. Most models up to 14B parameters fit on one node (0.6-1.2 TB of VRAM). We do full fine-tuning, at the cost of parameter-efficiency (with RL, people seem to care a lot about the last few percent gains in e.g. agent reliability).

Next up: continuous learning; tool use. Tool use is currently in private beta, which you can join here: https://forms.gle/D2mSmeQDVCDraPQg8

We'd love to hear any thoughts, questions, or positive or negative reinforcement!

Was excited to see something about reinforcement learning as I'm working on training an agent to play a game, but apparently all reinforcement learning nowadays is for LLMs.

Yeah, for better or worse, the way the median startup interfaces with AI these days is through an LLM API, and that's what all the workflows are built around, so that's what we're targeting. Though, depending on what you're trying to do, I wouldn't discount the use of starting with a pretrained model—there was that famous result from 2022 that showed that pretraining a model on _Wikipedia_ made training on Atari games more than twice as efficient [0]; these days, LLMs have huge amounts of priors about the real world that make them great starting points for a surprisingly diverse set of tasks (e.g. see the chemistry example in our video!)

[0]: https://arxiv.org/abs/2201.12122

Have you heard of https://puffer.ai? Might fit your use case

This is really neat! Didn’t realize it could be this simple to run RL on models. Quick question: How would I specify the reward function for tool use? or is this something you automatically do for me when I specify the available tools and their uses?

Thanks! Our goal is to make rl "just work" with completely automated GPU provisioning/algorithm selection/SFT-warm up, but giving people the ability to switch away from the defaults if they want to.

The way tools currently work in the beta is you add tools via MCP to the configuration, and they get passed in as additional context for the model; the model might then choose to use a tool during inference; the tool is then automatically called and the output is returned as a tool message. If you really want to you could parse the tool output as part of reward calculation, but I expect you'd usually base the reward just on the model's completion. I could give more details if there's a specific tool setup you're envisioning!

To add to this, you can currently manually parse tool calls in your environment's step function, but we'll be rolling out a UI that makes this easier soon.

Very neat! A) If I want to have a different grading rubric per example (and grade with an LLM as a judge), do I do this through the reward function? B) What's the pricing on the deployed API? (Is it per token?)

A) You could have an additional field in the jsonl file which says which rubric to use; then, your reward function could access this via `kwargs["rubric"]` and return a reward based on that example's preferred rubric;

B) currently, pricing on the deployed API is free, but the startup time is a few minutes and it's run on a small GPU node and is therefore not awfully fast. If you would like more production-level inference, email us at founders@runrl.com and we could set you up with something much faster (where we'd charge per token depending on model size)

Is there any credence to the view that these startups are basically dspy wrappers

DSPy is great for prompt optimization but not so much for RL fine-tuning (their support is "extremely EXPERIMENTAL"). The nice thing about RL is that the exact prompts don't matter so much. You don't need to spell out every edge case, since the model will get an intuition for how to do its job well via the training process.

Isn’t the latest trend in RL mostly about prompt optimization as opposed to full fine tuning

prompt optimization is very cool, and we use it for certain problems! The main goal with this launch is to democratize access to "the real thing"; in many cases, full RL allows you to get the last few percent in reliability for things like complex agentic workflows where prompt optimization doesn't quite get you far enough.

There's also lots of interesting possibilities such as RLing a model on a bunch of environments and then prompt optimizing it on each specific one, which seems way better than, like, training and hot-swapping many LoRAs. In any case, _someone's_ ought to provide a full RL api, and we're here to do that well!

Thanks. Is this mainly for verifiable tasks or any general task

It's for any task that has an "eval", which is often verifiable tasks or ones that can be judged by LLMs (e.g. see [0]). There's also been recent work such as BRPO [1] and similar approaches to make more and more "non-verifiable" tasks have verifiable rewards!

[0]: https://runrl.com/blog/funniest-joke

[1]: https://arxiv.org/abs/2506.00103

There needs to be some way of automatically assessing performance on the task, though this could be with a Python function or another LLM as a judge (or a combination!)

Perhaps less about DSPy, and rather about this: https://github.com/OpenPipe/ART

ART is also great, though since it's built on top of Unsloth it's geared towards single GPU QLoRA training. We use 8 H100s as a standard, so we can handle larger models and full-parameter fine-tunes.

Interesting, do you have benchmarks on FFT vs QLoRA for RL?

we should publish some; the high-order effect seems to be that LoRAs significantly hurt small model performance vs FFT, with less of an effect for large models. This is maybe because large models have more built-in skills and thus a LoRA suffices to elicit the existing skill, whereas for small models you need to do more actual learning (holding # parameter updates constant). In general I think it's better to get a performant small model with FFT than a performant large model with a large LoRA, which is why we default to FFT, but I agree that we should publish more details here.

Thanks! Personally I found FFT is not necessarily a strict improvement over (Q)LoRA as it can sometimes more easily lead to instability in the model, hence the bit of extra scrutiny.

Curious to see your thoughts and results whenever you get something out.

I'd love to see something that can RL an agent (of sorts) that interacts with an interactive theorem prover (like Lean4, Coq, or Isabelle/HOL), (probably/likely via a harness instead of plain shell-like interaction), and actively exploits that discovery itself is not harmful beyond the inference and oracle cost of investigating an abondoned branch.

I.e., it's not at all like a typical game, because at no point is "success rate without relying on rollback/savestate-reloading" something that actually matters. An agent that spends evenly on abandoned (exploratory) branches, and on the path that becomes part of the solution that the formal verifier checks to confirm, while having a near-100% solve rate for problems fed to the agent, is a VERY GOOD agent.

That's because this task unlike most RL tasks is one where the agent shall utilize discovery to log an interaction trace that can be trivially mechanically trimmed to a verifiable proof for the provided problem. I.e., the hard part is finding ANY path that solves, without spending exponential amounts of compute to brute force the problem over the bounded state size of practical relevance. Because that would be something that takes longer than the heat death of the universe: i.e.,it's theoretically impractical.

Most RL tasks want an agent that is particularly good at it's task; and while effort spent to find a proof is certainly something that matters (if only because lower cost means the agent can train on more instances with the same training budget), it's much less relevant than the solve rate itself (fraction of problems for which any verifiably-correct proof sequence can be found at some definable level of effort, expressed as e.g. number of shots, total compute budget for the instance, ratio of exploration nodes to those nodes that become part of the final proof sequence, etc.).

Considering that non-benchmark usage would mostly entail semi-synthetic crowd-sourced datasets that are open sub-instances from practical applications of formal verification, as well as more-synthetic instances from very coarse high-level questions (that get mechanically broken down into more-manageable chunks before the RL agent gets to work) like "given these more-specific rules of what is _actually_ UB and what is only UB in ANSI but actually defined in the specific toolchain that we use: does that C program over there contain ANY UB?" or "is there ANY way that input at that file/network-socket over there to that program over here could execute arbitrary code", there'd not be economic incentive to solve any given instance more than once, beyond what is necessary to make the RL training process itself stable.

That task also lends itself to semi-online learning as every supplied instance essentially pays once for a verified solution and the overall process should deliver solid ROI. Running a single GPU cluster/pile for both training and inference would allow higher utilization at the cost of running with some variable amount of latency between rolling out an episode and training on the completed episode's oracle-verified rewards.

Having an RL agent that's really good at search across some space sounds very powerful in general; "proofs-as-search" make this an appealing target. Back in the day, when I did more fundamental RL research, we worked on an extension of SoRB [0] where an additional meta-level target was learning improved heuristics to explore the search space faster; would be exciting to figure out what a good setup for doing things like this in LLM-policy-gradient world is these days!

[0]: https://arxiv.org/abs/1906.05253

[dead]

[dead]