I'm always skeptical because you can make it pass the benchmarks, then you use it and it is not practically useful unlike an extremely general model.
Cool work though, really excited for the potential of slimming down models.
I'm always skeptical because you can make it pass the benchmarks, then you use it and it is not practically useful unlike an extremely general model.
Cool work though, really excited for the potential of slimming down models.
I find it's often very language and sector dependent. I still see a massive difference in systems programming (normally c++ and rust) between any open model I've tried and something like sonnet 4.5 (not really tried 4.6). And honestly, even the big models (like Opus 4.6) struggle in many cases.
Perhaps these things aren't well represented in the training data for these open models? Every local model I've tried (minimax2.5, GLM-4.7, Quen3, 3.5 and -coder variants) spend so much time trying to get something syntactically sensible and accepted by the compiler that when they've finished they barely seem to have any "momentum" left to actually solve the problems, as pretty much anything but the most trivial change ends up in another loop of actually trying to get it working again, often losing the intent of that change in the process.
My fear is that the solution here, having multiple instances all making the same changes for later comparison, would spend a huge amount of time beating it's head against compiler errors, types, memory allocation (NO DON'T JUST SPRINKLE IN A FEW MORE RAW "new" KEYWORDS DAMMIT) before it even gets to the "logic".
Having plenty of local GPU power I'd love to be able to actually use that, and I'm already wary about some of the training data use and it's interactions with the license of the code I'm "sending" to the cloud models...
> Perhaps these things aren't well represented in the training data for these open models
I know from first-hand experience that at least a couple of the SOTA providers use third-party providers for supervised finetuning with instructions that are heavily geared towards a specific set of languages as well. But of course the base dataset from the major providers is likely to be sufficiently better that it matters less, and the big models are good enough at carrying over training that it at least seems like extra training on the core languages they care about at least somewhat carries over (you see this with natural language too - they do really well for many minor languages that make up a miniscule proportion of the training data).
(I won't say much more regarding the SFT/RLHF work due to NDAs - plural; I know who one of the providers is; I don't know who the one or more others are as the intermediary I did some work for obscured it well enough that I couldn't really violate the NDA even if I wanted to)
You obviously have to try it out to see how it works for you, but the trick they use is pretty clever. When you ask an AI to write code, it doesn’t always get it right. Sometimes the code has bugs, sometimes it misunderstands the problem entirely. A naive way to address that is to generate a few solutions and test each one. The odds that at least one works go way up. ATLAS generates multiple attempts, running each through a test suite. Each retry also gets told what went wrong with the previous attempt, so it can try to avoid the same mistake.
But this can be pretty slow since you have to run the code in an isolated environment, check the outputs, wait for it to finish. Doing that for every candidate quickly adds up. So ATLAS has another shortcut for avoiding unnecessary testing. Instead of simply generating solutions and testing all of them, it tries to predict which one is most likely correct before running any tests.
ATLAS also asks the model for an embedding of what it just wrote which acts as a fingerprint. Two similar pieces of code will produce similar fingerprints. A well-written, confident solution will produce a different fingerprint than a confused, buggy one.
These fingerprints get fed into a separate, much smaller neural network called the Cost Field. This little network was trained ahead of time on examples where they already knew which solutions were correct and which were wrong. It learned to assign a score to each fingerprint. Correct solutions get a low score and incorrect ones get a high one.
So the process is to generate multiple solutions, get their fingerprints, score each one, and pick the lowest. Only that one gets tested. The Cost Field picks correctly about 88% of the time according to the repo.
Really intriguing set of techniques to improve accuracy by generating multiple solutions. Even with the work to predict the most likely solutions, it's not clear to me based on the description how this could all be done efficiently. Would definitely be really impressive if it pans out on real-world use cases. Will look to kick the tires on this if I can get some time.
> it's not clear to me based on the description how this could all be done efficiently.
Depends how you define efficiency. The power use of this rig is a lot less than the large data centers that serve trillion parameter models. The page suggests that the final dollar cost per request is an order of magnitude lower than the frontier models charge.
Seems like the key insight is to train a small model that acts as a heuristic for embeddings that resemble quality code. I imagine a lot depends on how well this model is trained. And you could probably create specialized versions for different languages and domains.
Another interesting approach could be to use this set up with a language like Clojure or Common Lisp which facilitates interactive development. If you could hook up the agent directly to a REPL in a running program, then it could run tests with a lot less overhead.
I'm super confused. The small model "cost field" `rag-api/geometric_lens/cost_field.py` was trained on PASS_TASKS like "Write a function that counts vowels in a string." and FAIL_TASKS like "Write a function that converts a regular expression string to an NFA using Thompson's construction, then converts the NFA to a DFA.".
So it seems like it's a difficulty classifier for task descriptions written in English.
This is then used to score embeddings of Python code, which is a completely different distribution.
Presumably it's going to look at a simple solution, figure out it lands kinda close to simple problems in embedding space and pass it.
But none of this helps you solve harder problems, or distinguish between a simple solution which is wrong, and a more complex solution which is correct.
> But none of this helps you solve harder problems, or distinguish between a simple solution which is wrong, and a more complex solution which is correct.
It does because hallucinations and low confidence share characteristics in the embedding vector which the small neural learns to recognize. And the fact that it continuously learns based on the feedback loop is pretty slick.
I think the goal is to have a light heuristic that helps find plausibly useful solutions. They're still going to go through a testing phase as a next step, so this is just a very simple filter to decide what's even worth testing.
I tried to read the project documentation, but I got overwhelmed by the aimless AI generated documentation that has a nebulous goal of documenting absolutely everything, but never explaining anything.
If the author actually wanted to explain his project he should have started with something along the lines of "Inference-time learning is the act of updating model parameters while you are generating tokens. Inference time learning is cost prohibitive for LLMs due to the need to update billions of parameters. However, what if updating billions of parameters wasn't necessary to begin with? What if you could instead have a much smaller model that merely scores a bunch of candidate output tokens? That model could be small enough for inference time learning to become viable and that's exactly what ATLAS does to achieve a 74.6% pass rate in LiveCodeBench and thereby outperforms Claude Sonnet with a small 14B open weight model that can be run locally on your $500 GPU."
This would have primed the reader to know what to look for. Instead you got this insurmountable wall of distractions.
Example: "combining constraint-driven generation, energy-based verification, self-verified iterative refinement, and adaptive routing"
That's a very long sequence of unexplained buzzwords that could mean absolutely anything.
I think this is because when you shrink it down, the model ends up space constrained and each “neuron” ends up having to do multiple duties. It can stil be tuned to perform well at specific tasks, but no longer generalizes as well. It’s somewhat unintuitive but models that are larger are often simpler than smaller ones for this same reason.