we have an alignment blog post dropping soon! scaling up in the next couple of months, then hopefully opening up an API or licensing it.

Benchmarks are really fun—lots of secret ones. Our main thesis is that you should be using the same benchmarks to measure human ability to use a computer, as you would an AI model. Definitely a suite of continuous long term planning tasks (games) and things such as marking emails as spam etc.

definitely! we are looking into more interp + visualizations in general as we scale up.