CnakeCharmer - https://github.com/dleemiller/CnakeCharmer
https://huggingface.co/datasets/CnakeCharmer/CnakeCharmer
This project started from a belief that llms should be better at doing python to cython code translations than they are. So we started setting a large set of parallel implementations.
Then I realized that Claude code was much better at working on the data using tools (mcp) to check and iterate. The scope transformed into an platform for creating the SFT agentic trace dataset using sandboxed tools for compilation, testing, linting, address sanitizing and benchmarking.
We still need to bulk up the GRPO dataset with a large number of good unmatched python examples. But early results using SFT only on gpt-oss 20b are quite good.