I'm learning about "AI programming" by working on some toy problems, like an automated subtitle translator tool that can take both the existing English subtitles and a centre-weighted mono audio extracted from the video file and feed it to an AI.

My big takeaway lesson from this is that the APIs are clumsy, the frameworks are very rough, and we're still very much in the territory of having to roll your own bespoke solutions for everything instead of the whole thing "just working". For example:

Large file uploads are very inconsistent between providers. You get fun issues like a completed file upload being unusable because there's an extra "processing" step that you have to poll-wait for. (Surprise!)

The vendors all expose a "list models" API, none of which return a consistent and useful list of metadata.

Automatic context caching isn't.

Multi-modal inputs are still very "early days". Models are terrible at mixed-language input, multiple speakers, and also get confused by background noises, music, and singing.

You can tell an AI to translate the subtitles to language 'X', and it will.. most of the time. If you provide audio, it'll get confused and think that it is being asked to transcribe it! It'll return new English subtitles sometimes.

JSON schemas are a hint, not a constraint with some providers.

Some providers *cough*oogle*cough* don't support all JSON Schema constructs, so you can't safely use their API with arbitrary input types.

If you ask for a whole JSON document back, you'll get timeout errors.

If you stream your results, you have to handle reassembly and parsing yourself, the frameworks don't handle this scenario well yet.

You'd think a JSON list (JSONL) schema would be perfect for this scenario, but they're explicitly not supported by some providers!

Speaking of failures, you also get refusals and other undocumented errors you'll only discover in production. If you're maintaining a history or sliding window of context, you have to carefully maintain snapshots so you can roll back and retry. With most APIs you don't even know if the error was a temporary or permanent condition, of if your retry loop is eating into your budget or not.

Context size management is extra fun now that none of the mainstream models provide their tokenizer to use offline. Sometimes the input will fit into the context, sometimes it won't. You have to back off and retry with various heuristics that are problem-specific.

Ironically, the APIs are so new and undergoing so much churn that the AI models know nothing about them. And anyway, how could they? None of them are properly documented! Google just rewrote everything into the new "GenAI" SDK and OpenAI has a "Responses" API which is different from their "Chat" API... I don't know how. It just is.