Sure, but whatever you do, don't buy their (Z.ai) lite plan.
I feel like i threw 15 dollars in the sea. I'm getting rate limited after 3-4 prompts. You get way less value than just paying 25 dollars for Claude or OpenAI models.
Sure, but whatever you do, don't buy their (Z.ai) lite plan.
I feel like i threw 15 dollars in the sea. I'm getting rate limited after 3-4 prompts. You get way less value than just paying 25 dollars for Claude or OpenAI models.
Did you consider their peak hours and model usage multiplier? Read the green box https://docs.z.ai/devpack/overview#usage-instruction
I had the Lite plan, I NEVER maxed out the quota because I considered these things. If I, for example, switched over to GLM-5-Turbo, then I could've easily burned through quota.
I just read it and honestly it left an even worse taste in my mouth.
>GLM-5.2 and GLM-5-Turbo are advanced models designed to rival Claude Opus model. Its usage will be deducted at 3 × during peak hours and 2 × during off-peak hours.
Claude certainly does not punish me for using their best models. Why should this "up and coming" company do it?
I thought the up and coming ai companies was supposed to have some kind of leverage in terms of price/performance (see deepseeks insanely cheap V4 flash and pro).
With a claude code plan, can you generate as many tokens with Opus as you can with Haiku before filling your 5 hour window? The same is going on here.
How are you using it? I have the lite plan and I've only ever maxed my weekly usage a few hours before reset. I will concede that I'm not a super heavy LLM user but it's been really good for me.
My workflow is usually:
- read file. I want to achieve X, how do? Do not implement anything.
- I would do a, b and c
- sketch a brief implementation of your suggestion
- <code> (not writing files yet)
- instead of your approach x, wouldn't it make sense to instead do z? What would that look like?
- <code>
- nice, implement this
- starts writing files, run tests, etc.
Try pointing it to a small codebase, or even ask it to conjure information found online.
You'll see that it quickly gives up. Thing is, they seem to count cached hits as if they were the non-cached tokens.
I wont be subscribing again thats for sure. I am not paying iPhone money for a Xiaomi.
That's what I've been doing. I use crush normally. While the codebase are by no means huge, they're not tiny either.
Are you using it in an agentic workflow? Just reading the codebase will consume a lot of cached tokens, but seemingly, z.ai counts these as normal input tokens the way they're rate limiting.
I'm not entirely sure what an agentic workflow could mean today but I think so. I use a coding agent (crush), prompt it to brainstorm an implementation with me (or sometimes I know exactly how I want to implement it but ask it to challenge it), correct any wrong assumptions or request the implementation to look differently than suggested if I don't like it. Then finally when I'm positive I've cleared the most important assumptions I ask it to actually write and edit files and run tests and such (this just ends up being a "implement this").
With any model I've tried I've found it to be a huge pain to have it fix things where it made a wrong assumption without the code becoming a mess and burning a lot of tokens. I'm aware that not everyone works like this but I'm still very opinionated on what the end result should look like so I can still work on it without an LLM.