You probably want to try renting some time on a dedicated box with roughly the specs you’re considering and running the open models for a bit to see if you would actually use them before dropping a lot on local hardware. A 128 gig MacBook Pro isn’t going to get you an amazing model, and certainly not amazing speed. GLM 5.2 wants something like 350+ gigs at fp4 iirc.

I ran glm 5.2 on rented 8x h200 it could only do 2x concurrency at a cost of $40 an hour. It felt great but dang I wish it was cheaper... It needs 750 at fp8

what was the concurrency limitation? that node should be able to support a lot more

> You probably want to try renting some time on a dedicated box with roughly the specs you’re considering and running the open models

You don't even need to go that far. For example, with Exoscale Dedicated Inference[1] you just point it at the Hugging Face for the model and quantisation you want to test and it automagically spits out an OpenAI-compatible API endpoint.

[1] https://www.exoscale.com/ai-cloud-infrastructure/dedicated-i...

(I have no relationship with Exoscale, this particular product just crossed my radar recently)

I think they're just suggesting renting as a way to test that the hardware they're considering purchasing would actually be able to do what they need.

> I think they're just suggesting renting as a way to test

Well, yes, I understood that.

Which is why I started with the words "You don't even need to go that far.".

To re-phrase what I said in clearer terms:

Instead of renting an instance, then messing around with configuring Linux and whatever via SSH or Ansible or whatever. Just point a Hugging Face link at this magic service and get a ready-to-go API back. Enabling you to test your desired model spec with minimum fuss.

Ultimately the guy wants his own hardware. So why waste time messing around with someone else's VM if you just want to test a specific model spec. That is the TL;DR.