When I experimented with this a few years back a true NxN room would cap around 8 people when using PCs and 4 on mobile, the bottleneck is encoding/decoding of the video. For larger rooms you need a server to route the video to all recipients, this is called an SFU. With an SFU you can have hundreds of participants, but not everyone can speak or be seen at once.
For audio-only the sky is the limit. I used to work on a voice-based social media and you also need an SFU here as well, but I added a few mixing features so that multiple incoming audio streams would be mixed together into a single outgoing one. Was very fun (and scalable).