I only want to see how it performs on the Bullshit-benchmark https://petergpt.github.io/bullshit-benchmark/viewer/index.v...
GPT is not even close yo Claude in terms of responding to BS.
I only want to see how it performs on the Bullshit-benchmark https://petergpt.github.io/bullshit-benchmark/viewer/index.v...
GPT is not even close yo Claude in terms of responding to BS.
My current hunch is that that benchmark captures most of the relevant gap between Anthropic and the rest. “Can’t distinguish truth from fiction” has long been one of the deeper complaints about LLMs, and the bullshit benchmark seems like a clever approach to testing at least some of that.