If it's trained on street view data it's not unlikely that the model can associate a particular piece of context to street view. For example, a picture can have telltale signs that street view content has, such as blurred faces and street signs, watermarks, etc.

Even if it's not directly trained on street view data it has probably encountered street view content in it's training dataset.

The training process doesn't preserve information needed for the LLM to infer that. It cannot be anything other than nonsense that sounds plausible, which is what they do best.

I think the test which the OP performed (to pick a random street view and let it pinpoint it) would indicate that it has ingested some kind of information in this regard in a structured manner.