This reminds me of how trajectory prediction networks for autonomous driving used to use a CNN to encode scene context (from map and object detection rasters) until vectornet showed up