Anthropic have done some great work on neural interpretability that gets at the core of this problem.