Applied Mechanistic Interpretability in Legal Contexts — We are developing novel approaches to identifying and mitigating racial bias in LLMs. Using a method known as model pruning (dissecting the internal structure of LLMs), we selectively deactivated or removed specific neurons that were identified as contributing to biased behavior.
Breaking Down Bias Paper We identify racially biased neurons in language models and then carve them out. The resulting model displays less / no bias and is still virtually as useful as the unpruned model. We also show that racial bias is not always located in the same parts of a model. Instead, different neurons induce racial bias, depending on the context (e.g. purchasing an item, financial decision making, and so on). From a legal perspective, the results suggest that there may be limits to holding model developers (like OpenAI) liable for harmful behavior, given that they won’t be able to come up with a one-size-fits-all solution. Instead, it may be more effective to hold the companies deploying the models accountable for how they are used in specific contexts (e.g. a credit card company using OpenAI’s models to assist with restaurant recommendations), rather than expecting model developers to address all potential harms in advance.