Publication

Human-Machine Legal Preference Benchmarking

Overview

An Experimental Pilot Evaluating the Comparative Performance of LLM Models, Enterprise Tools, and Humans on Legal Tasks

Abstract

This study introduces an approach for evaluating legal Large Language Models (LLMs) through lawyer preference ratings rather than traditional benchmarks. Using a pregnancy discrimination case, we compared the outputs from foundation models, enterprise legal tools, and human lawyers across four litigation tasks. Senior partners conducted blind reviews of the outputs. The results provided empirical color to long-held assumptions: enterprise tools demonstrated remarkable consistency, while both LLMs and human lawyers showed striking performance swings. More importantly, the study reveals that traditional metrics may be fundamentally misaligned with what experienced lawyers actually value in legal work. The findings challenge existing benchmarking approaches and suggest that comprehensiveness may be less important than clarity and precision in legal work products.