AI-Powered Lawyering: AI Reasoning Models, RAG, and the Future of Legal Practice
Academic · Standard · 31 pages
Type
Standard
classification
Source
Academic
publisher
Published
—
date unknown
Series
Standard
document class
Pages
31
source PDF
Words
18,507
full text on file
Topics
—
tagged subjects
Full text
On file
readable here
Abstract
Original Research Article Journal of Law and Empirical Analysis 2026, Vol. 3(1) 220–250 © The Author(s) 2026 Article reuse guidelines: sagepub.com/journals-permissions DOI: 10.1177/2755323X261427048 journals.sagepub.com/home/lex AI-Powered Lawyering: AI Reasoning Models, Retrieval Augmented Generation, and the Future of Legal Practice Daniel Schwarcz1,* , Sam Manning2,*, J.J. Prescott3,* , Pa…
Full text
Original Research Article Journal of Law and Empirical Analysis 2026, Vol. 3(1) 220–250 © The Author(s) 2026 Article reuse guidelines: sagepub.com/journals-permissions DOI: 10.1177/2755323X261427048 journals.sagepub.com/home/lex AI-Powered Lawyering: AI Reasoning Models, Retrieval Augmented Generation, and the Future of Legal Practice Daniel Schwarcz1,* , Sam Manning2,*, J.J. Prescott3,* , Patrick Barry3, David R. Cleveland1 , and Beverly Rich4
Abstract
Generative AI is set to transform the legal profession, though its most promising uses and ultimate effects are still unclear. While AI models like GPT-4 improve efficiency, they can also “hallucinate” and may undermine legal judgment, particularly in complex tasks typically handled by skilled lawyers. This article examines two emerging AI innovations that may mitigate these concerns: Retrieval Augmented Generation (RAG), which grounds AI-powered analysis in legal sources, and AI reasoning models, which structure complex reasoning before generating output. We conduct the first randomized controlled trial assessing these technologies, assigning upper-level law students to complete legal tasks using a RAG-powered legal AI tool (Vincent AI, 2024), an AI reasoning model (OpenAI’s o1-preview), or no AI. We find that both AI tools significantly enhance legal work quality, a marked contrast with previous research examining older large language models like GPT-4. Moreover, these newer models appear to maintain the efficiency benefits associated with older AI technologies. Our findings also show that these AI tools significantly boost productivity in five out of six tested legal tasks, with statistically significant gains of anywhere from 50% to 130%. They perform exceptionally well in complex tasks like drafting persuasive letters and analyzing complaints. Notably, o1preview improves the analytical depth of work product and Vincent AI avoids introducing more hallucinations, suggesting that integrating domain-specific RAG capabilities with reasoning models could yield even larger improvements. Keywords artificial intelligence, lawyer productivity, lawyering, randomized controlled trial, reasoning models, retrieval augmented generation
1. Introduction Generative AI is poised to transform the legal profession in the coming years. Yet the scope and nature of this transformation remain uncertain. Some legal technology enthusiasts foresee a fundamental restructuring of the industry, where AI automates countless legal tasks and even replaces certain types of lawyers entirely (Brescia, 2024; Susskind & Susskind, 2023). Skeptics, however, argue that while AI may streamline certain aspects of legal work, it is unlikely to alter the core nature of lawyering (Armour et al., 2022). In this article, we bring new empirical evidence to bear on these competing claims. Using data from the first randomized controlled trial to test how next-generation AI tools affect the way lawyers perform core legal tasks, we examine the impact of these systems on the quality, efficiency, and productivity of written legal work. Our findings offer one of the first systematic glimpses into how these emerging technologies are likely to shape the organization and substance of lawyering in practice. 1University of Minnesota Twin Cities, Minneapolis, MN, USA 2GovAI, Washington, DC, USA 3University of Michigan, Ann Arbor, MI, USA 4University of Southern California Gould School of Law, CA, USA *Schwarcz, Manning, and Prescott are co-first authors. Corresponding Author: Daniel Schwarcz, University of Minnesota Twin Cities, 229 19th Av S, Minneapolis, MN 55455, USA. Email: schwarcz@umn.edu Creative Commons Non Commercial CC BY-NC: This article is distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 License (https://creativecommons.org/licenses/by-nc/4.0/) which permits non-commercial use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access pages (https://us.sagepub.com/en-us/nam/open-access-at-sage). The stakes of the debate about the implications of AI for legal practice are substantial and far-reaching. Billions of dollars are pouring into AI-driven legal startups, and industry giants like Westlaw and LexisNexis are racing to integrate AI into their existing platforms (LexisNexis, 2025; Thompson Reuters, 2025). Across the profession, lawyers—from Big Law partners to legal aid attorneys—are grappling with how best to incorporate AI into their work (Beioley & Criddle, 2023; Chien et al., 2025; Kim & Chien, 2025; Garg & Ma, 2025). Even judges are exploring ways AI may help them adjudicate cases and draft opinions (Arbel & Hoffman, 2024; Liu & Li, 2024; Re, 2024). Meanwhile, law schools face growing uncertainty about how to prepare students for the profession’s increasingly uncertain future (Bliss, 2024; Head & Willis, 2024). As this conversation unfolds, empirical research has begun to investigate AI’s likely impact on legal practice. Most prior work, however, focuses on AI benchmarking, which evaluates how AI outputs compare both across models and to human outputs (Guha et al., 2023; Vals.AI, 2025). But for most lawyers, the more relevant question is how human lawyers with access to AI tools perform compared to human lawyers who do not have access to these tools. Although a handful of studies concentrate on this question, a salient shortcoming of this existing research has been its use of older AI models, such as ChatGPT-3.5 and GPT-4 (Choi et al., 2024; Choi & Schwarcz, 2025; Nielsen et al., 2024). Because these models have only a modest ability to break down analytically complex tasks or draw from the most relevant and accurate legal source materials, their usefulness to practicing lawyers engaged in sophisticated legal work may be limited. By contrast, two emerging technologies have the potential to significantly enhance AI’s capacity to facilitate human legal work by improving reasoning capabilities and grounding outputs in authoritative legal materials. The first is Retrieval Augmented Generation (RAG), a technique that integrates generative AI with trustworthy, authentic legal source materials. Unlike traditional models that rely solely on their training data to answer prompts, legal AI systems with RAG capabilities can retrieve relevant legal texts—such as case law, statutes, and regulations—before generating output. Perhaps even more importantly, RAG makes it easier for humans to check an AI’s work product by consulting the underlying sources on which the software relies to generate an answer. The second major advance relevant to the legal profession is a new class of generative AI language models known as “reasoning models.” Developers explicitly design these models—unlike earlier AI chatbots—to draw on additional computational resources at the point of use, planning responses before generating them, much like a human taking longer to think and outline their thoughts before answering a complex question. To better understand AI’s impact on the future of lawyering, and especially the potential of these two types of cutting-edge innovations, we conduct a randomized controlled trial to determine how RAG and reasoning models affect lawyering and legal work. Our study evaluates the performance of 137 law students from the University of Minnesota and the University of Michigan on various legal tasks with or without access to AI tools. We asked each participant to complete six realistic legal tasks developed in collaboration with practicing lawyers. For two tasks, participants received no AI assistance; for two, they had access to the mid-2024 version of Vincent AI, a leading tool that integrates RAG and automated prompting assistance; and for the remaining two, they worked with an early AI reasoning model (o1-preview). We randomized the assignment of AI tools and control conditions to ensure a balanced distribution across participants. In our empirical work, we test the overarching hypothesis that giving participants access to these kinds of tools will enhance both the quality of legal work product and the efficiency of task completion. For Vincent AI, we expect quality gains to come primarily from improved accuracy and greater professionalism, reflecting the system’s ability to draw directly from relevant source materials. By contrast, we anticipate that o1-preview may strengthen the analytical rigor of lawyering, owing to the model’s superior capacity (relative to GPT-4) to structure, refine, and organize output. Our results largely confirm these hypotheses. The findings reveal that access to both Vincent AI and o1-preview lead to substantial, statistically significant improvements in the speed of lawyering. Also, with respect to at least four of the six tasks, access to AI tools considerably improves the quality of legal work product. While the speed-related gains are comparable in magnitude to those observed in prior research examining the impact of GPT-4 on lawyering, the quality enhancements of AI-augmented legal work product mark a significant departure from the findings of earlier studies. We also observe variation in how—and the extent to which—these two AI tools enhance the quality of legal work product. This variation largely, but not entirely, matches our hypotheses. For Vincent AI, quality improvements are primarily seen in the clarity, organization, and professionalism of the work. The tool’s impact on accuracy, however, is mixed. On the one hand, we find no evidence that overall accuracy scores—which depend on whether work product includes and properly characterizes the most relevant legal authorities and facts—significantly improve with access to Vincent AI, aside from a single task-specific gain with respect to analyzing a complaint. On the other hand, Vincent AIsupported work appears to contain fewer “hallucinated” citations to nonexistent source materials (3 total) than work produced using o1-preview (11 total). For o1-preview, we find stronger and more widespread improvements in the quality of legal work compared to Vincent AI. Most notably, in addition to enhancing clarity, organization, and professionalism, o1-preview produces substantial improvements in the strength of legal analysis—reflecting logical coherence Schwarcz et al. 221 and nuanced reasoning—especially in the two most analytically demanding tasks we test. Our findings suggest that the value of access to AI tools may depend on the nature of the legal work. Improvements in quality from the two AI tools are concentrated in litigationoriented tasks; these gains do not appear to extend to the one transactionally oriented task we evaluate, which involved drafting a short contract. We also document through postexperiment surveys that most participants feel their experience with the two AI tools in the study increases their likelihood of using similar tools in the future. Many also report gaining proficiency in using these tools over the course of the experiment. This positive subjective experience from using the two tools is particularly pronounced for Vincent AI relative to o1-preview. The implications of our findings for Vincent AI and o1preview are each independently significant. Considered together, however, they are even more noteworthy: these two AI technologies enhance legal work in distinct yet complementary ways. For Vincent AI, the primary such mechanism is RAG. Notably, however, Vincent AI 2024 uses an ensemble of non-reasoning OpenAI models—such as GPT-4 and GPT-4o—as its core foundation models. By contrast, o1preview offers technological improvements to foundation models, which can be integrated into legal AI tools like Vincent AI. Moreover, o1-preview was the first reasoning model to be made publicly available. Many new generations have since improved on predecessors (Wiggers, 2025). These facts together suggest that AI models are already enhancing lawyering in ways that extend beyond the effects we report in this article.
2. Background OpenAI’s public release of the Large Language Model (LLM) ChatGPT in late 2022 marked a pivotal moment in the development of AI and in expectations about its future impact. Although ChatGPT’s core design was not entirely novel— like earlier chatbots, it generated text by predicting the next token in a sequence—the technology reshaped the AI landscape with its remarkable ability to produce high-quality responses across diverse types of queries. ChatGPT’s performance stemmed from several key advances. First, the model’s size was much larger than prior LLMs, growing from 117 million parameters in early iterations to 175 billion in later versions (OpenAI, 2022). Second, ChatGPT’s training included Reinforcement Learning from Human Feedback, a fine-tuning method that uses human evaluations to align outputs with user intent (Choi et al., 2022). After ChatGPT’s public release, commentators worldwide began speculating about whether the underlying technology could revolutionize legal practice (Susskind & Susskind, 2023). This excitement grew with reliable indications that early versions of ChatGPT could achieve passing—albeit low—grades on a range of law school exams simply by processing the exam text (Choi et al., 2022; Nay et al., 2023). Further research indicated that more advanced AI models could attain higher scores on law school exams and even pass the bar exam, heightening expectations that AI would significantly reshape the legal profession (Alimardini, 2024; Katz et al., 2024).1 Although this early work highlights the capacity of AIproduced output to score well on legal exams, the more pressing set of empirical questions relate to how access to generative AI tools will affect legal practice by attorneys, especially in light of the broad consensus that ethical and practical considerations necessitate some human involvement in legal services (Browning, 2023; Pierce & Goutos, 2024; Wendel, 2019; Yamane, 2020). Throughout late 2023 and 2024, several studies began to explore this issue, finding preliminary evidence to suggest that tools like GPT-4 could significantly enhance lawyers’ speed for certain legal tasks, but limited evidence that these tools could consistently improve the quality of legal work (Choi et al., 2024; Choi & Schwarcz, 2025; Kim & Chien, 2025). Evidence also emerged that GPT-4 is vulnerable to producing hallucinations when used for legal research (Dahl et al., 2024).2 By contrast, other contemporary research focusing on nonlegal writing tasks provided support for the idea that giving humans access to ChatGPT could enhance their performance (Noy & Zhang, 2023). In short order, several key innovations in AI and legal technology sparked renewed enthusiasm about the technology’s potential to enhance the practice of law. First, several companies introduced specialized Retrieval Augmented Generation (RAG) AI systems for lawyering tasks. RAG systems integrate LLMs with legal search engines and document retrieval systems (Lewis et al., 2020), enabling these tools to respond to queries based on authoritative legal materials. Widely touted for its potential to minimize or even eliminate hallucinations (Ju, 2024; LexisNexis, 2024), RAG also enhances transparency, allowing users to verify LLM responses using underlying sources (Grupen & Pereyra, 2024).3 To date, however, little empirical evidence exists on the impact of RAG-enabled legal technology on human lawyering. One study suggests that RAG-enabled legal AI tools can and do hallucinate (Magesh et al., 2024).4 But the study assesses only the capabilities of legal research tools in isolation, without human involvement. Another recent study finds that many RAG-enabled legal tools outperform human lawyers at basic tasks, such as document extraction, summarization, and transcript analysis, but perform worse than human lawyers on more complex tasks (Vals.AI, 2025). But this study too compares AI-produced output with human output, not the work of humans with AI-tool access to the work of humans without access to AI tools. The second major AI development relevant to lawyering is the rise in mid-2024 of a new class of “reasoning models,” designed specifically to handle complex logical and analytical 222 Journal of Law and Empirical Analysis 3(1) tasks. OpenAI introduced the first such model with o1preview; since then, both OpenAI and other AI developers, including Google and DeepSeek, have released more advanced reasoning models (Wiggers, 2025). Reasoning models mark a significant departure from earlier LLMs like ChatGPT-3.5 and GPT-4 by allocating more compute at the time of inference, allowing them to process prompts step-bystep in ways earlier models did not. By constructing an internal chain of reasoning, these models continuously reevaluate initial output to refine the answers they ultimately produce. A large-scale reinforcement learning algorithm implemented during training further enhances this ability, optimizing how the model evaluates and adjusts its reasoning. This shift is significant enough that, to highlight the distinction, OpenAI introduced an entirely new naming convention for its first reasoning model: “o1” (OpenAI, 2024). Early evidence from other domains demonstrates that these reasoning models can outperform their predecessors in complex tasks across fields such as mathematics, coding, and medical diagnosis (Brodeur et al., 2024).5 However, there is limited evidence regarding how giving attorneys access to tools relying on these models may affect human performance on legal tasks. We offer the first systematic test of that possibility.
3. Methodology We use a randomized controlled trial to assess the potential impact of emerging AI reasoning models and specialized legal AI platforms on the future of lawyering. We focus on the two leading generative AI models as of late 2024. The first, o1-preview, is a general-purpose AI reasoning model released by OpenAI in September 2024. The second, VLex’s Vincent AI, is a specialized AI tool for lawyers that uses RAG to facilitate the work of lawyers. At the time of the study, Vincent AI used an ensemble of non-reasoning models, including GPT-4 and 4o, as its underlying foundation models. We began recruitment for the experiment in September 2024 at the University of Minnesota Law School and the University of Michigan Law School. We sent recruitment emails to all second- and third-year law students, as well as Master of Laws (LL.M.) students, at these institutions. These emails had the subject line “U-M Research Opportunity: $300 to Experiment with AI Tools.” More than 250 students Table 1. Baseline Balance Across Task–AI Assignment Groups Group size Group A Group B Group C Participants (N) 47 45 45 137 GPA (Mean) 3.275 3.330 3.323 120 Missing (%) 12.8% 11.1% 13.3% F-test p-value for GPA: 0.731 Student type (Proportions) 2L student 46.8% 37.8% 46.7% 60 3L student 40.4% 51.1% 40.0% 60 LL.M. Student 12.8% 11.1% 13.3% 17 Missing (%) 0.0% 0.0% 0.0% Chi-sq p-value for student type: 0.832 Prior AI use (Proportions) 0 times 19.1% 13.3% 17.8% 23 1–5 times 38.3% 33.3% 33.3% 48 6–10 times 19.1% 26.7% 15.6% 28 11–20 times 6.4% 4.4% 8.9% 9 More than 20 times 17.0% 22.2% 24.4% 29 Missing (%) 0.0% 0.0% 0.0% Chi-sq p-value for AI use: 0.903 School (Proportions) University of Michigan 38.3% 40.0% 33.3% 51 University of Minnesota 61.7% 60.0% 66.7% 86 Missing (%) 0.0% 0.0% 0.0% Chi-sq p-value for school: 0.793 Notes. This table reports baseline participant characteristics for Groups A, B, and C. Groups differ only in how the six legal tasks are paired with AI conditions; all participants complete tasks under all three conditions (No AI, o1-preview, and Vincent AI). Reported means and proportions therefore reflect participant-level characteristics rather than differential exposure to AI conditions. Standard deviations are shown in parentheses for continuous variables. “Missing (%)” reports the share of participants within each group with missing data for the corresponding variable; first-year GPA is unavailable for LL.M. students, so missing GPA values reflect program enrollment rather than attrition or nonresponse. P-values are from F-tests of equality of means for continuous variables and chi-square tests of independence for categorical variables. N denotes the number of participants in each group. Schwarcz et al. 223 expressed interest in participating. Of these students, 153 formally enrolled in the study, 125 students completed all tasks, and 137 completed at least one task. (Because some students completed only a subset of the six tasks, the number of observations varies by analysis, and certain regressions include fewer than the full 137 participants.) During the enrollment process, we collected basic demographic and academic information about participants, including their class year, first-year law school GPA (for second- and third-year law students), and their prior use of generative AI tools in the three months prior to enrollment. After participants formally enrolled, we randomly assigned them to one of three groups of equal size. We confirmed balance across these groups with respect to first-year GPA, student status, law school affiliation, and prior AI use. These group assignments were stable throughout the experiment. We present summary statistics for the participants who completed the study—broken down by group assignments—in Table 1. Study participants completed the experiment remotely from October 1, 2024, to October 31, 2024, using a Canvas interface. The study began with all participants in the study completing three online training modules. These modules included both general training on the use of AI models for legal work and training specifically tailored to Vincent AI. We then presented all participants with six lawyering tasks, each with task-specific instructions regarding the use of generative AI. For instance, for Task One, we prohibited participants in Group A from using any generative AI, required Group B participants to use only o1-preview, and required Group C participants to use only Vincent AI. These instructions varied systematically across groups and tasks, ensuring that each participant completed two tasks without AI, two tasks using o1-preview, and two tasks using Vincent AI. To ensure that the six tasks reflected realistic scenarios typically assigned to first- or second-year law firm associates, we developed all of them in collaboration with one or more practicing attorneys. We set task time limits using guidance from practicing lawyers regarding the amount of time they expected a junior associate would typically need to complete each task. These tasks, along with their respective time limits, are as follows: · Task One: Draft an email for a client (60-min time limit). · Task Two: Draft a legal memo for a partner (240-min time limit). · Task Three: Analyze a complaint and draft a written analysis (120-min time limit). · Task Four: Draft a nondisclosure agreement (NDA) for a client (180-min time limit). · Task Five: Draft a motion to consolidate (150-min time limit). · Task Six: Draft a persuasive letter addressing the enforceability of a covenant not to compete (150-min time limit). In designing our experiments, we sought to create tasks that meaningfully differed in the lawyering skills they required. Task One requires participants to explain a relatively straightforward set of legal precedents in plain language for a non-lawyer client. Task Two, by contrast, expects them to grapple with a complex contract-interpretation issue and to compare competing case law across jurisdictions. Task Three asks participants to identify key features of a longer document (a complaint), evaluate the elements of the claim, and consider both legal and non-legal factors in formulating a response strategy. Task Four, the most distinct, requires participants to complete a transactional task by adapting a template document to a new context. Task Five asks participants not only to develop an argument but also to present it in the form of a motion that would meet a court’s formal requirements. Finally, Task Six requires participants to draft a document akin to a legal opinion by evaluating the relative strengths of competing arguments. By structuring the tasks to test different facets of legal practice, we aim to assess whether particular types of AI tools are better suited to assisting with some types of legal work than others.
Appendix B contains the text of all assigned tasks.10 We
instructed all participants to complete the six tasks in order and to report the amount of time they spent working on each task. Three co-authors evaluated all work product, with each grader responsible for evaluating two tasks. We used a blind grading process to ensure that graders were unaware of the experimental condition and participant identity and characteristics. All three graders have legal practice experience and were uninvolved in data collection or analysis. Before the experiment began, the three grading co-authors collaboratively developed standardized grading rubrics to ensure consistency of grading across tasks. We measure five core attributes of quality legal work across all six tasks: Accuracy (the precision and usefulness of the research), Analysis (the depth and insightfulness of the reasoning), Organization (the structure of the work product), Clarity (the quality and persuasiveness of the writing), and Professionalism (the extent to which directions were followed). We use a standard 1–7 scale to assess performance on each attribute for each task. Each grader adapted this general rubric to create a version tailored for each of the two tasks they were assigned to grade. For example, the rubric for Accuracy in Task Two (Draft Legal Memo) lists the precise legal authorities that should be identified, their key holdings, the relevant insurance policy language to be highlighted, and the critical facts that should be incorporated into the analysis. Similar refinements for each rubric reflect the task’s distinctive doctrinal and practical demands. Each task rubric also includes a separate binary metric for whether any cited sources or assertions appear to be hallucinated, either because they do not exist or because their descriptions are entirely inaccurate. Appendix C contains the final grading rubrics for all tasks. 224 Journal of Law and Empirical Analysis 3(1) We use participants’ self-reported time spent on each assigned task, together with each task’s time limit, to construct measures of efficiency and task-level productivity that are comparable across heterogeneous forms of legal work. We define efficiency as the fraction of the task’s allotted time spent completing the task, so that lower values of this measure correspond to faster task completion. Combining this efficiency measure with participants’ quality attribute scores, we define task-level productivity as the sum of the five quality attribute scores divided by the fraction of allotted time spent. These outcome measures allow us to examine how access to AI tools affects the speed with which legal tasks are completed and how variation in time spent interacts with the quality of work product. To evaluate treatment effects, we use an ordinary least squares (OLS) regression framework with two treatment indicator variables. Conceptually, we can write our baseline specification as follows: yi ¼ β0 þ β1Vincenti þ β2o1previewi þϵi, (1) where yi represents our outcome measures for participant i (e.g., on a quality attribute, time spent, or productivity), and Vincenti and o1previewi indicate whether participant i had permission and received encouragement to use Vincent AI or o1-preview, respectively. Estimates for β1 and β2 represent the average treatment effects of Vincent AI and o1-preview relative to completing tasks without access to AI assistance. Equation (1) relies on random assignment to avoid confounding factors. A possible robustness check would be to expand the regression specification to account for the participant-specific information that we collected during enrollment: yi ¼ β0 þ β1Vincenti þ β2o1previewi þ γXi þϵi, (2) where Xi is a vector of control variables including first-year GPA (for non-LL.M.s), indicators for law school class year (2L, 3L, LL.M.), and self-reported prior AI use (during the three months prior to enrollment). We measure the effects of access to AI tools for each type of task along each outcome dimension (e.g., an attribute like Accuracy) because there are reasons to believe that access to AI tools might affect the quality of lawyering across some but not all quality dimensions and for some but not all types of legal tasks. Following Equations (1) and (2), one straightforward way to proceed would be to produce treatment estimates for each task and attribute, with approximately 125 observations per test, treating each attribute outcome for each task separately and conducting the test by separate regressions for each task/attribute. By pooling the data, however, we can also estimate treatment effects for AI access overall (across tasks and outcomes) and across each task and each outcome, and at the same time, we can account for more potential confounders, such as fixed differences across participants. For these reasons, we opt to employ a flexible pooled OLS regression framework, which allows us to account for the fact that tasks and attributes may have baseline differences and that participants contribute as many as 30 scores (6 tasks × 5 attributes) that may not be independent of each other. To analyze our experimental data, we estimate: yijk ¼β0 þ β1Vincentijk þ β2o1previewijk þ δi þ μj þ θk þϵijk, (3) where δi, μj, and θk represent participant, task, and attribute fixed effects. To elaborate, for each participant, we have performance information about six tasks and five attribute scores per task using the same 1–7 scale, which gives us 30 total observations per participant (if the participant completed all tasks). Using a multi-way fixed effects model allows us to control for participant fixed effects—to account for the possibility that unobserved participant quality or skill may explain some of our results—and to incorporate task and attribute fixed effects that absorb systematic differences across tasks and attributes. We adjust for correlated errors within participant by clustering standard errors at the participant level. In addition, we can omit the participant-level controls (Xi) that we consider including in equation (2), which are constant across a participant’s 30 observations, because they are mechanically absorbed by our participant fixed effects. When we aggregate in this way, we implicitly weight all tasks and attributes equally when assessing outcomes, but this approach allows us to provide a concise summary of our findings. We begin by presenting our results from this approach below before exploring treatment effects at the attribute and task levels. As a further robustness check, we also estimate a system of seemingly unrelated regressions (SUR). SUR is helpful in our setting because each participant-task generates five related quality scores—Accuracy, Analysis, Organization, Clarity, and Professionalism—that may have unexplained components that move together, for example, when a participant experiences a task-specific shock. For instance, a misread prompt or an unexpected interruption during a task can depress all five quality scores for that task simultaneously. A pooled OLS model with participant fixed effects cannot absorb this kind of within-task correlation because fixed effects only capture stable differences across participants (and other fixed effects control for fixed differences across tasks and attributes) but not shocks unique to a single participant– task combination. SUR can address this gap by estimating the five attribute equations jointly and allowing their residuals to be correlated for the same participant and task, thereby improving efficiency and testing whether our findings depend on assuming these errors are independent. In practice, our SUR estimates closely track the pooled OLS and fixed-effects results, suggesting that task-specific shocks that affect multiple quality dimensions do not materially influence our Schwarcz et al. 225 estimated treatment effects. We report full SUR results in Table A1 in Appendix A. At the conclusion of the experiment, we also invited all participants to complete a post-experiment survey about their experiences with completing the tasks and using the AI tools to conduct legal work. We collected responses from 120 participants. At the time of the survey, participants had not yet received grades or feedback on their submitted work. We preregistered a rough outline of the experiment prior to analyzing our results; the pre-analysis plan is archived with the American Economic Association’s registry for randomized controlled trials.13
4. Findings In this Section, we present our empirical results, beginning with the core causal effects of access to reasoning models or RAG systems on the quality, efficiency, and productivity of legal work. We then supplement these findings with qualitative assessments of how AI shaped the substance and structure of participants’ work product, along with an exploration of heterogeneous treatment effects across participants and descriptive evidence from a brief post-experiment survey. Taken together, these results provide a comprehensive picture of how access to tools like o1-preview and Vincent AI alters multiple dimensions of lawyering performance.
4.1 Effects of AI Access on Work Quality
We begin by presenting our quality-related findings at the highest level of aggregation. How does giving lawyers access to o1-preview or to Vincent AI affect the overall quality of their legal work? We report these findings in Table 2. Weighting all tasks and attributes equally, we find that both tools improve work-product quality by somewhere between 0.25 and 0.53 points on our seven-point scale. (At this level of aggregation, as Table 2 shows, including participant fixed effects in our specification turns out to matter very little to our estimates.) A direct test of equality confirms that the larger o1-preview estimated effect is statistically distinguishable from the Vincent AI estimated effect at conventional levels: the 0.26-point difference between the two coefficients yields a t-statistic of approximately 1.9. One obviously important question remains, however: how large are these quality improvements in legal work product in practical terms? Using the observed dispersion of the control-group scores in Table 2, we can directly standardize the treatment effects to assess the magnitudes of our findings. Relative to controlgroup standard deviations, access to Vincent AI improves quality by roughly 0.15 to 0.20 standard deviations, while o1preview produces gains of approximately 0.30 to 0.35 standard deviations. Effects of this magnitude are readily observable to trained graders and are large by the standards of other intervention studies: improvements on the order of one quarter to one half of a standard deviation are typically regarded as meaningful in medical competency training (Veloski et al., 2006), active-learning reforms in higher education (Crouch & Mazur, 2001), and managerial communication interventions (DeRue et al., 2012). Because standard deviations capture the typical spread in the quality of work product without access to AI tools (i.e., of the control group), expressing effects in standard-deviation units (SD) provides an intuitive sense of magnitude: a 0.30 to
0.35 SD improvement is roughly equivalent to moving a
lawyer from the 50th percentile of writing quality to about the 62nd to 64th percentile, and a 0.20 SD improvement corresponds to a shift to roughly the 58th percentile. As we will see, when we examine specific attributes for individual tasks, standardized improvements become more pronounced, as we would expect given that effects are likely to be concentrated along particular dimensions of quality and for certain tasks, even though the estimates of these task-specific effects are naturally less precise because each estimate draws on fewer repeated observations for a single task–attribute combination. For example, in Table 9, we find evidence that o1-preview improves Professionalism on the analyzing a complaint task by nearly a full standard deviation (0.92 SD improvement), an effect size consistent with moving a lawyer from the 50th percentile to roughly the mid-80s to high-80s in terms of quality (a striking improvement in practical terms and entirely separate from accompanying speed gains). Relative to these benchmarks, the increases from AI access we observe here, especially the roughly half-point gain associated with o1preview, represent substantial and practically significant improvements in the quality of written legal work. To unpack these results, we next examine aggregate treatment effects across all legal tasks and across all quality attributes. Tables 3 and 4 present these complementary views. Table 3 reports the effect of AI access on each quality attribute pooled across all six tasks—for example, the average effect on Accuracy across all tasks, weighting each task equally. Table 4 reports the effect of AI access on overall performance within each task pooled across the five quality dimensions— for example, the effect on the Draft Legal Memo averaged Table 2. Treatment Effects on Overall Performance Pooled Across Tasks and Attributes No AI N AI tool W/o participant FEs With participant FEs Mean (SD) Effect SE Effect SE 4.14 3,834 Vincent AI 0.26*** (0.10) 0.27*** (0.10) (1.60) o1-preview 0.52*** (0.09) 0.53*** (0.09) Notes. Effects are shown as absolute changes in the mean quality score (on the 1–7 scale) relative to the No AI condition. Outcomes are pooled across all six tasks and all five quality attributes, weighting each task–attribute observation equally. Standard errors are clustered at the participant level. N denotes the number of participant–task–attribute observations contributing to the fully pooled regression. ***p < 0.01,**p < 0.05, *p < 0.1. 226 Journal of Law and Empirical Analysis 3(1) across Accuracy, Analysis, Organization, Clarity, and Professionalism. Because Table 3 aggregates across tasks, those specifications include task fixed effects; because Table 4 aggregates across attributes, those specifications include attribute fixed effects. Both tables report results with and without participant fixed effects. Table 3 shows clear, statistically significant improvements in Clarity, Organization, and Professionalism when participants have access to either AI tool. For Analysis, we observe a statistically significant improvement for o1-preview—a result consistent with the model’s more advanced reasoning capabilities. In contrast, we find no evidence that either tool improves Accuracy, including Vincent AI, the RAG system. Across all attributes, o1-preview generally appears to produce larger gains in quality than Vincent AI, although the difference between the two tools is clearly statistically significant only for Organization and Professionalism. Notably, in this specification, adjusting for participant fixed effects has virtually no impact on the magnitude or significance of the treatment effects. In Table 4, we see that access to AI tools improves task performance, although the size of these improvements varies substantially across tasks. Focusing on our specification that includes participant fixed effects, we find that o1-preview produces large and statistically significant improvements on two tasks—Analysis of Complaint and Draft Persuasive Letter—and marginal improvements on Draft Client Email and Draft Motion to Consolidate. Vincent AI shows a similar pattern, with a large effect on Analysis of Complaint and marginal gains on Draft Client Email, Draft Persuasive Letter, and Draft Motion to Consolidate. Neither tool appears to improve performance on the NDA or legal memo tasks, although the point estimate for Vincent AI on Draft Legal Memo is positive and sizable. These results stand in contrast to the findings of a previous randomized controlled trial of GPT-4, which detects no statistically significant improvements in overall quality across the four tasks tested in that study (Choi et al., 2024). The fact that the fixed-effects estimates differ more noticeably in Table 4 than in Table 3 reflects the different role participant-level heterogeneity plays when outcomes are aggregated across attributes within a task. In practical terms, some participants consistently write more clearly, reason more effectively, or follow instructions more closely across all five quality dimensions, and these stable differences can distort task-level treatment effects unless they are held constant. Once we control for each participant’s overall writing and analytic ability, the treatment effects more clearly reflect how AI assistance shapes performance on each specific task, independent of which student was assigned to complete it. Overall, these fixed-effects results reinforce the conclusion that o1-preview provides the most consistent gains, particularly on tasks requiring substantial reasoning and written advocacy. Figures 1 through 6 plot the raw distributions of aggregate quality scores for each task, separately for the control group, the Vincent AI group, and the o1-preview group. These figures are kernel density plots that show, for each group, the share of participants (y-axis) receiving each possible score (x-axis). They allow us to observe whether the improvements we report in Table 4 stem from broad rightward shifts in the full score distribution, reductions in low-scoring submissions, or changes concentrated near the upper tail. Consistent with the fixed-effects estimates, the distributions for o1-preview show clear rightward movements for Analysis of Complaint, Draft Motion to Consolidate, and Draft Persuasive Letter, with fewer low-scoring submissions and a greater concentration of high-quality work. By contrast, the distributions for Draft NDA and Draft Legal Memo overlap significantly, visually reinforcing the null treatment effects we identify in Table 4. Table 3. Treatment Effects on Quality Attribute Performance Pooled Across Tasks Attribute No AI N AI tool W/o participant FEs With participant FEs Mean (SD) Effect SE Effect SE Accuracy 4.03 767 Vincent AI 0.01 (0.10) 0.03 (0.11) (1.71) o1-preview 0.09 (0.12) 0.10 (0.12) Analysis 3.87 767 Vincent AI 0.16 (0.12) 0.17 (0.11) (1.53) o1-preview 0.35∗∗∗ (0.11) 0.36∗∗∗ (0.11) Organization 4.25 767 Vincent AI 0.23∗ (0.13) 0.25∗ (0.13) (1.55) o1-preview 0.62∗∗∗ (0.12) 0.63∗∗∗ (0.12) Clarity 4.24 766 Vincent AI 0.46∗∗∗ (0.09) 0.47∗∗∗ (0.10) (1.37) o1-preview 0.60∗∗∗ (0.10) 0.61∗∗∗ (0.10) Professionalism 4.32 767 Vincent AI 0.42∗∗∗ (0.14) 0.43∗∗∗ (0.14) (1.76) o1-preview 0.95∗∗∗ (0.12) 0.95∗∗∗ (0.13) Notes. Effects are shown as absolute changes in the mean attribute score (on the 1–7 scale) relative to the No AI condition. Each row reports results for a single quality attribute, pooling outcomes across all six tasks and weighting tasks equally. Standard errors are clustered at the participant level. N denotes the number of participant–task observations contributing to the attribute-specific regression; N varies slightly across rows because some participants are missing scores for particular attributes on one or more tasks. These estimates are not adjusted for multiple hypothesis testing. ***p < 0.01,**p < 0.05, *p < 0.1. Schwarcz et al. 227 Across all tasks, we see only modest differences in the shape or spread of the distributions, with no strong evidence that AI access meaningfully narrows or widens the overall variance of scores. Some density plots appear “taller” for one treatment group than another, but this typically reflects a slightly tighter clustering of scores around the modal region rather than a systematic compression or expansion of the distribution. On tasks with clear average improvements—such as Analysis of Complaint and Draft Persuasive Letter—the uplift reflects broad rightward movement rather than polarization between stronger and weaker performers. Conversely, on tasks with no discernible average effect, such as Draft NDA, the distributions are nearly indistinguishable, indicating that the null results are not due to offsetting positive and negative effects at different points in the performance spectrum. Together, these distributional patterns show that the strongest quality improvements occur on tasks requiring sustained reasoning and written advocacy and reinforce the suggestion that these gains appear consistently across the distribution rather than being confined to a narrow subset of participants. Table 4. Treatment Effects on Task Performance Pooled Across Attributes Task No AI N AI tool W/o participant FEs With participant FEs Mean (SD) Effect SE Effect SE Draft client email 3.86 675 Vincent AI 0.68** (0.30) 0.28* (0.16) (1.70) o1-preview 0.37 (0.29) 0.56* (0.29) Draft legal memo 3.46 625 Vincent AI 0.38 (0.23) 0.23 (0.23) (1.41) o1-preview 0.72*** (0.23) 0.10 (0.28) Analysis of complaint 4.88 635 Vincent AI 0.39* (0.22) 1.00*** (0.33) (1.32) o1-preview 0.50** (0.24) 0.94*** (0.30) Draft NDA 5.28 635 Vincent AI 0.01 (0.16) 0.43 (0.30) (0.88) o1-preview 0.22 (0.17) 0.08 (0.29) Draft motion to consolidate 3.50 634 Vincent AI 0.42* (0.25) 0.28* (0.16) (1.30) o1-preview 1.01*** (0.22) 0.41 (0.27) Draft persuasive letter 3.91 630 Vincent AI 0.35 (0.34) 0.28* (0.16) (1.86) o1-preview 0.82*** (0.31) 1.28*** (0.30) Notes. Effects are shown as absolute changes in the mean quality score (on the 1–7 scale) relative to the No AI condition. Each row reports results for a single legal task, pooling outcomes across all five quality attributes and weighting attributes equally. Standard errors are clustered at the participant level. N denotes the number of participant–attribute observations contributing to the task-specific regression. These estimates are not adjusted for multiple hypothesis testing. ***p < 0.01,**p < 0.05, *p < 0.1. Figure 1. Total Score Distributions, Draft Client Email. Notes: This figure displays kernel-smoothed distributions of total scores for the Draft Client Email task by AI condition. Total score is the sum of all five quality attributes for a given task. Because the distributions are kernelsmoothed, estimated densities extend beyond the task’s allowable score range (0–35); these boundary violations are a mechanical artifact of the smoothing procedure rather than observed scores. Total N = 135. 228 Journal of Law and Empirical Analysis 3(1) We next examine treatment effects at the most disaggregated level—each combination of legal task and quality attribute. For each form of legal work we assess in this study (e.g., Draft Client Email, Draft Legal Memo, Analysis of Complaint, Draft NDA, Draft Motion to Consolidate, and Draft Persuasive Letter), we estimate both a pooled OLS model and a participant fixed-effects model. The pooled estimates provide a comparison of treated and untreated submissions while adjusting for systematic differences across tasks and attributes, whereas the participant fixed-effects specification also adjusts for stable differences in baseline writing and analytic ability across individuals. Reporting both allows readers to assess whether our estimates reflect differences in which participants happened to perform a given type of work and whether the results persist once we compare each participant’s AI-assisted work to their own baseline Figure 2. Total Score Distributions, Draft Legal Memo. Notes: This figure displays kernel-smoothed distributions of total scores for the Draft Legal Memo task by AI condition. Total score is the sum of all five quality attributes for a given task. Because the distributions are kernelsmoothed, estimated densities extend beyond the task’s allowable score range (0–35); these boundary violations are a mechanical artifact of the smoothing procedure rather than observed scores. Total N = 125. Figure 3. Total Score Distributions, Analysis of Complaint. Notes: This figure displays kernel-smoothed distributions of total scores for the Analysis of Complaint task by AI condition. Total score is the sum of all five quality attributes for a given task. Because the distributions are kernel-smoothed, estimated densities extend beyond the task’s allowable score range (0–35); these boundary violations are a mechanical artifact of the smoothing procedure rather than observed scores. Total N = 127. Schwarcz et al. 229 performance. Importantly, although we report one coefficient per task-attribute combination, the fixed-effects estimates draw on each participant’s full set of submissions, enabling the model to filter out underlying skill differences using our full data set. Because this approach more effectively isolates the impact of AI assistance on legal work quality, we rely on the fixed-effects estimates in Tables 5 through 9 and provide the pooled OLS results in Appendix A for comparison. Across the task–attribute results, a consistent pattern emerges: AI assistance improves several dimensions of legal work quality, but the strength and scope of these improvements vary meaningfully across both tasks and attributes. The most frequent and robust gains appear in Clarity, Figure 4. Total Score Distributions, Draft NDA. Notes: This figure displays kernel-smoothed distributions of total scores for the Draft NDA task by AI condition. Total score is the sum of all five quality attributes for a given task. Because the distributions are kernel-smoothed, estimated densities extend beyond the task’s allowable score range (0–35); these boundary violations are a mechanical artifact of the smoothing procedure rather than observed scores. Total N = 127. Figure 5. Total Score Distributions, Draft Motion to Consolidate. Notes: This figure displays kernel-smoothed distributions of total scores for the Draft Motion to Consolidate task by AI condition. Total score is the sum of all five quality attributes for a given task. Because the distributions are kernel-smoothed, estimated densities extend beyond the task’s allowable score range (0–35); these boundary violations are a mechanical artifact of the smoothing procedure rather than observed scores. Total N = 127. 230 Journal of Law and Empirical Analysis 3(1) Organization, and Professionalism, where both tools—but especially o1-preview—deliver large improvements across multiple forms of legal writing (i.e., often close to one standard deviation in the control mean). Improvements in Analysis are more selective, concentrated in tasks that require substantial legal reasoning and structured explanation (e.g., Draft Persuasive Letter). Within this broader pattern, our estimates of the effects of access to AI tools on Analysis of Complaint stand out. On this task, o1-preview produces large, statistically significant improvements across all five quality attributes—including the only meaningful gain in Accuracy we observe in the study— while Vincent AI also appears to yield substantial, though more modest, benefits. These unusually strong gains help drive the aggregate improvements we report in Table 3, particularly in Analysis, Organization, Clarity, and Professionalism. The complaint analysis task—requiring issue identification, synthesis of factual allegations, and articulation of a reasoned legal argument—aligns closely with the strengths of a reasoning-oriented model like o1-preview and Figure 6. Total Score Distributions, Draft Persuasive Letter. Notes: This figure displays kernel-smoothed distributions of total scores for the Draft Persuasive Letter task by AI condition. Total score is the sum of all five quality attributes for a given task. Because the distributions are kernel-smoothed, estimated densities extend beyond the task’s allowable score range (0–35); these boundary violations are a mechanical artifact of the smoothing procedure rather than observed scores. Total N = 126. Table 5. Treatment Effects on Task-Level Accuracy Task No AI N AI tool Effect SE % changeMean (SD) Draft client email 3.25 135 Vincent AI 0.05 (0.15) 1.6% (1.71) o1-preview 0.21 (0.29) 6.3% Draft legal memo 3.02 125 Vincent AI 0.09 (0.32) +3.0% (1.20) o1-preview 0.67** (0.33) 22.1% Analysis of complaint 5.10 127 Vincent AI 0.95*** (0.33) +18.6% (1.26) o1-preview 0.98*** (0.33) +19.2% Draft NDA 5.70 127 Vincent AI 0.60* (0.33) 10.6% (0.80) o1-preview 0.05 (0.32) 0.9% Draft motion to consolidate 3.78 127 Vincent AI 0.05 (0.15) 1.4% (1.33) o1-preview 0.27 (0.30) 7.2% Draft persuasive letter 3.44 126 Vincent AI 0.05 (0.15) 1.5% (1.89) o1-preview 0.85*** (0.29) +24.6% Notes. Effects are shown as absolute changes in the mean Accuracy score (on the 1–7 scale) relative to the No AI condition. Each row reports results for a single legal task. Standard errors are clustered at the participant level. All estimates are from specifications that include participant fixed effects. N denotes the number of participants contributing Accuracy observations for the corresponding task. Sample sizes may differ across Tables 5–9 because task completion and attribute scoring vary across tasks and attributes. These estimates are not adjusted for multiple hypothesis testing. ***p < 0.01,**p < 0.05, *p < 0.1. Schwarcz et al. 231 may help to explain why the RAG-based Vincent AI model, although helpful, produces smaller improvements. At the other end of the spectrum, we find no evidence that access to either AI tool helps to improve performance on the NDA task. For this form of transactional drafting, neither o1preview nor Vincent AI produces significant gains on any quality attribute, and the distributions of raw scores overlap almost completely across treatment conditions. Several factors may explain these null results, though each remains speculative. One possibility is that transactional drafting receives less emphasis in typical law school training, giving participants a weaker foundation for using AI effectively on this task, which would leave open the possibility that experienced lawyers might benefit more from access to AI tools. Another is that these tools, especially the RAG-based Vincent AI model, may be better suited to litigation-oriented reasoning than to transactional drafting. Because we supplied a template for the NDA task, the use of which is a common feature of transactional practice, participants had less discretion in drafting, which likely limited the scope for AIdriven quality improvement. Finally, the design of the task required only modest factual customization, leaving limited room for quality gains regardless of whether AI tools were available. In any event, the estimates from the NDA task contribute little to the aggregate results in Tables 2–4, and the precise reasons why tasks requiring structured reasoning, Table 6. Treatment Effects on Task-Level Analysis Task No AI N AI tool Effect SE % changeMean (SD) Draft client email 3.64 135 Vincent AI 0.04 (0.32) +1.1% (1.62) o1-preview 0.02 (0.32) 0.5% Draft legal memo 3.16 125 Vincent AI 0.16 (0.32) +5.0% (1.13) o1-preview 0.24 (0.33) 7.7% Analysis of complaint 4.62 127 Vincent AI 1.02*** (0.33) +22.0% (1.39) o1-preview 0.97*** (0.33) +20.9% Draft NDA 4.93 127 Vincent AI 0.66** (0.33) 13.3% (0.91) o1-preview 0.02 (0.32) 0.3% Draft motion to consolidate 3.42 127 Vincent AI 0.16 (0.32) +4.8% (1.29) o1-preview 0.16 (0.33) +4.7% Draft persuasive letter 3.46 126 Vincent AI 0.35 (0.33) +10.2% (1.85) o1-preview 1.38*** (0.33) +39.8% Notes. Effects are shown as absolute changes in the mean Analysis score (on the 1–7 scale) relative to the No AI condition. Each row reports results for a single legal task. Standard errors are clustered at the participant level. All estimates are from specifications that include participant fixed effects. N denotes the number of participants contributing Analysis observations for the corresponding task. Sample sizes may differ across Tables 5–9 because task completion and attribute scoring vary across tasks and attributes. These estimates are not adjusted for multiple hypothesis testing. ***p < 0.01,**p < 0.05, *p < 0.1. Table 7. Treatment Effects on Task-Level Organization Task No AI N AI tool Effect SE % changeMean (SD) Draft client email 4.05 135 Vincent AI 0.08 (0.32) +2.1% (1.60) o1-preview 0.49 (0.32) +12.1% Draft legal memo 4.14 125 Vincent AI 0.07 (0.32) 1.8% (1.57) o1-preview 0.19 (0.33) +4.6% Analysis of complaint 4.80 127 Vincent AI 1.16*** (0.33) +24.2% (1.26) o1-preview 1.05*** (0.33) +21.8% Draft NDA 5.05 127 Vincent AI 0.49 (0.33) 9.7% (0.79) o1-preview 0.11 (0.32) 2.2% Draft motion to consolidate 3.47 127 Vincent AI 0.63* (0.32) +18.2% (1.39) o1-preview 0.81** (0.33) +23.3% Draft persuasive letter 4.00 126 Vincent AI 0.18 (0.33) +4.5% (1.99) o1-preview 1.51*** (0.33) +37.9% Notes. Effects are shown as absolute changes in the mean Organization score (on the 1–7 scale) relative to the No AI condition. Each row reports results for a single legal task. Standard errors are clustered at the participant level. All estimates are from specifications that include participant fixed effects. N denotes the number of participants contributing Organization observations for the corresponding task. Sample sizes may differ across Tables 5–9 because task completion and attribute scoring vary across tasks and attributes. These estimates are not adjusted for multiple hypothesis testing. ***p < 0.01,**p < 0.05, *p < 0.1. 232 Journal of Law and Empirical Analysis 3(1) legal analysis, or written advocacy exhibit the largest gains remains an open question for future research. These task–attribute patterns also help clarify why the fixed-effects results in Tables 5 through 9 differ from the pooled OLS estimates that we report in Appendix A. Because the fixed-effects specification adjusts for each participant’s overall writing and analytic ability, it removes baseline differences that can inflate or attenuate the pooled estimates— particularly on tasks where stronger or weaker writers are unevenly distributed across treatment groups. For instance, the pooled OLS results suggest that o1-preview improves Analysis on Draft Legal Memo and Draft Motion to Consolidate, but these effects weaken or disappear once we include participant fixed effects, whereas the large improvements on Analysis of Complaint and Draft Persuasive Letter remain robust. This contrast underscores the idea that the fixed-effects model more cleanly captures how AI assistance alters the quality of different types of legal work, independent of who performs that work. Taken together, the task–attribute results show that AI tools deliver the greatest quality improvements on legal work that demands reasoning, narrative synthesis, and persuasive writing, while highly structured or template-driven tasks—most notably Draft NDA—benefit far less. In addition to evaluating legal work using our five qualityrelated metrics, we separately tracked instances of Table 8. Treatment Effects on Task-Level Clarity Task No AI N AI tool Effect SE % changeMean (SD) Draft client email 4.11 135 Vincent AI 0.58* (0.32) +14.2% (1.47) o1-preview 1.27*** (0.32) +30.9% Draft legal memo 3.35 125 Vincent AI 0.55* (0.32) +16.6% (1.33) o1-preview 0.16 (0.33) +4.8% Analysis of complaint 5.05 127 Vincent AI 0.84** (0.33) +16.7% (1.15) o1-preview 0.71** (0.33) +14.0% Draft NDA 5.14 127 Vincent AI 0.46 (0.33) 8.9% (0.77) o1-preview 0.07 (0.32) 1.4% Draft motion to consolidate 3.42 126 Vincent AI 0.33 (0.32) +9.5% (0.84) o1-preview 0.16 (0.33) +4.5% Draft persuasive letter 4.44 126 Vincent AI 0.88*** (0.33) +19.8% (1.41) o1-preview 1.29*** (0.33) +29.0% Notes. Effects are shown as absolute changes in the mean Clarity score (on the 1–7 scale) relative to the No AI condition. Each row reports results for a single legal task. Standard errors are clustered at the participant level. All estimates are from specifications that include participant fixed effects. N denotes the number of participants contributing Clarity observations for the corresponding task. Sample sizes may differ across Tables 5–9 because task completion and attribute scoring vary across tasks and attributes. These estimates are not adjusted for multiple hypothesis testing. ***p < 0.01,**p < 0.05, *p < 0.1. Table 9. Treatment Effects on Task-Level Professionalism Task No AI N AI tool Effect SE % changeMean (SD) Draft client email 4.30 135 Vincent AI 0.36 (0.32) +8.3% (1.95) o1-preview 1.26*** (0.32) +29.3% Draft legal memo 3.63 125 Vincent AI 0.44 (0.32) +12.1% (1.56) o1-preview 0.68** (0.33) +18.6% Analysis of complaint 4.83 127 Vincent AI 1.41*** (0.33) +29.2% (1.52) o1-preview 1.40*** (0.33) +28.9% Draft NDA 5.58 127 Vincent AI 0.33 (0.33) 6.0% (0.91) o1-preview 0.18 (0.32) 3.2% Draft motion to consolidate 3.40 127 Vincent AI 0.35 (0.32) +10.2% (1.57) o1-preview 0.82** (0.33) +24.2% Draft persuasive letter 4.23 126 Vincent AI 0.40 (0.33) +9.5% (1.97) o1-preview 1.75*** (0.33) +41.3% Notes. Effects are shown as absolute changes in the mean Professionalism score (on the 1–7 scale) relative to the No AI condition. Each row reports results for a single legal task. Standard errors are clustered at the participant level. All estimates are from specifications that include participant fixed effects. N denotes the number of participants contributing Professionalism observations for the corresponding task. Sample sizes may differ across Tables 5–9 because task completion and attribute scoring vary across tasks and attributes. These estimates are not adjusted for multiple hypothesis testing. ***p < 0.01,**p < 0.05, *p < 0.1. Schwarcz et al. 233 hallucinations, which we define as citations to entirely fabricated sources or to real sources cited based on a fundamentally inaccurate understanding of their content. The results, presented in Figure 7, indicate that while hallucinations are rare with these tools, they do occur. Although the small number of reported hallucinations limits our ability to draw definitive conclusions about their comparative likelihood, the data tentatively suggest that RAG-based systems like Vincent AI may reduce hallucinations. Indeed, we identified fewer hallucinations in tasks completed with Vincent AI (3 total) than in those completed without any AI assistance at all (4 total).14 By contrast, tasks completed with o1-preview exhibit a substantially higher absolute number of hallucinations (11 total).
4.2 Effects of AI Access on Time Use
We next examine how access to AI tools affects efficiency in completing legal work. Prior work provides strong evidence that AI assistance meaningfully accelerates legal analysis: Choi et al. (2024) report that access to GPT-4 reduces the time needed to complete realistic lawyering tasks by 12 to 32%. Our study enables us to evaluate whether similar efficiency gains arise when participants use either a reasoning-oriented model (o1-preview) or a RAG-based legal tool (Vincent AI). Because the six tasks in our experiment vary substantially in their allocated time—from 60 minutes for Draft Client Email to 240 minutes for Draft Legal Memo—we measure time spent as the percentage of the allotted time participants devote to each task. This standardized metric enables direct comparison across heterogeneous forms of legal work, although using absolute minutes yields the same substantive conclusions. As with our quality analysis, we rely on the participant fixed-effects specification to isolate the causal effect of AI assistance on speed. We report these results in Table 10. Across nearly all tasks, access to AI tools substantially reduces the percentage of the allotted time participants use to complete their work. Under the participant fixed-effects specification, both Vincent AI and o1-preview reduce time use by roughly 20–28% on five of the six tasks we evaluate, with the lone exception of Draft NDA, where neither tool appears to meaningfully affect time spent. The only other modest effect occurs for o1-preview on Draft Client Email, which shows a smaller but directionally similar reduction. These magnitudes closely resemble the 12– 32% reductions documented by Choi et al. (2024) for users of GPT-4, though our findings exhibit a more consistent pattern across a broader array of legal work and across two distinct classes of AI systems. Overall, our results provide convincing empirical evidence that access to either Vincent AI or o1preview meaningfully accelerates many forms of legal work, even as the gains vary somewhat across task type. Figures 8 through 13 provide additional insight into how AI affects the distribution of time spent across the six forms of legal work. These density plots show, for each treatment group, the distribution of the number of minutes participants used to complete each task. For tasks such as Draft Client Email, Analysis of Complaint, Draft Motion to Consolidate, and Draft Persuasive Letter, the Vincent AI and o1-preview curves shift noticeably to the left, indicating broad-based reductions in time spent rather than changes concentrated among only a few participants. By contrast, the curves for Draft NDA overlap almost exactly across groups, visually reinforcing the notion that drafting a nondisclosure agreement Figure 7. Hallucinations. Notes: This figure reports the number of submissions by task and AI condition in which at least one hallucination was identified. Hallucinations are defined as citations to entirely fabricated sources or to real sources cited based on a fundamentally inaccurate understanding of their content. Because hallucinations are rare events in the sample, the figure is intended to be descriptive rather than to support formal statistical inference. 234 Journal of Law and Empirical Analysis 3(1) is the one task where neither tool produces meaningful speed gains or quality improvements, perhaps because of the task’s template-driven, transactional structure or its limited opportunities for substantive modification. Table 11, which reproduces the GPT-4 results from Choi et al. (2024), provides a useful benchmark for interpreting these patterns. Choi et al. find sizable and uniform reductions in completion time across all four tasks they evaluate, raising the question whether newer tools—such as a reasoning model like o1-preview or a RAG-based system like Vincent AI— might generate even greater or more consistent speed gains. Our data offer no evidence of such differences: the density plots show similarly consistent leftward shifts for five of the six tasks we evaluate, with neither model class systematically outperforming the other or GPT-4 in terms of time savings.
4.3 Effects of AI Access on Productivity
Finally, we examine how access to AI tools affects productivity. We define productivity as a measure of qualityadjusted legal work that is normalized by the fraction of the task’s allotted time a participant spends completing the task. In practice, task-level productivity equals the sum of the five quality attribute scores divided by the share of allotted time spent. This construction yields a productivity measure that is comparable across tasks with different time limits. Table 12 presents estimates from specifications that include participant fixed effects, isolating the effect of AI access on productivity using within-participant variation across tasks. As a robustness check, we report corresponding pooled OLS estimates without participant fixed effects in Appendix A (Table A8), which show a similar pattern of results. The productivity gains associated with the availability of AI assistance are striking. Access to Vincent AI generates substantial improvements on five of the six tasks, ranging from roughly +50% to +110%. Access to o1-preview produces comparably large and, on several tasks, even larger gains, improving productivity on four tasks with increases of about +75% to +130%. The largest effects appear on tasks requiring structured reasoning and written advocacy, especially Analysis of Complaint, Draft Legal Memo, and Draft Persuasive Letter. Participants complete these tasks far faster while maintaining or improving quality. To put these magnitudes in perspective, a 130% increase in productivity means that lawyers using o1-preview produce more than twice as much quality-adjusted work after accounting for the fraction of the task’s allotted time spent (for example, by completing a task in less than half the allotted time without sacrificing quality or by delivering substantially higher-quality work in the same amount of time). These impressive gains reflect modest improvements in quality coupled with large, consistent reductions in the amount of time spent on the task. The same qualitative pattern appears in pooled OLS specifications that omit participant fixed effects, reported in Appendix A (Table A8). The sole exception to this pattern of productivity improvements is the NDA task, where we find no evidence in our experiment that either tool affects performance, mirroring the null effects on both quality and time for this form of transactional drafting.
4.4 Qualitative Assessment of AI-Assisted Work
To gain deeper insight into how access to AI tools affects the quality of legal work, our three grading co-authors conducted a post-analysis qualitative review of the task submissions they evaluated, this time with knowledge of whether the participant in question had access to o1-preview, Vincent AI, or neither AI tool. Their observations provide a complementary Table 10. Treatment Effects on Task Time Spent (Share of Allotted Time) Task No AI N AI tool Effect SE % changeMean (SD) Draft client email 0.84 134 Vincent AI 0.17*** (0.03) 20.2% (0.16) o1-preview 0.05 (0.06) 6.4% Draft legal memo 0.76 124 Vincent AI 0.18*** (0.05) 23.5% (0.19) o1-preview 0.20*** (0.06) 25.7% Analysis of complaint 0.89 126 Vincent AI 0.25*** (0.05) 27.6% (0.14) o1-preview 0.22*** (0.05) 25.1% Draft NDA 0.54 127 Vincent AI 0.06 (0.06) 11.9% (0.25) o1-preview 0.02 (0.05) 3.9% Draft motion to consolidate 0.64 127 Vincent AI 0.17*** (0.03) 26.5% (0.27) o1-preview 0.21*** (0.05) 33.0% Draft persuasive letter 0.74 126 Vincent AI 0.17*** (0.03) 22.9% (0.24) o1-preview 0.16*** (0.05) 22.1% Notes. Effects are shown as absolute changes in the fraction of the task’s allotted time that the participant spent relative to the No AI condition. Each row reports results for a single legal task. Standard errors are clustered at the participant level. All estimates are from specifications that include participant fixed effects. N denotes the number of participants contributing time-spent observations for the corresponding task. These estimates are not adjusted for multiple hypothesis testing. ***p < 0.01,**p < 0.05, *p < 0.1. Schwarcz et al. 235 perspective on the quantitative patterns we document above and highlight the kind of legal writing and analysis that AI access tends to generate. A consistent theme emerges from the qualitative review: legal professionals who have access to AI tools produce writing that is clearer, more polished, and easier to read than work produced without access to AI tools. Their sentences are more concise, their paragraphs flow more smoothly, and the overall structure presents information in a more coherent and user-friendly manner. Their submissions also contain far fewer surface-level errors—such as typos, comma splices, and other distracting mistakes—which AI support helps eliminate. These stylistic improvements align closely with the quantitative gains we observe in Clarity, Organization, and Figure 8. Time Spent Distributions, Draft Client Email. Notes: This figure displays kernel-smoothed distributions of time spent on the Draft Client Email task by AI condition. Time spent is measured in minutes and is bounded by the task’s time limit (0–60 minutes). Because the distributions are kernel-smoothed, estimated densities extend beyond the task’s time limit; these boundary violations are a mechanical artifact of the smoothing procedure rather than observed behavior. Total N = 134. Figure 9. Time Spent Distributions, Draft Legal Memo. Notes: This figure displays kernel-smoothed distributions of time spent on the Draft Legal Memo task by AI condition. Time spent is measured in minutes and is bounded by the task’s time limit (0–240 minutes). Because the distributions are kernel-smoothed, estimated densities extend beyond the task’s time limit; these boundary violations are a mechanical artifact of the smoothing procedure rather than observed behavior. Total N = 124. 236 Journal of Law and Empirical Analysis 3(1) Professionalism, and any differences between the work product of people using Vincent AI and those using o1preview appear modest along these dimensions. By contrast, the quality of writing varies widely when people performing legal work do not have access to AI tools. As one grading co-author, using a bowling analogy, put it: it is as if people with access to AI tools are not only playing with bumpers built into the gutters—to prevent huge mistakes— but also are told which ball to use, which shoes to use, and where to aim. Figure 10. Time Spent Distributions, Analysis of Complaint. Notes: This figure displays kernel-smoothed distributions of time spent on the Analysis of Complaint task by AI condition. Time spent is measured in minutes and is bounded by the task’s time limit (0–120 minutes). Because the distributions are kernel-smoothed, estimated densities extend beyond the task’s time limit; these boundary violations are a mechanical artifact of the smoothing procedure rather than observed behavior. Total N = 126. Figure 11. Time Spent Distributions, Draft NDA. Notes: This figure displays kernel-smoothed distributions of time spent on the Draft NDA task by AI condition. Time spent is measured in minutes and is bounded by the task’s time limit (0–180 minutes). Because the distributions are kernel-smoothed, estimated densities extend beyond the task’s time limit; these boundary violations are a mechanical artifact of the smoothing procedure rather than observed behavior. Total N = 127. Schwarcz et al. 237 Importantly, the stabilizing, “raise-the-floor” effect of AI access appears to be less pronounced in Analysis and Accuracy. Still, in several tasks, AI assistance seems to help legal writers focus more effectively on the central legal questions. For example, practitioners with AI access are less likely to veer off on tangents and more likely to concentrate on key material issues. AI tools also appear to reduce the amount of time people spend floundering during the research stage in a way that might otherwise leave too little time to write. These patterns suggest a connection between AI’s speed and its Figure 12. Time Spent Distributions, Draft Motion to Consolidate. Notes: This figure displays kernel-smoothed distributions of time spent on the Draft Motion to Consolidate task by AI condition. Time spent is measured in minutes and is bounded by the task’s time limit (0– 150 minutes). Because the distributions are kernel-smoothed, estimated densities extend beyond the task’s time limit; these boundary violations are a mechanical artifact of the smoothing procedure rather than observed behavior. Total N = 127. Figure 13. Time Spent Distributions, Draft Persuasive Letter. Notes: This figure displays kernel-smoothed distributions of time spent on the Draft Persuasive Letter task by AI condition. Time spent is measured in minutes and is bounded by the task’s time limit (0–150 minutes). Because the distributions are kernel-smoothed, estimated densities extend beyond the task’s time limit; these boundary violations are a mechanical artifact of the smoothing procedure rather than observed behavior. Total N = 126. 238 Journal of Law and Empirical Analysis 3(1) quality-related benefits: by streamlining early-stage research and issue-spotting, access to AI tools may effectively free lawyers to invest more time in analysis and refinement of their work product. Yet the positive effects of AI support on Analysis and Accuracy are inconsistent at best, especially for Vincent AI. AI assistance, and access to Vincent AI in particular, tends to be more beneficial when the legal task involves a narrow, well-defined issue with a clearly articulated deliverable. Conversely, its advantages diminish on broader tasks that require identifying the key issues. Vincent AI users were more likely to struggle with task identification on such broad tasks. AI users are more prone to respond to a far broader set of issues than they should and frequently include fewer relevant citations. In some cases, they provide case names without citations or rely on non-binding administrative or secondary sources. Additionally, AI access seems to lead participants to oversimplify legal questions or, in the case of o1-preview, to omit legal authorities altogether. Our qualitative review bears out our expectation that access to o1-preview might generate a disproportionate number of hallucinated sources in legal work. Although the overall number of hallucinated citations was very small (18 across all treatment groups on a total of 768 tasks), the pattern of inaccuracies was clear. For example, o1-preview users occasionally cited cases that were entirely fabricated— meaning they did not exist under the names or citations that participants provided. A more subtle issue relates to the types of sources that AI users might use to establish or defend their claims. Vincent AI users appeared to be more likely to include obscure and often unnecessary sources, setting them apart from other participants.
4.5 Effects of AI Access Across Baseline Skill Levels
In addition to assessing how access to o1-preview and Vincent AI can affect the performance of legal work on average, we also explore how outcomes vary with participants’ baseline skill levels. Prior research suggests that when GPT-4 does affect the quality of legal work, it does so unevenly, benefiting those with lower initial skill levels by more than those with higher baseline proficiency (Choi et al., 2024; Choi & Schwarcz, 2025). To assess whether similar patterns appear in our data, Figure 14 through 17 plot two outcomes— productivity and quality—averaged across the two tasks each participant completed under each experimental condition Table 11. Average Time Spent Completing Tasks With and Without GPT-4 (Choi et al., 2024) Task No GPT-4 (Std. Dev.) With GPT-4 (Std. Dev.) Difference (95%CI) % Diff. p value Complaint Drafting 160.69 (72.38) 122.00 (66.80) 38.77 (64.00, 13.36) 24.1 0.0018 Contract Drafting 69.72 (32.00) 47.59 (31.09) 22.40 (33.71, 10.91) 32.1 0.0000 Employee Handbook 37.24 (9.55) 29.41 (13.42) 7.84 (12.03, 3.74) 21.1 0.0000 Client Memo 244.41 (58.03) 215.69 (72.96) 28.75 (52.59, 5.05) 11.8 0.0152 Notes. This table reproduces results reported in Choi et al. (2024) showing the average time their participants spent completing several legal tasks (in minutes) with and without access to GPT-4. See Choi et al. (2024) for details on the experimental design and estimation strategy. Table 12. Treatment Effects on Task-Level Productivity Task No AI N AI tool Effect SE % changeMean (SD) Draft client email 23.57 134 Vincent AI 17.16*** (3.21) +72.8% (10.70) o1-preview 4.97 (6.50) +21.1% Draft legal memo 24.08 124 Vincent AI 19.27*** (6.00) +80.0% (10.02) o1-preview 25.86*** (7.23) +107.4% Analysis of complaint 28.55 126 Vincent AI 26.08*** (8.80) +91.4% (9.63) o1-preview 22.07*** (6.35) +77.3% Draft NDA 67.12 127 Vincent AI 15.72 (13.45) +23.4% (50.93) o1-preview 3.24 (10.53) +4.8% Draft motion to consolidate 35.18 127 Vincent AI 17.16*** (3.21) +48.8% (24.05) o1-preview 32.11*** (9.92) +91.3% Draft persuasive letter 28.93 126 Vincent AI 17.16*** (3.21) +59.3% (13.15) o1-preview 38.01*** (12.81) +131.4% Notes. Effects are shown as absolute changes in task productivity, defined as the sum of the five quality attribute scores divided by the fraction of the task’s allotted time that the participant spent, relative to the No AI condition. Each row reports results for a single legal task. Standard errors are clustered at the participant level. All estimates are from specifications that include participant fixed effects. N denotes the number of participants contributing valid quality and time-spent observations for the corresponding task. These estimates are not adjusted for multiple hypothesis testing. ***p < 0.01,**p < 0.05, *p < 0.1. Schwarcz et al. 239 against participants’ first-year GPAs. Figures 14 and 16 report the relationship between GPA and average task-level productivity, while Figures 15 and 17 present the corresponding relationship for average task-level quality scores, where each task-level score is computed as the mean of the five quality attribute scores. In each figure, outcomes are shown separately for participants with and without access to an AI tool. If o1-preview disproportionately benefits lower-GPA participants, we would expect a wider gap between the two lines on the left side of each graph where baseline skill may be lower, with the o1-preview line lying noticeably above the no-AI line. Although Figure 14 shows some evidence of convergence as first-year GPAs increase, the pattern does not indicate pronounced heterogeneous effects, suggesting that the productivity gains from o1-preview access are relatively constant across participant skill levels. By contrast, Figure 15 suggests that the differential effect of o1-preview access by ability—as measured by first-year GPA—is more pronounced when we focus on task scores. Specifically, the near convergence in Figure 15 of the two lines as GPAs increase implies that access to o1-preview provides a greater boost in quality for potential users with lower baseline skill levels as compared to potential users with higher baseline skills (as measured by first-year GPA, which we acknowledge is limited in important respects). Even among the highest skill levels, however, we detect no evidence to indicate that access to o1-preview reduces the overall quality of work. Figures 16 and 17 repeat this same analysis for Vincent AI. Interestingly, these figures reveal a pattern opposite to the one we observe for o1-preview. In terms of productivity, a differential effect across baseline skill level (measured by firstyear GPA) is evident from the convergence of lines in Figure 16. By contrast, Vincent AI’s effect on overall scores appears relatively uniform across baseline skill levels, as shown in Figure 17. Taken together, these results suggest that Vincent AI’s heterogeneous productivity effects arise chiefly from differential time savings: lower-skill participants speed up more with access to Vincent AI, whereas quality scores remain relatively uniform across the skill distribution. This pattern is somewhat at odds with the one for o1-preview, where heterogeneity arises primarily through differences in quality gains rather than differences in time savings. Another way to assess the relative impact of AI access across participants with different baseline skill levels is to Figure 14. Productivity and First-Year Law School GPA, o1-preview. Notes: This figure plots participant-level productivity against first-year law school GPA. Productivity is defined as total score divided by the fraction of the task’s allotted time that the participant spent on the task. For each condition, productivity is averaged across the two tasks completed under that condition. Each participant therefore contributes two observations, one from tasks completed without AI assistance and one from tasks completed with access to o1-preview. Fitted lines summarize the conditional relationship between GPA and productivity and are intended to be illustrative. Total N = 122 participants. 240 Journal of Law and Empirical Analysis 3(1) measure baseline skill not by first-year GPA but by participants’ scores on tasks completed without AI assistance. Figures 18 and 19 illustrate this approach for o1-preview and Vincent AI, respectively. In these figures, the outcome on the y-axis is each participant’s average change in task score (standardized within task, based on No-AI scores) when using the AI tool relative to their own performance on tasks completed without AI. Because baseline skill in this setting is measured using participants’ own task scores rather than an external proxy such as GPA, these figures are more susceptible to mechanical regression-to-the-mean effects and should therefore be interpreted descriptively and with caution. This score-based approach reveals even more pronounced differences in how the two tools affect the quality of legal work product across the skill distribution. Importantly, although values below zero are consistent with the possibility that AI reduces performance for high-skill participants, we cannot interpret them in this way. Such negative differences can arise mechanically from both regression to the mean and the assignment of participants to different tasks with varying baseline difficulty and treatment intensity: specifically, if a participant’s AI-assisted tasks are ones with lower baseline scores or smaller AI effects, the resulting difference may be negative even when AI improves performance on those tasks. Consistent with this interpretation, unreported regression analyses that interact AI access with baseline performance measures yield qualitatively similar patterns, suggesting that the heterogeneity observed in Figures 18 and 19 is not driven solely by regression to the mean, even though task-level composition may continue to contribute to variation in individual-level differences. Taken together, our findings reinforce a pattern documented in earlier GPT-4 studies (e.g., Choi & Schwarcz, 2025): AI tools appear to provide the largest gains for participants with lower baseline skill, while offering smaller— and in some cases slightly negative—estimated effects for those with the strongest baseline performance. These negative estimates are concentrated in settings where measurement and task-composition issues complicate interpretation and may not imply systematically worse performance. Accordingly, we find no reliable support for the idea that AI access meaningfully harms the quality of work for higher-skilled Figure 15. Total Score and First-Year Law School GPA, o1-preview. Notes: This figure plots participant-level total score against first-year law school GPA. Total score is defined as the sum of the five quality attributes for a given task and is averaged across the two tasks completed under each condition. Each participant therefore contributes two observations, one from tasks completed without AI assistance and one from tasks completed with access to o1-preview. Fitted lines summarize the conditional relationship between GPA and total score and are intended to be illustrative. Total N = 122 participants. Schwarcz et al. 241 participants. Instead, the results suggest that access to cuttingedge AI systems primarily “raise the floor,” narrowing performance gaps rather than uniformly shifting the entire distribution upward. These heterogeneous effects underline that the value of access to AI assistance depends not only on the specific AI tool and the legal task, but also on the user, with important implications for how legal organizations and educators integrate AI into training and practice.
4.6 Post-Experiment Survey Results
Our post-experiment survey results are generally consistent with our quantitative findings on quality, efficiency, and productivity, though our analysis surfaces a few discrepancies. Figure 20 displays the average responses to several survey questions, each of which we designed to elicit a distinct dimension of participants’ experiences with the two AI tools: participants’ intended future use of AI tools, their perceived improvement in proficiency with the AI tools over the course of the experiment, the extent to which the AI tools enhanced their overall satisfaction as they completed their tasks, their perceptions of how the tools affected the quality of their work, and their perceptions of how the tools affected their speed of completion. These data indicate that participants generally believed that both AI tools enhanced the quality of their work and increased their speed in completing their legal tasks. Interestingly, participants perceived o1-preview as more effective for boosting speed and Vincent AI as more helpful for enhancing quality. These subjective impressions diverge in important respects from the actual results of our experiment: in practice, the two tools produce similar gains in speed, while o1-preview delivers broader and more substantial improvements in quality across tasks and attributes. Even so, the survey responses in Figure 20 reveal that participants had a largely positive experience using both tools, with markedly strong approval expressed for Vincent AI. Figure 21 displays the average overall helpfulness ratings of the two AI tools, computed for each task using the ratings provided by participants who completed that task under the relevant AI condition. These results reveal that participants also have somewhat unreliable intuitions about where the tools were most and least helpful. In particular, their perceptions do not fully align with the actual performance data, Figure 16. Productivity and First-Year Law School GPA, Vincent AI. Notes: This figure plots participant-level productivity against first-year law school GPA. Productivity is defined as total score divided by the fraction of the task’s allotted time that the participant spent on the task. For each condition, productivity is averaged across the two tasks completed under that condition. Each participant therefore contributes two observations, one from tasks completed without AI assistance and one from tasks completed with access to Vincent AI. Fitted lines summarize the conditional relationship between GPA and productivity and are intended to be illustrative. Total N = 122 participants. 242 Journal of Law and Empirical Analysis 3(1) which do not provide any reliable evidence that o1-preview or Vincent AI enhance speed or quality in the NDA drafting task as compared to the other five tasks. Yet participants registered only small to moderate reductions in perceived helpfulness for the NDA task, despite the quantitative data showing that both systems—especially o1-preview—delivered substantially stronger gains on most other tasks.
5. Limitations Although our experiment provides valuable evidence about the impact of giving lawyers access to AI reasoning models and RAG-based AI systems as of 2024, our findings are subject to several important limitations involving both our study population and the design of the experiment. To begin with, our participants were not fully licensed lawyers but upper-level law students. This raises a significant question about how well the results generalize to licensed practitioners engaging in real legal practice. We chose this structure because hiring a sufficiently large sample of practicing lawyers to complete a wide range of tasks would have been prohibitively expensive and hard to administer in a manner that ensured compliance with our experimental protocols. Plus, we believe that upper-level students at the two highly selective law schools in our study serve as a reasonably good proxy for junior associates at typical law firms, especially because many third-year students have already worked as summer associates at large law firms. Even so, the extent to which this generalization holds true remains uncertain and costly to test. Also uncertain is the extent to which our results would generalize to upper-level law students at less selective law schools. A further related limitation is the possibility of a demand effect: although graders were blind to treatment status, participants knew when they were permitted to use an AI tool, which could have influenced how much effort they invested and thereby raised or lowered performance for reasons that are independent of access to the tool itself (a concern mitigated somewhat by the fact that most participants finished with time to spare). We also cannot know whether our results extend to more experienced lawyers. On one hand, it is plausible that Figure 17. Total Score and First-Year Law School GPA, Vincent AI. Notes: This figure plots participant-level total score against first-year law school GPA. Total score is defined as the sum of the five quality attributes for a given task and is averaged across the two tasks completed under each condition. Each participant therefore contributes two observations, one from tasks completed without AI assistance and one from tasks completed with access to Vincent AI. Fitted lines summarize the conditional relationship between GPA and total score and are intended to be illustrative. Total N = 122 participants. Schwarcz et al. 243 senior lawyers would exhibit smaller gains in efficiency and quality from access to AI tools because they tend to be more expert at their craft and more knowledgeable about the law. Such a pattern would mirror findings from other contexts where AI often enables less experienced workers to perform at the level of more experienced ones, while offering limited returns for those who are already highly skilled (Brynjolfsson et al., 2025). On the other hand, senior lawyers might benefit even more than young lawyers, particularly in avoiding the consequences of AI failures, since their greater expertise could better equip them to critically evaluate and refine AI-generated output (Schwarcz & Choi, 2023). Another limitation concerns the scope of the legal tasks we asked participants to perform. Because we believe our subjects most closely resemble junior lawyers, we designed tasks that would be appropriate for individuals in that role. None of the six tasks, however, were ones that senior lawyers or law firm partners would typically perform. Moreover, since law students generally receive more training in litigation than in transactional work, our tasks are predominantly litigation-oriented, with only our NDA task squarely transactional. Additionally, while we designed the tasks to test a broad variety of legal reasoning skills that might be required of young lawyers working in litigationoriented settings, we cannot be sure how well we achieved this objective. For these reasons, the somewhat specific features of our chosen tasks might constrain the generalizability of our findings. Moreover, the only law-specific tool we test—Vincent AI—was designed primarily to support litigation-related tasks, though at the time of testing, the version we used did include some features aimed at transactional work. This litigation-oriented design may help explain why we do not observe statistically significant effects on the transactional NDA task. We did not evaluate tools specifically built for that type of task. More generally, performance improvement, both actual and perceived, may be in part a function of how users experience the software entirely separately from its raw ability to generate high quality outputs. Better user experience may improve or reduce user performance even when underlying AI models are identical. Figure 18. Change in Standardized Performance Relative to No-AI Benchmark, o1-preview. Notes: This figure plots each participant’s average standardized performance on tasks completed without AI assistance (x-axis) against the participant’s average change in standardized performance when using o1-preview (y-axis). For each task, the score is standardized using the mean and standard deviation of scores among participants who completed that task without AI assistance. For each participant, the x-axis value is the average standardized score across the two tasks completed without AI assistance. The y-axis value equals the participant’s average standardized score across the two tasks completed with o1-preview minus the participant’s baseline average standardized score. Values on the x-axis therefore reflect relative standing when working without AI assistance. Fitted lines summarize the conditional relationship between baseline performance and AIrelated changes and are intended to be illustrative. Total N = 122 participants. 244 Journal of Law and Empirical Analysis 3(1) Other potential limitations involve how AI use may change over time. For example, our study cannot provide a longitudinal perspective on how lawyers’ use of AI might change in the future as they become more familiar with the tools or have regular access to other AI systems and technology. In addition, we did not counterbalance the tasks by varying the order in which participants completed them. It is therefore possible that participants improved in their ability to use the AI tools over the course of the experiment or that they relied on the tools differently in earlier versus later tasks. Given our results (which do not show performance improving monotonically) and the fact that tasks differ meaningfully in content, we doubt these possible confounders significantly bias our results. Still, we cannot discount these concerns. A final potential limitation concerns the grading process. We conducted all evaluations blindly, without knowledge of whether task submissions had been produced with AI assistance. Nonetheless, it is possible that we could intuit which tasks were more likely to have been supported by AI and unconsciously graded those tasks more favorably (or more harshly). We sought to mitigate this risk by strictly separating graders from those who managed the data, instructing graders not to speculate about AI involvement, and using prespecified rubrics to constrain discretion. Even so, these safeguards were surely not perfect, as illustrated by the inherent subjectivity in applying many elements of the rubrics to individual task submissions.
6. Discussion Our findings demonstrate that Vincent AI and o1-preview each independently improve the quality of certain types of legal work and increase the productivity of the people who produce it. Each AI system appears to do so through distinct and independent technological innovations, which can be and already are being combined with one another in updated legal technology tools. It stands to reason that the combined effects of these technologies on legal practice are likely to be greater than our results suggest. Consider the primary mechanism through which Vincent AI is likely to affect legal work beyond facilitating access to a foundation model: Retrieval Augmented Generation (RAG). Perhaps the most significant limitation of RAGFigure 19. Change in Standardized Performance Relative to No-AI Benchmark, Vincent AI. Notes: This figure plots each participant’s average standardized performance on tasks completed without AI assistance (x-axis) against the participant’s average change in standardized performance when using Vincent AI (y-axis). For each task, the score is standardized using the mean and standard deviation of scores among participants who completed that task without AI assistance. For each participant, the x-axis value is the average standardized score across the two tasks completed without AI assistance. The y-axis value equals the participant’s average standardized score across the two tasks completed with Vincent AI minus the participant’s baseline average standardized score. Values on the x-axis therefore reflect relative standing when working without AI assistance. Fitted lines summarize the conditional relationship between baseline performance and AI-related changes and are intended to be illustrative. Total N = 122 participants. Schwarcz et al. 245 based AI systems is their inability to reliably identify and leverage the most relevant sources among the millions of potentially relevant cases, statutes, regulations, and secondary materials. This challenge is particularly acute in legal analysis, where one of the key difficulties facing lawyers is determining which materials are most pertinent and how best to use them in constructing an argument. This difficulty may help explain why access to Vincent AI does not appear Figure 20. Self-Reported Experiences with o1-preview and Vincent AI. Notes: This figure reports mean participant ratings on postexperiment survey questions assessing perceived quality impact, perceived speed impact, self-reported satisfaction, self-assessed improvement, intended future use, and perceived helpfulness of the two AI tools. Ratings are measured on a 1–5 scale. Perceived helpfulness scores are averaged across all six tasks. Error bars represent standard errors of the mean. Total N ranges from 113 to 114 participants, depending on the survey item. Figure 21. Perceived Helpfulness by Task and AI Tool. Notes: This figure reports mean participant ratings of the perceived helpfulness of o1preview and Vincent AI for each of the six assigned tasks. Ratings are measured on a 1–5 scale and are reported only for tasks on which participants used the corresponding AI tool. Error bars represent standard errors of the mean. Total N varies by task–AI condition pair, ranging from 35 to 41 participants. 246 Journal of Law and Empirical Analysis 3(1) to improve Accuracy or Analysis across any of the six tasks, even though it yields fewer hallucinations than o1-preview access and even than working without any AI support. Legal accuracy and analytical quality depend not only on correctly summarizing legal sources but also on strategically selecting and persuasively leveraging the most compelling authorities to support an argument. Our results suggest that reasoning models like o1-preview have the potential to excel in precisely these areas relative to older AI models such as GPT-4. Just as o1-preview and increasingly advanced reasoning models can enhance RAG-based tools by addressing their primary weakness, the reverse is also true. When combined with extensive legal databases, RAG technology can mitigate a key shortcoming of general-purpose foundation models in legal analysis: their lack of direct access to authoritative legal source materials. This limitation is evident in our results, which show a higher hallucination rate when participants used o1-preview than when they used Vincent AI or when they had no access to AI at all. Additionally, this limitation is underscored by the lack of statistically significant improvement in Accuracy across the six tasks— except for Analysis of Complaint, where we directly provided the key source material (the complaint) to all participants. A further reason that our results are likely to understate the potential impact of AI on lawyering is more familiar: AI technology is continuing to improve at a blistering pace. Even the specific technology we test in this experiment is already out of date at the time of this writing. But this point has special salience in the context of our study, because the reasoning model we test (o1-preview) and find to improve the quality of human legal reasoning in ways that differ from any other previously tested model was the very first reasoning model to be made publicly available. The pace of innovation in AI, or any other field, is typically greatest when the first version of a new approach initially appears. Indeed, since OpenAI publicly released o1-preview—an event that immediately preceded the start of our experiment in October 2024—the company has released several new generations of reasoning models. Not surprisingly, the company’s more recent reasoning models, such as o3, substantially outperform the previous o1 model on numerous objective benchmarks (Wiggers, 2025). As AI continues to improve, the legal community must thoughtfully consider how best to integrate these tools. Legal institutions will need to develop systematic, empirical methods for evaluating AI’s capabilities on a broad range of legal tasks (Schwarcz et al., 2025). Some law firms and commentators have attempted to address this gap by developing legal benchmarks to objectively evaluate AI tools (Fei et al., 2023; Grupen & Pereyra, 2024; ValsAI, 2025). But these benchmarks are becoming increasingly inadequate for measuring AI’s legal capabilities. First, many are saturated. AI models have already achieved near-maximum—or even superhuman—performance, leaving little room for meaningful improvement. Second, because valuable lawyering tasks cannot be easily measured formulaically, the real-world relevance of AI performance on these tests is often unclear. Third, such benchmarking studies do not seek to evaluate how access to AI impacts the practice of human lawyers. Our approach to studying legal AI—randomized controlled trials focused on realistic lawyering tasks—offers key evidence for better understanding the role AI will play in legal practice. Unlike benchmarks, RCTs allow researchers to evaluate AI’s impact on humans’ ability to do what lawyers do in the real world. Given the transformative potential of AI on the profession, we believe it is important and timely for clients, law firms, and law schools to start embedding such trials into their operations and adapt accordingly.
7. Conclusion This article presents the first rigorous empirical evidence that access to Retrieval Augmented Generation (RAG) and reasoning models can significantly enhance the quality of legal work in realistic lawyering tasks, while preserving the efficiency gains observed with earlier generations of generative AI. Our findings demonstrate that access to early reasoning models improves not only the clarity, organization, and professionalism of legal work but also the depth and rigor of legal analysis itself. Additionally, we provide suggestive evidence that access to late 2024 versions of RAG-enabled legal AI tools may be able to reduce hallucinations in human legal work to levels comparable to those found in work completed without AI assistance. The distinct yet complementary strengths of these technologies indicate that their integration could yield even greater benefits, a development already taking shape in emerging legal tech. The rapid advancement of reasoning models also indicates that the improvements observed in this study may only be the beginning of AI’s transformative potential for legal practice.
Acknowledgements
For helpful comments and guidance, we thank Pablo Arredondo, Victor Bennett, Jonathan Choi, Christoph Engel, Miryam Gorelashvili, Dan Ho, Christina Lee, Eric Martinez, Bryan Mechell, Amy Monahan, Daniel Rock, Kyle Rozema, Alan Rozenshtein, James Snelson, Tim Sullivan and Peter Wills, the editor, and two anonymous reviewers. Tom´as Aguirre and Jared Sloan provided excellent research assistance. The anonymized data and code for the experiment and analysis are available online and upon request. ORCID iDs Daniel Schwarcz https://orcid.org/0009-0002-5019-4096 David R. Cleveland https://orcid.org/0009-0005-6174-1165 J.J. Prescott https://orcid.org/0000-0001-5483-3516 Funding The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: For Schwarcz et al. 247 generous financial support of this project, we thank Fredrikson & Byron PA, Robins Kaplan LLC, University of Minnesota Law School, and University of Michigan Law School. Neither OpenAI nor VLex (the company that owns Vincent AI) provided financial support for this project, but they both did make their AI platforms freely available to study participants. Declaration of Conflicting Interests The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article. Supplemental Material Supplemental material for this article is available online. Notes
1. An OpenAI press release even claimed that the model not only passed the exam but also ranked among the top 10% of human test-takers. However, further analysis revealed that the top 10% figure was an overstatement—largely because it was based on a comparison with February exam takers, who historically performed below average (Mart´ınez, 2024). But even after correcting for this oversight, GPT-4’s performance remains significantly above the passing threshold.
2. The concern over hallucinations gained significant traction among lawyers and judges in May 2023 when an otherwise routine legal dispute made global headlines. In what would be the first of many similar incidents, a New York lawyer submitted a court filing containing references to entirely fictitious cases. When questioned in court, the lawyer admitted to using ChatGPT to write his brief. He further explained that, after the tool initially provided the citations, he had explicitly asked whether they were real. ChatGPT affirmed they were. The judge publicly reprimanded the lawyer, sparking widespread media coverage and cementing the incident as a cautionary tale among legal professionals (Merken, 2023; Weiser, 2023).
3. For many legal tech companies, RAG is the primary mechanism by which they claim to deliver value beyond general-purpose models like ChatGPT. To be sure, most AI-enabled legal technology also includes automated or embedded prompting.
4. Critics rightly note that the study’s queries were designed in ways that would increase the likelihood of hallucinations and did not necessarily reflect how lawyers would use AI in practice. Some tools, for example, analyze uploaded documents and generate a series of suggested questions tailored to the document type. Others present users with a menu of capabilities, each triggering a set of pre-formulated prompts. In some cases, legal technology companies embed prompts within their interfaces in ways that users do not see but that enhance output. However, there is increasingly good reason to believe that these automated prompting tools provide limited value, as foundation models are continuously improving at generating high-quality responses without specialized prompting and are becoming increasingly adept at detecting context.
5. For example, OpenAI reports that its first reasoning model, o1, ranked in the 89th percentile on competitive programming questions, placed among the top 500 students in the USA Mathematical Olympiad qualifier, and exceeded PhD-level accuracy on a test covering physics, biology, and chemistry (OpenAI, 2024). More recent reasoning models, like o3 and Deepseek’s r1, have achieved even better scores on various benchmarks.
6. Choi et al. (2024) has an overlapping co-author with this study and uses a randomized controlled experiment similar to the one used in this Article.
7. Of course, the subset of students expressing interest in participating were likely particularly enthusiastic about AI or interested in learning more about it relative to the overall student population that received recruiting emails.
8. This approach ensured that each participant completed two tasks without AI assistance, two tasks with the assistance of GPT o1preview, and two tasks with the assistance of Vincent AI. This structure makes it especially important that we effectively randomized assignment to the three groups.
9. These training modules were developed and delivered by a coauthor, a representative from Vincent AI, and a research librarian. Each module included a 20- to 30-min video, with two modules also incorporating short exercises. The first module focused on the use of general-purpose AI tools for legal research, highlighting the risks of AI “hallucinations” and the dangers of over-reliance on AI at the expense of independent legal reasoning. Participants were encouraged to use AI as an aid to enhance their work rather than as a substitute for their own judgment. The second and third modules provided tailored instruction on Vincent AI, covering its various features and workflows and offering guidance on distinguishing between AI-generated text and content from primary sources.
10. We designed four of the six tasks (Tasks One, Two, Five and Six) to focus on research-oriented tasks for which retrievalaugmented generation using legal source materials was expected to be especially beneficial. However, we also strove to vary the complexity of the tasks, the extent to which they were litigation or transaction oriented, and the extent to which they required an objective or persuasive analysis.
11. To encourage participants to complete the assigned work efficiently and effectively, we instructed them as follows: “As with all assignments completed in connection with this experiment, you should approach the assignment as if you are a junior attorney who has been asked to produce work for a fee-sensitive client. While you can take up to the maximum time allotment to complete the task, you should stop working at the point where you would feel comfortable submitting your work product to a supervising attorney, given that your client would prefer to minimize the amount they pay for your work product. If you reach the end of the maximum time allocation and have not finished, you should simply turn in the work product you were able to produce within the allotted time. Do not spend any more than the maximum time on any assignment. As a reminder, your study compensation is not based on the actual time spent completing these assignments. Timekeeping is only used to gather data on the efficiency of both methods of completion.” It is possible that this instruction might have triggered a demand 248 Journal of Law and Empirical Analysis 3(1) effect in which participants perceived that researchers cared most about time or efficiency, causing them to rush through their tasks. However, we have little reason to believe that this would cause a differential time effect across the treatment and control groups. Additionally, although we recognize that client feesensitivity varies across firms and contexts, we believe that most firms and supervising attorneys place significant value on the efficiency with which junior lawyers complete assignments.
12. Each of the three grading co-authors graded two of the tasks that aligned most closely with their expertise. To ensure anonymity in the grading process, the three co-authors responsible for grading were different from the co-authors who coordinated the experiment and handled the data.
13. The Impact of Specialized AI tools for Lawyering Tasks, AEARCTR-0014957 (December 20, 2024), at https://www. socialscienceregistry.org/trials/14957.
14. We are not fully able to explain the four hallucinated sources that appeared in work produced without the assistance of AI. However, we suspect that the underlying sources do exist but that there were sufficiently significant errors in the citation details that we were not able to locate them easily with the information participants provided.
References
Alimardani, A. (2024). Generative artificial intelligence vs. law students: An empirical study on criminal law exam performance. Law, Innovation & Technology, 16(2), 777–819. https:// doi.org/10.1080/17579961.2024.2392932 Arbel, Y., & Hoffman, D. A. (2024). Generative interpretation. NYU Law Review, 99(2), 451–520. https://doi.org/10.2139/ssrn.4526219 Armour, J., Parnham, R., & Sako, M. (2022). Augmented lawyering. University of Illinois Law Review, 2022(1), 71–138. https://doi. org/10.2139/ssrn.3688896 Beioley, K., & Criddle, C. (2023). Allen & Overy introduces AI chatbot to lawyers in search of efficiencies. Financial Times. https://www. ft.com/content/baf68476-5b7e-4078-9b3e-ddfce710a6e2. Bliss, J. (2024). Teaching law in the age of generative AI. Jurimetrics, 64(2), 111–161. https://doi.org/10.2139/ssrn.4682456 Brescia, R. H. (2024). What’s a lawyer for? Artificial intelligence and third-wave lawyering. Florida State University Law Review, 51(3), 543–596. Brodeur, P. G., Buckley, T. A., Kanjee, Z., Goh, E., Bin Ling, E., Jain, P., Cabral, S., Abdulnour, R.-E., Haimovich, A., Freed, J. A., Olson, A., Morgan, D. J., Hom, J., Gallo, R., Horvitz, E., Chen, J., Manrai, A. K., & Rodman, A. (2024). Superhuman performance of a large language model on the reasoning tasks of a physician. arXiv. https://arxiv.org/abs/2412.10849 Browning, J. G. (2023). Robot lawyers don’t have disciplinary hearings—real lawyers do: The ethical risks and responses in using generative artificial intelligence. Georgia State University Law Review, 40(2), 917–966. https://gsulawreview.org/article/ 120262-robot-lawyers-don-t-have-disciplinary-hearings-reallawyers-do-the-ethical-risks-and-responses-in-usinggenerative-artificial-intelligence Brynjolfsson, E., Li, D., & Raymond, L. (2025). Generative AI at work. Quarterly Journal of Economics, 140(2), 889–942. https://doi.org/10.1093/qje/qjae044 Chien, C. V., Kim, M., Raj, A., & Rathish, R. (2025). How generative AI can help address the access to justice gap through the courts. Loyola Law Review, 57(1), 850–915. https://papers.ssrn. com/sol3/papers.cfm?abstract_id=4683309 Choi, J. H., Hickman, K. E., Monahan, A. B., & Schwarcz, D. (2022). ChatGPT goes to law school. Journal of Legal Education, 71(1), 387–416. https://jle.aals.org/home/vol71/iss3/2/ Choi, J. H., Monahan, A., & Schwarcz, D. (2024). Lawyering in the age of artificial intelligence. Minnesota Law Review, 109(2), 147–205. https://minnesotalawreview.org/article/lawyering-inthe-age-of-artificial-intelligence/ Choi, J. H., & Schwarcz, D. (2025). AI assistance in legal analysis: An empirical study. Journal of Legal Education, 73(2), 384-420. [forthcoming]. Crouch, C. H., & Mazur, E. (2001). Peer instruction: Ten years of experience and results. American Journal of Physics, 69(9), 970–977. https://doi.org/10.1119/1.1374249 Dahl, M., Magesh, V., Suzgun, M., & Ho, D. E. (2024). Large legal fictions: Profiling legal hallucinations in large language models. Journal of Legal Analysis, 16(1), 64–93. https://doi.org/10.1093/ jla/laae003. https://academic.oup.com/jla/article/16/1/64/7699227 DeRue, D. S., Nahrgang, J. D., Hollenbeck, J. R., & Workman, K. M. (2012). A quasi-experimental study of after-event reviews and leadership development. Journal of Applied Psychology, 97(5), 997–1015. https://doi.org/10.1037/a0028244 Fei, Z., Shen, X., Zhu, D., Zhou, F., Han, Z., Zhang, S., Chen, K., Shen, Z., & Ge, J. (2023). Lawbench: Benchmarking legal knowledge of large language models. arXiv preprint arXiv: 2309.16289. Garg, A., & Ma, M. (2025). Opportunities and challenges in legal AI. Stanford Law School. https://law.stanford.edu/publications/ opportunities-and-challenges-in-legal-ai/ Grupen, N., & Pereyra, J. (2024). BigLaw bench – Retrieval. Harvey.ai. https://www.harvey.ai/blog/biglaw-bench-retrieval Guha, N., Nyarko, J., Ho, D. E., R´e, C., Chilton, A., Narayana, A., Chohlas-Wood, A., Peters, A., Waldon, B., Rockmore, D. N., Zambrano, D., Talisman, D., Hoque, E., Surani, F., Fagan, F., Sarfaty, G., Dickinson, G. M., Porat, H., Hegland, J., … Li, Z. (2023). LEGALBENCH: A collaboratively built benchmark for measuring legal reasoning in large language models. In Advances in Neural Information Processing Systems (36). NeurIPS. Head, A., & Willis, S. (2024). Assessing law students in a GenAI world to create knowledgeable future lawyers. International Journal of Legal Professions, 31(1), 293–321. https://doi.org/ 10.1080/09695958.2024.2379785 Ju, J. (2024). Retrieval-augmented generation in legal tech. Thomson Reuters. Katz, D. M., Bommarito, M. J., Gao, S., & Arredondo, P. (2024). GPT-4 passes the bar exam. Philosophical Transactions of the Royal Society, 382(2270), 1–17. https://doi.org/10.1098/rsta. 2023.0254 Schwarcz et al. 249 Kim, M., & Chien, C. V. (2025). Generative AI and legal aid: Results from a field study and 100 use cases to bridge the access to justice gap. Loyola Law Review, 57(1), 903–988. https:// digitalcommons.lmu.edu/cgi/viewcontent.cgi?article=3210& context=llr Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rockt¨aschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33, 9459–9479. https://proceedings.neurips. cc/paper/2020/file/6b493230205f780e1bc26945df7481e5Paper.pdf. LexisNexis. (2024). How Lexis+ AI delivers hallucination-free linked legal citations. https://www.lexisnexis.com/community/insights/ legal/b/product-features/posts/how-lexis-ai-delivershallucination-free-linked-legal-citations?srsltid=AfmBOoqTS9Qf0Uo9szlTpe7BYrcumIH7KpseJibAYIvn8vD2rY7awt_ LexisNexis. (2025). LexisNexis introduces Prot´eg´e personalized AI assistant with agentic AI, making it easier to power complex legal task completion. https://www.lexisnexis.com/community/ pressroom/b/news/posts/lexisnexis-introduces-protegepersonalized-ai-assistant-with-agentic-ai-making-it-easier-to-powercomplex-legal-task-completion?srsltid=AfmBOorPsYbZvmWKsiC2VnaYaKZ6iXAcv06Z96xVIFCRY4xpmcKGB1v Liu, J. Z., & Li, X. (2024). How do judges use large language models? Evidence from Shenzhen. Journal of Legal Analysis, 16(1), 235–262. https://doi.org/10.1093/jla/laae009 Magesh, V., Surani, F., Dahl, M., Suzgun, M., Manning, C. D., & Ho, D. E. (2024). Hallucination-free? Assessing the reliability of leading AI legal research tools. arXiv. https://arxiv.org/abs/ 2405.20362 Mart´ınez, E. (2024). Re-evaluating GPT-4’s bar exam performance. Artificial Intelligence and Law, 1(1), 1–24. https://doi.org/10. 1007/s10506-024-09396-9 Merken, S. (2023). New York lawyers sanctioned for using fake ChatGPT cases in legal brief. Reuters. https://www.reuters. com/legal/new-york-lawyers-sanctioned-using-fake-chatgptcases-legal-brief-2023-06-22/ Nay, J. J., Karamardian, D., Lawsky, S. B., Tao, W., Bhat, M., Jain, R., Lee, A. T., Choi, J. H., & Kasai, J. (2023). Large language models as tax attorneys: A case study in legal capabilities emergence. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 382, 1–15. https://doi.org/10.1098/rsta.2022.0449 Nielsen, A., Skylaki, S., Norkute, M., & Stremitzer, A. (2024). Building a better lawyer: Experimental evidence that artificial intelligence can increase legal work efficiency. Journal of Empirical Legal Studies, 21(3), 979–1020. https://doi.org/10.1111/jels.12396 Noy, S., & Zhang, W. (2023). Experimental evidence on the productivity effects of generative artificial intelligence. Science, 381(6654), 187–192. https://doi.org/10.1126/science.adh2586 OpenAI. (2022). Introducing ChatGPT. https://openai.com/blog/ chatgpt OpenAI. (2024). OpenAI o1 system card. https://openai.com/index/ openai-o1-system-card/ Pierce, N. A., & Goutos, S. L. (2024). Why lawyers must responsibly embrace generative AI. Berkeley Business Law Journal, 21(1), 469–516. https://doi.org/10.15779/Z389K45V33 Re, R. M. (2024). Artificial authorship and judicial opinions. George Washington Law Review, 92(6), 1558–1589. Schwarcz, D., & Choi, J. H. (2023). AI tools for lawyers: A practical guide. Minnesota Law Review Headnotes, 108, 1–39. https:// doi.org/10.2139/ssrn.4404017 Schwarcz, D., Das, D., Kang, D., & McDonnell, B. H. (2025). Thinking like a lawyer in the age of generative AI: Cognitive limits on AI adoption among lawyers. Journal of Institutional and Theoretical Economics, (forthcoming). Minnesota Legal Studies Research Paper No. 25-31. https://doi.org/10.2139/ssrn.5260645 Susskind, R., & Susskind, R. E.(2023). Tomorrow’s lawyers: An introduction to your future. Thompson Reuters. (2025). Thomson Reuters Launches CoCounsel Legal: Transforming Legal Work with Agentic AI and Deep Research. https://www.thomsonreuters.com/en/press-releases/ 2025/august/thomson-reuters-launches-cocounsel-legaltransforming-legal-work-with-agentic-ai-and-deep-research. Vals.AI. (2025). Vals legal AI report. https://www.vals.ai/vlair Veloski, J., Boex, J. R., Grasberger, M. J., Evans, A., & Wolfson, D. B. (2006). Systematic review of the literature on assessment, feedback and physicians’ clinical performance. Medical Teacher, 28(2), 117–128. https://doi.org/10.1080/01421590600622665 Weiser, B. (2023). Here’s what happens when your lawyer uses ChatGPT. New York Times. https://www.nytimes.com/2023/ 05/27/nyregion/avianca-airline-lawsuit-chatgpt.html Wendel, W. B. (2019). The promise and limitations of artificial intelligence in the practice of law. Oklahoma Law Review, 72(1), 21–49. https://digitalcommons.law.ou.edu/cgi/ viewcontent.cgi?params=/context/olr/article/1376/&path_ info=02_wendel_article_blu7.pdf Wiggers, K. (2025). OpenAI launches o3-mini, its latest ‘reasoning’ model. TechCrunch. https://techcrunch.com/2025/01/31/ openai-launches-o3-mini-its-latest-reasoning-model/ Yamane, N. (2020). Artificial intelligence in the legal field and the indispensable human element legal ethics demands. Georgetown Journal of Legal Ethics, 33(3), 877–890. https://www. law.georgetown.edu/legal-ethics-journal/wp-content/uploads/ sites/24/2020/09/GT-GJLE200038.pdf. 250 Journal of Law and Empirical Analysis 3(1)