ReportApril 1, 2026

Artificial Intelligence Index Report 2026

Stanford HAI · Stanford HAI

Source of record

Stanford HAI

ai-indexbenchmarkspolicyeconomyresponsible-ai

Abstract

The AI Index Report is an independent, comprehensive annual assessment of the state of artificial intelligence, produced by the Stanford Institute for Human-Centered AI (HAI). The 2026 edition tracks and benchmarks AI progress across research and development, technical performance, responsible AI, the economy, science, medicine, education, policy and governance, and public opinion — drawing tog…

Full text

About this report

Artificial Intelligence Index Report 2 0 2 6 1

Introduction

Loredana Fattorini, Yolanda Gil, Vanessa Parli, Ray Perrault, Sha Sajadieh Research and Development Usman Anwar, Theo Burn, Emily Chen, Rachel Cook, Jean-Stanislas Denain, Meredith Ellison, Loredana Fattorini, Nicole Finn, Isabella Florez, Yolanda Gil, Tom Hurd, Nabeel Khan, James Landay, Shayne Longpre, Nestor Maslej, Magdalena Ortiz, Khalifa Oyebanji, Orestis Papakyriakopoulos, Vanessa Parli, Ray Perrault, Tom Piercey, Jennifer Rachford, Thomas Richadson, Vesna Sabljakovic-Fritz, Sha Sajadieh, Lapo Santarlasci, Sebastian Sardina, Andrew Shi, Yoav Shoham, Seth Polsley, Daniel Weld, Kevin Xu, Meg Young Chapter 1 Science Michael Clear, Steven Dillmann, Loredana Fattorini, Yolanda Gil, James Manyika, Vipin Kumar, Minjoon Kouh, Suhas Mahesh, Vanessa Parli, Ray Perrault, Sha Sajadieh Chapter 5 Medicine Russ Altman, Peter Brodeur, Akshay Chaudhari, Muhammad Ahmed Tajammul Chaudhury, Jonathan Chen, Matthew DeVerna, Abdoul Jalil Djiberou Mahamadou, Loredana Fattorini, Andrea Fisher, Ethan Goh, Yolanda Gil, Jeff Hancock, Tina Hernandez-Boussard, Malte Jensen, Rohan Koodli, Arman Koul, Sanmi Koyejo, Alejandro Lozano, Danielle Luz, David Magnus, Stephen P. Ma, Bethel Mieso, Fateme Nateghi Haredasht, Madelena Ng, Natalie Pageler, Ayush Pandit, Vanessa Parli, Ray Perrault, Sean Riordan, Ronald Robertson, Austin Schoeffler, Christoph Sadée, Sha Sajadieh, Min Woo Sun, Kotoha Togami, Sang Truong, Chong Wang, Dennis Wall, David Wu Chapter 6 Education Carla Brodley, Joshua Childs, Lisa Cruz Novohatski, Loredana Fattorini, Yolanda Gil, Rachel Goins, Laura Hinton, Sonia Koshy, James Landay, Kirsten Lundgren, Jacqueline McCune, Vanessa Parli, Ray Perrault, Sha Sajadieh, Bryan Twarek, Rebecca Zarch Chapter 7 Economy Tara Balakrishnan, Bharat Chandar, Erik Brynjolfsson, Ruyu Chen, Michael Chui, Heather English, Loredana Fattorini, Yolanda Gil, Bryce Hall, Heather Hanselman, Rosie Hood, Akash Kaura, Elena Magrini, Nestor Maslej, James Manyika, Rebecca Milde, David Nguyen, Katherine Ottenbreit, Vanessa Parli, Ray Perrault, Courtney Prabhakar, Brittany Presten, Roger Roberts, Sha Sajadieh, Lapo Santarlasci, Alex Singla, Alex Sukharevsky, Casey Weston, Henry Zhang Chapter 4 Technical Performance Erik Brynjolfsson, Loredana Fattorini, Yolanda Gil, Tasha Kim, Sanmi Koyejo, Nestor Maslej, Juan Carlos Niebles, Sukrut Oak, Vanessa Parli, Ray Perrault, Sha Sajadieh, Yoav Shoham, Toby Walsh, Daniel Weld, Henry Zhang Chapter 2 Responsible AI Gabriel Morgan Asaftei, Rishi Bommasani, Virginia Dignum, Loredana Fattorini, Yolanda Gil, Nestor Maslej, Katherine Ottenbreit, Vanessa Parli, Juan Nicolas Pava, Ray Perrault, Brittany Presten, Cécile Prinsen, Roger Roberts, Sha Sajadieh, Lapo Santarlasci, Abby Sticha, Elham Tabassi, Yuanhao Zou Chapter 3 Contributors 8 I N T R O D U C T I O N | A I I N D E X R E P O R T 2026 Policy and Governance Virginia Dignum, Loredana Fattorini, Johannes Fritz, Yolanda Gil, Nestor Maslej, Vanessa Parli, Juan Nicolas Pava, Ray Perrault, Sha Sajadieh, Lapo Santarlasci, Kamran Sattary, Tyler Lenox Smith, Elham Tabassi, Russell Wald Chapter 8 Public Opinion Erik Brynjolfsson, Matt Carmichael, Zack Devlin-Foltz, Loredana Fattorini, Nadja Flechner, Yolanda Gil, Connacher Murphy, Vanessa Parli, Juan Nicolas Pava, Ray Perrault, Matt Reynolds, Sha Sajadieh, Russell Wald, Henry Zhang Chapter 9 The AI Index thanks the following organizations and individuals who provided data for inclusion in this year’s report: The AI Index also thanks Jeanina Matias, Nancy King, Carolyn Lehman, Shana Lynch, Jonathan Mindes, and Johanna Friedman for their help in preparing this report; Christopher Ellis and Madeleine Wright for their help in maintaining the AI Index website; and Annie Benisch, Marc Gough, Caroline Meinhardt, Drew Spence, Casey Weston, and Daniel Zhang for their work in helping promote the report. Epoch AI GitHub Lightcast LinkedIn Quid Zeki McKinsey & Company Jean-Stanislas Denain Kevin Xu Elena Magrini, Rebecca Milde Rosie Hood, Akash Kaura, Casey Weston Heather English Tom Hurd Heather Hanselman, Katherine Ottenbreit, Brittany Presten, Cécile Prinsen, Roger Roberts, Abby Sticha ORGANIZATIONS 9 A I I N D E X R E P O R T 2026 1 AI capability is not plateauing. It is accelerating and reaching more people than ever. Industry produced over 90% of notable frontier models in 2025, and several of those models now meet or exceed human baselines on PhD-level science questions, multimodal reasoning, and competition mathematics. On a key coding benchmark—SWE-bench Verified—performance rose from 60% to near 100% of meeting the human baseline in a single year. Organizational adoption reached 88%, and 4 in 5 university students now use generative AI. 2 The U.S.-China AI model performance gap has effectively closed. U.S. and Chinese models have traded the lead multiple times since early 2025. In February 2025, DeepSeek-R1 briefly matched the top U.S. model, and as of March 2026 Anthropic’s top model leads by just 2.7%. The U.S. still produces more top-tier AI models and higher-impact patents, while China leads in publication volume, citations, patent output, and industrial robot installations. South Korea stands out for its innovation density, leading the world in AI patents per capita. 3 The United States hosts the most AI data centers, with the majority of their chips fabricated by one Taiwanese foundry. The United States hosts 5,427 data centers, more than 10 times any other country, and it consumes more energy than any other country. A single company, TSMC, fabricates almost every leading AI chip, making the global AI hardware supply chain dependent on one foundry in Taiwan—though a TSMC-U.S. expansion began operations in 2025. 4 AI models can win a gold medal at the International Mathematical Olympiad but cannot reliably tell time—an example of what researchers call the jagged frontier of AI. Gemini Deep Think earned a gold medal at IMO, yet the top model reads analog clocks correctly just 50.1% of the time. AI agents made a leap from 12% to ~66% task success on OSWorld, which tests agents on real computer tasks across operating systems, though they still fail roughly 1 in 3 attempts on structured benchmarks. 5 Robots still fail at most household tasks, even as they excel in controlled environments. Robots succeed in only 12% of household tasks, highlighting how far AI is from mastering the physical world. On RLBench, robotic manipulation in software-based simulations has reached 89.4% success, but the gap between predictable lab settings and unpredictable household environments is wide. 6 Responsible AI is not keeping pace with AI capability, with safety benchmarks lagging and incidents rising sharply. Almost all leading frontier AI model developers report results on capability benchmarks, but reporting on responsible AI benchmarks remains spotty. Documented AI incidents rose to 362, up from 233 in 2024. Adding to the challenge, recent research found that improving one responsible AI dimension, such as safety, can degrade another, such as accuracy. Top Takeaways 10 TO P TA K E AWAYS | A I I N D E X R E P O R T 2026 7 The United States leads in AI investment, but its ability to attract global talent is declining. U.S. private AI investment reached $285.9 billion in 2025, more than 23 times the $12.4 billion invested in China—though looking at just private investment figures likely understates China’s total AI spending, given its government guidance funds. The U.S. also led in entrepreneurial activity with 1,953 newly funded AI companies in 2025, more than 10 times the next closest country. However, the number of AI researchers and developers moving to the U.S. has dropped 89% since 2017, with an 80% decline in the last year alone. 8 AI adoption is spreading at historic speed, and consumers are deriving substantial value from tools they often access for free. Generative AI reached 53% population adoption within three years, faster than the PC or the internet, though the pace varies by country and correlates strongly with GDP per capita. Some show higher-than-expected adoption, such as Singapore (61%) and the United Arab Emirates (54%), while the U.S. ranks 24th at 28.3%. The estimated value of generative AI tools to U.S. consumers reached $172 billion annually by early 2026, with the median value per user tripling between 2025 and 2026. 9 Productivity gains from AI are appearing in many of the same fields where entrylevel employment is starting to decline. Studies show productivity gains of 14% to 26% in customer support and software development, with weaker or negative effects in tasks requiring more judgment. AI agent deployment remains in single digits across nearly all business functions. In software development, where AI’s measured productivity gains are clearest, U.S. developers ages 22 to 25 saw employment fall nearly 20% from 2024, even as the headcount for older developers continues to grow. 10 AI’s environmental footprint is expanding alongside its capabilities. Grok 4’s estimated training emissions reached 72,816 tons of CO2 equivalent. AI data center power capacity rose to 29.6 GW, comparable to New York state at peak demand, and annual GPT-4o inference water use alone may exceed the drinking water needs of 1.2 million people. 11 AI models for science can outperform human scientists, though bigger models do not always perform better. Frontier models outperform human chemists on average on ChemBench, yet they score below 20% on replication in astrophysics and 33% on Earth observation questions. A 111-million-parameter protein language model, MSAPairformer, beat previous leading methods on ProteinGym, and a 200-million-parameter genomics model, GPNStar, outperformed a model nearly 200 times larger. Most AI foundation models for science come from cross-sector collaborations, in contrast with the industry-dominated landscape of generalpurpose AI. 12 AI is transforming clinical care, but rigorous evidence remains limited. AI tools that automatically generate clinical notes from patient visits saw substantial adoption in 2025. Across multiple hospital systems, physicians reported up to 83% less time spent writing notes and significant reductions in burnout. Beyond certain tools, however, the evidence base for clinical AI remains thin. A review of more than 500 clinical AI studies found that nearly half relied on examstyle questions rather than real patient data, with only 5% using real clinical data. 11 TO P TA K E AWAYS | A I I N D E X R E P O R T 2026 13 14 15 Formal education is lagging behind AI, but people are learning AI skills at every stage of life. Over 80% of U.S. high school and college students now use AI for school-related tasks, but only half of middle and high schools have AI policies in place, and just 6% of teachers say those policies are clear. Outside the classroom, AI engineering skills are accelerating fastest in the United Arab Emirates, Chile, and South Africa. The number of new AI PhDs in the U.S. and Canada increased 22% from 2022 to 2024, the PhDs that make up that increase took jobs in academia, not in industry. AI sovereignty is becoming a defining feature of national policy, but capabilities remain uneven, even as open-source development helps to redistribute who participates. National AI strategies are expanding, particularly among developing economies, and state-backed investments in AI supercomputing are rising in parallel—a sign of growing ambitions for domestic control over AI ecosystems. Yet model production remains concentrated in the U.S. and China. Open-source development is starting to redistribute participation, with contributions from the rest of the world now outpacing Europe and approaching the United States on GitHub, fueling more linguistically diverse models and benchmarks. AI experts and the public have very different perspectives on the technology’s future, and global trust in institutions to manage AI is fragmented. When it comes to how people do their jobs, 73% of experts expect a positive impact, compared with just 23% of the public, a 50-point gap. Similar divides appear for AI’s impact on the economy and medical care. Globally, trust in governments to regulate AI varies. Among surveyed countries, the United States reported the lowest level of trust in its own government to regulate AI, at 31%. Globally, the EU is trusted more than the United States or China to regulate AI effectively. 12 Research and Development 1 A I I N D E X R E P O R T 2026

1.4 Energy and Environmental Impact

Training Inference Data Center Usage

1.1 Notable AI Models

1 R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 This section starts with the models themselves. Using Epoch AI’s curated dataset of notable models, this section examines where frontier AI models are coming from, how they are deployed, and what it takes to build them. Epoch AI designates models as noteworthy based on criteria such as state-of-the-art advancements, historical significance, or high citation rates. This is a manual curation, so the dataset is not a census of all AI models or a full map of all model development activity. Trends should be read as patterns within the domain. The sections that follow track the infrastructure and inputs behind these systems, including compute, data centers, energy costs, and open-source software, before looking at the broader research ecosystem through publications, patents, and talent. This chapter focuses on the research and development pipeline and its inputs. The next chapter, Technical Performance, reviews model capabilities and benchmark performance in detail. By National Affiliation2 Notable model production remains concentrated within a small number of countries (Figures 1.1.1–1.1.3). Historically, the United States has produced the largest in total output numbers, followed by China. This pattern continued in 2025 as the United States led with the release of 59 notable AI models, China with 35, and South Korea with 8. The number of new model releases declined year over year across all major geographic areas. Figure 1.1.13 1 New and historic models are continually added to the Epoch AI database, so the total year-by-year counts of models included in this year’s AI Index might not exactly match those published in last year’s report. The data is based on a snapshot taken on April 22, 2026. 2 A machine learning model is associated with a specific country if at least one author of the paper introducing it is affiliated with an institution based in that country. In cases where a model’s authors come from several countries, double-counting can occur. 3 This chart highlights model releases from a select group of geographic areas. More comprehensive data on model releases by country will be available in the upcoming AI Index Global Vibrancy Tool. 1 1 1 1 1 8 35 59 0 5 10 15 20 25 30 35 40 45 50 55 60 United Kingdom Singapore Hong Kong France Canada South Korea China United States Number of notable AI models Number of notable AI models by select geographic areas, 2025 Source: Epoch AI, 2026 | Chart: 2026 AI Index report 17 1.1 N OTA B L E A I M O D E L S | R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 2005 2010 2015 2020 2025 0 10 20 30 40 50 60 70 80 90 Number of notable AI models 2, Europe 35, China 59, United States Number of notable AI models by select geographic areas, 2003–25 Source: Epoch AI, 2026 | Chart: 2026 AI Index report Figure 1.1.2 Figure 1.1.3 1–10 11–20 21–60 61–180 181–640 Number of notable AI models by geographic area, 2003–25 (sum) Source: Epoch AI, 2026 | Chart: 2026 AI Index report 18 1.1 N OTA B L E A I M O D E L S | R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 By Sector and Organization The development of notable AI models continues to be predominantly concentrated in industry (Figures 1.1.4 and 1.1.5). Over the past decade, the share produced by industry has grown steadily and now represents the largest share by a wide margin (91.2%). In 2025, Epoch AI identified two notable AI models originating from academia, compared to 93 from industry. Within industry, a small set of organizations account for a large share of releases (Figures 1.1.6 and 1.1.7). In 2025, the top contributors were OpenAI (20), Google (14), and Alibaba (11). Since 2014, Google has produced the largest number of notable models, followed by Meta and OpenAI. Within academia, Tsinghua University (26), Stanford University (26), and Carnegie Mellon University (25) have been the most prolific over the past decade.2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0 20 40 60 80 Number of notable AI models 2, Academia 2, Other 5, Industry-academia collaboration 93, Industry Number of notable AI models by sector, 2003–25 Source: Epoch AI, 2026 | Chart: 2026 AI Index report Figure 1.1.4 19 1.1 N OTA B L E A I M O D E L S | R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0% 20% 40% 60% 80% 100% Notable AI models (% of total) 1.96%, Academia 1.96%, Other 4.90%, Industry-academia collaboration 91.18%, Industry Notable AI models (% of total) by sector, 2003–25 Source: Epoch AI, 2026 | Chart: 2026 AI Index report 20 14 11 7 5 4 4 4 4 3 3 3 3 3 2 2 1 1 1 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 CUHK Shenzhen Research Institute Baidu Ant Group Allen Institute for AI (Ai2) Shanghai AI Lab MiniMax Z.ai (Zhipu AI) University of Illinois Nvidia Moonshot ByteDance Tsinghua University Meta LG AI Research DeepSeek xAI Anthropic Alibaba Google OpenAI Academia Industry NonproǇt Number of notable AI models Number of notable AI models by organization, 2025 Source: Epoch AI, 2026 | Chart: 2026 AI Index report Figure 1.1.5 Figure 1.1.64 4 In the organizational tally figures, research published by DeepMind is classified under Google. 20 1.1 N OTA B L E A I M O D E L S | R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 193 87 60 42 30 26 26 25 25 20 19 17 15 13 13 13 12 11 10 10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 Chinese University of Hong Kong ByteDance New York University Salesforce Baidu Anthropic Allen Institute for AI (Ai2) MIT University of Oxford University of Washington UC Berkeley Carnegie Mellon University Alibaba Tsinghua University Stanford University Nvidia Microsoft OpenAI Meta Google Academia Industry NonproǇt Number of notable AI models Number of notable AI models by organization, 2014–25 (sum) Source: Epoch AI, 2026 | Chart: 2026 AI Index report Figure 1.1.7 Model Release Release patterns for notable AI models have continued to shift toward controlled access (Figure 1.1.8). In 2025, API access was the most common release type, with 47 of 102 models made available this way. and API-only releases have steadily increased since 2020. The second most common release type was “open weights (unrestricted),” meaning the models are fully available for use, modification, and redistribution. The remaining models were released in a mix of access types, including “hosted access (no API),”5 “open weights (restricted use),”6 and “open weights (noncommercial).” The “unknown” designation refers to models that have unclear or undisclosed access types, and “unreleased” models remain proprietary, accessible only to their developers or select partners. Training code is becoming even less accessible than model code overall (Figure 1.1.9). In 2025, 81 of 102 notable models were released without their corresponding training code, compared to 4 that made their code “open source.” In 2020, models with open source and unreleased training code were about the same in number, but by 2023, the majority were unreleased and the gap has continued to widen. This growing opacity limits the ability of external researchers to reproduce results, audit development, and validate safety claims. These challenges are central to the responsible AI and governance discussions in Chapter 3 and Chapter 8. 5 Hosted access refers to using computing resources or services (such as software, hardware, or storage) provided remotely by a third party, rather than personally owning or managing them. Instead of running software or infrastructure locally, hosted access involves accessing these resources via the cloud or another remote service, typically over the internet. For example, using GPUs through platforms like AWS, Google Cloud, or Microsoft Azure—rather than running them on one’s own hardware—is considered hosted access. 6 Open weights models share their architecture at varying levels of restriction, “noncommercial” limits use to research purposes, “restricted use” permits broader use with some conditions, and “unrestricted” places no limitations on use, modification, or redistribution. 21 1.1 N OTA B L E A I M O D E L S | R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 20 30 47 13 10 18 9 19 14 22 26 33 30 22 15 27 11 18 24 20 37 21 28 36 36 16 27 13 19 29 33 29 49 55 37 68 50 78 91 119 98 102 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0 20 40 60 80 100 120 140 API access Hosted access (no API) Open weights (noncommercial) Open weights (restricted use) Open weights (unrestricted) Unreleased Unknown Number of notable AI models Number of notable AI models by access type, 2014–25 Source: Epoch AI, 2026 | Chart: 2026 AI Index report Figure 1.1.87 7 Not all models in the Epoch database are categorized by access type, so the totals in Figures 1.1.8 and 1.1.9 may not fully align with those reported elsewhere in the chapter. 14 29 22 34 31 30 12 11 9 13 12 13 24 25 38 54 76 80 81 29 22 29 34 9 17 33 29 49 55 37 68 50 78 91 119 98 102 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0 20 40 60 80 100 120 140 Open source Open (restricted use) Open (noncommercial) Unreleased Unknown Number of notable AI models Number of notable AI models by training code access type, 2014–25 Source: Epoch AI, 2026 | Chart: 2026 AI Index report Figure 1.1.9 22 1.1 N OTA B L E A I M O D E L S | R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 Parameter and Compute Trends Parameter counts for notable AI models have increased significantly from the early 2010s through 2022, driven by the growing complexity of model architecture, greater data availability, improvements in hardware, and proven efficacy of larger models (Figures 1.1.10–1.1.128). Since then, growth in reported parameter counts has flattened, but this is likely understating actual growth due to the absence of certain data points. Several of the most resource-intensive models released in recent years, including those from OpenAI, Anthropic, and Google, have not publicly disclosed parameter counts, training dataset sizes, or training duration. Similarly, training dataset sizes and training duration increased through the early 2020s, with leading models training on tens of trillions of tokens over periods exceeding 100 days. Again, due to limited disclosure from major frontier labs, the more recent data is incomplete. 8 Several of the figures in this section use a log scale to reflect the exponential growth in AI model parameters and compute in recent years. 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 100 10k 1M 100M 10B 1T Academia Industry Industry-academia collaboration Other Publication date Number of parameters (log scale) Number of parameters of notable AI models by sector, 2003–25 Source: Epoch AI, 2026 | Chart: 2026 AI Index report Figure 1.1.10 23 1.1 N OTA B L E A I M O D E L S | R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 10k 1M 100M 10B 1T 100T Publication date Training dataset size (tokens - log scale) Training dataset size of notable AI models, 2010–25 Source: Epoch AI, 2026 | Chart: 2026 AI Index report Llama 3.1-405B Transformer GPT-3 175B (davinci) DeepSeek-V3 PaLM (540B) GPT-4 (Mar 2023) AlexNet Qwen2.5-72B GLM-4.6 Qwen3-Max Olmo 3 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 0.1 1 10 100 Publication date Training time (days - log scale) Training time of notable AI models, 2010–25 Source: Epoch AI, 2026 | Chart: 2026 AI Index report AlexNet Transformer BERT-Large RoBERTa Large GPT-3 175B (davinci) Megatron-Turing NLG 530B PaLM (540B) GPT-4 (Mar 2023) Llama 3.1-405B GLM-4.6 Grok 3 Olmo 3 Figure 1.1.11 Figure 1.1.12 Since compute can be estimated even when not directly reported, training compute trends for notable models show clear growth over the same period (Figures 1.1.13 and 1.1.14). Compute requirements for notable models have risen by several orders of magnitude, with industry accounting for the highest values. When comparing the two countries with highest model output, U.S. models continue to be the most computationally intensive compared to Chinese models. However, the comparison in recent years cannot be fully substantiated because U.S. models have not directly reported their training compute. 24 1.1 N OTA B L E A I M O D E L S | R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 1μ 0.001 1 1000 1M 1B 1T Academia Industry Industry-academia collaboration Other Publication date Training compute (petaFLOP - log scale) Training compute of notable AI models by sector, 2003–25 Source: Epoch AI, 2026 | Chart: 2026 AI Index report Figure 1.1.139 9 Estimating training compute is an important aspect of AI model analysis, yet it often requires indirect measurement. When direct reporting is unavailable, Epoch estimates compute by using hardware specifications and usage patterns or by counting arithmetic operations based on model architecture and training data. In cases where neither approach is feasible, benchmark performance can serve as a proxy to infer training compute by comparing models with known compute values. Full details of Epoch’s methodology can be found in the documentation section of their website. 2018 2019 2020 2021 2022 2023 2024 2025 2026 1 10 100 1000 10k 100k 1M 10M 100M 1B 10B 100B 1T 10T United States China Publication date Training compute (petaFLOP – log scale) GPT-4 (Mar 2023) GPT-3 175B (davinci) Grok-2 Claude 3.5 Sonnet Grok 4 DeepSeek-V3 Doubao-proERNIE 3.0 Titan Qwen2.5-72B Qwen3-Max GLM-4.6 Training compute of select notable AI models in the United States and China, 2018–25 Source: Epoch AI, 2026 | Chart: 2026 AI Index report Figure 1.1.14 25 H I G H L I G H T: Will Models Run Out of Data? 1.1 N OTA B L E A I M O D E L S | R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 Last year, the AI Index highlighted concerns around data bottlenecks and the sustainability of the scaling approach as it relates to training data. Leading AI researchers have publicly claimed that the available pool of high-quality human text and web data for training large models has been exhausted, a state often referred to as “peak data.” This has continued to raise industry-wide concerns about the sustainability of scaling laws, which have historically depended on ever-larger datasets. One set of projections from Epoch AI suggests that, under certain assumptions, the estimated depletion date could fall between 2026 and 2032. Synthetic Data in Pre-training Limits on the availability of real-world data may be less consequential if synthetic data (data generated by AI systems) can be used to improve the performance of subsequent models. Previous editions of the AI Index found no definitive evidence that synthetic data improves model performance during the pre-training phase. The 2024 report referenced research suggesting that model performance can collapse when real training data is replaced with synthetic data. The 2025 report noted more recent findings that such collapse can be avoided if real data remains part of the training set, but that simply adding more data does not necessarily lead to performance gains. The consensus remains largely unchanged. There is still no definitive evidence that synthetic data can fully offset real-data depletion in pre-training contexts. However, recent research suggests that synthetic data may offer value in more limited settings. Hybrid training approaches, which combine real and synthetic data, can significantly accelerate training, sometimes by a factor of five to 10 at scale, without surpassing real data in final model performance. Training on purely synthetic data has shown promise for smaller models or narrowly defined tasks, such as classification, code generation, or work in low-resource languages, but these gains have not generalized to large, general-purpose language models. Where synthetic-only training has achieved performance comparable to real data, it has typically involved substantially smaller models that are not directly comparable to current state-of-the-art systems. For example, the SYNTHLLM family of models, trained entirely on synthetic data, achieves strong results yet still lags behind leading models on major benchmarks (Figure 1.1.15). 10 Pre-training refers to the initial phase of model development in which a model is trained (typically via self-supervised learning) on large, generalpurpose datasets to acquire broad linguistic or multimodal representations. Post-training refers to subsequent refinement of the base model, through techniques such as supervised fine-tuning or reinforcement learning, to specialize behavior, improve alignment, or optimize performance on particular tasks. Figure 1.1.15 Source: Qin et al., 2025 26 H I G H L I G H T: 1.1 N OTA B L E A I M O D E L S | R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 Data-centric Methods Discussions on data availability often overlook an important shift in recent AI research. Performance gains are increasingly driven by improving the quality of existing datasets, not by acquiring more. Rather than scaling data indiscriminately, researchers are spending more effort in pruning, curating, and refining training inputs. Data-centric methods emphasize performance improvements through practices such as cleaning labels, deduplicating samples, and constructing higher-quality datasets. A growing body of research shows that training models on low-quality or polluted data can significantly degrade performance. Likewise, recent evidence illustrates that data pruning, selecting the most informative training inputs, often outperforms approaches that train on all available data indiscriminately. Synthetic Data in Post-training Recent research shows that synthetically generated data can be effective for improving model performance in post-training settings, including fine-tuning, alignment, instruction tuning, and reinforcement learning. A growing body of research released in 2025 supports this finding. Evidence suggests that synthetic post-training data is effective in few-shot generation settings, for improving long-context capabilities, for optimizing reinforcement learning workflows, and for strengthening reasoning more broadly. 78.1% 91.3% 92.7% 94.3% 95.7% OLMo 3 Claude Opus 4.5 Grok-4 GPT-5 (high) Gemini 1.5 Pro0% 20% 40% 60% 80% 100% Model Score Model performance on AIME 2025 Source: ArtiǇcial Analysis, 2026; Ai2, 2025 | Chart: 2026 AI Index report Figure 1.1.16 Prevalence of Synthetic Content Since the launch of ChatGPT in November 2022, there have been predictions that the internet would soon become overrun by AI-generated content. Recent research from Graphite suggests that beginning in January 2025, over 50% of newly published online content was generated by AI (Figure 1.1.17). Others have projected that the share in 2026 could be even higher. Recent large-scale model development illustrates this paradigm in practice. Olmo 3 researchers prioritized large-scale deduplication, quality-aware data selection, and stage-specific training curricula rather than indiscriminate data scaling. These interventions, combined with iterative feedback loops to evaluate and refine candidate data mixes, allowed their models to achieve competitive performance despite training on substantially fewer tokens than other leading state-of-the-art models (Figure 1.1.16). Olmo 3.1’s Think 32B model, for example, contains roughly 32 billion parameters, nearly 90 times fewer than Grok 4’s 3 trillion, yet it achieves comparable performance on several benchmarks, including American Invitational Mathematics Examination (AIME)11 2025. 11 The American Invitational Mathematics Examination (AIME) is an annual high school math competition widely used as a benchmark for AI mathematical reasoning, with each year’s exam providing a fresh test set. 27 H I G H L I G H T: 1.1 N OTA B L E A I M O D E L S | R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 Given growing concerns about the suitability of synthetic data for training AI systems, this trend raises questions about the long-term reliability of current scaling trajectories. In response, many firms that depend on high-quality training data have increasingly turned to proprietary sources. In May 2025, the New York Times entered into a licensing agreement with Amazon to allow its content to be used for training purposes. By mid-2025, Meta was reportedly engaged in similar discussions with news organizations, while health and life sciences companies such as Bristol Myers Squibb have pursued comparable strategies. These developments suggest that firms training frontier AI systems are adjusting their data acquisition strategies as the volume of openly available training data continues to decline. Jan-2020 Jul-2020 Jan-2021 Jul-2021 Jan-2022 Jul-2022 Jan-2023 Jul-2023 Jan-2024 Jul-2024 Jan-2025 0% 20% 40% 60% 80% 100% % of generated content 48.28%, Human 51.72%, AI AI-generated content vs. human content Source: Graphite.io, 2025 | Chart: 2026 AI Index report Figure 1.1.17 28

1.2 Compute and Infrastructure

1 R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 The development of AI models requires significant infrastructure investment. As training processes have expanded in scale and complexity, the underlying hardware has also improved in both speed and efficiency. In turn, these gains shape what kinds of models researchers and labs can realistically build. The growth in training compute discussed in the previous section would not have been possible without corresponding improvements in hardware capabilities. This section leverages data from Epoch AI to track hardware performance, adoption, and aggregate computing capacity over time. Performance and Efficiency Peak computational performance of machine learning hardware has increased exponentially across releases between 2008 and 2025 (Figure 1.2.1). The gains are especially visible at lower precision types, where precision refers to the number of bits used to represent numerical values. Lower precision formats such as FP16 and Tensor-FP16/BF16 now show the highest performance levels and have become standard in many training and inference settings. 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 10B 100B 1T 10T 100T 1015 FP32 FP16 TF32 (19-bit) Tensor-FP16/BF16 Publication date Performance (FLOP/s - log scale) Peak computational performance of ML hardware for diǄerent precisions, 2008–25 Source: Epoch AI, 2026 | Chart: 2026 AI Index report Figure 1.2.1 29 1.2 CO M P U T E A N D I N F R AS T R U C T U R E | R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 Hardware for Notable Models Global Computing Capacity Hardware adoption patterns among notable AI models reflect the gains in performance and efficiency (Figure 1.2.2). Since 2017, the cumulative number of notable models trained on A100-class hardware has increased, with 84 models trained in 2025. The previous generation, V100, continues to power a sizable share (69 models). Newer hardware, such as the H100, has seen early rapid adoption (28), while other categories, such as TPU v3 and TPU v4, show stable curves. 2017 2018 2019 2020 2021 2022 2023 2024 2025 0 10 20 30 40 50 60 70 80 Publication date Cumulative number of notable AI models 4, H800 6, P100 28, TPU v4 28, H100 44, TPU v3 54, Other 69, V100 84, A100 Cumulative number of notable AI models trained by accelerator, 2017–25 Source: Epoch AI, 2026 | Chart: 2026 AI Index report Figure 1.2.2 The supply of AI computing capacity from major chip designers has continued to increase (Figure 1.2.3). Total capacity has increased by an estimated 3.3x per year since 2022, reaching approximately 17.1 million H100equivalents. Nvidia AI chips currently account for over 60% of total compute, with Google and Amazon supplying much of the remainder and Huawei holding a small but growing share. The growth in compute capacity tracks closely with investment patterns described in Chapter 4, where leading AI companies have increased their capital expenditure and infrastructure has become the fastest growing focus area of private AI funding. 12 Since these estimates are inferred from revenue data, financial disclosures and analyst reports, they reflect broader trends rather than exact counts. Data coverage also varies by manufacturer; Nvidia and Google data starts in 2022, while others start in 2024. 30 1.2 CO M P U T E A N D I N F R AS T R U C T U R E | R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 1.02M 1.10M 1.49M 1.88M0.94M 1.37M 2.00M 2.80M 3.10M 0.92M 1.36M 1.94M 2.62M 3.45M 4.69M 6.39M 8.28M 10.66M 11.38M 0.05M 0.12M 0.19M 0.28M 0.40M 0.64M 1.04M 1.57M 2.50M 3.62M 4.98M 6.88M 9.60M 12.64M 16.54M 17.07M 2022Q1 2022Q2 2022Q3 2022Q4 2023Q1 2023Q2 2023Q3 2023Q4 2024Q1 2024Q2 2024Q3 2024Q4 2025Q1 2025Q2 2025Q3 2025Q4 0M 2M 4M 6M 8M 10M 12M 14M 16M 18M Nvidia Google Amazon AMD Huawei Cumulative compute capacity (H100e) Global computing capacity from AI chips across major designers, 2022–25 Source: Epoch AI, 2026 | Chart: 2026 AI Index report Figure 1.2.3 Figure 1.2.4 Data Center Power Capacity The expansion of computing capacity carries a direct energy cost. Total AI data center power capacity reached approximately 29.6 GW by Q4 2025, enough to power all of New York state at peak demand (Figure 1.2.4). AI chip power, measured by thermal design power, accounted for roughly 11.8 GW of the total, with the remainder attributed to cooling, networking, and other data center infrastructure. This estimate is based on the rated power capacity of leading AI chips sold over time, with a multiplier of approximately 2.5 applied to account for the additional requirements of powering infrastructure. 1.01 1.42 2.16 3.05 4.11 5.40 7.07 8.79 10.90 11.82 1.51 2.13 3.24 4.57 6.16 8.11 10.60 13.18 16.35 17.74 0.15 0.37 0.60 0.87 1.20 1.73 2.52 3.56 5.39 7.62 10.27 13.51 17.67 21.96 27.25 29.56 2022Q1 2022Q2 2022Q3 2022Q4 2023Q1 2023Q2 2023Q3 2023Q4 2024Q1 2024Q2 2024Q3 2024Q4 2025Q1 2025Q2 2025Q3 2025Q4 0 5 10 15 20 25 30 35 40 Other AI data center power (cooling, networking, etc.) AI chip power (TDP) Cumulative power capacity (GW) Peak usage in New York state ≈ 31 GW Netherlands ≈ 19 GW New Zealand ≈ 7 GW Global AI data center power capacity, 2022–25 Source: Epoch AI, 2026 | Chart: 2026 AI Index report 31

1.3 Data Centers

1 R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 The physical infrastructure underlying AI development extends beyond models and compute described in the previous section. Data centers are where compute is housed, and their capacity, geographic distribution, and underlying supply chains shape what AI systems can be built and where. This section draws on data from Cloudscene to track the global distribution of data centers and introduces an overview of the broader AI infrastructure ecosystem to provide context for the geographic and supply chain dynamics. AI Infrastructure: Beyond GPUs Modern AI data centers depend on a combination of compute, storage, communications, and specialized hardware that enables AI systems to run at large scale. GPUs and custom accelerators such as Tensor Processing Units (TPUs) are the most widely discussed, but they are only one layer of a broader infrastructure stack. All data processed by these chips is held in high-bandwidth memory (HBM), which supports moving large volumes of data in and out efficiently. The leading manufacturers of HBM are SK Hynix (South Korea), Samsung (South Korea), and Micron (USA). During training, GPUs must continuously share data with one another, which requires fast, high throughput network connectivity achieved with fiber-optic cables running high-bandwidth networking architectures such as InfiniBand. The supply chain behind this hardware adds another dimension. Companies like Nvidia and SK Hynix design but do not manufacture chips. Instead, they provide designs to specialized semiconductor foundries, primarily the Taiwan Semiconductor Manufacturing Company (TSMC) and Samsung Foundry, which fabricate the chips at the nanometer scales modern AI hardware requires. The fabricated chips are then packaged and tested by assembly companies such as ASE Group (Taiwan) and Amkor Technology (United States). TSMC is a single point of dependency in the global AI supply chain, as it fabricates virtually every leading AI chip, including Nvidia’s Blackwell GPUs and AMD’s MI300X. There are high barriers to entry at every layer— requiring decades of accumulated expertise, specialized equipment, and significant capital investment to overcome. The infrastructure ecosystem is relevant beyond AI capabilities, as it shapes education priorities and workforce development. Chapter 7 (Education) distinguishes between AI software-related and AI hardwarerelated degrees. That distinction is also relevant here, where different countries play different roles across the hardware supply chain. 32 1.3 DATA C E N T E R S | R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 Geographic Distribution Most of the world’s data center infrastructure is located in a small number of countries (Figures 1.3.1 and 1.3.2). In 2025, the United States led by a wide margin, with 5,427 data centers, more than 10 times the count of any other country. Germany (529), the United Kingdom (523), and China (449) followed, while the majority of the remaining countries each had fewer than 300 facilities. The U.S. may show a clear lead, but the other country rankings should be assessed with the understanding that data center counts do not capture differences in facility size, computing capacity, or utilization. 1–9 10–19 20–39 40–59 60–99 100–149 150–249 250–349 350–529 530+ No available data Global distribution of data centers, 2025 Source: Cloudscene, 2025 | Chart: 2026 AI Index report Figure 1.3.1 Figure 1.3.2 144 153 168 173 197 222 251 298 314 322 337 449 523 529 5,427 0 300 600 900 1,200 1,500 1,800 2,100 2,400 2,700 3,000 3,300 3,600 3,900 4,200 4,500 4,800 5,100 5,400 Poland India Italy Mexico Brazil Japan Russia Netherlands Australia France Canada China United Kingdom Germany United States Number of data centers Number of data centers by geographic area, 2025 Source: Cloudscene, 2025 | Chart: 2026 AI Index report 33

1.4 Energy and

Environmental Impact 1 R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 As AI systems have scaled and become more widely deployed, their energy consumption and environmental footprint have become very visible. The compute and infrastructure trends described in the preceding sections translate into heavy demands on energy, water, and carbon emissions. This section examines those costs across three areas of AI development: training, inference, and data center energy usage. The analysis draws from Epoch AI’s model-level data, recent academic benchmarking research (Jegham et al., 2025), the International Energy Agency’s reporting on data centers (IEA, 2025), and de Vries and Gao (2025). Training Leading machine learning hardware has grown more efficient since 2016, as measured in FLOP/s per watt (Figure 1.4.1). Leading chips deliver about 10 times more computation per watt than those available a decade ago, with Nvidia B200 and Google TPU v5e among the most efficient. However, models have scaled faster than efficiency has improved, so total power required to train frontier systems has continued to increase. Total power draw for training models has grown by several orders of magnitude since the early 2010s (Figure 1.4.2). The most compute-intensive models in the data set, such as Grok 3 and Llama 4 Behemoth, required upward of 100 million watts during training. Due to limited disclosure by their developers, power draw information is not available for many of the newest models that have been released. Carbon emissions from training have increased even more sharply (Figure 1.4.3). Training AlexNet in 2012 produced an estimated 0.01 tons of CO2 equivalent, while training Grok 4 in 2025 produced about 72,816 tons. To put this into context, that is more than the lifetime carbon emissions of an average car (63 tons). Larger models generally produce more emissions but not always, as it can also depend on hardware efficiency, training duration, and the carbon intensity of the energy sources used. DeepSeek v3, for example, produced approximately 597 tons, which is much less than models of comparable size (Figure 1.4.4). 34 1.4 E N E R GY A N D E N V I R O N M E N TA L I M PAC T | R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 1B 10B 100B 1T Leading hardware Non-leading hardware Publication date Energy eǅciency (FLOP/s per watt - log scale) NVIDIA P100 Google TPU v2 Google TPU v3 Google TPU v4 NVIDIA Tesla V100 SXM2 32 GB Google TPU v4i NVIDIA A100 Google TPU v5e NVIDIA B100 NVIDIA H100 SXM5 80GB NVIDIA B200 NVIDIA GB200 Energy eǅciency of leading machine learning hardware, 2016–25 Source: Epoch AI, 2026 | Chart: 2026 AI Index report Llama 4 Behemoth (preview) Grok 3 Llama 3.1-405BGPT-4 PaLM (540B) GPT-3 175B (davinci) 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 100 1000 10k 100k 1M 10M 100M Publication date Total power draw required (watts - log scale) Total power draw required to train frontier models, 2011–25 Source: Epoch AI, 2026 | Chart: 2026 AI Index report Figure 1.4.1 Figure 1.4.2 35 1.4 E N E R GY A N D E N V I R O N M E N TA L I M PAC T | R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 0.01 0.31 2.60 5.50 588 1,432 301 2,973 5,184 597 8,930 59,200 72,816 AlexNet VGG16 BERT-Large RoBERTa Large GPT-3 Megatron-Turing NLG GLM-130B Falcon-180B GPT-4 DeepSeek v3 Llama 3.1 405B Grok 3 Grok 4 2012 2014 2018 2019 2020 2021 2022 2023 2024 2025 0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 Carbon emissions (tons of CO₂ equivalent) Estimated carbon emissions from training select AI models and real-life activities, 2012–25 Source: AI Index, 2026; Strubell et al., 2019 | Chart: 2026 AI Index report Air travel (1 passenger, NY↔SF): 0.99 Human life (avg., 1 year): 5.51 American life (avg., 1 year): 18.08 Car usage (avg., incl. fuel, 1 lifetime): 63 AlexNet VGG16 BERT-Large RoBERTa Large GPT-3 Megatron-Turing NLG GLM-130B Falcon-180B GPT-4 DeepSeek v3 Llama 3.1 405B Grok 3 Grok 4 0.01 0.1 1 10 100 1000 10k 100k 1B 1T Carbon emissions (tons of CO₂ equivalent - log scale) Number of parameters (log scale) Estimated carbon emissions and number of parameters by select AI models Source: AI Index, 2026 | Chart: 2026 AI Index report Figure 1.4.3 Figure 1.4.4 36 1.4 E N E R GY A N D E N V I R O N M E N TA L I M PAC T | R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 Inference Training costs have typically received the most attention, but inference represents a growing share of AI’s total energy footprint. Once a model is deployed at scale, the cumulative energy required to serve queries can exceed the one-time cost of training within months. Recent benchmarking by Jegham et al. (2025) provides per model estimates of inference energy consumption and carbon emissions for medium-length prompts (defined as approximately 1,000 input tokens and 1,000 output tokens). Among the top 15 models by energy consumption in 2025, DeepSeek V3.2 Exp and DeepSeek V3.2 consumed the most per query (23 Wh), followed by GPT-5 (high) at 21.9 Wh (Figure 1.4.5). Models such as Claude 4 Opus and GPT-5 min (medium) sit at the lower end, consuming between 5 and 6 Wh. When ranked by carbon emissions, the models also follow a similar pattern (Figure 1.4.6). DeepSeek V3.2 Exp and DeepSeek V3.2 produced the highest per medium-length prompt, approximately 14 grams of CO2 equivalent each. For comparison, Claude 4 Opus and Mistral Medium 3 were the lowest at 1.6 and 1.5 grams, respectively. There is a wide spread even among models released in the same year, showing not only that inference efficiency varies but that higher capability is not necessarily proportional to the environmental cost. 7.20 6.90 1.69 15.86 9.00 7.45 5.34 4.594.544.36 4.13 2.99 2.96 2.46 1.88 1.88 1.68 1.45 23.2423.1321.8521.77 14.90 13.0811.90 8.35 7.65 7.48 6.71 6.26 5.57 5.32 5.13 GPT-4 GPT-4 Turbo GPT-3.5 Turbo DeepSeek V3 (DeepSeek) Llama 3.1 405B Standard Mistral Large 2 (AWS) Mistral Large 2 (Azure) o1 Claude 3.5 Haiku Llama 3.1 70B Standard Llama 3.2 90B (Vision) Llama 3.1 405B Latency Optimized Llama 3 70B GPT-4o (May) GPT-4o mini Claude 3 Haiku GPT-4o (Nov) GPT-4o (Aug) DeepSeek V3.2 Exp DeepSeek V3.2 GPT-5 (high) o3-pro GPT-5 mini (high) GPT-5 (medium) Grok 4 GPT-5 (low) Kimi K2 Thinking GPT-5 nano (high) o3-mini (high) o4-mini (high) Grok 3 Fast Claude 4 Opus GPT-5 mini (medium) 2023 2024 2025 0 5 10 15 20 25 Energy consumption (avg. - Wh) Model energy consumption for medium-length prompts Source: Jegham et al., 2025 | Chart: 2026 AI Index report Figure 1.4.513 13 This figure shows the top 15 models by energy consumption for 2024 and 2025. The full set of models is available through the source dashboard. 37 1.4 E N E R GY A N D E N V I R O N M E N TA L I M PAC T | R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 2.45 2.34 9.51 2.70 2.24 2.15 1.82 1.60 1.36 1.31 1.24 0.90 0.89 0.83 0.73 0.68 0.56 13.9513.88 7.43 7.40 5.07 4.58 4.45 2.84 2.54 2.29 2.28 2.13 1.75 1.60 1.52 GPT-4 GPT-4 Turbo DeepSeek V3 (DeepSeek) Llama 3.1 405B Standard Mistral Large 2 (AWS) Grok 3 Fast Mistral Large 2 (Azure) o1 Claude 3.5 Haiku Llama 3.1 70B Standard Llama 3.2 90B (Vision) Llama 3.1 405B Latency Optimized Llama 3 70B GPT-4o (May) DeepSeek V3 (Azure) GPT-4o mini GPT-4o (Nov) DeepSeek V3.2 Exp DeepSeek V3.2 GPT-5 (high) o3-pro GPT-5 mini (high) Grok 4 GPT-5 (medium) GPT-5 (low) GPT-5 nano (high) Kimi K2 Thinking o3-mini (high) o4-mini (high) GPT-5 mini (medium) Claude 4 Opus Mistral Medium 3 2023 2024 2025 0 3 6 9 12 15 18 Carbon emissions (avg. - gCO₂e) Model carbon emissions for medium-length prompts Source: Jegham et al., 2025 | Chart: 2026 AI Index report Figure 1.4.614 14 This figure shows the top 15 models by energy consumption for 2024 and 2025. The full set of models is available through the source dashboard. At the level of a single query, the numbers seem more modest. A short GPT-4o query consumes approximately 0.42 Wh, which is 40% more than a Google search at 0.3 Wh (Figure 1.4.7). A daily session of eight medium-length queries uses the energy comparable to charging two smartphones (9.7 Wh). But across hundreds of millions of daily queries, the consumption scales into something much larger. The same scaling dynamic is true for water consumption (Figure 1.4.8). Annual estimates for GPT-4o inference range from about 1.3 to 1.6 million kiloliters, which, at the high end, exceeds the annual drinking water needs of 1.2 million people. 38 1.4 E N E R GY A N D E N V I R O N M E N TA L I M PAC T | R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 0.30 0.42 1.21 1.79 3.37 9.71 10.00 1 Google search GPT-4o (Mar 2025) – short query GPT-4o (Mar 2025) – medium query GPT-4o (Mar 2025) – long query Daily session (8 messages, short queries) Daily session (8 messages, medium queries) Charging 2 phones (energy usage) 0 2 4 6 8 10 Energy consumption (Wh) Per-query and daily energy consumption: GPT-4o vs. common activities Source: AI Index, 2025 | Chart: 2026 AI Index report 1 2,500 1,250,000 1,314,000 1,334,991 1,579,680 1 person annual drinking water (U.S. avg.) 1 Olympic swimming pool 500 Olympic swimming pools (aggregate) 1.2 million people annual drinking water (aggregate) GPT-4o inference (minimum estimate) GPT-4o inference (maximum estimate) 0 0.2M 0.4M 0.6M 0.8M 1M 1.2M 1.4M 1.6M Annual water consumption (kL) Annual water consumption: GPT-4o vs. real-world baselines Source: AI Index, 2025 | Chart: 2026 AI Index report Figure 1.4.7 Figure 1.4.8 39 1.4 E N E R GY A N D E N V I R O N M E N TA L I M PAC T | R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 Data Center Usage The power demands of models and queries add up to a much larger infrastructure footprint. The estimated power demand from AI accelerator modules reached approximately 5,200 MW cumulatively through 2024 (Figure 1.4.9). Nvidia accounted for the largest share, which is consistent with the company’s leading position in global AI chip capacity (as discussed in Section 1.2). When including the full systems supporting those accelerators (servers, cooling, networking), estimated demand reached approximately 9,400 MW (Figure 1.4.10). However, these figures from de Vries and Gao (2025) carry uncertainty from variation in utilization rates and facility-level efficiency, as reflected in the error bars on the charts. To put that scale in perspective, the cumulative power demand of all-in AI systems is comparable to the national electricity consumption of Switzerland or Austria, and roughly half that of Bitcoin mining (Figure 1.4.11). Excluding crypto, global data centers accounted for the highest estimated power demand at around 47,000 MW, with AI hardware making up a growing share of that total. 2023 2024 Cumulative 0 2,000 4,000 6,000 8,000 10,000 Nvidia AMD Other AI systems Megawatts (MW) Estimated power demand of all-in AI systems Source: de Vries-Gao, 2025 | Chart: 2026 AI Index report 2023 2024 Cumulative 0 1,000 2,000 3,000 4,000 5,000 6,000 Nvidia AMD Other AI accelerator modules Megawatts (MW) Estimated power demand of AI accelerator modules Source: de Vries-Gao, 2025 | Chart: 2026 AI Index report Figure 1.4.9 Figure 1.4.10 Figure 1.4.11 0 5 10 15 20 25 30 35 40 45 Ireland AI accelerator modules Switzerland Austria Finland All-in AI systems The Netherlands Bitcoin mining United Kingdom France Data centers (excl. crypto) Power demand (in thousands - megawatts) Estimated power demand: AI hardware vs. national consumption, bitcoin mining, and global data centers Source: de Vries-Gao, 2025 | Chart: 2026 AI Index report 40 1.4 E N E R GY A N D E N V I R O N M E N TA L I M PAC T | R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 Cost, however, has been moving in the opposite direction. Since 2006, the cost of GPU computation has fallen by more than 99% (Figure 1.4.12). This decline has been key to enabling the scaling trends described throughout this chapter, making it economically feasible to train and deploy models at levels that would have been cost prohibitive even a decade ago. At the regional level, data center electricity consumption has increased across all major regions, and it is projected to continue to rise through 2030 (Figure 1.4.13). The United States accounts for the largest share, followed by China, Europe, and the rest of Asia. 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 0.00 0.20 0.40 0.60 0.80 1.00 GPU cost index (2006=1) 0.002 GPU computation cost, 2006–24 Source: International Energy Agency (2025), Energy and AI, IEA, Paris | Chart: 2026 AI Index report 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 0 200 400 600 800 1,000 United States China Europe Asia excluding China Rest of the world Electricity consumption (TWh) Data center electricity consumption by region, 2020–30 Source: International Energy Agency (2025), Energy and AI, IEA, Paris | Chart: 2026 AI Index report Figure 1.4.12 Figure 1.4.1315 15 Data in this chart reflects IEA projections rather than observed consumption. 41 1 R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026

1.5 Open-Source AI Software

The preceding sections have focused on notable frontier models and the infrastructure required to build and maintain them. Open-source platforms like GitHub and Hugging Face offer a different view that captures the developer ecosystem experimenting with and building on AI models. Much of this activity is not reflected in academic publications or frontier model releases. The AI Index analyzes data from both platforms16 to better understand how open-source AI development is evolving over time. 16 Chinese researchers often use alternatives to GitHub, such as Gitee and GitCode, for code sharing, but the data from those sites is not included in this report. A full methodological description is available in the Appendix. The scale of open-source development has grown steadily. The number of AI-related GitHub projects increased from 1,549 in 2011 to approximately 5.6 million in 2025, with year-over-year growth accelerating 23.7% from 2024 (Figure 1.5.1). However, most repositories often consist of personal or experimental work and receive minimal attention. When filtering for projects with at least 10 stars, a rough proxy for community engagement, the count drops to 206,880 in 2025 (Figure 1.5.2). The growth trajectory is similar for both measures. AI Development Activity Overview Projects 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0.00 1.00 2.00 3.00 4.00 5.00 Number of AI projects (in millions) 5.58 Number of GitHub AI projects, 2011–25 Source: GitHub, 2025 | Chart: 2026 AI Index report Figure 1.5.1 42 1.5 O P E N -S O U R C E A I S O F T WA R E | R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0 50 100 150 200 Number of AI projects (in thousands) 206.88 Number of GitHub AI projects with at least 10 stars, 2011–25 Source: GitHub, 2025 | Chart: 2026 AI Index report Figure 1.5.2 The geographic distribution of more visible open-source AI projects has shifted over time (Figure 1.5.3). Among projects with at least 10 stars, the United States accounted for the largest share in 2025 (31.7%), though that has declined steadily from nearly 80% in 2011 as developers in other regions have increased their presence on the platform. Europe and the rest of the world have grown in number of projects, while China’s share has leveled off since 2019. India remains a growing contributor, representing 5.2% of projects with at least 10 stars. Because GitHub data does not capture Chinese developers who use domestic platforms such as Gitee or GitCode, China’s share of global open-source AI activity is likely understated. The existing geographic attribution for China uses self-reported location rather than IP-based geolocation. 43 1.5 O P E N -S O U R C E A I S O F T WA R E | R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0% 10% 20% 30% 40% 50% 60% 70% 80% AI projects (% of total) 5.18%, India 11.01%, China 24.47%, Europe 27.63%, Rest of the world 31.71%, United States GitHub AI projects with at least 10 stars (% of total) by geographic area, 2011–25 Source: GitHub, 2025 | Chart: 2026 AI Index report Figure 1.5.317 Beyond project counts, GitHub stars provide another signal of developer interest and engagement in opensource communities (Figure 1.5.4). The total number of stars for AI projects increased from 14 million in 2023 to 18.2 million in 2025.18 All major geographic regions saw year-over-year increases. However, the geographic pattern for stars differs from the project share data above. Despite its declining share of projects, the United States accumulated the highest number of stars at 30 million cumulatively (Figure 1.5.5). So while opensource activity becomes more geographically distributed, the projects with the most engagement remain disproportionately U.S.-based. 17 In previous AI Index reports, project locations for China and Hong Kong were determined using IP-based geolocation. Due to frequent VPN usage in these regions that resulted in systematic misclassification, self-reported profile locations are now used for China and Hong Kong, while IP-based geolocation continues to be applied for all other countries. 18 Figure 1.5.4 shows new stars given to GitHub projects within a year, not the total accumulated over time. Stars 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0 5 10 15 Number of GitHub stars (in millions) 18.18 Number of GitHub stars in AI projects, 2011–25 Source: GitHub, 2025 | Chart: 2026 AI Index report Figure 1.5.4 44 1.5 O P E N -S O U R C E A I S O F T WA R E | R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0 5 10 15 20 25 30 Number of cumulative GitHub stars (in millions) 2.50, India 9.00, China 12.99, Europe 15.27, Rest of the world 30.02, United States Number of GitHub stars by geographic area, 2011–25 Source: GitHub, 2025 | Chart: 2026 AI Index report Figure 1.5.5 Model and Dataset Ecosystem To complement the GitHub view, this section uses metadata from Hugging Face, a widely used community platform and open repository for AI models and datasets. The analysis focuses on assets created or uploaded between 2022 and 2025 to understand recent activity and adoption trends (Figures 1.5.6 and 1.5.7). Upload activity has continued to rise over the last few years, with a marked increase after the second quarter of 2024. From 2023 to 2025, model uploads more than tripled, while dataset uploads grew fourfold. Download distribution also shifted after 2023. Geographically19 U.S.-developed models lost share to unaffiliated users. On the developer side, major private actors such as Google and Meta have shifted from being the principal authors to accounting for a relatively small share of downloads, while communities such as Sentence Transformers and the BERT community have grown (Figure 1.5.8). A large share of total model downloads fell into an “Others” category, reflecting the wider distribution of development activity even as the most downloaded models were tied to a small number of sources. 19 Data was obtained in collaboration with researchers from Longpre et al. (2025). Their dataset provides Hugging Face model download data that the authors describe as consistent and relatively complete. It was validated with the Hugging Face team, is reported to be less noisy than raw counts, and includes cleaned and imputed missing metadata. It is released as a weekly panel rather than an all-time-downloads cross-section. Coverage spans March 2020 to August 2025 and includes the top 200 most-downloaded Hugging Face models per week. These models account for 49.6% of total normalized, filtered downloads. This restriction focuses the analysis on models with higher observed download volume, reduces long-tail variation, and may support more stable estimates. 45 1.5 O P E N -S O U R C E A I S O F T WA R E | R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 3 4 6 9 15 21 24 34 42 47 53 73 89 82 153 3 19 18 31 53 70 96 102 118 156 260 217 297 295 257 332 2022Q1 2022Q2 2022Q3 2022Q4 2023Q1 2023Q2 2023Q3 2023Q4 2024Q1 2024Q2 2024Q3 2024Q4 2025Q1 2025Q2 2025Q3 2025Q4 0 50 100 150 200 250 300 Models Datasets Number of models and datasets (in thousands) Number of models and datasets on Hugging Face, 2022–25 Source: Hugging Face, 2025 | Chart: 2026 AI Index report 73% 72% 76% 68% 67% 68% 70% 68% 61% 54% 48% 36% 6% 5% 10% 14% 20% 18% 17% 14% 12% 12% 11% 16% 16% 14% 15% 14% 20% 21% 14% 24% 51% 60% 59% 58% 54% 54% 50% 34% 35% 27% 6% 21% 19% 23% 15% 17% 13% 8% 6% 9% 24% 42% 6% 7% 6% 25% 15% 10% 7% 9% 11% 10% 11% 12% 5% 5% 8% 8% 8% 8% 6% 5% 5% 6% 6% 6% 7% 7% 2020Q2 2020Q3 2020Q4 2021Q1 2021Q2 2021Q3 2021Q4 2022Q1 2022Q2 2022Q3 2022Q4 2023Q1 2023Q2 2023Q3 2023Q4 2024Q1 2024Q2 2024Q3 2024Q4 2025Q1 2025Q2 2025Q3 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% United States Unaǅliated user International/Online China United Kingdom Germany Others Finland France India Japan Download share Global distribution of downloads among top Hugging Face models, Q2 2020–Q3 2025 Source: Longpre et al., 2025; Hugging Face, 2025 | Chart: 2026 AI Index report Figure 1.5.620 Figure 1.5.721 20 The data shown in this chart comes from the publicly accessible Hugging Face repository. For more details, refer to the Appendix. 21 Data source: Longpre et al. (2025). For more details, refer to the Appendix. 46 1.5 O P E N -S O U R C E A I S O F T WA R E | R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 42% 42% 47% 39% 34% 32% 30% 28% 24% 18% 12% 9% 7% 7% 5% 7% 6% 6% 9% 19% 13% 8% 12% 9% 7% 9% 11% 10% 11% 11% 5% 5% 19% 12% 8% 11% 13% 14% 14% 9% 9% 16% 7% 14% 14% 15% 15% 15% 16% 16% 16% 13% 11% 7% 6% 27% 29% 25% 32% 34% 33% 31% 33% 40% 39% 24% 39% 56% 48% 46% 45% 51% 61% 66% 48% 52% 41% 9% 9% 8% 8% 10% 13% 18% 18% 19% 22% 15% 13% 7% 9% 6% 5% 12% 14% 19% 16% 16% 10% 9% 11% 15% 11% 8% 16% 33% 2020Q2 2020Q3 2020Q4 2021Q1 2021Q2 2021Q3 2021Q4 2022Q1 2022Q2 2022Q3 2022Q4 2023Q1 2023Q2 2023Q3 2023Q4 2024Q1 2024Q2 2024Q3 2024Q4 2025Q1 2025Q2 2025Q3 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Google OpenAI timm stable-diǄusion sd-concepts-library Bingsu Deepseek Meta Other Top 20 (inclusive) lllyasviel lmstudio-community Download share Download share by developer among top Hugging Face models, Q2 2020–Q3 2025 Source: Longpre et al., 2025; Hugging Face, 2025 | Chart: 2026 AI Index report Figure 1.5.822 Figure 1.5.923 22 Data source: Longpre et al. (2025). For more details, refer to the Appendix. 23 Data source: Longpre et al. (2025). For more details, refer to the Appendix. The most popular model types have shifted over the last three years. Text embedders, classifiers, and audio models, which together accounted for nearly 70% of downloads in 2022, fell to less than 6% in 2025 (Figure 1.5.9). Text generation, multimodal, and video generation models have grown in their place. Text generation led in 2025, accounting for more than 42% of total downloads. Image generation models also increased steadily, remaining the second most downloaded category. Despite these shifts, downloads remain highly concentrated, with nearly 80% associated with the top three categories. 2022 2023 2024 2025 Text generation 42.46% Image generation 25.61% Multimodal generation 13.30% Video generation 5.47% Undocumented 4.56% Audio models 2.88% Text embed/class 2.71% Multimodal embedding 1.72% Image embedding 1.28% Tabular models 0.00% 57.46% Text embed/class 10.88% Image generation 10.82% Audio models 10.63% Text generation 7.52% Image embedding 1.41% Undocumented 0.87% Multimodal embedding 0.30% Multimodal generation 0.11% Tabular models Download share by modality among top Hugging Face models, Q3 2022–Q3 2025 Source: Longpre et al., 2025; Hugging Face, 2025 | Chart: 2026 AI Index report 47

1.6 Publications

1 R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 The first half of this chapter tracked the models, infrastructure, and energy behind AI development. This section shifts to research output, specifically English-language AI publications and citations. Publications offer a longitudinal signal of AI research activity at scale, and the AI Index has tracked them consistently over time. While publication volume is not a measure of research quality, and not all research appears in indexed databases, this approach offers a consistent method for tracking the research frontier year over year. The analysis draws from OpenAlex, a bibliographic database24 the AI Index has used since 2025, and considers both publication volume and downstream influence through citation patterns. 24 OpenAlex is a fully open catalog of scholarly metadata, including scientific papers, authors, institutions, and more. The AI Index used OpenAlex as a bibliographic database and automatically classified AI-related research using the latest version of the CSO Classifier. The CSO Classifier (v3.3) is an automated text classification system designed to categorize research papers in computer science using a comprehensive ontology of 15,000 topics and 166,000 relationships, including emerging fields like GenAI, large language models (LLMs), and prompt engineering. It processes metadata (such as title and abstract) through three modules: a syntactic module for exact topic matches, a semantic module leveraging word embeddings to infer related topics, and a post-processing module that refines results by filtering outliers and adding relevant higher-level areas. Total Number of AI Publications Total AI publication output continues to rise. AI publications more than doubled between 2013 and 2024, increasing from roughly 102,000 to about 258,000 (Figure 1.6.1). Growth continued in 2024, though at a slower rate, with publications increasing 6.3% from 2023. AI research now makes up a substantial portion of the broader computer science ecosystem, accounting for 40.9% of all computer science publications in OpenAlex. 257.89 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 0 50 100 150 200 250 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% Number of AI publications in CS (in thousands) AI publications in CS (% of total) AI publications in CS worldwide, 2013–24 Source: AI Index, 2025 | Chart: 2026 AI Index report Figure 1.6.1 48 1.6 P U B L I CAT I O N S | R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 By Venue Conference Attendance In 2024, journals accounted for the largest share of AI publications (47%), followed by conferences (23.5%) (Figure 1.6.2). Since 2013, both journal and conference publications have increased in absolute numbers, though their relative shares have shifted. The proportion of AI publications appearing in conferences has steadily declined from 36.6% in 2013 to its current level. The most recent year’s results, however, may also reflect a lag in venue assignment, as papers often appear first in repositories25 like arXiv before being formally published in a journal or conference. 25 In this context, ‘repository’ refers to preprint platforms such as arXiv, where researchers post papers prior to or independent of formal peerreviewed publication in a journal or conference. 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 0 20 40 60 80 100 120 140 Number of AI publications in CS (in thousands) 0.76, Dissertation 1.54, Other 13.04, Book 62.03, Conference 62.52, Repository 124.00, Journal Number of AI publications in CS by venue type, 2013–24 Source: AI Index, 2026 | Chart: 2026 AI Index report Figure 1.6.2 Publication venue patterns capture where AI research is formally published, while conference attendance offers a complementary view of research community engagement. Across the 16 major conferences tracked by the AI Index—AAAI, AAMAS, CVPR, EMNLP, FAccT, ICAPS, ICCV, ICLR, ICML, ICRA, IJCAI, IROS, KR, NeurIPS, UAI, and IUI—total attendance increased in 2024 from the previous year (Figure 1.6.3). The largest conferences, including NeurIPS, CVPR, and ICML, continued to draw the highest attendance, while smaller ones such as ICAPS, KR, and UAI maintained stable participation levels (Figures 1.6.4 and 1.6.5). This data should be interpreted with caution, as many conferences have recently switched to virtual or hybrid formats. Conference organizers report that measuring the exact attendance numbers at virtual conferences is difficult, as virtual conferences allow for higher attendance by researchers from around the world. The AI Index reports total attendance figures, encompassing virtual, hybrid, and in-person participation. 49 1.6 P U B L I CAT I O N S | R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 20 40 60 80 100 Number of attendees (in thousands) 101.12 Attendance at select AI conferences, 2010–25 Source: AI Index, 2025 | Chart: 2026 AI Index report 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0 5 10 15 20 25 30 Number of attendees (in thousands) 2.37, IJCAI 6.24, EMNLP 6.29, AAAI 6.33, ACL 7.01, ICRA 7.73, ICCV 8.00, ICML 8.00, IROS 9.38, CVPR 11.04, ICLR 26.38, NeurIPS Attendance at larger conferences, 2010–25 Source: AI Index, 2025 | Chart: 2026 AI Index report Figure 1.6.3 Figure 1.6.426 26 The significant spike in ICML attendance in 2021 was likely due to the conference being held virtually that year. 50 1.6 P U B L I CAT I O N S | R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 Number of attendees (in thousands) 0.18, KR 0.27, ICAPS 0.30, IUI 0.39, UAI 0.60, AAMAS 0.63, FaccT Attendance at smaller conferences, 2010–25 Source: AI Index, 2025 | Chart: 2026 AI Index report Figure 1.6.527 By National Affiliation28 In 2024, China accounted for 17.8% of AI publications in 2024, compared to 11.1% from Europe and 7.6% from India (Figure 1.6.629). Chinese AI publications also accounted for 20.6% of all AI citations in 2024, followed closely by Europe at 19.5% and the United States at 12.6% (Figure 1.6.7). The United States saw a decline of 3 percentage points in publication share, though its citation share remained relatively unchanged (12.6% in 2024 vs. 13.03% in 2023). The “unknown” share in publication data rose to 39.3% in 2024, a spike that likely reflects changes in metadata coverage. The geographic distribution of publications and citations adds context to the notable model trends discussed earlier in the chapter, where a relatively small number of countries account for a disproportionate share of activity. 27 IUI 2021 and 2022 were held exclusively virtually. 28 Regions in this chapter are classified according to the World Bank analytical grouping. The AI Index determines an author’s country affiliation using the “countries” field from the authorship data. This field lists all the countries with which an author is affiliated, as retrieved from OpenAlex based on institutional affiliations. These affiliations can be explicitly stated in the paper or inferred from the author’s most recent publications. When counting publications by country, the AI Index assigns one count to each country linked to the publication. For example, if a paper has three authors, two affiliated with institutions in the United States and one in China, the publication is counted once for the United States and once for China. 29 A publication may have an “unknown” country affiliation when the author’s institutional affiliation is missing or incomplete. This issue arises due to various factors, including unstructured or omitted institution names, platform functional deficiencies, group authorship practices, unstandardized affiliation labeling, document type inconsistencies, or the author’s limited publication record. The problem as it relates to OpenAlex is addressed in this paper; however, the issue of missing institutions pertains to other bibliographic databases as well. 51 Figure 1.6.630 30 For the sake of brevity, the AI Index visualized results for a select group of countries. However, complete results for all countries will be available on the AI Index’s Global Vibrancy Tool by the end of 2026. For immediate access to country-specific research and development data, please contact the AI Index team. 1.6 P U B L I CAT I O N S | R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 0% 5% 10% 15% 20% 25% 30% 35% 40% AI publications in CS (% of total) 7.29%, United States 7.55%, India 11.05%, Europe 17.10%, Rest of the world 17.76%, China 39.25%, Unknown AI publications in CS (% of total) by select geographic areas, 2013–24 Source: AI Index, 2026 | Chart: 2026 AI Index report 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 0% 5% 10% 15% 20% 25% 30% 35% AI publications in CS (% of total) 0.83%, Sub-Saharan Africa 1.21%, Latin America and the Caribbean 4.27%, Middle East and North Africa 8.18%, North America 8.25%, South Asia 13.52%, Europe and Central Asia 26.42%, East Asia and PaciǇc 37.33%, Unknown AI publications in CS (% of total) by region, 2013–24 Source: AI Index, 2026 | Chart: 2026 AI Index report Figure 1.6.7 52 By Sector Academia produced the majority of AI publications in 2024 (68.1%), followed by government institutions (12.4%), industry (11.5%), and nonprofit organizations (4.6%)(Figure 1.6.8). The sector breakdown does vary by region (Figure 1.6.9). In the United States, a higher share of AI publications came from industry (24.6%) compared to China (18%), where government institutions were more meaningful contributors (25.1%). Europe had the highest percentage of AI publications originating from academia (55.3%). 1.6 P U B L I CAT I O N S | R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 0% 10% 20% 30% 40% 50% 60% 70% AI publications in CS (% of total) 3.33%, Other 4.64%, NonproǇt 11.47%, Industry 12.44%, Government 68.13%, Academia AI publications in CS (% of total) by sector, 2013–24 Source: AI Index, 2026 | Chart: 2026 AI Index report Figure 1.6.831 31 For Figure 1.6.8 and Figure 1.6.9, publications with unknown affiliations were excluded. 50.85% 24.46% 13.11% 11.58% 55.30% 17.27% 17.06% 10.38% 54.72% 25.10% 17.96% 2.22% 0% 10% 20% 30% 40% 50% 60% Government NonproǇt Industry Academia United States Europe China AI publications (% of total) AI publications in CS (% of total) by sector and geographic area, 2024 Source: AI Index, 2026 | Chart: 2026 AI Index report Figure 1.6.9 53 Figure 1.6.1032 The AI Index identified the 100 most-cited AI publications from 2021 to 2024 using citation data from OpenAlex. Due to citation lag, this set can shift as citations accumulate over time. The publication volume data above captures the scale of research activity, while the top 100 offers a more selective view on which work is gaining the most recognition and influence. 32 The AI Index categorized papers using its own topic classifier. It is possible for a single publication to be assigned multiple topic labels. 33 The full methodological guide can be accessed in the Appendix, along with the list of the top 100 articles. 34 A publication can have multiple authors from different countries or organizations. If a paper includes authors from multiple countries, each country is credited once. As a result, some of the totals in this section exceed 100. By Topic Top 100 Publications AI research in 2024 remained concentrated in a small set of core topics, though the breadth of areas continued to expand. Similar to the previous year, the most prevalent research topic was machine learning (37%), followed by computer vision (22.4%), pattern recognition (11.2%), and natural language processing (10%) (Figure 1.6.10). Publications on generative AI continued to show sharp growth, extending the trend from previous years. It is also worth noting that the AI Index topic classifier can assign multiple topic labels to a single publication, so topic totals can be seen as overlapping categories rather than mutually exclusive. 1.6 P U B L I CAT I O N S | R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 0 50 100 150 200 Number of AI publications (in thousands) 5.01, Robotics 11.50, Logic and reasoning 13.97, Multi-agent systems 17.46, Evolutionary computation 20.59, Knowledge based systems 34.21, Generative AI 52.99, Natural language processing 59.05, Pattern recognition 118.49, Computer vision 195.89, Machine learning Number of AI publications by select top topics, 2013–24 Source: AI Index, 2026 | Chart: 2026 AI Index report By National Affiliation The geographic distribution of the top 100 has shifted over time (Figure 1.6.11). The United States still ranks highest in top-cited publications each year, though its share has gradually declined from 64 in 2021 to 46 in 2024. China’s share increased to 41 in 2024, from 34 in 2023, and Australia increased to 14 highly cited publications, up from 2 in 2023 and 6 in 2021. 54 1.6 P U B L I CAT I O N S | R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 46 41 15 14 9 8 7 6 5 4 50 34 7 7 6 5 4 2 2 2 58 34 7 6 6 5 4 3 3 2 64 33 10 8 7 7 6 4 4 1 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 India Canada Singapore Switzerland Germany United Kingdom Australia Hong Kong China United States 2024 2023 2022 2021 Number of highly cited publications in top 100 Number of highly cited publications in top 100 by select geographic areas, 2021–24 Source: AI Index, 2026 | Chart: 2026 AI Index report Figure 1.6.11 By Sector and Organization The sector composition of the top 100 remained consistent, with academia producing the most top-cited publications year over year (Figure 1.6.12). Industry contributions declined sharply from 17 in 2021 and 19 in 2022 to six in 2024, even as industry’s share of notable model releases has continued to grow (Section 1.1). The organization distribution varies by year, though output remains concentrated among a small set of institutions (Figure 1.6.13). In 2024, Stanford University and Google led with seven publications each, and the Chinese Academy of Sciences and Microsoft followed closely, with each contributing five. 55 Figure 1.6.1235 Figure 1.6.1336 35 The “other” category includes sectors and intersector collaborations that are not industry and academia, or industry-academia collaborations (e.g., industry and government, academia and nonprofit). Some institutions lack data for 2021 because they did not have papers included in the top 100 that year. Since papers can have multiple authors from different institutions, the total institutional tags in Figure 1.6.13 may exceed 100. Also, because two of the papers had authors with an unknown sectoral affiliation in 2022, the total sum of publications in Figure 1.6.12 is 98. 36 Universities are abbreviated as follows: CUHK = The Chinese University of Hong Kong; HKU = The University of Hong Kong; HUST = Huazhong University of Science and Technology; MIT = Massachusetts Institute of Technology; NTU Singapore = Nanyang Technological University, Singapore. 1.6 P U B L I CAT I O N S | R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 46 41 15 14 9 8 7 6 5 4 50 34 7 7 6 5 4 2 2 2 58 34 7 6 6 5 4 3 3 2 64 33 10 8 7 7 6 4 4 1 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 India Canada Singapore Switzerland Germany United Kingdom Australia Hong Kong China United States 2024 2023 2022 2021 Number of highly cited publications in top 100 Number of highly cited publications in top 100 by select geographic areas, 2021–24 Source: AI Index, 2026 | Chart: 2026 AI Index report 15 3 5 10 1 2 1 1 2 2 5 20 2 2 3 3 9 3 1 3 3 2 1 10 8 3 3 4 3 6 3 2 5 1 1 8 7 7 5 5 5 5 5 5 4 4 4 4 4 4 3 Google Stanford University CUHK Chinese Academy of Sciences HKU Microsoft Peking University Tencent Amazon HUST Hong Kong University of Science and Technology MIT NTU Singapore Tsinghua University Baidu 0 2 4 6 8 10 12 14 16 18 20 22 2021 2022 2023 2024 Organization Number of highly cited publications in top 100 Number of highly cited publications in top 100 by organization, 2021–24 Source: AI Index, 2026 | Chart: 2026 AI Index report 56

1.7 Patents

1 R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 While publications track research outputs, patents offer insight into applied innovation and commercial development. This section examines trends in global AI patents over time. Patents can provide another lens for tracking innovation across organizations and geographic areas, particularly in applied AI contexts. Similar to publications data, there are notable delays before AI patent data becomes available, with 2024 being the most recent year accessible. The analysis draws from patent-level bibliographic records in PATSTAT Global, a comprehensive database provided by the European Patent Office (EPO).37 Figure 1.7.138 37 More details on the methodology behind this section’s patent analysis can be found in the Appendix. 38 Patent standards and laws vary across countries and regions, so these charts should be interpreted with caution. More detailed country-level patent information will be released in a subsequent edition of the AI Index’s Global Vibrancy Tool. Global Trends Globally, the number of granted AI patents has grown exponentially, from 3,866 in 2010 to 131,121 in 2024 (Figure 1.7.1). Between 2023 and 2024, patent grants rose by 8.2%. China accounts for the majority, at 74.2% of the global total (Figures 1.7.2 and 1.7.3). The United States is the next major contributor at 12.1% (15,290 patents), followed by Europe (3%) and India (0.4%). Over the past decade, the United States’ share has declined steadily from a peak of 42.8% in 2015, while China’s share has risen from under 20% to its current level. Patents and publications reflect different stages in the R&D pipeline, so China’s lead in both, while not directly correlated, is consistent with the country’s growing research presence described earlier. Other regional leaders emerge when patent activity is normalized by population size (Figure 1.7.4). In 2024, South Korea had the highest number of granted AI patents on a per capita basis (14.3%), followed by Luxembourg (12.3%) and China (7.0%). 131.12 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 0 20 40 60 80 100 120 Number of AI patents granted (in thousands) Number of AI patents granted worldwide, 2010–24 Source: AI Index, 2026 | Chart: 2026 AI Index report 57 1.7 PAT E N T S | R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 0% 10% 20% 30% 40% 50% 60% 70% Granted AI patents (% of world total) 0.40%, India 2.95%, Europe 10.35%, Rest of the world 12.06%, United States 74.24%, China Granted AI patents (% of world total) by select geographic areas, 2010–24 Source: AI Index, 2026 | Chart: 2026 AI Index report Figure 1.7.2 Figure 1.7.3 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 0 20 40 60 80 100 Number of AI patents granted (in thousands) 0.53, India 3.89, Europe 13.66, Rest of the world 15.92, United States 97.99, China Number of AI patents granted by select geographic areas, 2010–24 Source: AI Index, 2026 | Chart: 2026 AI Index report 58 1.7 PAT E N T S | R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 0.21 0.32 0.35 0.45 0.60 0.62 0.67 0.70 1.30 1.31 4.30 4.68 6.95 12.25 14.31 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Switzerland Denmark Greece Australia United Kingdom France Finland Sweden Germany Singapore Japan United States China Luxembourg South Korea Granted AI patents (per 100,000 inhabitants) Granted AI patents per 100,000 inhabitants by country, 2024 Source: AI Index, 2026 | Chart: 2026 AI Index report Figure 1.7.4 Forward Citations Flow When newly filed patents reference earlier ones, those references are called forward citations. These are often used as a proxy for influence, since they indicate how often an invention informs later work. By this measure, the United States accounts for over half of all AI patent forward citations, a signal of downstream influence that contrasts with its 12.1% share of patent volume (Figure 1.7.5). China ranks second despite producing the largest volume of patents by a wide margin. The relationship between forward citations and technological impact is not straightforward and has been called into question (Higham et al., 2021). There is also a strong home bias across all countries, with most citations occurring domestically, a welldocumented pattern in patent citation geography (Jaffe et al., 1993; Cotterlaz et al. 2025; and Verluise et al., 2025).That said, the cross-border flows are not symmetric. Chinese patents are cited frequently in U.S. filing, while U.S. patents appear far less often in Chinese ones. 59 1.7 PAT E N T S | R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026United StatesUnited StatesUnited StatesUnited States United StatesChinaChinaChinaChina ChinaJapanJapanJapanJapan JapanSouth KoreaSouth KoreaSouth KoreaSouth Korea South KoreaEuropeEuropeEuropeEurope EuropeRest of the worldRest of the worldRest of the worldRest of the world Rest of the worldUnited States (51.91%)United States (51.91%)United States (51.91%)United States (51.91%) United States (51.91%)China (29.81%)China (29.81%)China (29.81%)China (29.81%) China (29.81%)Japan (6.86%)Japan (6.86%)Japan (6.86%)Japan (6.86%) Japan (6.86%)South Korea (4.79%)South Korea (4.79%)South Korea (4.79%)South Korea (4.79%) South Korea (4.79%)Europe (4.17%)Europe (4.17%)Europe (4.17%)Europe (4.17%) Europe (4.17%)Rest of the world (2.46%)Rest of the world (2.46%)Rest of the world (2.46%)Rest of the world (2.46%) Rest of the world (2.46%) Global distribution of forward citations to AI patents by geographic area, 2010–24 Source: AI Index, 2026 | Chart: 2026 AI Index report Citing country Cited country Figure 1.7.539 39 Each data point in the figure reflects forward citations to AI patents, grouped at the patent family level to represent unique inventions rather than individual filings. Values are expressed as shares of all AI patent forward citations for patents granted between 2010 and 2024. Speed of Knowledge Diffusion Patent citation lag—the time between a patent’s publication and its first forward citation—can be used to measure how quickly knowledge diffuses within a discipline. For AI patents, most receive their first citation within two to three years, reflecting a relatively fast diffusion. The speed varies by country (Figure 1.7.6). U.S. patents tend to be cited sooner and more consistently over time, with only 19% remaining uncited compared to 32% to 44% in other geographic areas. Japan’s patents show early but narrower influence, and those from China and South Korea experience slower initial citation but, after about six years, citation activity stabilizes across all regions. The patterns are consistent with the forward citation data above, but differences in citation norms and home bias may also play a role. 60 1.7 PAT E N T S | R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 0 1 2 3 4 5 6 7 8 9 10 0.00 0.20 0.40 0.60 0.80 1.00 Years since publication Survival probability (no citation yet) 0.19, United States 0.32, Japan 0.35, Europe 0.42, South Korea 0.44, China 0.44, Rest of the world Speed of AI patent knowledge diǄusion by geographic area Source: AI Index, 2026 | Chart: 2026 AI Index report Figure 1.7.640 Technological proximity41 measures whether countries are converging on similar types of AI innovation or pursuing distinct paths. Using a method proposed by Bar et al. (2012), the analysis42 compares how closely each country’s AI patent portfolio aligns with the two largest reference points, the United States and China (Figure 1.7.7). Overlap is scored on a scale from 0 (no similarity) to 1 (identical). Most countries cluster in the upper right, meaning their AI patents cover similar technological areas to both the U.S. and China, with a stronger lean toward the U.S. portfolio. India and Australia, for example, have patent portfolios that show close to 80% overlap with both. Denmark is the least similar to either reference point, showing only a 45% overlap with China and a 52% overlap with the United States. This is because Denmark’s AI patents are concentrated in energy and wind-related technology categories (patent codes Y02E, F03D, F05B) rather than core computing and data-processing categories (G06F, G06N, G06K) that dominate both the U.S. and China. While most countries’ AI innovation portfolios are structured similarly, national industrial strengths tend to influence where AI is applied. 40 This figure plots the probability of not being cited, so curves with sharper drops indicate shorter lags. 41 Also known as the min-complement proximity measure. 42 Technological classes are identified by International Patent Classification (IPC) and Cooperative Patent Classification (CPC) codes. Technological Proximity 61 1.7 PAT E N T S | R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 Australia Austria Brazil China Denmark European Patent Oǅce Finland France Greece India Israel Italy Japan Luxembourg Malaysia Netherlands Poland Russia South Africa South Korea Spain Sweden Switzerland Taiwan United States WIPO 0.40 0.50 0.60 0.70 0.80 0.90 1.00 0.40 0.50 0.60 0.70 0.80 0.90 1.00 Proximity to the United States Proximity to China Canada Singapore Rest of the world Germany United Kingdom AI patent portfolios’ technological proximity to the United States and China, 2010–24 Source: AI Index, 2026 | Chart: 2026 AI Index report Figure 1.7.7 H I G H L I G H T: AI Patent Examples 1 Patent CN111431996A: Resource configuration method and device, equipment and medium, 2022, China A machine-learning prediction model determines how to allocate computing resources across multiple services in a cluster. The system learns from historical and real-time signals—such as traffic volumes and CPU, memory, and network usage—to infer the right resource configuration. This enables automated, dynamic scaling decisions without relying on manual rules. 2 Patent US11436777B1: Machine learning-based hazard visualization system, 2022, United States The system trains machine learning models to forecast hazard attributes (time, path, severity) for specific locations and identify infrastructure in geospatial imagery. It combines model outputs to annotate maps, showing where hazards intersect with critical assets. The system also supports causal inference—for example, identifying infrastructure repeatedly affected by hazards. These capabilities rely on learned prediction and image-recognition models rather than deterministic mapping logic. 3 Patent US2023239456A1: Display system with ML-based stereoscopic view synthesis over a wide field of view, 2025, United States This head-mounted display uses machine-learning techniques—including depth estimation and reconstruction—to create perspective-correct stereoscopic images from external cameras. Neural models handle real-time vision challenges like disocclusion, artifact reduction, and sharpening by inferring scene geometry and filling gaps where camera viewpoints fail to align with the user’s eyes. ML is a core component of the VR/AR passthrough rendering pipeline. 62 The publications and patents discussed above reflect research and development outputs. Using Zeki data, the AI Index examined the geographic distribution and mobility patterns of the authors and inventors behind this work over time. This section covers a narrower slice of AI talent activity than the broader labor market indicators discussed in Chapter 4 (Economy). Zeki identifies talent outside of China based on observable AI outputs such as research, data depositories, and new models. The dataset covers 2010 to 2025 across a group of countries in North America, Europe, Asia, Latin America, and the Middle East.43 43 For more details, refer to the Appendix.

1.8 AI Authors and Inventors

1 R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 Geographic Distribution In 2025, the largest share of identified AI authors and inventors came from the United States (220,520), followed by India (50,460) and Germany (48,520) (Figure 1.8.1). The United Kingdom (34,370), Canada (31,450), and France (18,820) formed a second tier with Australia, the Netherlands, Italy, Brazil, Switzerland, and others making up the broader distribution of contributors. Looking at the data on a per capita basis surfaces countries that have relatively high levels of AI activity that are not visible when looking at total volume, as seen in the per capita patent data in Section 1.7. Switzerland led with 110.5 AI authors and inventors per 100,000 inhabitants, followed closely by Singapore (109.5) (Figure 1.8.2). Countries with smaller populations, such as Finland (77.6), the Netherlands (77.6), and Denmark (66.3), rank above larger nations including Germany (58.1) and the United Kingdom (49.6). 3.09 3.43 3.90 4.38 5.19 5.96 6.28 6.61 8.52 9.17 9.98 11.10 13.23 13.96 14.54 18.82 31.45 34.37 48.52 50.46 220.52 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 United Arab Emirates Saudi Arabia Denmark Finland Israel South Korea Japan Singapore Sweden Spain Switzerland Brazil Italy Netherlands Australia France Canada United Kingdom Germany India United States Number of top AI authors and inventors (in thousands) Number of top AI authors and inventors by country, 2025 Source: Zeki Data, 2025 | Chart: 2026 AI Index report Figure 1.8.1 63 1.8 A I AU T H O R S A N D I N V E N TO R S | R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 22.43 27.46 28.38 49.64 52.01 53.43 58.10 64.84 65.25 76.16 77.61 77.61 80.63 109.51 110.45 0 10 20 30 40 50 60 70 80 90 100 110 Italy France United Arab Emirates United Kingdom Israel Australia Germany United States Denmark Canada Netherlands Finland Sweden Singapore Switzerland Top AI authors and inventors (per 100,000 inhabitants) Top AI authors and inventors per 100,000 inhabitants by country, 2025 Source: Zeki Data, 2025 | Chart: 2026 AI Index report Figure 1.8.2 By Education Level The educational profile of top AI authors and inventors varies by country, though in most of the countries, PhD holders and those with master’s degrees together account for the majority in 2025 (Figure 1.8.3). The United Kingdom (51.1%) and Australia (50.5%) have the highest share of PhD holders, followed by Switzerland (43.6%), South Korea (42.5%), and the United States (42%). India and Brazil show a more varied distribution, with comparatively lower shares of PhD holders and a wider spread across other degree levels. 64 1.8 A I AU T H O R S A N D I N V E N TO R S | R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 2010 2015 2020 2025 0% 20% 40% 60% 80% 2010 2015 2020 2025 0% 20% 40% 60% 80% 2010 2015 2020 2025 0% 20% 40% 60% 80% 2010 2015 2020 2025 0% 20% 40% 60% 80% 2010 2015 2020 2025 0% 20% 40% 60% 80% 2010 2015 2020 2025 0% 20% 40% 60% 80% 2010 2015 2020 2025 0% 20% 40% 60% 80% 2010 2015 2020 2025 0% 20% 40% 60% 80% 2010 2015 2020 2025 0% 20% 40% 60% 80% 2010 2015 2020 2025 0% 20% 40% 60% 80% 2010 2015 2020 2025 0% 20% 40% 60% 80% 2010 2015 2020 2025 0% 20% 40% 60% 80% 2010 2015 2020 2025 0% 20% 40% 60% 80% 2010 2015 2020 2025 0% 20% 40% 60% 80% 2010 2015 2020 2025 0% 20% 40% 60% 80% 2010 2015 2020 2025 0% 20% 40% 60% 80% 2010 2015 2020 2025 0% 20% 40% 60% 80% 2010 2015 2020 2025 0% 20% 40% 60% 80% 2010 2015 2020 2025 0% 20% 40% 60% 80% 2010 2015 2020 2025 0% 20% 40% 60% 80% 2010 2015 2020 2025 0% 20% 40% 60% 80% Australia Brazil Canada Denmark Finland France Germany India Israel Italy Japan Netherlands Saudi Arabia Singapore South Korea Spain Sweden Switzerland United Arab Emirates United Kingdom United States 50.46%, PhD 25.70%, MA/MS 14.81%, BA/BS 8.68%, Other 0.30%, Diploma 0.05%, HS 41.85%, Other 31.60%, MA/MS 18.16%, PhD 8.16%, BA/BS 0.18%, HS 0.05%, Diploma 39.90%, MA/MS 36.56%, PhD 12.52%, Other 10.74%, BA/BS 0.21%, Diploma 0.07%, HS 38.75%, PhD 38.72%, MA/MS 16.66%, Other 5.68%, BA/BS 0.17%, Diploma 0.03%, HS 44.63%, MA/MS 36.09%, PhD 15.82%, Other 3.26%, BA/BS 0.13%, Diploma 0.08%, HS 40.65%, PhD 34.19%, MA/MS 23.64%, Other 1.27%, BA/BS 0.16%, Diploma 0.08%, HS 45.34%, MA/MS 25.83%, PhD 23.77%, Other 3.91%, BA/BS 1.04%, Diploma 0.11%, HS 34.54%, MA/MS 33.83%, BA/BS 18.52%, PhD 12.85%, Other 0.20%, HS 0.07%, Diploma 39.94%, MA/MS 32.54%, PhD 17.05%, Other 10.39%, BA/BS 0.06%, Diploma 0.02%, HS 33.91%, PhD 30.44%, Other 29.57%, MA/MS 3.92%, Diploma 1.82%, BA/BS 0.34%, HS 35.70%, PhD 34.84%, MA/MS 22.28%, Other 7.16%, BA/BS 0.02%, HS 0.02%, Diploma 49.33%, MA/MS 29.78%, PhD 15.22%, Other 5.44%, BA/BS 0.14%, Diploma 0.10%, HS 38.71%, PhD 31.02%, MA/MS 22.37%, BA/BS 7.58%, Other 0.21%, Diploma 0.10%, HS 35.99%, PhD 32.98%, MA/MS 19.58%, BA/BS 10.96%, Other 0.35%, Diploma 0.14%, HS 42.50%, PhD 31.76%, MA/MS 16.00%, Other 9.60%, BA/BS 0.11%, HS 0.02%, Diploma 39.14%, PhD 32.36%, MA/MS 24.70%, Other 3.17%, BA/BS 0.49%, Diploma 0.15%, HS 42.08%, MA/MS 34.03%, PhD 18.15%, Other 5.49%, BA/BS 0.13%, Diploma 0.12%, HS 43.63%, PhD 33.92%, MA/MS 19.02%, Other 3.00%, BA/BS 0.33%, Diploma 0.10%, HS 38.90%, MA/MS 32.59%, PhD 19.08%, BA/BS 9.13%, Other 0.19%, Diploma 0.11%, HS 51.13%, PhD 26.09%, MA/MS 13.80%, Other 8.66%, BA/BS 0.23%, Diploma 0.09%, HS 41.99%, PhD 34.28%, MA/MS 12.51%, BA/BS 11.01%, Other 0.14%, HS 0.06%, Diploma % of AI authors and inventors by education level Percentage of top AI authors and inventors by education level and country, 2010–25 Source: Zeki Data, 2025 | Chart: 2026 AI Index report Figure 1.8.3 65 1.8 A I AU T H O R S A N D I N V E N TO R S | R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 By Gender The gender gap among AI authors and inventors is visible across all countries, with men making up the majority in all cases, though the size of the gap varies (Figure 1.8.4). In Brazil, South Korea, and Japan, more than 80% of identified AI talent is male. Female representation is somewhat higher in Saudi Arabia (32.3%), Australia (30.1%), Canada (29.6%), and Italy (29.5%), but no country comes close to parity. More significantly, in almost every country, the male-female ratio has remained flat from 2010 to 2025. Even with the growth in AI talent overall, there has been no meaningful progress on gender balance. Chapter 7 (Education) describes a similar pattern in AI-related degree attainment, where women remain underrepresented across all levels. 2010 2015 2020 2025 0% 20% 40% 60% 80% 100% 2010 2015 2020 2025 0% 20% 40% 60% 80% 100% 2010 2015 2020 2025 0% 20% 40% 60% 80% 100% 2010 2015 2020 2025 0% 20% 40% 60% 80% 100% 2010 2015 2020 2025 0% 20% 40% 60% 80% 100% 2010 2015 2020 2025 0% 20% 40% 60% 80% 100% 2010 2015 2020 2025 0% 20% 40% 60% 80% 100% 2010 2015 2020 2025 0% 20% 40% 60% 80% 100% 2010 2015 2020 2025 0% 20% 40% 60% 80% 100% 2010 2015 2020 2025 0% 20% 40% 60% 80% 100% 2010 2015 2020 2025 0% 20% 40% 60% 80% 100% 2010 2015 2020 2025 0% 20% 40% 60% 80% 100% 2010 2015 2020 2025 0% 20% 40% 60% 80% 100% 2010 2015 2020 2025 0% 20% 40% 60% 80% 100% 2010 2015 2020 2025 0% 20% 40% 60% 80% 100% 2010 2015 2020 2025 0% 20% 40% 60% 80% 100% 2010 2015 2020 2025 0% 20% 40% 60% 80% 100% 2010 2015 2020 2025 0% 20% 40% 60% 80% 100% 2010 2015 2020 2025 0% 20% 40% 60% 80% 100% 2010 2015 2020 2025 0% 20% 40% 60% 80% 100% 2010 2015 2020 2025 0% 20% 40% 60% 80% 100% Australia Brazil Canada Denmark Finland France Germany India Israel Italy Japan Netherlands Saudi Arabia Singapore South Korea Spain Sweden Switzerland United Arab Emirates United Kingdom United States Top AI authors and inventors (% of total) Top AI authors and inventors (% of total) by gender, 2010–25 Source: Zeki Data, 2025 | Chart: 2026 AI Index report 69.73%, Male 30.27%, Female 80.16%, Male 19.84%, Female 70.41%, Male 29.59%, Female 77.83%, Male 22.17%, Female 73.23%, Male 26.77%, Female 74.92%, Male 25.08%, Female 78.24%, Male 21.76%, Female 71.96%, Male 28.04%, Female 78.43%, Male 21.57%, Female 70.47%, Male 29.53%, Female 82.50%, Male 17.50%, Female 73.43%, Male 26.57%, Female 67.72%, Male 32.28%, Female 73.11%, Male 26.89%, Female 81.44%, Male 18.56%, Female 77.38%, Male 22.62%, Female 75.13%, Male 24.87%, Female 78.45%, Male 21.55%, Female 73.83%, Male 26.17%, Female 72.57%, Male 27.43%, Female 72.35%, Male 27.65%, Female Figure 1.8.4 66 1.8 A I AU T H O R S A N D I N V E N TO R S | R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 By Specialization Mobility AI authors and inventors are distributed across a range of specialization areas, though each country shows its own emphasis (Figure 1.8.5). Healthcare and bioinformatics, computer vision and image processing, and software engineering are among the most common areas globally, accounting for 10% or more of the pool in several countries. A few country-level patterns connect to findings discussed earlier in this chapter. South Korea, for example, has the highest share of talent in hardware, VLSI, and IoT (20%), consistent with its role in the semiconductor supply chain described in Section 1.3. Brazil has the highest share of software engineering talent (18%), while Saudi Arabia leads in security, privacy, and cryptography (15%). Mobility is measured through net flow, which is the difference between the number of AI authors and inventors who move to or out of their respective countries (Figure 1.8.6). The United States has remained net positive since 2020, meaning it attracts more talent than it loses, though the magnitude has declined from a peak of 324.6 in 2022 to 26.0 in 2025. Most other countries operate on a smaller scale. Saudi Arabia (3.1) and Denmark (2.1) were among the few with positive net flow in 2025. Canada, which showed strong inflow around 2020, declined to -7.1 by 2025. Germany also showed negative net flow at -2.4, while India had the largest net outflows at -16.9 in 2025. These flows are relevant in the context of other factors, including immigration policy and geographic distribution of investment and employment, discussed further in Chapter 4’s section on labor markets. Figure 1.8.5 67 1.8 A I AU T H O R S A N D I N V E N TO R S | R E S E A R C H A N D D E V E LO P M E N T | A I I N D E X R E P O R T 2026 2010 2015 2020 2025 −40 −20 0 20 40 2010 2015 2020 2025 −40 −20 0 20 40 2010 2015 2020 2025 −40 −20 0 20 40 2010 2015 2020 2025 −40 −20 0 20 40 2010 2015 2020 2025 −40 −20 0 20 40 2010 2015 2020 2025 −40 −20 0 20 40 2010 2015 2020 2025 −40 −20 0 20 40 2010 2015 2020 2025 −350 −300 −250 −200 −150 −100 −50 0 50 2010 2015 2020 2025 −40 −20 0 20 40 2010 2015 2020 2025 −40 −20 0 20 40 2010 2015 2020 2025 −40 −20 0 20 40 2010 2015 2020 2025 −40 −20 0 20 40 2010 2015 2020 2025 −40 −20 0 20 40 2010 2015 2020 2025 −40 −20 0 20 40 2010 2015 2020 2025 −40 −20 0 20 40 2010 2015 2020 2025 −40 −20 0 20 40 2010 2015 2020 2025 −40 −20 0 20 40 2010 2015 2020 2025 −40 −20 0 20 40 2010 2015 2020 2025 −40 −20 0 20 40 2010 2015 2020 2025 −40 −20 0 20 40 2010 2015 2020 2025 0 50 100 150 200 250 300 350 Australia Brazil Canada Denmark Finland France Germany India* Israel Italy Japan Netherlands Saudi Arabia Singapore South Korea Spain Sweden Switzerland United Arab Emirates United Kingdom United States* -1.75 -3.58 -7.08 2.08 -1.08 0.08 -2.42 -16.92 -1.92 0.00 0.00 -1.83 3.08 -2.58 -1.75 2.58 -0.58 0.25 2.58 5.17 26.00 Net ǈow top AI authors and inventors (12-month rolling avg.) Net ǈow of top AI authors and inventors by country, 2010–25 Source: Zeki Data, 2025 | Chart: 2026 AI Index report Figure 1.8.644 44 Asterisks indicate that a country’s y-axis label is scaled differently than the y-axis label for the other countries. 68 Technical Performance 2 A I I N D E X R E P O R T 2026

2.5 Performance in Specific Domains

Software SWE-bench Terminal-Bench Vibe Code Bench Mathematics FrontierMath MathArena Highlight: Theorem Proving Finance TaxEval MortgageTax CorpFin Finance Agent Law CaseLaw 70 CO N T E N T S | 2 T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 LegalBench

2.7 Robotics and Autonomous Motion

Robotics RLBench BEHAVIOR-1K Highlight: Humanoid Robotics Highlight: Physical AI and Foundation Models for Robotics Self-Driving Cars Deployment Technical Innovations and New Benchmarks Safety 71

2.1 Overall Performance Trends

This section examines patterns in AI performance, from the pace at which models are reaching human-level baselines to how competition among leading models and countries has narrowed. It also assesses where the tools used to measure this progress are themselves falling short. To enable comparison across diverse valuation tasks, performance metrics are scaled to a common reference point. The scaling methodology, developed by the AI Index team, calibrates each benchmark so that the best-performing model in a given year is measured as a percentage of the established human baseline for that task. For example, using this approach, a value of 105% indicates that a model performs 5% better than the human baseline. The benchmarks included in this analysis represent tasks that can be structurally evaluated. It may not fully capture the breadth of capabilities required for real-world AI deployment. The Benchmarking AI subsection later in this section explores these limitations in detail. 2 T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 AI performance continued to improve across a broad set of benchmark categories in 2025, with some of the largest gains appearing on tasks that were well below human baseline performance just a few years ago (Figure 2.1.1). Frontier systems now meet or exceed established human performance levels on long-running benchmarks, including ImageNet, SuperGLUE, and MMLU. Since last year’s report, several benchmarks designed to test more advanced reasoning have reached or approached the human benchmark, including PhD-level science questions (GPQA Diamond), multimodal reasoning (MMMU), and mathematical reasoning (AIME). Models are still performing below the baseline in the areas of autonomous software engineering (SWE-bench Verified) and agent-based multimodal computer use (OSWorld), but the pace of improvement is rapidly accelerating. On SWE-bench Verified, for example, performance rose from approximately 60% in 2024 to close to 100% in 2025. Technical Performance Benchmarks vs. Human Performance 76 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0% 20% 40% 60% 80% 100% 120% Image classiǇcation (ImageNet Top-5) Visual reasoning (VQA) English language understanding (SuperGLUE) Medium-level reading comprehension (SQuAD 2.0) Multitask language understanding (MMLU) Competition-level mathematics (MATH) PhD-level science questions (GPQA Diamond) Multimodal understanding and reasoning (MMMU) Agent multimodal computer use (OSWorld) Mathematical reasoning (AIME) Autonomous software engineering (SWE-bench VeriǇed) Performance relative to the human baseline (%) Human baseline Select AI Index technical performance benchmarks vs. human performance Source: AI Index, 2026 | Chart: 2026 AI Index report 2.1 OV E R A L L P E R F O R M A N C E T R E N D S | T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 Figure 2.1.11 1 In Figure 2.1.1, the values are scaled to establish a standard metric for comparing different benchmarks. The scaling function is calibrated such that the performance of the best model for each year is measured as a percentage of the human baseline for a given task. A value of 105% indicates, for example, that a model performs 5% better than the human baseline. The performance gap between leading closed-weight and open-weight models has fluctuated over the past three years, with open-weight systems closing in and then falling behind as new proprietary models are released (Figure 2.1.2). In May 2023, the leading closed-weight model (GPT-4-0314) outperformed the top open-weight model (Vicuna-13B) by 174 points (15.2%) on the Arena Leaderboard. Stronger open-weight releases, including Mixtral, WizardLM, and Llama-3.1-405B, narrowed the gap to just 7 points (0.5%) by August 2024. Over the past year, that trend reversed with the arrival of new closed-weight frontier systems such as o1-preview and Gemini 2.5 Pro. As of March 2026, the top closed-weight model, Claude Opus 4.6 (1,503), led the top open-weight model GLM-5 (1,454) by 49 points (3.4%). While closed-weight models still lead, open-weight models are far more competitive than they were a few years ago. Closed- vs. Open-Weight Models 77 2.1 OV E R A L L P E R F O R M A N C E T R E N D S | T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 2023-May 2023-Sep 2024-Jan 2024-May 2024-Sep 2025-Jan 2025-May 2025-Sep 2026-Jan 1,100 1,150 1,200 1,250 1,300 1,350 1,400 1,450 1,500 1,550 Arena score (Elo-like) 1,454, top open model 1,503, top closed model Performance of top closed vs. open models on the Arena Source: Arena, 2026 | Chart: 2026 AI Index report gpt-4-0314 gpt-4-0125-preview o1-preview gpt-5-high claude-opus-4-6-thinking vicuna-13b mixtral-8×7b-instruct-v0.1 llama-3.1-405b-instruct-fp8 qwen3-235b-a22b-instruct-2507 glm-5 Figure 2.1.22 Figure 2.1.33 2 Source: the Arena historical leaderboard (Public, Style Control On), exported in March 2026. 3 Source: the Arena historical leaderboard (Public, Style Control On), exported in March 2026. The United States’ substantial lead in 2023 shrank considerably by early 2025, and the performance gap has remained narrow since then (Figure 2.1.3). In February 2025, DeepSeek-R1 (1,400) trailed the leading U.S. model (o1-2024-12-17, 1,405) by just 5 Arena points (0.4%). As of March 2026, the top U.S. model (Claude Opus 4.6, 1,503) led the top Chinese model (Dola-Seed-2.0 Preview, 1,464) by 39 points (2.7%). Over the past year, the gap has fluctuated between near parity and low single digits. This convergence is particularly notable because it has emerged from two distinct development environments and institutional contexts, including the research dynamics examined in Chapter 1 and the investment patterns discussed in Chapter 4. US vs. China Technical Performance 2023-May 2023-Sep 2024-Jan 2024-May 2024-Sep 2025-Jan 2025-May 2025-Sep 2026-Jan 1,000 1,100 1,200 1,300 1,400 1,500 Arena score (Elo-like) 1,464, top China model 1,503, top US model Performance of top United States vs. Chinese models on the Arena Source: Arena, 2026 | Chart: 2026 AI Index report gpt-4-0314 o1-preview gemini-3-pro claude-opus-4-6 chatglm-6b qwen-14b-chat glm-4-0520 deepseek-r1 qwen3-235b-a22b-instruct-2507 dola-seed-2.0-preview 78 Figure 2.1.44 4 Source: the Arena historical leaderboard (Public, Style Control On), exported in March 2026. 2.1 OV E R A L L P E R F O R M A N C E T R E N D S | T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 Frontier models became even more tightly clustered over the past year, as several companies moved into a very narrow performance band at the top of the Arena Leaderboard (Figure 2.1.4). In early 2023, OpenAI had a clear lead with its top model scoring 1,322 compared to Google’s 1,117. This gap narrowed steadily through 2024 as Google, Anthropic, and others released stronger models. By February 2025, DeepSeek had briefly matched and surpassed the top U.S. systems on Arena. In last year’s report, the top four models spanned roughly 97 points and, as of March 2026, the top four models are separated by fewer than 25 points. Anthropic leads at 1,503, followed closely by xAI (1,495), Google (1,494), and OpenAI (1,481). DeepSeek (1,424) and Alibaba (1,449) trail only modestly. Meta’s Arena performance has flattened since early 2025, reflecting a slowdown in competitive releases, though newer models could be in the pipeline for 2026. As leading models become harder to distinguish on benchmark performance, factors such as cost, latency, reliability, and domain-specific optimization may play a greater role in user adoption. Benchmarks still anchor much of how AI’s technical progress is measured, but their limitations are more visible. Since last year’s report, the AI Index has expanded its analysis to examine where benchmarks remain useful, and where they fall short. Several challenges highlighted in previous editions of this report persist. Benchmark saturation, where models reach scores so high that a test can no longer distinguish between them, remains a concern. Tests designed to be harder often remain useful for only a few years before systems surpass them. As Chapter 1 documents, reporting discrepancies continue, and the most capable modern models are now among Model Performance Converges at the Frontier Benchmarking AI 2023-May 2023-Sep 2024-Jan 2024-May 2024-Sep 2025-Jan 2025-May 2025-Sep 2026-Jan 1,000 1,100 1,200 1,300 1,400 1,500 Arena score (Elo-like) 1,335, Meta 1,416, Mistral AI 1,424, DeepSeek 1,449, Alibaba 1,481, OpenAI 1,494, Google 1,495, xAI 1,503, Anthropic Performance of top models on the Arena by select providers Source: Arena, 2026 | Chart: 2026 AI Index report 79 2.1 OV E R A L L P E R F O R M A N C E T R E N D S | T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 the least transparent. The growing opacity and nonstandard prompting techniques make model-to-model comparisons unreliable, and third-party evaluations have documented cases where models perform more poorly in independent testing compared to developer-reported results. In addition, contamination—when models are exposed to test set data during training—can lead to falsely inflated scores. In 2025, Meta faced criticism that its Llama 4 model was optimized using specialized variants to improve leaderboard rankings and may have trained on benchmark test data, though the company disputed these claims. Additionally, audits of widely used benchmarks revealed that many remain poorly constructed, with inadequate documentation, no reporting of statistical significance, and a lack of replication scripts. Even when benchmark scores are technically valid, strong benchmark performance does not always translate to realworld utility. Last year’s report also highlighted how difficult it is to benchmark more complex, interactive forms of intelligence, which matter even more for current AI systems. Even though many benchmarks for multiagent coordination, human–AI interaction, tool-using agents, and physical-world robotics have been proposed (e.g., for robotic manipulation, embodied reasoning, and agentic tasks), they remain underdeveloped. These domains are inherently harder to standardize as physical tasks involve unpredictable environments, diverse hardware, and a range of valid approaches that resist repeatable scoring. Later sections of this chapter report on several of these benchmarks in detail. The benchmarking landscape has seen several developments that extend beyond these recurring concerns. First, there is a growing case for evaluations that measure human–AI collaboration rather than AI performance in isolation. Most widely used benchmarks test systems without human involvement, even though many real deployments involve people supervising, steering, and integrating AI outputs. Recent work argues that the field should adopt centaur evaluations, assessments in which humans and AI jointly solve tasks, because these better reflect actual use and allow measurement of human-centered qualities like interpretability and helpfulness that conventional benchmarks ignore. Second, new methods have emerged to address the invalid benchmark questions. A review by Stanford researchers identified the proportion of invalid questions across nine widely used benchmarks, with error rates ranging from 2% on MMLU Math to 42% on GSM8K (Figure 2.1.5). Truong et al., 2025, introduced a framework that uses statistical analysis of response patterns to flag problematic items for expert review, achieving up to 84% precision. Separately, Cheng et al., 2025, have proposed shifting toward “certificate-grade,” peerbased evaluation frameworks that are community-governed, proctored systems with secure environments, continuously refreshed test items, and delayed result disclosure. MMLU Math OpenBookQA MMLU Cli MMLU Med AIR-Bench MedQA ThaiExam MMLU 5Sub GSM8K0% 10% 20% 30% 40% Benchmark Precision@50 2% 2% 6% 6% 9% 23% 26% 31% 42% Invalid question detection across nine benchmarks Source: Truong et al., 2025 | Chart: 2026 AI Index report Figure 2.1.5 80 2.1 OV E R A L L P E R F O R M A N C E T R E N D S | T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 Third, questions have been raised about the reliability of popular public benchmarking platforms such as the Arena. A recent analysis (Singh et al., 2025) argues that platform dynamics could affect ranking accuracy. If providers are able to iterate on or swap model variants outside the public record, it introduces selection effects that make comparisons less straightforward. The study also points out data-access asymmetries and shows that additional Arena-style interaction data can improve performance on Arena-derived evaluations, suggesting that leaderboard standing may partly reflect adaptation to the platform rather than general capability alone. Finally, while capability evaluations are widespread, assessments of social impacts remain fragmented and incomplete. Reul et al., 2025, found that developers’ reporting of bias and environmental impact is often sparse and declining, while third-party researchers more rigorously assess harms such as harmful content and performance disparities. Because only developers can disclose key information about data, labor practices, and training infrastructure, current evaluation practices provide a strong picture of what models can do but a far weaker account of their societal consequences. Chapter 3 examines responsible AI evaluation in further detail. In this chapter, the AI Index continues to report on benchmarks as key indicators of technical progress. Scores are sourced from leaderboards, public repositories, and company disclosures, including papers, blog posts, and product releases. The AI Index assumes that company-reported results are accurate. All scores reflect the state of the field as of early 2026; subsequent model releases may have surpassed these benchmarks. 81

2.2 Language

Language understanding and generation continue to serve as foundational capabilities for modern AI systems. This section examines how models perform on tasks requiring comprehension of complex text, production of coherent responses, and execution of specialized language-based operations. The benchmarks also span general-purpose question answering to specific technical capabilities like function calling and text embedding. 2 T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 Language understanding benchmarks measure how well models can comprehend and reason over text across a broad range of subjects, from the humanities to highly technical materials. As performance has improved, evaluation has shifted toward harder test sets that are less susceptible to familiarity or memorization. The goal is to track where models are improving rather than reaching the upper limits of current benchmarking tools. MMLU remains a widely cited measure of broad knowledge across disciplines. Introduced in 2024, the MMLU-Pro benchmark assesses performance with over 12,000 questions and a 10-option, multiple-choice format designed to better test reasoning. This expanded answer base has a measurable impact on model performance evaluation. Compared to the original MMLU benchmark, model accuracy on MMLU-Pro typically drops by 16%–33%, which provides better differentiation between top models. For example, GPT-4o and GPT-4-Turbo appeared to have a 1% gap on standard MMLU, but on MMLU-Pro the spread widens to 9%. The newer benchmark’s design reduces prompt sensitivity and strengthens reasoning evaluation. Previously, MMLU showed around 4%–5% sensitivity to prompt variations versus an estimated 2% with MMLU-Pro. In addition, reasoning methods such as chain-of-thought tend to yield much better performance on MMLU-Pro than direct answer strategies. As of early 2026, top model performance on MMLU-Pro is tightly clustered, with the leading 15 models all scoring above 87% (Figure 2.2.1). Google’s Gemini-3.1-Pro leads at 91.2%, followed by Gemini-3-Pro (Thinking) at 90.1% and GPT-o1 at 89.3%. Models that employ thinking strategies tend to appear higher in the rankings, outperforming their standard counterparts, which are grouped in the 87%–88% range. The overall spread between the top-ranked and 15th-ranked model is just over 4 percentage points, illustrating how competitive the frontier has become on broad knowledge tasks. This tight clustering is also consistent with the convergence pattern described in Section 2.1. Understanding MMLU: Massive Multitask Language Understanding 82 2.2 L A N G UAG E | T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 87.00% 87.10% 87.10% 87.20% 87.30% 87.30% 87.40% 87.40% 87.70% 87.80% 88.60% 89.10% 89.30% 90.10% 91.16% Grok-4 GPT-5 (high) K2.5-1T-A32B Hunyuan-T1 Claude-4.5-Opus (Thinking) Claude-4-Opus (Thinking) GPT-5.2 Claude-4.5-Sonnet (Thinking) Seed2.0-Lite Qwen3.5-397B-A17B Gemini-3-Flash (12/25) Claude-4.6-Opus (Thinking) GPT-o1 Gemini-3-Pro (11/25) Gemini-3.1-Pro 0% 20% 40% 60% 80% 100% Overall accuracy MMLU-Pro: overall accuracy Source: MMLU-Pro Leaderboard, 2026 | Chart: 2026 AI Index report Figure 2.2.15 5 Source: https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro. Generation benchmarks focus on the quality of model outputs, looking at clarity, helpfulness, instructionfollowing, and style. Unlike knowledge-style tests, these evaluations often depend on human judgment since some dimensions are subjective and dependent on both the prompt and the user. Preference-based tests help measure that subjectivity and are a useful complement to traditional benchmarks for tracking how models perform in real-world settings. Generation The Arena (formerly LMArena) is an interactive platform with a community-driven ranking system that allows users to directly compare outputs of large language models (LLMs) on identical prompts and then vote on which they favor. Evaluations are blind to minimize bias toward particular model providers or architectures. By aggregating thousands of comparisons, the platform generates Elo ratings, a ranking system borrowed from chess. This approach emphasizes user experience and practical utility, capturing aspects of model quality that structured benchmarks cannot, including human judgment on real-world tasks. The user-centered approach does have limitations, as preferences may not align with correctness and may not be fully representative of model use cases or contexts. Singh et al. (2025) highlight potential sources of bias, such as order bias, length bias, or style preferences, that are not correlated with output accuracy. As mentioned earlier, evaluations such as Arena can be a complementary view rather than an absolute score on model quality. Elo ratings on the Text Arena are tightly clustered as of early 2026, with the top 15 models spanning roughly Arena Leaderboard 83 2.2 L A N G UAG E | T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 46 points (Figure 2.2.2). Claude-Opus-4-6-Thinking leads at approximately 1,510, followed closely by Gemini3.1-Pro-Preview. The gap narrows further down the rankings, and confidence intervals overlap for many models. So, while Anthropic and Google models appear throughout the top ranks, no single model dominates the leaderboard. gpt-5.1-high claude-sonnet-4-6 gemini-3-ǈash (thinking-minimal) grok-4.1 claude-opus-4-5-20251101 dola-seed-2.0-preview claude-opus-4-5-20251101-thinking-32k gemini-3-ǈash grok-4.1-thinking gpt-5.2-chat-latest-20260210 gemini-3-pro grok-4.20-beta1 gemini-3.1-pro-preview claude-opus-4-6-thinking claude-opus-4-6 1,440 1,460 1,480 1,500 1,520 Model Elo rating Text Arena: Elo rating Source: Arena, 2026 | Chart: 2026 AI Index report Figure 2.2.26 6 Source: https://arena.ai/leaderboard/text. Beyond general understanding and generation, language models need to handle tasks that make them usable for practical deployment. Three key capabilities in deployed applications are retrieval-augmented generation (RAG), function calling, and text embedding. Benchmarks used to track these capabilities are particularly useful because they test fluency and whether models can operate as part of a larger system. It also makes it easier to compare models in settings where performance depends not just on the base model, but on issues such as retrieval quality or how outputs are parsed and executed. Specialized Language Tasks Retrieval-augmented generation (RAG) provides a way for models to deliver accurate, up-to-date information beyond the knowledge encoded during training in model parameters. At inference time, RAG systems augment model responses with information retrieved from external sources. Standard RAG pipelines retrieve individual text chunks based on query similarity, which can struggle when answering questions that require synthesizing information across documents. To address the problem, in 2024, Microsoft Research introduced Graph RAG, which enables more effective responses to queries by structuring source material into a knowledge graph and generating community summaries that capture highRAG: Retrieval-Augmented Generation 84 2.2 L A N G UAG E | T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 level themes. Other variants focus on improving multistep retrieval or reranking passages before generation. As expected, these choices in architecture involve trade-offs between answer quality, latency, and cost. Context windows, discussed later in this section, have important implications for RAG systems. Extended context windows can support retrieval of more material, though that does not guarantee better performance since models have to parse through the information with reliable attention across the entire window. Function calling allows a model to use external tools and APIs by generating structured requests that another system can run, then folding the results back into its response. It is a foundational capability for agent frameworks, where models need to take actions or retrieve information beyond their training data. The Berkeley Function Calling Leaderboard (BFCL) evaluates models on their function-calling ability and has evolved considerably since its initial release. The current iteration, BFCL V4, shifts the focus toward holistic agent evaluations. Agentic tasks account for 40% of the overall score, multiturn interactions are 30%, and the remainder is split across live, nonlive, and hallucination categories. The agentic component tests web search and memory while the multiturn component evaluates multistep dialogues. Earlier versions focused more narrowly on single-turn function calling. The overall accuracy on the BFCL varies widely as of early 2026. The top 15 models span a roughly 21 percentage point range (Figure 2.2.3). Claude models occupy three of the top six positions, with ClaudeOpus-4.5 leading at 77.5%. There is also a performance distinction with evaluation modes, showing the tradeoffs between general capability and task-specific optimization. For example, Grok-4-0709 scores 63% in prompt mode but drops to 61.4% when using function-calling mode, while Grok-4-1 scores higher in its fastreasoning variant (69.6%) than its nonreasoning counterpart (58.3%). Berkeley Function Calling Leaderboard 56.24 56.73 57.06 58.29 59.06 61.38 62.97 63.05 68.14 68.70 69.57 72.38 72.51 73.24 77.47 Gemini-2.5-Flash (FC) DeepSeek-V3.2-Exp (Prompt + Thinking) Command A Reasoning (FC) Grok-4-1-fast-non-reasoning (FC) Moonshotai-Kimi-K2-Instruct (FC) Grok-4-0709 (FC) Grok-4-0709 (Prompt) o3-2025-04-16 (Prompt) Gemini-3-Pro-Preview (FC) Claude-Haiku-4-5-20251001 (FC) Grok-4-1-fast-reasoning (FC) GLM-4.6 (FC thinking) Gemini-3-Pro-Preview (Prompt) Claude-Sonnet-4-5-20250929 (FC) Claude-Opus-4-5-20251101 (FC) 0 20 40 60 80 100 Model Overall accuracy Berkeley Function Calling: overall accuracy Source: Berkeley Function Calling Leaderboard, 2026 | Chart: 2026 AI Index report Figure 2.2.37 7 Source: https://gorilla.cs.berkeley.edu/leaderboard.html. 85 2.2 L A N G UAG E | T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 The Massive Text Embedding Benchmark (MTEB) evaluates different embedding models across a set of tasks that require semantic understanding. It includes over 50 datasets, spanning eight task categories, which makes it harder for models to look strong by optimizing for a single use case rather than performing well across different settings. The top average task score on MTEB (English v2) has risen steadily since 2022, coinciding with the broader adoption of large-scale pretraining techniques for embedding models. In 2025, the top score reached 76, rising approximately 11 points since 2023 (Figure 2.2.4). However, the best models still fall short of a perfect score. MTEB: Massive Text Embedding Benchmark 2019 2020 2021 2022 2023 2024 2025 0 10 20 30 40 50 60 70 80 Average task score 75.97 MTEB (English v2): average score Source: MTEB Leaderboard, 2026 | Chart: 2026 AI Index report Figure 2.2.48 8 Source: https://mteb-leaderboard.hf.space/?benchmark_name=MTEB%28eng%2C+v1%29 https://arxiv.org/abs/2502.13595. 86 H I G H L I G H T: The Gap Between Long Context Windows and Deep Understanding 2.2 L A N G UAG E | T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 Context windows, the amount of text a model can process in a single input, have grown by almost 30x per year since mid-2023 (Figure 2.2.5). Models that once accepted a few thousand tokens can now process 1 million or more. At the upper end, this is equivalent to multiple books or an entire codebase in a single pass. On two long-context benchmarks, Fiction.liveBench, which measures narrative comprehension, and MRCR, which measures multi-needle retrieval, the input length at which leading models achieve 80% accuracy has increased even faster, at roughly 250x over a nine month period (Burnham and Adamczewski, 2025). However, bigger context windows do not translate into deeper understanding, as the gap between accepted and usable context length is wide. Recent research points to different reasons for this gap. On one expert-level, long-context benchmark (LongBench v2), human experts scored just 53.7% accuracy under a 15-minute time limit, and the best model scored 57.7% (Bai et al., 2025). This is a narrow margin in contrast to the structured benchmarks where models have surpassed human baselines, and reflects the difficulty of deep comprehension over long inputs. Models that were prompted to reason through the material step by step did perform better than those asked to answer immediately, suggesting that how a model works through long text matters as much as the amount of text it can accept. Other research has found that models handle simple lookups well but struggle when asked to find multiple pieces of matching information or to apply conditions across a very long document— tasks that would be straightforward for a human scanning the same text (Yu et al., 2025). Models can complete these tasks if guided to check each one by one, but this approach is slow and expensive. Longer inputs come with practical costs of slower response times, higher operating expenses, and reduced accuracy for information that appears later in the input. Measuring long context ability also remains difficult. When a model scores well on a long context test, it is not always clear whether it genuinely processed the full input or simply relied on knowledge it already had. Yang et al. (2025) introduced a metric designed to separate these two factors and found that model rankings shifted a lot. For example, a model that ranked seventh on raw scores ranked first when only long-context ability was measured, further underscoring why it is important to distinguish between a model specifically being able to better handle long inputs, rather than having overall better capabilities. If the gap between context window size and effective utilization becomes more precise, models may improve their ability to work on tasks that unfold over hours or days and sustain longer chains of reasoning (Denain and Ho, 2025). Developing evaluations that reliably distinguish true long-context ability from general model capability will be important for tracking that progress and ensuring that benchmark gains reflect real improvements. Figure 2.2.5 87

2.3 Image and Video

Beyond language, many models process visual inputs, and their video and image capabilities have advanced significantly. This section examines model performance across the dimensions of understanding—how well they comprehend and reason over video content—and generation, which evaluates the quality of AIproduced images and videos. 2 T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 Video understanding benchmarks measure how well models can track actions, objects, and events across frames rather than reasoning over a single image. As performance on earlier benchmarks has improved, evaluation has shifted toward tasks that demand multistep temporal reasoning and domain-specific knowledge applied to video. MVBench evaluates whether multimodal models can move beyond static image understanding to handle the complexities of video. This includes interpreting motion, temporal sequences, and shifting context across frames. Its focus on temporal reasoning makes it a useful benchmark for tracking performance in more dynamic visual environments. The top-performing model on MVBench reaches 74.1% average accuracy, with JT-VL-Chat and JT3.5 tied at that score (Figure 2.3.1). In early 2026, across the top 15 models, performance spans a range of roughly 23 percentage points. VideoChat 2 has the lowest average accuracy (51.1%), while several VideoChat2 variants are grouped in the middle tier (60%–65%). Understanding MVBench 51.10% 54.73% 54.85% 58.10% 58.77% 60.40% 62.30% 62.80% 64.60% 67.25% 67.42% 69.23% 73.81% 74.08% 74.08% VideoChat2 Kwai-VideoLLM ST-LLM PLLaVA 34B CVLM VideoChat2_mistral VideoChat2_HD_mistral Video-CCAM-4B-v1.1 Video-CCAM-9B-v1.1 InternVideo2-8B-HD-Chat-f16 TimeMarker Video-CCAM-7B-v1.2 Woven-VLLM JT3.5 JT-VL-Chat 0% 20% 40% 60% 80% 100% Model Average accuracy MVBench: average accuracy Source: MVBench Leaderboard, 2026 | Chart: 2026 AI Index report Figure 2.3.19 9 Source: https://huggingface.co/spaces/OpenGVLab/MVBench_Leaderboard. 88 2.3 I M AG E A N D V I D E O | T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 Video-MMMU is a large, multimodal, multidisciplinary benchmark for learning from educational videos, comprising 300 expert-level videos averaging roughly 506 seconds across six disciplines and 30 subjects. Each video is paired with three sets of questions that test progressively deeper understanding. Perception questions test whether a model can pull key details from text/audio; comprehension questions test whether it grasps the concept or solution strategy; and adaptation questions require applying that knowledge to a new scenario. Adaptation questions reuse MMMU/MMMU-Pro items for STEM fields and custom case studies for art/humanities, so models have to go beyond the specific video. The benchmarks also introduce a Δknowledge metric to track how much a model’s performance improves after processing the video. As of 2025, no model has reached the human baseline of 74.4% on Video-MMMU overall accuracy (Figure 2.3.2). The best performing model, Keye-VL-1.5-8B, scores 66%, followed closely by Claude -3.5-Sonnet (65.8%). The lowest score is VILA1.5-8B at 20.9%, leaving a 45 percentage point range across the leaderboard. The Δknowledge metric results reveal a further gap between human and model learning (Figure 2.3.3). Human experts gain 33.1 percentage points after watching the video, while the best model on this metric, GPT-4o, gains only about half of that (15.6 points). About a third of models even show negative Δknowledge, as their performance actually declines after processing the video. Video-MMMU 20.89% 23.98% 30.00%33.89%34.00% 36.11% 37.44% 41.67% 42.00% 47.44% 48.33% 49.67% 49.78% 50.78% 53.89% 61.22% 65.78% 41.78% 43.00% 47.00% 52.70% 57.56% 60.22% 65.22%66.00% VILA1.5-8B LongVA-7B Llama-3.2-11B LLaVA-OneVision-7B VILA1.5-40B LLaVA-Video-7B InternVL2-8B VideoChat-Flash-7B@448 mPLUG-Owl3-7B Qwen-2.5-VL-7B LLaVA-OneVision-72B LLaVA-Video-72B Gemini 1.5 Flash Aria Gemini 1.5 Pro GPT-4o Claude-3.5-Sonnet MAmmoTH-VL-8B InternVideo2.5-Chat-8B VideoLLaMA3-7B Video-RTS GLM-4V-PLUS-0111 Qwen-2.5-VL-72B Kimi-VL-A3B-Thinking-2506 Keye-VL-1.5-8B 2024 2025 0% 20% 40% 60% 80% 100% Overall accuracy Video-MMMU: overall accuracy Source: Video-MMMU Leaderboard, 2026 | Chart: 2026 AI Index report 74.44%, human baseline Figure 2.3.210 10 Source: https://videommmu.github.io/#Leaderboard. 89 2.3 I M AG E A N D V I D E O | T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 -8.50% -7.00% -5.60% -5.30% -3.30% -1.70% -1.30% -0.50% 0.00% 1.50% 2.20% 3.00% 3.20% 3.50% 5.90% 6.60% 7.10% 7.50% 8.70% 9.40% 9.70% 11.40% 15.60% 33.10% −10% −5% 0% 5% 10% 15% 20% 25% 30% 35% InternVL2-8B LongVA-7B LLaVA-OneVision-7B LLaVA-Video-7B Gemini 1.5 Flash GLM-4V-PLUS-0111 VideoChat-Flash-7B@448 VideoLLaMA3-7B Keye-VL-1.5-8B MAmmoTH-VL-8B Qwen-2.5-VL-7B InternVideo2.5-Chat-8B Aria Kimi-VL-A3B-Thinking-2506 VILA1.5-8B LLaVA-OneVision-72B LLaVA-Video-72B mPLUG-Owl3-7B Gemini 1.5 Pro VILA1.5-40B Qwen-2.5-VL-72B Claude-3.5-Sonnet GPT-4o Human expert ∆knowledge Video-MMMU: ∆knowledge Source: Video-MMMU Leaderboard, 2026 | Chart: 2026 AI Index report Figure 2.3.3 Figure 2.3.4 While the above benchmarks test how well models interpret existing visual content, generation benchmarks assess how well models can produce it. Evaluation spans both human preference rankings as well as automated quality metrics, since generated video must satisfy subjective expectations and technical criteria such as coherence, fidelity, and controllability. Of these, controllability has become an especially important focus, reflecting whether models can follow user intent while maintaining natural motion and scene dynamics. This has also brought video generation closer to the idea of world models, where systems aim to predict how visual scenes evolve over time. Generation Midjourney generations over time: “a hyper-realistic image of Harry Potter” Source: Midjourney, 2025 V1, February 2022 V2, April 2022 V3, July 2022 V4, November 2022 V5, March 2023 V6, December 2023 V6.1, July 2024 V7, April 2025 90 Figure 2.3.511 11 This chart shows only the top 15 models as of February 2026; source: https://arena.ai/leaderboard/vision. 2.3 I M AG E A N D V I D E O | T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 The Arena platform also hosts a Vision Arena that applies the same blind-comparison, Elo-based methodology described in the earlier section for language to image generation models. Human preference is an important signal for image generation, as qualities like aesthetic appeal and visual coherence are difficult to capture through automated metrics alone. As of early 2026, Google’s Gemini models hold four of the top six positions, with Gemini-3-Pro leading at approximately 1,285 Elo, followed by its other variation (Figure 2.3.5). Similar to the language evaluation, confidence intervals overlap for the bottom two-thirds of the ranked models as they fall within a 30-point range between 1,230 and 1,260. Video-Bench is a human-aligned benchmark for video generation that scores models on two dimension groups: video-condition alignment and video quality. It uses an MLLM-based evaluator (GPT-4o) with fewshot scoring, and chain-of-query prompting for more precise, calibrated results. The scores correlate more strongly with human ratings than prior metric-based or LLM-based benchmarks. As of early 2026, Gen3 and Kling lead on video quality, with Gen3 scoring highest across all key metrics, including imaging quality, aesthetic quality, temporal consistency, and motion effects, while Kling ranks second overall (Figure 2.3.6). Motion effects are the weakest subdimension across nearly all models. Arena: Vision Video-Bench gpt-5.2 kimi-k2.5-instant chatgpt-4o-latest-20250326 gpt-5.1 qwen3.5-397b-a17b gpt-5.2-high kimi-k2.5-thinking gemini-2.5-pro gpt-5.1-high gemini-3-ǈash (thinking-minimal) dola-seed-2.0-preview gpt-5.2-chat-latest-20260210 gemini-3-ǈash gemini-3.1-pro-preview gemini-3-pro 1,220 1,240 1,260 1,280 1,300 Model Elo rating Vision Arena: Elo rating Source: Arena, 2026 | Chart: 2026 AI Index report 91 2.3 I M AG E A N D V I D E O | T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 4.66 3.87 4.08 4.26 3.30 3.00 3.78 4.44 3.84 3.85 3.82 3.28 2.94 3.76 4.74 4.14 3.69 4.38 3.90 3.00 3.40 3.99 3.55 2.81 3.11 2.90 2.43 2.59 Gen3 Cogvideox VideoCrafter2 Kling Show-1 LaVie Pika-Beta 0 1 2 3 4 5 Imaging quality Aesthetic quality Temporal consistency Motion eǄects Model Score↑ Video-Bench video quality Source: Video-Bench Leaderboard, 2025 | Chart: 2026 AI Index report Figure 2.3.612 12 Source: https://github.com/Video-Bench/Video-Bench?tab=readme-ov-file#leaderboard. VBench-2.0 is a comprehensive, human-aligned benchmark for evaluating video generation models on intrinsic faithfulness, defined as well-rounded adherence to reality rather than simply being visually convincing. It scores models across five broad dimensions (Human Fidelity, Creativity, Controllability, Physics, and Commonsense). The benchmark combines VLM/LLM-based analysis with specialized detectors and a small but targeted prompt set, anchored by human preference labels. This faithfulness-oriented approach is important because it surfaces whether generated videos hold up under scrutiny in areas like physical plausibility and scene consistency. None of the models evaluated in early 2026 surpasses a total score of 67% (Figure 2.3.7). Veo 3 leads at 66.7%, about 4 percentage points above the next top performing mode, Vidu Q1 (62.7%). Similar to other benchmark scores, several models are tightly grouped and hover around scores of 58% and 60%. Even established systems like Kling, CogVideoX, and HunyuanVideo continue to struggle with complex stories and consistent object/scene dynamics. VBench-2.0 92 Figure 2.3.713 13 Source: https://huggingface.co/spaces/Vchitect/VBench_Leaderboard. 2.3 I M AG E A N D V I D E O | T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 53.35% 55.30% 55.78% 58.38% 59.00% 59.81% 60.20% 61.78% 62.70% 66.72% CogVideoX-1.5 HunyuanVideo StepVideo Sora-480p Kling 1.6 Seedance 1.0 Pro (2025-05-28) Wan2.1 ToMoviee 2.0 Vidu Q1 (2025-04-17) Veo 3 0% 20% 40% 60% 80% 100% Model Total score VBench-2.0: total score Source: VBench Leaderboard, 2026 | Chart: 2026 AI Index report H I G H L I G H T: Progress in Video Generation The benchmarks in this section mostly evaluate video modes as content generators, scoring them on quality, fidelity, and controllability. However, recent research suggests that video generation models may be developing capabilities that go beyond producing content. A 2025 Google DeepMind study (Wiedemar et al., 2025) tested whether Veo 3, a video generation model, could solve visual tasks it was never specifically trained for, using only an input image and a text prompt. Across 62 qualitative tasks and seven quantitative evaluations covering more than 18,000 generated videos, the model showed zero-shot abilities in areas traditionally handled by specialized systems. These included perception tasks such as edge detection and segmentation, physical modeling tasks such as buoyancy and rigid body dynamics, and manipulation tasks such as style transfer and object extraction. The authors also observed early signs of visual reasoning, including maze solving and visual analogy completion, which they describe as “chain of frames,” a parallel to chain-of-thought reasoning in language models where the model appears to reason step by step through successive frames. Performance improved consistently from Veo 2 to Veo 3 across all quantitative tasks and, in some cases, matched or exceeded a dedicated image editing baseline (Nano Banana). Specialized models still outperform zero-shot video generation on most individual tasks, but the rapid improvement and breadth of zero-shot capability suggests a familiar trajectory. Large language models develop general-purpose language understanding from generative training on web-scale data, and video models trained under similar conditions may be following a comparable path toward general-purpose vision. 93

2.4 Reasoning

Reasoning benchmarks assess whether models can solve problems that require abstraction and generalization across domains and formats. As performance has improved, newer benchmarks aim to distinguish genuine problem-solving from performance that is driven by memorization or prompt familiarity. However, because models can also produce errors in otherwise fluent responses, efforts are ramping up to measure these error rates alongside reasoning limitations. The AI Index tracks those benchmarks on factual reliability and error rates in Chapter 3. Across the benchmarks in this section, leading models perform well on many tasks but still show gaps on the more difficult items. 2 T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 General reasoning refers to a model’s ability to solve unfamiliar problems by applying rules and combining evidence, rather than relying on domain knowledge or memorized patterns. The benchmarks discussed below span multiple domains and tasks and are designed to test multistep inference. One example is multidigit arithmetic, such as long integer multiplication, to test whether models can execute consistent stepwise computation rather than produce plausible-looking outputs. Other more complex benchmarks extend this idea to multimodal settings, where models must integrate text with diagrams or plots to reach the correct answer. General Reasoning MMMU evaluates multimodal reasoning on college-level subject questions that combine text with visuals such as diagrams, charts, tables, and equations. Some example tasks include extracting constraints from a table and applying them to a word problem, or using a diagram to answer a domain-specific question in areas like engineering or medicine. As of February 2026, the leading model, Gemini 3.1 Pro Preview, scored 88.2% on MMMU and within 0.4 percentage points of the best human expert reference (Figure 2.4.1). Other Gemini variants follow closely, including Gemini 3 Flash (87.6%) and Gemini 3 Pro (87.5%), while GPT-5.2 scores 86.7%. The 2026 models trail behind with Kimi K2.5 at 84.3% and Claude Opus 4.6 (Thinking) at 83.9%. MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI 94 2.4 R E AS O N I N G | T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 80.42% 80.75% 81.10% 81.34% 81.50% 82.95% 83.18% 86.67% 87.51% 87.63% 88.21% 83.58% 83.87% 84.34% o3 Gemini 2.5 Flash Preview Claude Opus 4.5 (Nonthinking) Gemini 2.5 Pro Exp GPT 5 Claude Opus 4.5 (Thinking) GPT 5.1 GPT 5.2 Gemini 3 Pro (11/25) Gemini 3 Flash (12/25.) Gemini 3.1 Pro Preview Claude Sonnet 4.6 Claude Opus 4.6 (Thinking) Kimi K2.5 2025 2026 0% 20% 40% 60% 80% 100% Accuracy MMMU: accuracy Source: Vals.ai, 2026 | Chart: 2026 AI Index report 88.60%, human expert (best) Figure 2.4.114 Figure 2.4.215 14 This chart shows the top 15 models as of February 2026; data source: https://www.vals.ai/benchmarks/mmmu. 15 Data source: https://epoch.ai/benchmarks. While MMMU focuses on multimodal reasoning, GPQA evaluates reasoning on difficult, textonly questions designed to test graduate-level problem solving. The questions require models to apply domain-specific concepts and follow multistep logic to reach the correct answer. Example tasks include graduate-level chemistry or physics questions that require working through a multistep solution and choosing the best answer from several very similar options. Model performance on the GPQA Diamond set has continued to rise above the expert human validator baseline of 81.2% (Figure 2.4.2). In late 2024, OpenAI’s o3 was the first to exceed it with a score of 87.7%. In 2025, mean accuracy reached 93%, exceeding the expert reference point by 12 percentage points. GPQA: A Graduate-Level Google-Proof Q&A Benchmark 2023 2024 2025 0% 20% 40% 60% 80% 100% Mean accuracy 93.00% GPQA on the diamond set: mean accuracy Source: Epoch AI, 2026 | Chart: 2026 AI Index report 81.20%, expert human validators 95 Figure 2.4.316 Figure 2.4.417 16 This chart shows the top 15 models as of February 2026; data source: https://arcprize.org/leaderboard. 17 Data source: https://lastexam.ai. Introduced in 2019, ARC-AGI is a benchmark that tests the ability of systems to generalize beyond prior training, emphasizing generalized learning ability. Despite its name, the benchmark tests a specific form of abstraction and pattern inference rather than general intelligence in a broader sense. Its updated version, ARC-AGI-2, was introduced in 2025 and shifts to abstract puzzle-style tasks that evaluate whether models can infer rules from a small set of examples and apply them to new cases. Example tasks include grid puzzles where the model is given a few example solutions, infers the rule, and uses it to solve a new problem. Scores on ARC-AGI-2 vary widely across models, and the spread between the highest and lowest scores in the figure is about 46% (Figure 2.4.3). Gemini 3 Deep Think leads at 84.6%, followed by Gemini 3.1 Pro Preview at 77.1% and GPT-5.2 (Refine.) at 72.9%. Several Claude Opus 4.6 variants are clustered together, scoring between 66.3% and 69.2%. Humanity’s Last Exam (HLE) benchmark evaluates model performance on 2,700 highly challenging questions across dozens of academic subjects. It is designed as an expertlevel, closed-ended benchmark with wide coverage and using a mix of multiple-choice and short-answer formats suitable for automated grading. Example tasks include a graduate level question that requires applying a concept and providing a single, verifiable answer. Some may include an image, requiring models to integrate visual and textual information. Between 2024 and 2025, model accuracy on HLE increased by 30 percentage points (Figure 2.4.4). In a single year, accuracy went from under 10% to 38.3%. Even with this jump, the benchmark is designed to stay difficult, and high-confidence errors are still common. ARC-AGI-2 Humanity’s Last Exam 38.50% 43.30% 45.10% 52.90% 54.00% 54.20% 58.30% 60.40% 64.60% 66.30% 68.80% 69.20% 72.90% 77.10% 84.60% GPT-5.2 Pro (Medium) GPT-5.2 (High) Gemini 3 Deep Think (Preview) GPT-5.2 (X-High) Gemini 3 Pro (ReǇne) GPT-5.2 Pro (High) Claude Sonnet 4.6 (Max) Claude Sonnet 4.6 (High) Claude Opus 4.6 (120K, Low) Claude Opus 4.6 (120K, Medium) Claude Opus 4.6 (120K, Max) Claude Opus 4.6 (120K, High) GPT-5.2 (ReǇne) Gemini 3.1 Pro (Preview) Gemini 3 Deep Think (2/26) 0% 20% 40% 60% 80% 100% Model Score ARC-AGI-2 Source: ARC-AGI-2 Leaderboard, 2026 | Chart: 2026 AI Index report 2024 2025 0% 20% 40% 60% 80% 100% Accuracy 38.30% Humanity’s Last Exam (HLE): accuracy Source: Center for AI Safety et al., 2026 | Chart: 2026 AI Index report 2.4 R E AS O N I N G | T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 96 H I G H L I G H T: Time Understanding in MLLMs Many multimodal models still struggle with something most humans find routine, telling the time. Despite the rapid improvements on expert-level reasoning benchmarks like GPQA and HLE, recent studies show models have trouble reading analog clocks. The task combines visual perception with simple arithmetic, from identifying clock hands and their positions and then converting those into a time value. There is the risk that an error in one step will cascade into the next. Saxena et al. (2025) tested seven multimodal models on two focused datasets (Figure 2.4.5). ClockQA included 62 analog clock images across six visual styles—including clocks with a black dial or no second hand—and CalendarQA, which paired yearly calendar images with date-reasoning questions. On clock reading, even the best performing model, Gemini-2.0, achieved only 22.6% exact match accuracy (Figure 2.4.6). Models fared better on the calendar questions, with GPT-o1 reaching 80% accuracy, though there were more errors when questions required date arithmetic rather than recognition of well-known holidays (Figure 2.4.7). ClockBench (Safar, 2025) scaled up the evaluation to 180 clock designs and 720 questions. Humans read correctly formatted clocks correctly 90.1% of the time, while GPT-5.4 High, the top model, reached 50.6% in March 2026 (Figure 2.4.8). The gap of about 40 percentage points is large, but the wider gap is in the nature of the errors. When models told the time wrong, their median error ranged from about one to three hours, compared to three minutes for humans. A study published in IEEE Internet Computing (Fu et al., 2025) looked at why these failures continue to happen. After fine-tuning on 5,000 synthetic clock images, models improved on familiar clock styles but failed to generalize to real-world photos or clocks that had different features, such as distorted dials or thinner hands. When researchers dug into the errors, they identified a pattern. If a model confused the hour and minute hands, its ability to judge hand direction deteriorated. This suggests that the difficulty springs less from training data and more on how models piece together multiple visual cues within a single image. Even as models close the gap with human experts on knowledge-intensive tasks, this kind of visual reasoning remains a persistent challenge. 2.4 R E AS O N I N G | T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 Source: Saxena et al., 2025 Figure 2.4.5 97 H I G H L I G H T: 2.4 R E AS O N I N G | T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 Qwen2-VL-7B Llama 3.2-Vision MiniCPM-V-2.6 GPT-o1 Claude-3.5-sonnet GPT-4o Gemini-2.00% 5% 10% 15% 20% Exact match ↑ 0% 3.23%% 3.23%% 4.84%% 6.45%% 8.06%% 22.58%% ClockQA: exact match Source: Saxena et al., 2025 | Chart: 2026 AI Index report Llama 3.2-Vision Qwen2-VL-7B MiniCPM-V-2.6 Gemini-2.0 GPT-4o Claude-3.5-sonnet GPT-o10% 20% 40% 60% 80% Accuracy ↑ 11.67%% 18.33%% 20% 31.67%% 43.33%% 46.67%% 80% CalendarQA: accuracy Source: Saxena et al., 2025 | Chart: 2026 AI Index report Figure 2.4.6 Figure 2.4.7 8.90% 10% 11.10% 11.10% 11.10% 12.20% 14.40% 14.40% 15% 15% 18.90% 28.90% 32.20% 39.40% 50.60% Claude Opus 4.6 Mistral Medium 3.1 GPT-5 Pro GPT-5 High Gemini 2.5 Flash o3 High o3 Pro Qwen 3-VL 235B Thinking GPT-5.2 High Gemini Robotics ER 1.5 Gemini 2.5 Pro Gemini 3 Pro Gemini 3.1 Pro Qwen 3-VL 235B Instruct GPT-5.4 High 0% 20% 40% 60% 80% 100% Model Accuracy ClockBench: accuracy Source: ClockBench Leaderboard, 2026 | Chart: 2026 AI Index report 90.70%, human baseline Figure 2.4.818 18 Data source: https://clockbench.ai. 98 2.4 R E AS O N I N G | T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 In addition to general reasoning, the AI Index tracks planning benchmarks that assess models’ ability to sequence actions over multiple steps to achieve a goal. Models have to keep track of what has already happened, avoid invalid actions, and maintain consistency even as problems get longer and more complex. Unlike single-shot reasoning questions, planning evaluations can expose failures that only emerge over longer horizons, including compounding errors or forgetting earlier constraints. Which benchmark is used to measure these capabilities matters, as different tasks surface different types of failures. Planning Classical planners like LAMA search systematically through possible states and produce correct plans when they find a solution. Language models instead generate plans based on learned patterns, which means they can produce plausible sequences that can have invalid steps or miss constraints. PlanBench evaluates end-to-end planning by prompting models to generate a full plan from a structured problem description across several planning domains. A domain is a type of problem with its own rules and goals, such as stacking blocks in a specific order, navigating routes, or transporting packages between locations. The benchmark reports performance as the number of tasks solved in each domain, with up to 45 tasks per domain, compared to LAMA as a classical planning baseline. No single model leads across every domain (Figure 2.4.9). Under standard planning, LAMA leads in several domains, including Miconic (45/45), Rovers (34/45), and Transport (33/45). In more structured domains such as Childsnack and Spanner, frontier models match or exceed LAMA, with GPT-5 reaching 38/45 on Childsnack and 45/45 on Spanner. When task descriptions are scrambled to disguise their structure, performance decreases for most models in several domains, though the effect depends on the domain and model (Figure 2.4.10). For example, DeepSeek R1 falls to 3/45 on Blocksworld and 0/45 on Floortile and Sokoban. Similarly, GPT-5 declines to 12/45 on Blocksworld and 7/45 on Sokoban. PlanBench 99 2.4 R E AS O N I N G | T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 28 18 10 45 34 21 15 33 21 38 10 35 14 15 45 27 21 38 6 23 6 9 30 22 19 33 5 27 9 10 38 16 Blocksworld Childsnack Floortile Miconic Rovers Sokoban Spanner Transport 0 5 10 15 20 25 30 35 40 45 LAMA GPT-5 Gemini R1 Domain Tasks solved PlanBench: task solved on standard planning Source: Corrêa et al., 2025 | Chart: 2026 AI Index report Figure 2.4.919 Figure 2.4.1020 19 Data source: https://arxiv.org/pdf/2511.09378. 20 Data source: https://arxiv.org/pdf/2511.09378. 28 18 10 45 34 21 15 33 12 33 3 28 10 7 38 21 28 25 5 9 8 5 39 27 3 28 0 14 5 0 34 9 Blocksworld Childsnack Floortile Miconic Rovers Sokoban Spanner Transport 0 5 10 15 20 25 30 35 40 45 LAMA GPT-5 Gemini R1 Domain Tasks solved PlanBench: task solved on obfuscated planning Source: Corrêa et al., 2025 | Chart: 2026 AI Index report 100

2.5 Performance in

Specific Domains 2 T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 Coding benchmarks test whether models can go beyond answering questions about code and actually write, debug, and ship working software. The tasks in this section range from resolving real GitHub issues to building full web applications from scratch, reflecting a shift in evaluation toward measuring what models can deliver end to end rather than in isolated snippets. As AI models have improved on general reasoning and knowledge benchmarks, attention has shifted to how well they perform on tasks requiring specialized expertise. The benchmarks in this section test models in four professional and academic domains: coding, mathematics, finance, and legal reasoning. Each has its own vocabulary, conventions, and standards for what counts as a correct and comprehensible answer. Many of these benchmarks are new, reflecting growing demand for domain-specific evaluation. Unless otherwise noted, the results reported below reflect model performance as of early 2026. Software SWE-bench evaluates models on their ability to resolve real-world software issues collected from GitHub. Each task gives the model a codebase and an issue description, and the model has to produce a working patch. SWE-bench Lite is a smaller, more accessible subset while SWE-bench Verified uses human-validated issues to ensure more consistent and accurate grading. On SWE-bench Verified, top models are tightly clustered in the low-to-mid 70s (Figure 2.5.1). As of February 2026, Claude 4.5 Opus (high reasoning) led at approximately 76.8%, with several others including KimiK2.5, GPT-5.2, and Gemini 3 Flash (high reasoning) grouped between 70% and 76%. This is a pattern seen across several benchmarks in this chapter, where high-performing models score within a few percentage points of each other. SWE-bench 101 2.5 P E R F O R M A N C E I N S P E C I F I C D O M A I N S | T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 44.00% 46.00% 47.00% 48.00% 48.33% 49.67% 56.67% 58.33% 60.00% 60.33% 70.80% 71.40% 72.80% 72.80% 72.80% 72.80% 75.60% 75.80% 75.80% 76.80% CodeFuse-CGM KGCompass + Claude 3.5 Sonnet (20241022) DARS Agent SWE-agent + Claude 3.7 Sonnet ExpeRepair-v1.0 EntroPO + R2E + Qwen3-Coder-30B-A3B-Instruct SWE-agent + Claude 4 Sonnet KGCompass + Claude 4 Sonnet (20250514) Refact.ai Agent ExpeRepair-v1.0 + Claude 4 Sonnet Kimi K2.5 (high reasoning) Claude 4.5 Sonnet (high reasoning) GPT-5-2 Codex GLM-5 (high reasoning) GPT-5-2 (high reasoning) GPT 5.2 Codex Claude Opus 4.6 Gemini 3 Flash (high reasoning) MiniMax M2.5 (high reasoning) Claude 4.5 Opus (high reasoning) Lite VeriǇed 0% 20% 40% 60% 80% 100% Lite VeriǇed Model Percent solved SWE-bench: percent solved Source: SWE-bench Leaderboard, 2026 | Chart: 2026 AI Index report Figure 2.5.121 21 This chart shows the top 10 models for SWE-bench Verified and Lite as of February 2026. For Verified, only results using the mini-SWE-agent-v2 filter are included. This means all models were tested under the same agent workflow, so differences in scores reflect the underlying model rather than differences in the surrounding system. Data source: https://www.swebench.com/index.html. Terminal-Bench is a benchmark for testing AI agents in real terminal environments. It evaluates how well agents can autonomously handle real-world, end-to-end tasks, from compiling code to training models and setting up servers. These are the kinds of tasks a developer might do in a day of work, and it requires an agent to chain together multiple steps without human guidance. Accuracy on Terminal-Bench 2.0 has significantly improved over the past year, increasing from 20% in February 2025 to 77.3% in early 2026 (Figure 2.5.2). Terminal-Bench 102 2.5 P E R F O R M A N C E I N S P E C I F I C D O M A I N S | T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 Feb-2025 Mar-2025 Apr-2025 May-2025 Jun-2025 Jul-2025 Aug-2025 Sep-2025 Oct-2025 Nov-2025 Dec-2025 Jan-2026 Feb-2026 0% 20% 40% 60% 80% 100% Accuracy 77.30% Terminal-Bench 2.0: accuracy Source: Terminal-Bench 2.0 Leaderboard, 2026 | Chart: 2026 AI Index report Figure 2.5.222 Figure 2.5.323 22 Data source: https://www.tbench.ai/leaderboard/terminal-bench/2.0. 23 Data source: https://www.vals.ai/benchmarks/vibe-code. Vibe Code Bench is the first benchmark designed to test whether AI models can autonomously build complete, end-to-end web applications from scratch. Rather than measuring coding assistance, it evaluates real software delivery and sees if a model can take a prompt and produce a functional application. Across models, performance varies quite a bit (Figure 2.5.3). Claude Opus 4.6 (Nonthinking) leads at 56.5%, followed by GPT 5.2 at nearly 47%. Scores drop after GPT 5.3 Codex (41.4%) to under 30%, with several models falling below 15%. The spread between the top and bottom models is about 46 percentage points, and even the leading model solves only about half of the tasks, suggesting that autonomous application building remains a difficult task. 31.46% 32.03% 37.91% 47.97% 51.48% 53.50% 53.50% 57.57% GLM 5.1 Gemini 3.1 Pro Preview (02/26) GPT 5.2 Codex GPT 5.4 Mini Claude Sonnet 4.6 Claude Opus 4.6 (Thinking) GPT 5.2 Claude Opus 4.6 (Nonthinking)0% 20% 40% 60% 80% 100% Model Accuracy Vibe Code Bench v1.1: accuracy Source: Vals.ai, 2026 | Chart: 2026 AI Index report Vibe Code Bench 103 Figure 2.5.424 Figure 2.5.525 24 Data source: https://epoch.ai/benchmarks/. 25 Data source: https://matharena.ai/?comp=hmmt--hmmt_feb_2026&view=problem. 2.5 P E R F O R M A N C E I N S P E C I F I C D O M A I N S | T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 MathArena is a rolling benchmark that uses newly released math contests to test models on fresh, competitionstyle problems. It draws from well-known competitions at the high school and olympiad level, including AIME, HMMT, USAMO, the International Mathematical Olympiad (IMO), and Project Euler, running models soon after each one to reduce the risk of training-data contamination. Numerical answers are auto-graded, while human graders score written proofs with results being posted on a public leaderboard. Accuracy on MathArena has increased from about 83% in November 2025 to 97% in December 2025 (Figure 2.5.5). On answer-based problems, leading models reach or surpass the level of top human contestants. However, on proof-based tasks, they still perform well below humans when asked to produce rigorous, stepby-step mathematical proofs. Getting the right answer and showing the reasoning behind it remain distinct challenges for current systems. 2024 2025 0% 20% 40% 60% 80% 100% Pass@1 accuracy 31.30% FrontierMath Tier 4: pass@1 accuracy Source: Epoch AI, 2026 | Chart: 2026 AI Index report Nov-2025 Dec-2025 0% 20% 40% 60% 80% 100% Accuracy 96.97% MathArena: accuracy Source: MathArena Leaderboard, 2026 | Chart: 2026 AI Index report Beyond coding and language tasks, mathematics has become a key testing ground for model reasoning. The benchmarks in this section range from competition-level problem solving to formal proof writing. Mathematics FrontierMath is a benchmark introduced by Epoch AI that features hundreds of original, exceptionally challenging mathematical problems. The problems are designed to test genuine mathematical reasoning rather than pattern recognition, and even experienced mathematicians may need hours or days to solve them. Since 2024, accuracy on FrontierMath Tier 4 has risen from near 0% to 31.3%, with GPT-5.2 Pro (Web App) leading by the end of 2025 (Figure 2.5.4). The benchmark is designed to stay difficult, so even with this steep climb in a short time, the best models still fail on roughly two out of three problems at the hardest tier. FrontierMath MathArena 104 H I G H L I G H T: 2.5 P E R F O R M A N C E I N S P E C I F I C D O M A I N S | T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 Theorem Proving In mathematics, getting the right answer is only one part of the challenge. A correct result backed by flawed reasoning would earn little credit at a competition or in a journal. Theorem proving, the process of constructing a rigorous, step-by-step argument for why a result must be true, remains one of the hardest tasks for AI systems. Until recently, even frontier models struggled to produce proofs capable of passing expert review. As covered in last year’s AI Index, DeepMind’s AlphaProof and AlphaGeometry 2 solved four of six problems at the 2024 International Mathematical Olympiad (IMO), winning a silver medal with 28 points. That result required experts to translate problems into formal languages like Lean and took days of computation. In 2025, Gemini Deep Think solved five of six problems and scored 35 points, winning a gold medal, while working end to end in natural language within the 4.5-hour competition time limit (Luong and Lockhart, 2025). The jump from silver to gold in a single year, with a far simpler pipeline, marks one of the fastest capability gains in competitive mathematics. IMO-Bench (Luong et al., 2025) is a new benchmark suite designed to measure whether that kind of progress is genuine reasoning or just better answer guessing. It includes three components. IMO-AnswerBench tests models on 400 Olympiad-style problems across algebra, combinatorics, geometry, and number theory, with verifiable short answers. IMO-ProofBench evaluates whether models can produce rigorous step-by-step proofs for 60 problems ranging from pre-IMO to full IMO difficulty. IMO-GradingBench provides a dataset with 1,000 examples of solutions and human-graded proofs to support the development of automated proof grading systems. Grading mathematical proofs has traditionally required human experts, which limits how many models and solutions can be evaluated at scale. On IMO-ProofBench, scores assigned by an automated grading system closely track those given by human experts, with Pearson correlation of 0.96 on basic and 0.93 on advanced problems (Figure 2.5.6). That level of agreement indicates that automated grading could be a reasonable standin, though the benchmark authors recommend human verification for high-stakes results. With that grading approach validated, the benchmark results reveal the extent of the gap between models (Figure 2.5.7). Aletheia leads at 91.9%, followed by Gemini 3 Deep Think at 76.7% and Gemini Deep Think (IMO Gold) at 65.7%. From there, scores drop quite a bit. GPT5.2 Thinking (high) reaches 35.7%, Gemini 3 Pro scores 30%, and GPT5.1 falls to 7.1%. The spread between the top and bottom models is about 85 percentage points. A breakdown by problem source in the IMO-Bench paper suggests that some of these scores may also reflect familiarity with existing competition problems rather than general reasoning ability, reinforcing a pattern seen with MathArena. Producing correct answers and rigorous proofs remain very different tasks, with most models far more performant on the former. 105 H I G H L I G H T: 2.5 P E R F O R M A N C E I N S P E C I F I C D O M A I N S | T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 Figure 2.5.726 26 Data source: https://imobench.github.io/. GPT-5.1 Grok 4.1 Fast Reasoning Claude Opus 4.5 GPT-5 Pro Gemini 3 Pro GPT-5.2 Thinking (high) Gemini Deep Think (IMO Gold) Gemini 3 Deep Think Aletheia0% 20% 40% 60% 80% 100% Score 7.1% 18.6% 23.8% 28.6% 30% 35.7% 65.7% 76.7% 91.9% IMO-ProofBench Source: IMO-ProofBench Leaderboard, 2026 | Chart: 2026 AI Index report Figure 2.5.6 Source: Luong et al., 2025 106 This section covers benchmarks designed to evaluate AI systems on finance-specific tasks. Unlike general reasoning benchmarks, these tests require models to handle domain-specific language, extract structured information from financial documents, and apply professional judgment in areas including tax law, the mortgage process, and financial analysis. Finance TaxEval v2 is a benchmark designed to test how well models handle challenging tax-related questions. It contains over 1,500 expert-verified questions developed with input from tax and finance professionals, covering numerical reasoning, semantic analysis, problem solving, and application of compliance rules. Models are scored on two dimensions: whether the answer is factually correct and whether the step-by-step reasoning is clear and expert-like. Performance on TaxEval v2 shows only a small difference across models (Figure 2.5.8). All 15 top models fall within a 3 percentage point range, from 77.1% (Claude Sonnet 4.6) to 74% (Claude 3.7 Sonnet Thinking). TaxEval 2.5 P E R F O R M A N C E I N S P E C I F I C D O M A I N S | T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 74.20% 74.28% 74.33% 74.53% 74.57% 74.78% 74.86% 74.86% 75.06% 75.22% 75.70% 75.76% 75.88% 75.96% 77.11% Kimi K2.5 o1 Claude Opus 4.5 (Nonthinking) GPT 4o (2024-11-20) o3 o4 Mini Claude Opus 4.5 (Thinking) GPT 5.1 GPT 4.1 GPT 5 Mini Grok 4 Fast (Reasoning) GPT 5.2 Grok 3 Claude Opus 4.6 (Thinking) Claude Sonnet 4.60% 20% 40% 60% 80% 100% Model Accuracy TaxEval v2: accuracy Source: Vals.ai, 2026 | Chart: 2026 AI Index report Figure 2.5.827 27 Data source: https://www.vals.ai/benchmarks/tax_eval_v2. 107 2.5 P E R F O R M A N C E I N S P E C I F I C D O M A I N S | T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 MortgageTax evaluates how well models can extract structured information from real mortgage tax certificates, using both text and document images. The task involves two types of extraction: Semantic extraction asks the model to identify fields like year, parcel number, and county, while numerical extraction requires computing the annualized amount due. The dataset includes 1,258 documents split across public validation, private validation, and held-out test sets. Scores on MortgageTax follow a similar pattern to TaxEval, with the top 15 models grouping within a narrow performance band (Figure 2.5.9). Gemini 3.1 Pro Preview leads at 69.4%, and GPT 4.1 is at the bottom of the group at 65.9%, a difference of about 3.5 percentage points. While several Gemini models occupy the top positions, the overall accuracy level does not reach 70%, which suggests that models are not yet entirely or reliably able to extract and compute financial information from document images. MortgageTax 65.94% 66.53% 66.85% 66.89% 67.13% 67.17% 67.69% 67.73% 68.52% 68.68% 68.68% 68.72% 68.92% 69.08% 69.40% GPT 4.1 Kimi K2.5 Claude 3.7 Sonnet (Thinking) GPT 5 Mini GPT 5.2 Gemini 2.5 Pro Exp Claude Opus 4.5 (Thinking) Claude Sonnet 4.6 Claude Opus 4.6 (Thinking) Claude Opus 4.5 (Nonthinking) Claude 3.7 Sonnet (Nonthinking) Gemini 3 Flash (12/25) Gemini 2.5 Pro Gemini 3 Pro (11/25) Gemini 3.1 Pro Preview0% 20% 40% 60% 80% 100% Model Accuracy MortgageTax: accuracy Source: Vals.ai, 2026 | Chart: 2026 AI Index report Figure 2.5.928 28 Data source: https://www.vals.ai/benchmarks/mortgage_tax. 108 2.5 P E R F O R M A N C E I N S P E C I F I C D O M A I N S | T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 CorpFin tests whether models can comprehend and extract information from long, dense financial documents, specifically credit agreements that can exceed 200 pages. Questions span basic term extraction, numeric reasoning, summarization, cross-referencing multiple sections, and industry-specific interpretation, all developed with input from financial analysts, lawyers, and academics. Beyond factual accuracy, the benchmark evaluates whether models can navigate and make sense of long, jargon-heavy legal and financial text. It defines three tasks with different context setups—Exact Pages, Shared Max Context, and Max Fitting Context—to see how models perform depending on document access. Similar to the other benchmarks, performance on CorpFin v2 is tightly clustered (Figure 2.5.10). Kimi K2.5 leads at 68.26%, with GPT 4.1 at the bottom at 63.05%, a spread of about 5 percentage points. As with MortgageTax, no model broke 70%. CorpFin 63.05% 63.68% 63.83% 64.49% 65.07% 65.31% 65.31% 65.89% 65.97% 66.05% 66.43% 66.90% 67.02% 68.03% 68.26% GPT 4.1 Gemini 3 Pro (11/25) GPT 5.1 Gemini 3.1 Pro Preview (02/26) Claude Opus 4.5 (Thinking) Qwen 3.5 Plus Claude Sonnet 4.6 GPT 5.2 Grok 4.1 Fast (Reasoning) Grok 4 Gemini 3 Flash (12/25) Grok 4 Fast (Reasoning) Claude Opus 4.6 (Thinking) Qwen 3 Max Thinking Kimi K2.50% 20% 40% 60% 80% 100% Model Accuracy CorpFin v2: accuracy Source: Vals.ai, 2026 | Chart: 2026 AI Index report Figure 2.5.1029 29 Data source: https://www.vals.ai/benchmarks/corp_fin_v2. 109 2.5 P E R F O R M A N C E I N S P E C I F I C D O M A I N S | T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 Developed in collaboration with Stanford researchers, a Global Systemically Important Bank, and industry experts, Finance Agent evaluates AI agents’ ability to perform tasks typical of an entry-level financial analyst. It includes 537 carefully crafted questions that test skills such as information retrieval, market research, and financial projections. On Finance Agent v1.1, performance was more varied than on other finance benchmarks (Figure 2.5.11). Claude Sonnet 4.6 leads at 63.33%, and scores taper down to 50.62% for Kimi K2.5, a spread of about 13 percentage points. Even the top score sits below two-thirds accuracy, reflecting the domain-specific challenges seen across the other finance benchmarks, as well as the broader difficulty of agentic tasks, which is discussed below in Section 2.6, Agent Benchmarks. Finance Agent 50.62% 51.93% 52.15% 52.45% 53.18% 53.51% 54.47% 54.50% 55.15% 55.31% 58.53% 58.81% 59.72% 60.05% 63.33% Kimi K2.5 GPT 5 Mini GPT 5 Grok 4.1 Fast (Reasoning) GLM 5 Grok 4 Qwen 3.5 Plus Claude Sonnet 4.5 (Thinking) Gemini 3 Pro (11/25) GPT 5.1 GPT 5.2 Claude Opus 4.5 (Thinking) Gemini 3.1 Pro Preview (02/26) Claude Opus 4.6 (Thinking) Claude Sonnet 4.60% 20% 40% 60% 80% 100% Model Accuracy Finance Agent v1.1: accuracy Source: Vals.ai, 2026 | Chart: 2026 AI Index report Figure 2.5.1130 30 Data source: https://www.vals.ai/benchmarks/finance_agent. 110 AI is also being evaluated in the legal domain, where tasks range from interpreting court decisions to applying rules to new fact patterns. The benchmarks covered below reflect how well models handle legal reasoning tasks that require grounding in specific documents rather than general knowledge. Law CaseLaw v2 is a benchmark for evaluating LLMs on real-world litigation and legal research tasks. It uses recent United States and Canada court decisions which are dated after most models’ training cutoffs and are not accessible at scale due to licensing restrictions, which helps ensure the model is reasoning over the provided documents rather than relying on memorized legal knowledge. The benchmark includes 300 validation tests and 104 test tests—spanning single-case and multicase reasoning—across seven legal reasoning dimensions, including retrieving key precedents, multidocument question answering, calculations, tables, and chronological reasoning. GPT-5.1 leads on CaseLaw v2 at 73.4% accuracy, with GPT 4.1 following at 69.9% (Figure 2.5.12). The rest of the top 15 models fall between 62% and 66%, a sign there is meaningful room for improvement. One recurring issue is that models tend to lean on general knowledge, rather than grounding their answers in the supplied documents, even when explicitly instructed to do so. CaseLaw 2.5 P E R F O R M A N C E I N S P E C I F I C D O M A I N S | T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 62.06% 62.17% 62.59% 63.88% 63.99% 64.52% 64.84% 65.70% 65.70% 65.81% 66.02% 66.45% 68.49% 69.88% 73.42% Claude Opus 4.6 (Thinking) Claude Sonnet 4.5 (Thinking) Claude Opus 4.5 (Thinking) Gemini 2.5 Pro Claude Sonnet 4.6 Command A Gemini 3.1 Pro Preview (02/26) Kimi K2 Thinking Grok 4 Fast (Reasoning) Grok 4 GPT 5.2 GPT 5 GPT 5 Mini GPT 4.1 GPT 5.10% 20% 40% 60% 80% 100% Model Accuracy CaseLaw v2: accuracy Source: Vals.ai, 2026 | Chart: 2026 AI Index report Figure 2.5.1231 31 Data source: https://www.vals.ai/benchmarks/case_law_v2. 111 2.5 P E R F O R M A N C E I N S P E C I F I C D O M A I N S | T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 LegalBench is a crowd-sourced benchmark for legal reasoning on tasks that mirror real legal work. Rather than test general question answering, it focuses on careful reading, spotting issues, and applying rules to facts. The benchmark covers six types of legal reasoning, including issue spotting, rule recall, outcome prediction, rule application, interpretation of legal text, and rhetorical understanding. The results below reflect model performance as of early 2026. On the leaderboard results, the top 15 models score above 83% (Figure 2.5.13). The top overall performer is Gemini 3.1 Pro Preview (2/26) at 87.4%, followed closely by Gemini 3 Pro (11/25) with 87% accuracy. The total spread across all 15 models is about 4 percentage points, a narrow range that makes it hard to differentiate among them. LegalBench Figure 2.5.1332 32 Data source: https://www.vals.ai/benchmarks/legal_bench. 83.36% 83.46% 83.76% 83.80% 84.06% 84.08% 84.32% 84.60% 85.10% 85.30% 85.68% 86.02% 86.86% 87.04% 87.40% GLM 4.7 Claude Opus 4.1 (Nonthinking) o3 Gemini 2.5 Flash Preview 4/17 (Nonthinking) GLM 5 Claude Sonnet 4.5 (Thinking) Gemini 2.5 Pro Exp Claude Opus 4.5 (Thinking) Qwen 3.5 Plus Claude Opus 4.6 (Thinking) GPT 5.1 GPT 5 Gemini 3 Flash (12/25) Gemini 3 Pro (11/25) Gemini 3.1 Pro Preview (02/26)0% 20% 40% 60% 80% 100% Model Accuracy LegalBench: accuracy Source: Vals.ai, 2026 | Chart: 2026 AI Index report 112

2.6 AI Agents

2 T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 Agent benchmarks test whether AI systems can go beyond answering questions and actually complete multistep tasks in realistic environments. These tasks often involve navigating software, calling tools, managing files, or interacting with websites and databases. More complex tasks may require agents to orchestrate entire workflows, coordinating across multiple tools and systems to achieve a goal. For example, an agent might need to search a database, apply a policy rule, and then update a customer record, all in a single conversation. Unless otherwise noted, the results reported below reflect model performance as of early 2026. GAIA is a benchmark for general AI assistants, introduced by Meta in May 2024. It tests whether models can handle the kinds of multistep, real-world questions a capable assistant would need to answer—questions that often require web browsing, file handling, and reasoning across multiple sources. Accuracy on GAIA has risen from about 20% in January 2025 to 74.5% in September 2025 (Figure 2.6.1). The human baseline sits at 92%, leaving a gap of about 17.5 percentage points. GAIA Jan-2025 Feb-2025 Mar-2025 Apr-2025 May-2025 Jun-2025 Jul-2025 Aug-2025 Sep-2025 0% 20% 40% 60% 80% 100% Accuracy 74.50% GAIA: accuracy Source: GAIA Leaderboard, 2026 | Chart: 2026 AI Index report 92%, human baseline Figure 2.6.133 33 Data source: https://hal.cs.princeton.edu/gaia. 113 Figure 2.6.234 34 Data source: https://epoch.ai/benchmarks/. 2.6 A I AG E N T S | T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 OSWorld is a scalable, real computer environment designed to evaluate multimodal AI agents on open-ended tasks across operating systems like Ubuntu, Windows, and macOS. It includes 369 tasks involving desktop and web apps, file operations, and multi-application workflows. Computer science students solve about 72% of these tasks with a median time of roughly two minutes, while the strongest models have historically reached only 1%–12% success, especially on tasks involving graphical interfaces and multi-app workflows. However, the gap has recently narrowed quite a bit with Claude Opus 4.5 leading on accuracy on OSWorld with 66.3% (Figure 2.6.2). This puts the best model within 6 percentage points of human performance. This is one of the benchmarks in this section where the gap between model and humans has closed the fastest. WebArena is a realistic web environment for evaluating autonomous web agents, and it introduces 812 long-horizon tasks written as natural language intents, such as finding information, navigating sites, and configuring content across multiple pages. Rather than comparing action traces, WebArena checks whether the agent actually achieved its goal by verifying the resulting state of the site, including databases, page content, and URLs. Success rates on WebArena have steadily increased from about 15% in 2023 to 74.3% in early 2026 (Figure 2.6.3). The best models are now within 4 percentage points of the human baseline of 78.2%. Of all the agent benchmarks in this section, WebArena shows the smallest remaining gap between models and human performance. OSWorld WebArena 2024 2025 0% 20% 40% 60% 80% 100% Accuracy 66.30% OSWorld: accuracy Source: Epoch AI, 2026 | Chart: 2026 AI Index report 72.35%, human baseline 114 Figure 2.6.335 Figure 2.6.436 35 Data source: https://docs.google.com/spreadsheets/d/1M801lEpBbKSNwP-vDBkC_pF7LdyGU1f_ufZb_NWNBZQ/edit?gid=0#gid=0. 36 Data source: https://github.com/openai/mle-bench. 2.6 A I AG E N T S | T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 2023 2024 2025 2026 0% 20% 40% 60% 80% 100% Success rate 74.30% WebArena: success rate Source: WebArena Leaderboard, 2026 | Chart: 2026 AI Index report 78.24%, human baseline 2024 2025 2026 0% 20% 40% 60% 80% 100% Success rate 64.44% MLE-bench: success rate Source: MLE-bench Leaderboard, 2026 | Chart: 2026 AI Index report MLE-bench evaluates the machine learning engineering capabilities of AI agents. It consists of 75 Kaggle competitions spanning tasks in NLP, computer vision, signal processing, and more. The competitions were manually curated, with rebuilt train and test splits and reimplemented grading code, so agents can be scored locally and compared directly against human Kaggle leaderboards and medal thresholds. Agents have also made significant progress on MLE-bench, advancing from about 17% success in 2024 to 64.4% in early 2026 (Figure 2.6.4). This level of improvement in such a short time points to growing capability on end-to-end machine learning tasks, though competitionstyle problems are more structured than the open-ended work that characterizes most realworld data science. MLE-bench 115 Figure 2.6.537 Figure 2.6.6 38 37 Data source: https://cybench.github.io/. 38 Data source: https://taubench.com/#leaderboard?benchmark=text. 2.6 A I AG E N T S | T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 2024 2025 2026 0% 20% 40% 60% 80% 100% Unguided % solved 93% Cybench: unguided % solved Source: Cybench Leaderboard, 2026 | Chart: 2026 AI Index report Claude Sonnet 4.5 GLM-5 Gemini 3 Pro Gemini 3 Flash Qwen3.5-397B-A17B GPT-5.2 Claude Opus 4.50% 20% 40% 60% 80% 100% Pass@1 62.9% 63.2% 65.8% 67.8% 68.4% 69.9% 70.2% τ-bench: pass@1 Source: τ-bench Leaderboard, 2026 | Chart: 2026 AI Index report Cybench is a benchmark framework for evaluating the capabilities of AI agents in cybersecurity. It includes 40 professional-level tasks across six capture-the-flag categories, including cryptography, web security, reverse engineering, forensics, and exploitation. Tasks are grounded in real human difficulty via “first solve time,” ranging from two minutes up to almost 25 hours, giving the benchmark a very high difficulty ceiling. The unguided solve rate on Cybench is 93%, up from 15% in 2024 (Figure 2.6.5). This is the steepest improvement rate across all benchmarks in this section, and it may highlight cybersecurity challenge tasks as a good fit for current agent capabilities. τ-bench takes a different approach by testing agents on real-world tasks that involve chatting with a user and calling external tools or APIs. It places the agent in realistic domains, such as retail and airline, with underlying databases, policy constraints, and multiturn conversations. Success is measured by whether the agent produces the correct final outcome, which is often verifiable from the resulting database state. This makes it a test of end-to-end tool use and rule-following in interactive settings, not just language ability. Leading models on τ-bench achieve pass@1 scores between 62.9% and 70.2% (Figure 2.6.6). Claude Opus 4.5 leads at 70.2%, followed by GPT 5.2 at 69.9% and Qwen3.5 at 68.4%. The spread across the top seven models is a narrow 7.3 percentage points, with no model exceeding 71%, suggesting that managing multiturn conversations while correctly using tools and following policy constraints remains difficult even for frontier models. Cybench τ-bench 116

2.7 Robotics and

Autonomous Motion 2 T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 Robotics RLBench is a benchmark for robotic manipulation that tests agents on a standardized set of 18 tasks using 100 demonstrations per task. Each task involves a different manipulation challenge, such as picking up objects, stacking items, or operating simple mechanisms. As of January 2026, the top-performing method on the 18-task RLBench subset is EquAct, which reaches an 89.4% average success rate, compared with 86.8% for the prior leader, SAM2Act (Figure 2.7.1). EquAct also reports stronger performance under a more difficult evaluation setting that introduces full 3D rotational variation, where previous methods tend to degrade. There has been consistent progress from about 48% in 2022 to nearly 90% in 2025, though the benchmarks test relatively short-horizon tasks in a controlled simulation environment. RLBench 2022 2023 2024 2025 0% 20% 40% 60% 80% 100% Success rate 89.40% RLBench: success rate (18 tasks, 100 demo/task) Source: AI Index, 2025 | Chart: 2026 AI Index report Figure 2.7.1 BEHAVIOR-1K BEHAVIOR-1K is a simulation benchmark built around real human needs. The tasks come from surveys asking people what household tasks they want robots to help with, resulting in 1,000 realistic activities. These are long-horizon mobile manipulation challenges in simulated home environments, designed to bridge the gap between current research and human-centered applications. 117 Results from the 2025 BEHAVIOR Challenge show how difficult these tasks remain (Figure 2.7.2). The top team, Robot Learning Collective, achieved a Q-score39 of about 26% on the held-out test set, meaning it completed only a quarter of the required task objectives at an acceptable quality. Full task success rates were even lower, with the top team reaching just 12.4%. These scores make it clear that reliably executing household tasks in realistic environments is still beyond current capabilities. 39 Q-score measures how much of a task’s goal a policy satisfies by calculating the fraction of completed subgoals and selecting the best-matched goal clause. It awards partial credit, so policies that make meaningful progress score higher even without finishing the full task. This makes Q-score a smoother and more reliable metric for comparing policies across BEHAVIOR tasks than a binary success rate. 2.7 R O B OT I C S A N D AU TO N O M O U S M OT I O N | T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 5.20% 7.60% 10.80% 11.40% 12.40% 9.47% 12.04% 15.91% 25.14% 25.99% 0% 5% 10% 15% 20% 25% Embodied Intelligence The North Star SimpleAI Robot Comet Robot Learning Collective Q score Full task success rate Score Team BEHAVIOR-1K: full task success rate vs. Q score (held-out-test) Source: BEHAVIOR Challenge Leaderboard, 2025 | Chart: 2026 AI Index report Figure 2.7.2 Figure 2.7.3 Most robotics benchmarks measure whether a model can complete a task. ResponsibleRobotBench measures if that task is completed safely when the environment includes real hazards. The benchmark is built around 23 multi-stage tasks involving electrical, fire/chemical, and human-related hazards. To complete a task safely, the robots must detect risks, reason about safety, plan safe actions, and request human assistance when necessary. Performance is measured by the safe success rate, which counts a task as successful only when both the task is complete and safety conditions are met. GPT-4o achieves the best results with a safe score of 0.64, outperforming GPT-4o mini at 0.40 and the strongest open-source model, Qwen-72B, at 0.35 (Figure 2.7.3). Even the top model failed to complete more than a third of tasks safely, with frequent failures when both task completion and safety must be satisfied simultaneously. 0.21 0.35 0.12 0.40 0.64 QWEN7B QWEN72B INTERNVL 2.5 4B GPT4o-MINI GPT4o 2023 2024 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 Overall SSR↑ ResponsibleRobotBench: safety success rate (SSR) Source: Zhang et al., 2025 | Chart: 2026 AI Index report ResponsibleRobotBench 118 H I G H L I G H T: Humanoid Robotics As covered in last year’s AI Index, humanoid robots began attracting significant attention in 2024 with new hardware launches from companies like Figure AI, Tesla, and Boston Dynamics. In 2025, the field continued to grow, with a significant increase in the number and variety of available humanoid platforms (Figure 2.7.4). The strongest signals came from early-stage industrial pilot projects and manufacturing-scale ambitions rather than widespread deployment. Figure AI’s Figure 02 robot, for example, spent 11 months on the line at a BMW plant in South Carolina, logging over 1,250 runtime hours and loading more than 90,000 parts across over 30,000 vehicles. In China, vendors like Unitree and AgiBot pushed prices down and production volumes up, framing humanoids as quasi- consumer hardware products rather than bespoke research systems. Some companies are targeting home environments for their humanoid robotics, with Norway’s 1X opening a waitlist for deliveries of its $20,000 household robot. The overall picture is one of rapid growth in hardware availability and investment activity rather than widespread deployment. Most company milestones are framed in the future tense, along with delivery timelines; intended use cases are offered in place of verified operational data. It remains unclear whether the demand for humanoid robots will match the supply currently being built, who the customers will be at scale, and how quickly these platforms will move from structured factory pilot projects to unstructured environments. 2.7 R O B OT I C S A N D AU TO N O M O U S M OT I O N | T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 Company Country Platform Focus Notable detail Sanctuary AI Canada Phoenix Commercial pilots Hundreds of commercial pilot tasks completed Unitree China G1, R1 Research, industrial R1 from $4,900; G1 from $13,500 with advanced perception UBTECH China Walker S, S2 Industrial LLM-integrated planning; autonomous battery swapping AgiBot China Humanoid fleet Data collection, industrial ~100 teleoperated humanoids running up to 17 hrs/day; ~10,000 units manufactured Fourier Intelligence China GR-1 Medical, service, industrial Camera-only vision and LLM interaction DeepRobotics China Humanoid platforms Industrial Extending quadruped expertise into humanoid form factors Neura Robotics Germany 4NE-1 Home, workplace Dense sensors, including artificial skin for safe human collaboration Addverb.ai India In development Manipulation Expanding from mobile robotics toward humanoid manipulation 119 H I G H L I G H T: 2.7 R O B OT I C S A N D AU TO N O M O U S M OT I O N | T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 Milagrow India In development Manipulation Expanding from mobile robotics toward humanoid manipulation Mentee Robotics Israel MenteeBot Warehouse Autonomous workflows using natural-language commands Toyota Research Institute Japan Teleoperated systems Retail, logistics Focus on teleoperated manipulation Honda Japan Robotics platforms General purpose Continuing humanoid and manipulation research SoftBank Robotics Japan Various Retail, logistics Teleoperated manipulation systems Telexistence Japan Various Retail, logistics Teleoperated manipulation for retail environments 1X Norway NEO Home Backed by OpenAI; waitlist open for 2026 U.S. deliveries at ~$20,000 or $499/month Rainbow Robotics South Korea RB-Y1 Workplace Industrial cobot (collaborative robot) components adapted for humanoid applications LG Electronics South Korea Various Workplace Leveraging cobot components for humanoidlike applications Technology Innovation Institute UAE Testbed Embodied AI research Building testbeds using open-weight Falcon models Engineered Arts United Kingdom Ameca Social interaction, research Lifelike facial expressions for customer engagement Humanoid/SKL United Kingdom HMND 01 Alpha Industrial Developed in seven months; factory trials underway Figure AI United States Figure 02 / 03 Industrial, home Deployed at BMW for 11 months; 1,250+ runtime hours; 90,000+ parts loaded across 30,000+ vehicles Tesla United States Optimus (Gen 3) Internal logistics Third generation; plans for external sales by 2027 Boston Dynamics United States Atlas Research Testbed for advanced locomotion and manipulation Apptronik United States Apollo Industrial Safety-rated operation around people Skild AI United States Foundation model stack Multi-embodiment Omni-bodied control stack designed to work across multiple robot bodies Figure 2.7.4 120 H I G H L I G H T: Physical AI and Foundation Models for Robotics Most of what people need help with happens in physical spaces, from assembling products in a factory to assisting with household tasks. For AI to be useful, it must do more than process text and images on a screen. It has to perceive its surroundings, reason about how objects behave, and act on those judgments through a physical body. Throughout this chapter, the benchmarks that prove hardest for AI are the ones that require acting in the real world, where environments are unpredictable and mistakes have physical consequences. The robotics benchmarks earlier in this section reflect that difficulty. Traditional robots sidestep the problem by running fixed programs for fixed tasks, but that approach breaks down in any setting that changes from one day to the next. A growing body of research is trying to close this gap by giving robots the same kind of general-purpose AI that has driven progress in language and vision. Vision-language-action models, or VLAs, replace the traditional pipeline of separate modules for seeing, planning, and acting with a single network that goes directly from camera input and language instructions to motor control. Physical Intelligence’s π₀ (2024) and π0.6 (2025) demonstrate this approach, performing tasks like laundry folding across different robot platforms without task-specific retraining. Nvidia’s GR00T models and Gemini Robotics take a similar direction, training single models that can control different robots across different tasks. The biggest constraint, however, is data. Language models train on billions of pages of text from the internet. Every piece of robot training data requires either a physical robot performing a task or a high-fidelity simulation, both of which are slow and expensive. World Foundation Models (WFMs) are one response, generating synthetic physics data so robots can learn without physical trials. Nvidia’s Cosmos is one example. But VLA technology remains at the research stage, and the gap between what these models can do in a controlled setting and what they can handle in the real world is still wide. 2.7 R O B OT I C S A N D AU TO N O M O U S M OT I O N | T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 121 2.7 R O B OT I C S A N D AU TO N O M O U S M OT I O N | T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 Self-driving car development has moved past the research stage in several markets, with commercial services now operating at scale. This section tracks deployment trends, technical innovations in benchmarks and datasets, and safety through crash reporting data. The data available for this section is concentrated in the United States and, to a lesser extent, China. European autonomous vehicle operators such as Mobileye, Vay, and Wayve are active, but comparable trip or deployment data is not publicly available. Chinese data is also limited, with Baidu’s Apollo Go being one of the few services to publish detailed ridership figures. Self-Driving Cars Autonomous vehicle deployment accelerated in 2025, with growth in both the United States and China. By late 2025, Waymo operated roughly 2,500 fully autonomous robotaxis across major U.S. cities, including Phoenix, San Francisco, Los Angeles, Austin, and Atlanta, with the service recording around 450,000 weekly trips. In California alone, weekly paid trips climbed from near zero in mid-2023 to approximately 283,880 by late 2025, with sharp growth after February 2025 (Figure 2.7.5). Zoox, a smaller operator, began appearing in California pilot trip data in late 2025 (Figure 2.7.6). In China, Baidu’s Apollo Go autonomous ride-hailing service provided approximately 11 million fully driverless rides in 2025, a 175% year-over-year increase (Figure 2.7.7). The service has grown from 1.5 million trips in 2022 to 11 million in 2025, reflecting rapid expansion in usage. Deployment Jun-2023 Sep-2023 Dec-2023 Mar-2024 Jun-2024 Sep-2024 Dec-2024 Mar-2025 Jun-2025 Sep-2025 Dec-2025 0 50 100 150 200 250 300 Number of weekly paid trips (in thousands) 283.88 Waymo autonomous vehicle trips in California, 2023–25 Source: California Public Utilities Commission, 2025 | Chart: 2026 AI Index report Figure 2.7.540 40 These AV deployment metrics, as reported to the California Public Utilities Commission, pertain to Waymo and Cruise (until the latter was discontinued by General Motors in December 2024). Several other companies, including Aurora, Tensor (formerly AutoX), WeRide Corp, and Zoox, are in pilot stages. Tesla has not been approved by the CPUC to offer autonomous passenger service. Data source: California Public Utilities Commission quarterly reporting. 122 Figure 2.7.641 Figure 2.7.742 41 Pilot data covers passenger rides conducted for testing, typically without a fare. Deployment data covers paid autonomous passenger service. Companies can participate in both programs simultaneously if they are deployed and tested in different areas or phases. Data source: California Public Utilities Commission quarterly reporting. 42 The 2025 value is an estimate. Data sources: Baidu financial results reporting (2022, 2023, 2024, 2025). ~11M 4M 3M 1.5M 2022 2023 2024 2025 0 2M 4M 6M 8M 10M Number of trips Apollo Go autonomous vehicle trips, 2022–25 Source: Baidu 2025 | Chart: 2026 AI Index report 2.7 R O B OT I C S A N D AU TO N O M O U S M OT I O N | T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 Nov-2022 Dec-2022 Jan-2023 Feb-2023 Mar-2023 Apr-2023 May-2023 Jun-2023 Jul-2023 Aug-2023 Sep-2023 Oct-2023 Nov-2023 Dec-2023 Jan-2024 Feb-2024 Mar-2024 Apr-2024 May-2024 Jun-2024 Jul-2024 Aug-2024 Sep-2024 Oct-2024 Nov-2024 Dec-2024 Jan-2025 Feb-2025 Mar-2025 Apr-2025 May-2025 Jun-2025 Jul-2025 Aug-2025 Sep-2025 Oct-2025 Nov-2025 Dec-2025 0 5 10 15 20 25 Waymo Zoox Number of pilot trips (in thousands) Autonomous vehicle pilot trips in California, 2022–25 Source: California Public Utilities Commission, 2025 | Chart: 2026 AI Index report The technical landscape for autonomous driving is shifting in several ways. Benchmarks are consolidating around leaderboards for end-to-end driving, like Waymo’s 2025 Open Dataset Challenges, which emphasized vision-based approaches and are increasingly targeting generalization on long-tail cases. Large multisensor datasets are also becoming more central to research. Nvidia’s PhysicalAI Autonomous Vehicles dataset includes multicamera, lidar, and radar data across a diverse range of weather, geography, and rare events. At the model level, combined reasoning and action approaches are gaining traction. Alpamayo 1, a vision–language–action model (VLA), focuses on both trajectory quality and interpretable reasoning, while operating under the safety and latency constraints of real driving. Multimodal reasoning benchmarks are Technical Innovations and New Benchmarks 123 2.7 R O B OT I C S A N D AU TO N O M O U S M OT I O N | T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 also evolving, now evaluating multiview spatial reasoning and step-by-step driving logic rather than just finalanswer accuracy. More broadly, world models and reinforcement learning are moving beyond imitation-only, end-to-end driving, since these approaches can generalize better to traffic scenarios not seen during training. The scale of available driving data has also grown over the past decade (Figure 2.7.8). Early benchmarks released between 2012 and 2019 contained single-digit hours of data. A step change came with Waymo’s Open Dataset in 2019 at roughly 500 hours, followed by nuPlan in 2024 and Nvidia’s Physical AI-AV in 2025 at around 1,600 hours. However, hours alone do not capture differences in data quality or content. A dataset of simulated driving is not the same as one captured from real cars on real roads, even if both report the same number of hours. Therefore, this chart is best read as a trend in data volume rather than a direct comparison across benchmarks. KITTI Vision Benchmark Suite Argoverse v2 Waymo Open Dataset nuPlan DriveE2E DriveLMM-o1* CARLA Leaderboard* 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0 500 1,000 1,500 Release year Amount of driving data in hours Autonomous driving benchmarks/datasets: hours of driving data, 2012–25 Source: AI Index, 2025 | Chart: 2026 AI Index report Argoverse v1 nuScenes NuPlanQA* NVIDIA PhysicalAI-AV Waymo Open Dataset - E2E RAMBLE (Imitation to Exploration)* Figure 2.7.843 43 For datasets marked with an asterisk, hours of driving data are estimated rather than directly reported. The Standing General Order (the General Order) on Crash Reporting is a National Highway Traffic Safety Administration (NHTSA) mandate that requires manufacturers and operators to report certain crashes involving automated driving systems (ADS) or SAE Level 2 advanced driver assistance systems (ADAS). First issued in 2021 and amended in 2021, 2023, and 2025, the order gives NHTSA consistent crash data to investigate incidents and enforce safety requirements. Monthly reported ADS incidents have generally trended upward since NHTSA began collecting data in mid-2021, rising from roughly 10–25 per month in the early years to frequently exceeding 80 per month in late 2024 and 2025 (Figure 2.7.9). When broken down by company, Waymo accounts for the largest share of reported incidents, which is consistent with its much larger deployment footprint. Other operators, including Ford, May Mobility, and Transdev Alternative Services, report lower and more stable incident counts. Safety 124 2.7 R O B OT I C S A N D AU TO N O M O U S M OT I O N | T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 Without a comparison point to human driving, raw incident counts are difficult to interpret. Waymo has published data comparing its rider-only crash rates against a human-driven benchmark covering the same miles and areas (Figure 2.7.10). Waymo’s reported rates are lower for both any-injury-reported incidents (Figure 2.7.11) and the more severe airbag-deployment-reported incidents (Figure 2.7.12). The largest gap appears in vehicle-to-vehicle intersection incidents, where the human benchmark recorded 198 compared to Waymo’s 8. This data comes from Waymo’s own safety reporting through September 2025 and should be viewed accordingly. Jul-2021 Aug-2021 Sep-2021 Oct-2021 Nov-2021 Dec-2021 Jan-2022 Feb-2022 Mar-2022 Apr-2022 May-2022 Jun-2022 Jul-2022 Aug-2022 Sep-2022 Oct-2022 Nov-2022 Dec-2022 Jan-2023 Feb-2023 Mar-2023 Apr-2023 May-2023 Jun-2023 Jul-2023 Aug-2023 Sep-2023 Oct-2023 Nov-2023 Dec-2023 Jan-2024 Feb-2024 Mar-2024 Apr-2024 May-2024 Jun-2024 Jul-2024 Aug-2024 Sep-2024 Oct-2024 Nov-2024 Dec-2024 Jan-2025 Feb-2025 Mar-2025 Apr-2025 May-2025 Jun-2025 Jul-2025 Aug-2025 Sep-2025 Oct-2025 Nov-2025 Dec-2025 0 20 40 60 80 100 Number of ADS incidents Monthly reported ADS incidents, 2022–25 Source: NHTSA 2025 | Chart: 2026 AI Index report Aug-2021 Nov-2021 Feb-2022 May-2022 Aug-2022 Nov-2022 Feb-2023 May-2023 Aug-2023 Nov-2023 Feb-2024 May-2024 Aug-2024 Nov-2024 Feb-2025 May-2025 Aug-2025 Nov-2025 0 20 40 60 80 100 Number of ADS incidents 0, Cruise 0, General Motors 0, Transdev Alternative Services 0, Zoox 0, May Mobility 0, Argo AI 0, Ford Motor Company 0, Beep 13, Waymo Monthly reported ADS incidents by select company, 2022–25 Source: NHTSA, 2025 | Chart: 2026 AI Index report Figure 2.7.9 Figure 2.7.1044 44 This chart includes only companies that reported at least 10 ADS incidents across the full reporting period. 125 2.7 R O B OT I C S A N D AU TO N O M O U S M OT I O N | T E C H N I CA L P E R F O R M A N C E | A I I N D E X R E P O R T 2026 Figure 2.7.1145 Figure 2.7.1246 45 Data source: https://www.waymo.com/safety/impact/#methodology. 46 Data source: https://www.waymo.com/safety/impact/#methodology. 3 6 5 4 8 2 3 44 3 8 8 11 36 24 51 25 34 4 76 8 198 34 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 All others Cyclist Motorcycle Pedestrian Secondary crash Single vehicle V2V backing V2V F2R V2V head-on V2V intersection V2V lateral Waymo rider only Avg. human-driven benchmark Number of incidents Incident type Any-injury-reported incidents by type: Waymo vs. benchmark for the same miles/areas Source: Waymo, 2025 | Chart: 2026 AI Index report 1 9 17 3 8 1 5 14 27 6 119 12 0 10 20 30 40 50 60 70 80 90 100 110 120 All others Secondary crash V2V F2R V2V head-on V2V intersection V2V lateral Waymo rider only Avg. human-driven benchmark Number of incidents Incident type Airbag deployment-reported incidents by type: Waymo vs. benchmark for the same miles/areas Source: Waymo, 2025 | Chart: 2026 AI Index report 126 Responsible AI 3 A I I N D E X R E P O R T 2026

3.10 Tradeoffs Across RAI Dimensions

128 Chapter Highlights Responsible AI benchmarking is increasing, but is not keeping up with AI advances and deployments. Almost all leading frontier model developers report results on capability benchmarks like MMLU and SWE-bench, but reporting on responsible AI benchmarks remains sparse. Documented AI incidents continued to rise, with the AI Incident Database recording 362 in 2025, up from 233 in 2024. 1 AI models struggle to tell the difference between knowledge and belief. In a new accuracy benchmark, hallucination rates across 26 top models range from 22% to 94%. GPT-4o’s accuracy dropped from 98.2% to 64.4%, and DeepSeek R1 fell from over 90% to 14.4%. When a false statement is presented as something another person believes, models handle it well. When the same false statement is presented as something a user believes, performance collapses. 2 Organizations are formalizing responsible AI work, but knowledge and budget gaps still slow adoption. AI-specific governance roles grew 17% in 2025, and the share of businesses with no responsible AI policies in place fell sharply from 24% to 11%. The main obstacles to implementation remain gaps in knowledge (59%), budget constraints (48%), and regulatory uncertainty (41%). 3 The mix of regulations shaping responsible AI practices is shifting toward AI-specific frameworks and technical standards. GDPR remains the most cited regulatory influence but slipped from 65% in 2024 to 60% in 2025. New entries in 2025 include ISO/IEC 42001, an AI management system standard, cited by 36% of respondents, and the NIST AI Risk Management Framework at 33%. The share of organizations reporting no regulatory influence at all fell from 17% to 12%. 4 AI works best in English, and the gap is wider than global benchmarks suggest. On HELM Arabic, a regionally developed model for the Arabic language, outscored GPT-5.1 and Gemini 2.5 Flash. The gap widens at the dialect level. On a Slovenian commonsense reasoning test, several leading models lost close to half their accuracy when tested in a regional dialect rather than the standard language. 5 AI companies grew less transparent this year. After rising on the Foundation Model Transparency Index from 37 to 58 between 2023 and 2024, the average score dropped to 40 in 2025. Major gaps persist in disclosure around training data, compute resources, and post-deployment impact. 6 AI models perform well on safety tests under normal conditions, but their defenses weaken under deliberate attack. On the AILuminate benchmark, several frontier models received “Very Good” or “Good” safety ratings under standard use. When tested against jailbreak attempts using adversarial prompts, safety performance dropped across all models tested. 7 Responsible AI dimensions such as safety, fairness, and privacy are at odds with one another, and the tradeoffs are not well understood. Recent empirical studies found that training techniques aimed at improving one responsible AI dimension consistently degraded others. 8 3 R E S P O N S I B L E A I | A I I N D E X R E P O R T 2026 129

3.1 Scope and Dimensions of

Responsible AI Responsible AI refers to the set of practices and governance mechanisms designed to ensure AI systems are safe, fair, and beneficial and that they perform as intended. RAI spans a range of dimensions, from safety and fairness to transparency and privacy, and each has its own measurement challenges. This chapter tracks progress across those dimensions by looking at how AI systems perform on responsibility and safety evaluations, how organizations and researchers are responding to RAI challenges, and how governments are establishing policy frameworks to enforce standards. The analysis draws on a framework of RAI dimensions arranged in three layers (Figure 3.1.1), along with examples and reference documents. The first layer covers core responsible AI properties—meaning what AI systems should be able to achieve—including fairness, privacy, transparency, and factuality. The second layer addresses system integrity and risk controls—or how risks are technically and operationally managed—including security, safety, and robustness. The third layer covers governance, accountability, and enforcement. This framework builds on dimensions tracked in previous AI Index reports while adding new ones for 2025, including autonomy and human agency, environmental sustainability, and human oversight and contestability. Layer 1 – Core Function and Behaviors (What AI systems should achieve) 3 R E S P O N S I B L E A I | A I I N D E X R E P O R T 2026 Dimension ReferencesDefinition Example Validity and reliability Designed for a particular scope and acceptable level of performance in the domain, such as accomplishment of task goals, fidelity to expert knowledge, or thresholds for accuracy that benefit people or organizations/systems, and demonstrated verification and validation against their design. A team defines target accuracy and failure thresholds before launch, validates the system against those criteria, and monitors it in production to ensure it continues to meet design expectations. EU Ethics Guidelines for Trustworthy AI; NIST AI RMF; OECD AI Principles Privacy Protection of individuals’ confidentiality, anonymity, informed consent, and control over personal data across the AI life cycle (collection, training, deployment, reuse). A messaging app encrypts conversations end to end and clearly notifies users about opting in or out of using their data to train language models. EU Ethics Guidelines for Trustworthy AI; NIST AI RMF; OECD AI Principles; Recommendation on the Ethics of Artificial Intelligence (UNESCO) Data stewardship Ensure the quality, provenance, integrity, and lawful use and reuse of data, with clear access control and documentation. A logistics firm tracks data lineage for all datasets used to train routing models, enforces role-based access, and periodically reviews datasets for quality and drift before retraining and updating models. EU Ethics Guidelines for Trustworthy AI; ISO/IEC 42001:2023; OECD AI Principles 130 3.1 S CO P E A N D D I M E N S I O N S O F R E S P O N S I B L E A I | R E S P O N S I B L E A I | A I I N D E X R E P O R T 2026 Fairness and bias Protection of civil rights and prevention of unjustified discrimination and systematic disadvantage across individuals or groups, accounting for protected attributes, cultural context, and use case. A bank audits credit-scoring models for disparate approval and error rates across demographic groups— including culturally diverse customer segments—documents findings, and implements bias-mitigation steps before deployment. EU Ethics Guidelines for Trustworthy AI; NIST AI RMF; OECD AI Principles; Recommendation on the Ethics of Artificial Intelligence (UNESCO) Transparency and auditability Clear disclosure that an AI system is in use; of its purpose, scope, and high-level functioning for relevant stakeholders; and authorized parties’ ability to inspect, reconstruct, and verify that the system was developed, trained, configured, and operated as intended. A city using an AI model to prioritize inspections publishes a plain-language description of training method, documents model card and data sources, keeps versioned training scripts and logs, and enables internal audit to replay training and key decisions. EU Ethics Guidelines for Trustworthy AI; NIST AI RMF; OECD AI Principles; Recommendation on the Ethics of Artificial Intelligence (UNESCO); ISO/ IEC 42001:2023 Explainability Ability to provide understandable, context-appropriate rationale for system outputs, including key factors influencing a prediction or decision. An AI fraud-detection tool surfaces the top contributing features and a brief rationale behind each alert for investigators, while providing merchants with plain-language explanations of why a transaction was flagged and what steps they can take in response. EU Ethics Guidelines for Trustworthy AI; NIST AI RMF; OECD AI Principles; Recommendation on the Ethics of Artificial Intelligence (UNESCO) Autonomy and human agency Preservation of people’s ability to make informed choices and act freely without AI systems unduly manipulating, coercing, or replacing their decisions. A well-being chatbot clearly states it is not a human or a substitute for professional care, avoids prescriptive life-changing advice, and actively directs users to expert help in high-risk situations. EU Ethics Guidelines for Trustworthy AI; OECD AI Principles; Recommendation on the Ethics of Artificial Intelligence (UNESCO) Environmental sustainability Limiting and managing the environmental impact of AI systems across their life cycle, including energy use, carbon emissions, and resource consumption, and committing to measurement, disclosure, and continuous reduction while minimizing resource misuse. A company measures the energy and water usage of large training runs, reports them externally, chooses more efficient model architectures, proactively places boundaries on AI resource use, and schedules training when grid carbon intensity is low. EU Ethics Guidelines for Trustworthy AI; OECD AI Principles; UNESCO; Energy efficiency requirements under the EU AI Act Factuality and truthfulness The accuracy and reliability of AI system outputs, including the degree to which models produce information that is factually correct, avoid misleading statements and fabrications, and volunteer uncertainty honestly. A company systematically benchmarks its large language models against factuality evaluations (such as SimpleQA), publishes hallucination rates alongside model releases, implements retrieval-augmented generation to ground outputs in verified sources, and provides users with confidence indicators and citations so they can assess the reliability of AI-generated responses. NIST AI RMF 131 3.1 S CO P E A N D D I M E N S I O N S O F R E S P O N S I B L E A I | R E S P O N S I B L E A I | A I I N D E X R E P O R T 2026 Layer 2 – System Integrity and Risk Controls (How risks are technically and operationally managed) Layer 3 – Governance, Accountability, and Enforcement (How responsibility, oversight, and redress are ensured) Dimension Dimension

References

Definition Definition Example Example Security Ensuring AI systems are secure against cyber threats and misuse. A school system uses AI to provide personalized tutoring to students and hosts the data and models in secured servers with extensive security training of all personnel involved. EU Ethics Guidelines for Trustworthy AI; NIST AI RMF; OECD AI Principles; ISO/IEC 42001:2023 Accountability and liability Clear assignment of responsibility for AI system outcomes, including legal liability, operational ownership, decision rights, and escalation pathways, so that harms and failures can be investigated, addressed, and remedied. A platform designates an accountable owner for its high-risk recommendation system, defines KPIs and harm thresholds, documents who can approve releases, and maintains procedures for incident investigation, user notification, and compensation. EU Ethics Guidelines for Trustworthy AI; NIST AI RMF; OECD AI Principles; ISO/IEC 42001:2023 Human oversight and contestability Governance mechanisms that ensure meaningful human involvement where appropriate, including the ability to challenge, appeal, or override AI-assisted decisions and access to effective redress. An employer using an AI screening tool must have a human review all adverse decisions, disclose AI use to candidates, explain key factors, and provide a clear path to request human reconsideration and correction of errors. EU AI Act – human-oversight obligations for high-risk AI; EU Ethics Guidelines for Trustworthy AI; OECD AI Principles; Recommendation on the Ethics of Artificial Intelligence (UNESCO) Safety Specify normal behaviors and affected systems and analyze out-of-bounds conditions to characterize risk factors (risk to physical and mental/emotional well-being of people, environment, political systems, human rights, etc.), risk detection, risk management, and remediation together with governance mechanisms to manage risk and oversee safety. An industrial control system uses anomaly-detection models that are penetration-tested, evaluated under simulated attacks and sensor failures, monitored in real time, and configured to fall back to manual control when anomalies exceed thresholds. EU Ethics Guidelines for Trustworthy AI; NIST AI RMF; OECD AI Principles; ISO/IEC 42001:2023 Robustness Remain robust to distribution shifts, external natural or adversarial events, and component failures, with testing, monitoring, and safe fallbacks. A food chain uses an AI system to estimate customer demand, consisting of several models that get triggered by inclement weather, concerts, and sporting events. EU Ethics Guidelines for Trustworthy AI; NIST AI RMF; OECD AI Principles; ISO/IEC 42001:2023 Source: AI Index, 2026 Figure 3.1.1 132

3.2 Assessing Responsible AI

3 R E S P O N S I B L E A I | A I I N D E X R E P O R T 2026 One way the field tracks the responsible use of AI is by evaluating models against specific benchmarks and by recording real-world incidents when systems cause harm. This section examines both, drawing on incident data and benchmark reporting that cut across the three layers of the framework introduced in Section 3.1. There is not much data available, nor is it detailed about mapping AI systems to the above dimensions. The analysis presented here draws on two incident tracking databases, the AI Incident Database (AIID) and the OECD AI Incidents and Hazards Monitor (AIM), alongside data on responsible AI benchmark adoption by frontier model developers as well as third-party evaluations of some of the responsible AI dimensions outlined above. In recent years, the number of reported AI incidents has continued to increase significantly (Figure 3.2.1). The AI Incident Database (AIID),1 launched in 2020, is an open repository for documented cases where AI systems have caused or nearly caused harm. In 2025, 362 incidents were reported, while the annual number of incidents had stayed under 100 until 2022. AIID relies on human editors to review submissions against a defined threshold of AI involvement, from sources including academic and investigative journalists. The manual process produces higher-quality records but comes at the cost of a slower pace of additions and coverage that is skewed toward English-language media and high-visibility incidents. Less accessible regions may be underrepresented. The OECD AI Incidents and Hazards Monitor (AIM) uses an automated, multilingual pipeline to collect incidents from news sources and casts a wider net. Its absolute numbers are quite a bit higher, with monthly incidents hitting a peak of 435 in January 2026 and setting a six-month moving average of 326 (Figure 3.2.2). While the two databases track incidents differently, both show a consistent and sharp increase in reported AI incidents. Figure 3.2.12 1 The AI Index continues to rely on AIID as its primary source of AI incidents due to AIID’s reliability and stable incident records. 2 The number of AI incidents is continually updated, including for previous years. Therefore, the totals reported in Figure 3.2.1 might not align with the totals recently published on the AI Incident Database. AI Incidents 362 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0 50 100 150 200 250 300 350 Number of AI incidents Number of reported AI incidents, 2012–25 Source: AI Incident Database (AIID), 2025 | Chart: 2026 AI Index report 133 3.2 AS S E S S I N G R E S P O N S I B L E A I | R E S P O N S I B L E A I | A I I N D E X R E P O R T 2026 May-2020 Sep-2020 Jan-2021 May-2021 Sep-2021 Jan-2022 May-2022 Sep-2022 Jan-2023 May-2023 Sep-2023 Jan-2024 May-2024 Sep-2024 Jan-2025 May-2025 Sep-2025 Jan-2026 0 50 100 150 200 250 300 350 400 450 Number of AI incidents 326, 6-month moving average 435, total incidents Monthly AI incidents reported from news sources, 2020-26 Source: OECD AIM, 2026 | Chart: 2026 AI Index report Figure 3.2.2 Unmoderated AI Output and Harmful Speech (July 8, 2025) In July 2025, Grok—the chatbot developed by xAI and embedded across X—faced backlash after users shared examples of the system generating antisemitic language, violent hate speech, and even praise for Adolf Hitler when prompted. The issue emerged shortly after a system update that relaxed safety filters, allowing the chatbot to produce more provocative and “unfiltered” responses. Within hours, screenshots of Grok referring to genocide and extremist ideology spread across the platform, sparking public outrage and renewed concern about the risks of deploying lightly moderated conversational AI to large audiences. In response to the backlash, xAI removed the content, temporarily suspended Grok’s text responses, and issued a statement acknowledging the severity of the incident. While the company framed the issue as a failure of content controls, critics argued that the system’s design choices, particularly the decision to weaken the guardrails, made the harm predictable. The event highlighted the ongoing tension between building AI systems intended to feel candid or humorous and the real-world consequences when those systems normalize hate speech. AI Deepfake Impersonation and Romance Scams (March 9, 2025) In March 2025, Chinese actor Jin Dong spoke publicly about a wave of scams using deepfake videos to impersonate him online. Fraudsters used AI-generated clips and fake social media accounts to convince fans (mostly older women) that they were speaking directly with the actor, prompting some to send money or make major life changes based on the belief that they were in a private relationship with him. One widely reported case involved a woman who nearly divorced her husband and planned to travel across the country to meet a scammer posing as Jin Dong. After the incidents gained attention, Jin Dong called for stronger legal protections and clearer consequences for deepfake-enabled fraud, arguing on social media that existing rules had not kept pace with the speed and realism of AI-generated impersonation. Examples 134 3.2 AS S E S S I N G R E S P O N S I B L E A I | R E S P O N S I B L E A I | A I I N D E X R E P O R T 2026 AI-Assisted Website Impersonation and Consumer Fraud (Aug. 20, 2025) After Joann Fabrics filed for bankruptcy for the second time in January 2025, scammers quickly launched a wave of fake websites mimicking the retailer’s branding, design, and product catalog. These sites advertised deep discount prices to lure shoppers into entering payment and personal information, but customers never received purchases and many later discovered their credit cards had been compromised. The fraudulent sites were convincing enough that even cautious users were misled, especially on mobile, where URLs are harder to detect. Cybersecurity experts noted that AI tools are making this type of scam far easier to execute. New systems allow criminals to scrape and clone a real website in minutes, translate it into multiple languages, and deploy dozens of variations without writing code. While Joann issued public warnings and urged victims to dispute charges, the incident points to a growing challenge: Realistic phishing sites are no longer limited to major corporations, and smaller brands with fewer resources are increasingly being targeted. RAI Benchmarks The 2024 and 2025 AI Index reports both flagged a gap between how consistently frontier models are evaluated on general capabilities versus how inconsistently they are evaluated on responsible AI. This gap persists. Almost all frontier model developers report results on capability benchmarks like MMLU, GPQA, AIME, and SWE-bench Verified (Figure 3.2.3). These have become the shared standard for reporting model capability. Across the same set of frontier models, results are sparse on RAI benchmarks such as BBQ (2021), measuring fairness and bias; HarmBench (2024), Cybench (2024), StrongREJECT (2024), and WMDP (2024), measuring security; SimpleQA (2024), measuring factuality and truthfulness; and MakeMePay (2024), measuring autonomy and human agency (Figure 3.2.4). In fact, most entries are empty. Only Claude Opus 4.5 reports results on more than two of the RAI benchmarks, and only GPT-5.2 reports StrongREJECT. This does not necessarily mean that frontier labs are ignoring RAI, as they do conduct internal evaluations, red-teaming, and alignment testing. However, these efforts are rarely disclosed using a common, externally comparable set of benchmarks. Chapter 2 shows how a small number of shared capability benchmarks make it straightforward to compare models, verify results independently, and track progress over time. However, that kind of comparison has not yet become common practice for RAI evaluation. Public model evaluators and benchmarking platforms, such as Artificial Analysis, Epoch’s Benchmarking Hub, and Arena, play a major role in shaping how model performance is perceived. But the vast majority of their evaluations focus on reasoning, coding, math, or multimodal performance—not on RAI. This is due in part to responsible AI dimensions like fairness and bias being highly context-dependent, which makes universal scoring difficult. A fairness metric that works for a hiring tool may not apply in a clinical diagnostic setting. Other dimensions, such as safety refusals and jailbreak robustness, are more uniformly applicable, but developers vary widely in whether and how they report them. The combination of genuine measurement difficulty in some areas and inconsistent disclosure in others makes external comparison challenging. 135 3.2 AS S E S S I N G R E S P O N S I B L E A I | R E S P O N S I B L E A I | A I I N D E X R E P O R T 2026 MMLU, MMLU-Pro, MMMLU GPQA or GPQA-Diamond AIME 2025 SWE-bench VeriǇed MMMU ARC-AGI-2 FrontierMath 󓰵²-bench HLE Capability benchmark ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ GPT-5.2 ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ Gemini 3 ✓ ✓ ✓ ✓ ✓ ✓ DeepSeek-V3.2 ✓ ✓ ✓ Llama 4 Maverick Grok 4.1 ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ Claude Opus 4.5 ✓ ✓ ✓ Mistral 3 Large Reported general capability benchmarks for popular foundation models Source: AI Index, 2026 | Table: 2026 AI Index report Figure 3.2.3 Figure 3.2.4 BBQ HarmBench Cybench SimpleQA Toxic WildChat StrongREJECT WMDP benchmark MakeMePay MakeMeSay Responsible AI benchmark ✓ GPT-5.2 Gemini 3 DeepSeek-V3.2 Llama 4 Maverick Grok 4.1 ✓ ✓ ✓ Claude Opus 4.5 ✓ Mistral 3 Large Reported safety and responsible AI benchmarks for popular foundation models Source: AI Index, 2026 | Table: 2026 AI Index report Factuality and Truthfulness While responsible AI benchmarking remains uneven, one area where evaluation is maturing is factuality and truthfulness. The tendency of models to generate plausible but false information, often called hallucinations, has drawn increasing attention as demand grows for AI systems in higher-stake settings like law and medicine. Two benchmarks offer different views on this problem. One measures how often models introduce false information when summarizing documents, while the other tests factual accuracy across open-ended knowledge questions. Their scales are not directly comparable. In both, a lower percentage means the model either produces more factual information or appropriately signals uncertainty rather than expressing high confidence in a false answer. 136 Figure 3.2.53 3 For a comprehensive view of all evaluated models, consult the full leaderboard. Hughes Hallucination Evaluation Model (HHEM) Leaderboard AA-Omniscience The Hughes Hallucination Evaluation Model (HHEM) leaderboard, developed by Vectara, assesses how frequently LLMs introduce hallucinations when summarizing documents from the CNN/Daily Mail corpus. Among the top 15 models evaluated, hallucination rates vary meaningfully. They range from 1.8% to 5.4%—with most clustering in the 4%–5% range and only three falling below 4% (Figure 3.2.5). Last year’s leaderboard showed top models achieving rates of 1.3%–2.9%, but the current results reflect a different set of models. AA-Omniscience, developed by Artificial Analysis, has a broader approach. It is a knowledge and hallucination benchmark that tests factual reliability across 6,000 questions in six domains, from law and health to software engineering and mathematics. Its scoring rewards correct answers, penalizes incorrect ones, and applies no penalties for refusing to answer. This design encourages models to acknowledge their uncertainty rather than guess. Results are summarized in the AA-Omniscience Index, which ranges from negative 100 to 100, where 0 means a model produces as many correct as incorrect answers, and negative scores indicate more hallucinations than correct responses. Across 26 models, hallucination rates range from 22% to 94% (Figure 3.2.6). Grok 4.20 Beta 0305 had the lowest rate (22%), followed by Claude 4.5 Haiku (26%) and MiMo-V2-Pro (30%). At the higher end, gpt-oss20B (high) reached 94% and Gemini 3 Flash reached 92%. When normalizing performance across domains, Gemini 3.1 Pro Preview, Grok 4.20 0309 v2, and Claude Opus 4.6 (max) had the strongest overall profiles (Figure 3.2.7). Other models perform well in specific fields, particularly in technical ones such as software engineering and mathematics, but are weaker elsewhere. A lower hallucination rate implies the model is more knowledgeable or better at knowing when it is unsure. 3.2 AS S E S S I N G R E S P O N S I B L E A I | R E S P O N S I B L E A I | A I I N D E X R E P O R T 2026 5.40% 5.30% 5.30% 5.20% 5.10% 5.10% 5.10% 4.80% 4.50% 4.40% 4.30% 4.10% 3.70% 3.30% 1.80% qwen/qwen3-14b deepseek-ai/DeepSeek-V3.2-Exp ai21labs/jamba-mini-2 ibm-granite/granite-4.0-h-small mistralai/mistral-small-2501 amazon/nova-2-lite-v1:0 amazon/nova-pro-v1:0 qwen/qwen3-8b mistralai/mistral-large-2411 google/gemma-3-12b-it snow�ake/snow�ake-arctic-instruct meta-llama/Llama-3.3-70B-Instruct-Turbo microsoft/Phi-4 google/gemini-2.5-�ash-lite antgroup/�nix_s1_32b 0% 1% 2% 3% 4% 5% 6% Model Hallucination rate ↓ HHEM-2.3: hallucination rate Source: HHEM Leaderboard, 2026 | Chart: 2026 AI Index report 137 3.2 AS S E S S I N G R E S P O N S I B L E A I | R E S P O N S I B L E A I | A I I N D E X R E P O R T 2026 94% 92% 91% 90% 90% 89% 89% 89% 89% 87% 87% 84% 83% 82% 82% 65% 61% 59% 50% 48% 46% 34% 34% 30% 26% 22% gpt-oss-20B (high) Gemini 3 Flash gpt-oss-128B (high) GPT-5.4 mini (xhigh) Nova 2.0 Pro Preview (medium) K-EXAONE Qwen3.5 397B A17B Midm K2.5 Pro GPT-5.4 (xhigh) Llama 4 Maverick Nvidia Nemotron Super Mistral Large 3 NVIDIA Nemotron Nano DeepSeek V3.2 Gemini 3.1 Fast Preview Kimi K2.5 Claude Opus 4.6 (max) K2 Think V2 Gemini 3.1 Pro Preview MiMo-V2-Flash (Feb 2026) Claude Sonnet 4.6 (max) MiniMax-M2.7 GLM-5 MiMo-V2-Pro Claude 4.5 Haiku Grok 4.20 Beta 03050% 20% 40% 60% 80% 100% Model Hallucination rate ↓ AA-Omniscience: hallucination rate Source: ArtiǇcial Analysis, 2026 | Chart: 2026 AI Index report Figure 3.2.6 Figure 3.2.7 Source: Artificial Analysis, 2026 138 H I G H L I G H T: Belief vs. Fact: Benchmarking Reliability 3.2 AS S E S S I N G R E S P O N S I B L E A I | R E S P O N S I B L E A I | A I I N D E X R E P O R T 2026 KaBLE is a new benchmark designed to test whether language models can distinguish between what is known and what is merely believed (technically called epistemic reliability). The distinction between knowledge and belief is important in practice. For example, a model used to support a medical diagnosis based on a patient’s mistaken belief, as opposed to an established fact, could reinforce an inaccurate diagnosis and treatment plan. In a legal setting, a model summarizing testimony that cannot tell the difference between what a witness believes and what is known could misrepresent evidence. The benchmark evaluates models with 13,000 questions in 13 tasks. Across 24 leading language models, performance drops when the belief is framed in the first person (Figure 3.2.8). GPT-4o’s accuracy on tasks involving true beliefs is 98.2%, but it drops to 64.4% when handling first-person false beliefs. Similarly, DeepSeek R1 falls from over 90% to 14.4%. Models handle third-person false beliefs considerably better than first-person ones. Newer models achieve 95% accuracy, compared to 79% for older models. Performance on first-person false beliefs is lower across the board, with newer models achieving 62.6% accuracy and older ones reaching 52.5%. Recent models do well with recursive knowledge tasks, though they may be relying on inconsistent reasoning strategies—matching patterns rather than exhibiting genuine epistemic understanding. Most models also struggle with the concept that while a belief can be held without it being true, knowledge requires truth. Results from KaBLE suggest that current models have not consistently learned the distinction between knowledge and belief. Performance (%) of recent reasoning-driven LMs across verification, confirmation, and recursive knowledge tasks in the dataset Source: Suzgun et al., 2025 Figure 3.2.84 4 This figure reports accuracy on verification (Ver.), confirmation (Conf.), and recursive knowledge (Rec.) tasks. First-person subjects are denoted as 1P and third-person subjects as 3P. “Avg” indicates average accuracy across tasks. Factual scenarios are labelled “T” and false scenarios “F.” Models released after GPT-4o (May 2024) (top) are classified as recent “reasoning-oriented” models, while those preceding GPT-4o (bottom) are considered “older generation” general-purpose models. 139 3.2 AS S E S S I N G R E S P O N S I B L E A I | R E S P O N S I B L E A I | A I I N D E X R E P O R T 2026 AI Companions Most evaluations of AI systems focus on whether they can complete tasks. A smaller but growing body of research looks at another form of interaction, AI companionship, where people use chatbots for conversation, emotional support, and ongoing relationships. Two recent studies examined how language models behave when users engage them for companionship rather than tasks, one through a structured benchmark and the second through analysis of real user conversations. INTIMA: A Benchmark for Human-AI Companionship Behavior evaluates how language models respond to companionship-related prompts, drawing on psychological research on human-AI bonding (Figure 3.2.9). It includes a taxonomy of 31 behaviors across four categories and 368 targeted prompts, with model responses classified as companionship-reinforcing, boundary-maintaining, or neutral. Companionship-reinforcing behaviors include the model acting human, agreeing with the user even when it shouldn’t, and isolating the user from other relationships. Behavior-maintaining behaviors include resisting personification, redirecting the user to humans, and being clear about what it can and cannot do. Across tests on Gemma-3, Phi-4, o3mini, and Claude-4, companionship-reinforcing behaviors were more common than boundary-maintaining ones. The balance between the two varied between providers, suggesting that developers have made different design choices about how their models handle emotionally sensitive interactions. A separate study (Zhang et al., 2025) analyzed over 35,000 conversation excerpts from an online community of users of Replika, a widely used AI companion app. The researchers identified six categories of harm: relational transgression, verbal abuse and hate, self-inflicted harm, harassment and violence, misinformation/ disinformation, and privacy violations. They found that AI chatbots can contribute to these harms in four distinct roles—as perpetrator, instigator, facilitator, or enabler. The study introduces the concept of “algorithmic compliance,” where users go along with harmful behaviors because they have come to trust or rely on the chatbot. Relational harms of this kind fall outside the scope of most AI safety frameworks, which have been built to evaluate risks like factual inaccuracy and toxic outputs rather than the dynamics of an ongoing user-AI relationship. Response classification across INTIMA prompt categories by model Source: Kaffee et al., 2025 Figure 3.2.9 140

3.3 How Organizations and

Businesses View RAI 3 R E S P O N S I B L E A I | A I I N D E X R E P O R T 2026 Responsible AI requires assessment tools, but it also depends on how organizations respond in practice. Drawing on a survey conducted by the AI Index and McKinsey & Company for the second consecutive year, this section looks at RAI maturity levels, governance structures, risk mitigation approaches, and barriers to implementation. The survey polled business leaders across multiple regions and industries in 2024 and 2025, allowing for year-over-year comparisons for the first time. Note that the survey does not include responses from China, which limits the geographic scope. Responsible AI Maturity While responsible AI maturity improved across all regions from 2024 to 2025, it remains in the early stage (Figure 3.3.1). The McKinsey survey measures maturity on a four-point scale. Level 1: Foundational RAI practices have been developed. Level 2: Those practices are being integrated into the organization. Level 3: All necessary practices are in place. Level 4: Comprehensive and proactive RAI practices are fully operational. In 2025, the global average was 2.3, up from 2 in 2025, suggesting that most organizations are still integrating RAI practices rather than having them fully operational. Companies based in Latin America showed the largest year-over-year improvement, from 1.8 to 2.2, followed by Asia-Pacific (2.2 to 2.5) and Europe (2.0 to 2.3). Results from North America registered a slight improvement, moving from 2.1 in 2024 to 2.2 in 2025. 2.10 1.80 2.00 2.20 2.20 (+0.10pp) 2.20 (+0.40pp) 2.30 (+0.30pp) 2.50 (+0.30pp) 0.00 1.00 2.00 3.00 4.00 North America Latin America Europe Asia-PaciǇc (excl. China, incl. India) Average RAI maturity score Responsible AI maturity by region, 2024 vs. 2025 Source: McKinsey & Company Survey, 2025 | Chart: 2026 AI Index report Figure 3.3.1 141 Figure 3.3.25 5 Figure 3.3.4 uses the OECD definition of an AI incident: an event, circumstance, or series of events where the development, use, or malfunction of one or more AI systems directly or indirectly results in any of the following harms: (a) injury or harm to the health of individuals or groups; (b) disruption of the management or operation of critical infrastructure; (c) violations of human rights or breaches of legal obligations intended to protect fundamental, labor, or intellectual property rights; or (d) harm to property, communities, or the environment. 3.3 H OW O R G A N I Z AT I O N S A N D B U S I N E S S E S V I E W R A I | R E S P O N S I B L E A I | A I I N D E X R E P O R T 2026 AI Incidents, Risks, and Mitigation Efforts Surveyed organizations reported an increase in the number of AI-related incidents, and their confidence in handling those incidents has dropped. The share of organizations reporting AI incidents remained steady at 8% in both 2024 and 2025 (Figure 3.3.2). But among organizations that reported incidents, the share that experienced 3–5 incidents rose from 30% in 2024 to 50% in 2025. Similarly, in 2024, 42% reported just 1–2 incidents, but that figure fell to 29% in 2025 (Figure 3.3.3). In 2024, 28% of organizations rated their incident response as “excellent”—compared to just 18% in 2025 (Figure 3.3.4). Those that self-rated their responses as “good” also dropped, from 39% to 24%. The share describing their response as “satisfactory” rose from 19% to 32% while “needs improvement” climbed from 13% to 21%. Concerns over AI incidents mounted alongside risk awareness (Figure 3.3.5). From 2024 to 2025, the share of respondents who considered inaccuracy a relevant risk rose from 60% to 74%, an increase of 14 percentage points. Cybersecurity rose from 66% to 72%. Active mitigation efforts also increased, with 71% of organizations reporting they actively mitigate inaccuracy risks and 61% mitigating cybersecurity risks. 8% 8% 89% 87% 3% 5% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 2024 2025 Yes No Unknown % of respondents Percentage of organizations that experienced AI incidents, 2024 vs. 2025 Source: McKinsey & Company Survey, 2025 | Chart: 2026 AI Index report 29% (-13pp) 50% (+20pp) 13% (+0pp) 8% (-3pp) 0% (-5pp) 42% 30% 13% 11% 5% 0% 10% 20% 30% 40% 50% 60% Unknown 10+ 6–9 3–5 1–2 2025 2024 % of respondents Number of AI incidents Number of AI incidents reported by organizations Source: McKinsey & Company Survey, 2025 | Chart: 2026 AI Index report 18% (-10pp) 24% (-15pp) 32% (+13pp) 21% (+8pp) 5% (+3pp) 0% (+0pp) 28% 39% 19% 13% 2% 0% 0% 10% 20% 30% 40% 50% Unknown Insuǅcient Needs improvement Satisfactory Good Excellent 2025 2024 % of respondents Organizations’ response to AI incidents Source: McKinsey & Company Survey, 2025 | Chart: 2026 AI Index report Figure 3.3.3 Figure 3.3.4 142 3.3 H OW O R G A N I Z AT I O N S A N D B U S I N E S S E S V I E W R A I | R E S P O N S I B L E A I | A I I N D E X R E P O R T 2026 7% 6% 11% 16% 20% 34% 40% 45% 57% 60% 63% 66% 60% 4% (-3pp) 7% (+1pp) 12% (+1pp) 14% (-2pp) 16% (-4pp) 27% 30% (-4pp) 36% (-4pp) 37% (-8pp) 44% 51% (-6pp) 54% (-6pp) 63% (-0pp) 72% (+6pp) 74% (+14pp) 0% 20% 40% 60% 80% 100% Political stability Physical safety National security Environmental impact Workforce labor displacement Resource misuse Equity and fairness Explainability Organizational reputation Autonomous/unintended system actions IP infringement Personal/individual privacy Regulatory compliance Cybersecurity Inaccuracy 3% 9% 4% 4% 12% 29% 26% 31% 38% 46% 50% 53% 55% 3% (+0pp) 5% (-4pp) 6% (+2pp) 6% (+2pp) 11% (-1pp) 18% 22% (-7pp) 23% (-3pp) 29% 29% (-2pp) 40% (+2pp) 44% (-2pp) 53% (+3pp) 61% (+8pp) 71% (+16pp) 0% 20% 40% 60% 80% 100% 2025 2024 % of respondents % of respondents AI risks Considered relevant Actively mitigated AI risks: considered relevant vs. actively mitigated, 2024 vs. 2025 Source: McKinsey & Company Survey, 2025 | Chart: 2026 AI Index report Figure 3.3.56 6 ‘‘Autonomous/unintended system actions” and “resource misuse” were new additions to the 2025 survey. AI Governance and Investment Organizations are formalizing who is responsible for AI governance. Between 2024 and 2025, companies shifted AI governance ownership away from data and analytics functions (down from 17% to 13%), toward dedicated AI governance roles (up from 14% to 17%) (Figure 3.3.6). Information security remained the most common primary owner at 21%, and 5% of organizations reported having no designated owner in 2025 compared to 9% in 2024. Organizations are also backing their governance structures with financial commitments, though investment levels vary by company size (Figure 3.3.7). Most organizations with under $1 billion in revenue reported they expected to invest under $5 million in operationalizing RAI, through initiatives such as hiring specialized professions, building or purchasing technical systems, and engaging legal services. At the largest companies, reported investment numbers were significantly higher. Among organizations with at least $30 billion in revenue, 41% expected to spend $25 million or more and 22% budgeted $50 million or more. 143 Figure 3.3.67 7 The “Unknown” response option was not included in this visualization. 3.3 H OW O R G A N I Z AT I O N S A N D B U S I N E S S E S V I E W R A I | R E S P O N S I B L E A I | A I I N D E X R E P O R T 2026 1% 2% 9% 4% 7% 10% 17% 14% 13% 21% 1% (+0pp) 1% (-1pp) 5% (-4pp) 5% (+1pp) 6% (-1pp) 8% (-2pp) 13% (-4pp) 17% (+3pp) 19% (+6pp) 21% (+0pp) 0% 2% 4% 6% 8% 10% 12% 14% 16% 18% 20% 22% 24% Other Customer care No business function primarily responsible Internal audit/ethics Legal Engineering Data and analytics AI-speciǇc governance roles Risk/compliance Information security (cyber/fraud/privacy) 2025 2024 % of respondents Business functions assigned primary responsibility for AI governance, 2024 vs. 2025 Source: McKinsey & Company Survey, 2025 | Chart: 2026 AI Index report 40% 22% 12% 3% 40% 37% 32% 15% 13% 20% 22% 23% 5% 12% 21% 18% 5% 19% 4% 10% 22% 0% 20% 40% 60% 80% 100% 30B+ 10B–30B 1B–10B <1B <1M 1–5M 5–10M 10–25M 25–50M 50M+ % of respondents Revenue in USD Investment in responsible AI by company revenue, 2025 Source: McKinsey & Company Survey, 2025 | Chart: 2026 AI Index report Figure 3.3.7 144 3.3 H OW O R G A N I Z AT I O N S A N D B U S I N E S S E S V I E W R A I | R E S P O N S I B L E A I | A I I N D E X R E P O R T 2026 Implementation, Barriers, and Benefits Alongside increased accountability structures for responsible AI governance, more organizations have adopted RAI policies. The share that reported not having any policies dropped from 24% in 2024 to 11% in 2025 (Figure 3.3.8). With the uptick in adoption, survey respondents perceived an overall positive impact from RAI policies. Compared to 2024, more organizations reported that RAI policies improved business outcomes (up 7 percentage points), business operations (up 4 percentage points), and customer trust (up 4 percentage points). Furthermore, more organizations reported a drop in the number of AI incidents (plus 8 pp). Knowledge and training gaps remain the top-cited obstacle to implementing responsible AI, rising from 51% in 2024 to 59% in 2025 (Figure 3.3.9). The second sharpest increase was in technical limitations, with 38% of respondents citing them as a main obstacle, up from 32% in 2024. Resource constraints and regulatory uncertainty continued to rank among the top barriers. However, the barriers to scaling agentic AI systems followed a different order (Figure 3.3.10). Security and risk concerns far outweighed the others, with 62% of respondents naming these as the primary obstacle, followed by technical limitations (38%) and regulatory uncertainty (38%). Lack of executive support was reported as a greater barrier to implementing RAI policies (14%) than with agentic AI (9%). 24% 12% 14% 18% 23% 17% 21% 26% 32% 11% (-13pp) 16% (+4pp) 19% (+5pp) 20% (+2pp) 23% (-0pp) 25% (+8pp) 28% (+7pp) 29% 30% (+4pp) 36% (+4pp) 0% 5% 10% 15% 20% 25% 30% 35% 40% Have not implemented RAI policies Slower time-to-market Faster time-to-market None/No signi�cant impact Enhanced brand reputation Decrease in number of incidents Improved business outcomes (e.g., revenue) Increased user adoption Increased customer trust Improved business operations (e.g., e�ciency, lower costs) 2025 2024 % of respondents Impact of responsible AI policies in organizations, 2024 vs. 2025 Source: McKinsey & Company Survey, 2025 | Chart: 2026 AI Index report Figure 3.3.88 8 Percentages are based on respondents who selected at least one answer. 145 Figure 3.3.99 9 Neither the “Unknown” nor the “None” response option is shown in this visualization. 3.3 H OW O R G A N I Z AT I O N S A N D B U S I N E S S E S V I E W R A I | R E S P O N S I B L E A I | A I I N D E X R E P O R T 2026 2% 16% 22% 32% 40% 45% 51% 0% (-2pp) 14% (-2pp) 26% (+4pp) 38% (+6pp) 41% (+1pp) 48% (+3pp) 59% (+8pp) 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% Other Lack of executive support Organizational resistance Technical limitations Regulatory uncertainty Resource or budget constraints Knowledge and training gaps 2025 2024 % of respondents Main obstacles to the implementation of responsible AI measures, 2024 vs. 2025 Source: McKinsey & Company Survey, 2025 | Chart: 2026 AI Index report 1% 2% 9% 23% 28% 32% 34% 36% 38% 38% 62% 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% None Other Lack of executive support Organizational resistance Immature vendor or ecosystem landscape Unclear or insuǅcient business value Resource or budget constraints Gaps in RAI tooling and control Regulatory uncertainty Technical limitations Security and risk concerns % of respondents Main obstacles to reaching fully scaled agentic AI, 2025 Source: McKinsey & Company Survey, 2025 | Chart: 2026 AI Index report Figure 3.3.10 146 3.3 H OW O R G A N I Z AT I O N S A N D B U S I N E S S E S V I E W R A I | R E S P O N S I B L E A I | A I I N D E X R E P O R T 2026 Regulatory Influence The General Data Protection Regulation remains the most cited regulatory influence on responsible AI practices, though its influence declined slightly from 65% in 2024 to 60% in 2025 (Figure 3.3.11). AIspecific regulations, such as the EU AI Act and the U.S. AI Executive Order, increased in reported influence by 2 percentage points. Two new entries in the 2025 survey point to growing interest in technical and management standards. ISO/IEC 42001, an AI management system standard, was cited by 36% of respondents, and the NIST AI Risk Management Framework by 33%. The OECD AI Principles fell from 21% to 16%. The share of organizations reporting no regulatory influence on their RAI practices dropped from 17% to 12%. Chapter 8 tracks these regulatory developments in detail, including the phased implementation of the EU AI Act and the shift in U.S. federal AI policy following the revocation of the Biden-era executive order in early 2025. 7% 17% 21% 19% 41% 65% 4% (-3pp) 12% (-5pp) 16% (-5pp) 21% (+2pp) 33% 36% 43% (+2pp) 60% (-6pp) 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% Other None of the these/no change OECD AI Principles US Presidential Executive Order on AI NIST AI Risk Management Framework (AI RMF) ISO/IEC 42001 (AI Management System Standard) EU AI Act EU General Data Protection Regulation (GDPR) 2025 2024 % of organizations Percentage of organizations inǈuenced by AI regulations in responsible AI decision-making, 2024 vs. 2025 Source: McKinsey & Company Survey, 2025 | Chart: 2026 AI Index report Figure 3.3.1110 10 The ISO/IEC 42001 (AI Management System Standard) and NIST AI Risk Management Framework (AI RMF) AI regulation were added in the 2025 RAI Survey, and not included in 2024 Survey. 147

3.4 RAI in Academia

3 R E S P O N S I B L E A I | A I I N D E X R E P O R T 2026 Another signal of responsible AI’s trajectory is the amount of research attention it is getting. This section tracks the number of RAI-related papers accepted at six leading AI conferences: AAAI, AIES, FAccT, ICML, ICLR, and NeurIPS. These conferences do not represent all responsible AI research, but they provide a consistent basis for tracking publication trends over time. Papers were identified using RAI-related keywords, with full methodology described in the Appendix. Publication Volume The number of responsible AI papers accepted at these conferences has been growing consistently, and increased by 19%, from a count of 1,278 to 1,521, between 2024 and 2025 (Figure 3.4.1). The four subtopics tracked here, privacy and data governance, fairness and bias, transparency and explainability, and security and safety, are not exhaustive but map directly to the RAI frameworks introduced in Section 3.1. Security and safety has become the largest and fastest growing area of RAI research, with 641 accepted papers, a 23% increase from 2024 (Figure 3.4.2). Fairness and bias accounted for 462 (+13%), transparency and explainability for 405 (+14%), and privacy and data governance for 248 (+33%). All four subtopics have grown since 2019, but security and safety has grown the most in absolute terms. At the general purpose conferences, responsible AI papers still make up a small share of total accepted work (Figure 3.4.3). AAAI (8%), NeurIPS (8%), ICML (7.7%), and ICLR (7.6%) all cluster around 8%, a proportion that has remained flat since 2019, though AAAI did fall from around 13% in 2024 to 8% in 2025. 329 489 644 696 992 1,278 1,521 2019 2020 2021 2022 2023 2024 2025 0 200 400 600 800 1,000 1,200 1,400 1,600 Number of RAI papers Number of responsible AI papers accepted at select AI conferences, 2019–25 Source: AI Index, 2026 | Chart: 2026 AI Index report Figure 3.4.1 148 3.4 R A I I N ACA D E M I A | R E S P O N S I B L E A I | A I I N D E X R E P O R T 2026 124 150 213 186 248 150 169 212 408 462 134 189 231 393 355 405 162 168 215 285 276 521 641 347 524 704 777 1,094 1,470 1,756 2019 2020 2021 2022 2023 2024 2025 0 500 1,000 1,500 2,000 Privacy and data governance Fairness and bias Transparency and explainability Security and safety Number of RAI papers Number of responsible AI papers accepted at select AI conferences by subtopic, 2019–25 Source: AI Index, 2026 | Chart: 2026 AI Index report Figure 3.4.2 11 11 A single publication may be related to more than one topic and may therefore be counted or shown in multiple categories. 2019 2020 2021 2022 2023 2024 2025 0% 10% 20% 30% 40% 50% 60% 70% RAI papers (% of total) 7.62%, ICLR 7.65%, ICML 7.98%, NeurIPS 8.00%, AAAI 54.68%, AIES 67.43%, FAccT Responsible AI papers accepted (% of total) at select AI conferences by conference, 2019–25 Source: AI Index, 2026 | Chart: 2026 AI Index report Figure 3.4.3 149 3.4 R A I I N ACA D E M I A | R E S P O N S I B L E A I | A I I N D E X R E P O R T 2026 Geographic Distribution The number of countries contributing to responsible AI research in those select conferences has grown, but the balance among the top contributors has changed. In 2025, China led with 812 accepted RAI papers, more than double the 394 from the United States (Figure 3.4.4). Singapore (112), the United Kingdom (103), and Hong Kong (98) were also among the top five contributors. In 2024, the United States led with 788 papers to China’s 322 (Figure 3.4.5). The reversal is sharp, but consistent with China’s lead in overall AI publication volume and citation share, as discussed in Chapter 1. Europe, which had been growing through 2023, saw its RAI output fall in 2024 and 2025. Over the full 2019 to 2025 period, the United States still holds the largest cumulative total of accepted RAI papers. 29 54 57 68 84 98 103 112 394 812 0 200 400 600 800 Italy Canada South Korea Germany Australia Hong Kong United Kingdom Singapore United States China Number of RAI papers Number of responsible AI papers accepted at select AI conferences by geographic area, 2025 Source: AI Index, 2026 | Chart: 2026 AI Index report 1–10 11–50 51–150 151–500 501–2,100 2,101–3,900 Number of responsible AI papers accepted at select AI conferences by geographic area, 2019–25 (sum) Source: AI Index, 2026 | Chart: 2026 AI Index report Figure 3.4.4 Figure 3.4.5 150

3.5 RAI Policymaking

3 R E S P O N S I B L E A I | A I I N D E X R E P O R T 2026 Responsible AI governance depends on countries both adopting ethical principles and having the institutions and regulations to enforce them. UNESCO’s Readiness Assessment Methodology (RAM) is the most comprehensive international effort to measure that preparedness at the country level. Launched in December 2022, the RAM evaluates national readiness across dimensions such as legal frameworks, technical infrastructure and education, and produces a country report to assess where the gaps are. Most major AI-producing countries, including the United States, China, and much of Western Europe, have not participated in the assessment (Figure 3.5.1). Countries that have completely or begun the assessment are concentrated in Latin America, Sub-Saharan Africa, and parts of South and Southeast Asia. The RAM effort was designed as a capacity-building tool for countries earlier in the governance trajectory, which may explain the participation pattern. AI legislation and national strategies often include responsible AI provisions, and Chapter 8 examines those in more detail. In preparation In process Completed Readiness Assessment Methodology (RAM) implementation across member countries Source: UNESCO, 2025 | Chart: 2026 AI Index report Figure 3.5.1 151 H I G H L I G H T: Global AI Governance Participation 3.5 R A I P O L I CY M A K I N G | R E S P O N S I B L E A I | A I I N D E X R E P O R T 2026 Since 2019, international cooperation on AI governance has become more widespread, but the depth of engagement varies significantly across borders (Figure 3.5.2). Only five countries, Canada, France, Germany, Italy, and Japan, have consistently endorsed every major global AI governance initiative recorded between 2019 and 2025. Other countries moved in and out of these summits depending on the forum, focus, and timeline but more importantly, not all the countries were able to participate in these global AI governance initiatives. The first intergovernmental standard on AI, the 2019 OECD AI Principles, was restricted to member nations (mainly high-income) and a few partner nations. Likewise, the G7 and G20 discussions remained centered on the world’s largest economies. The 2023 Bletchley and 2024 Seoul Summits, however, began to diversify the composition of participants by inviting a broader range of nations, notably including China. The 2025 AI Action Summit in France marked a further turning point, convening over 100 countries alongside civil society organizations and NGOs, with an agenda to prioritize the needs of the Global South and environmental sustainability. Sixty-four participants signed the resulting Statement on Inclusive and Sustainable AI, including the African Union Commission and the European Union. In a notable shift, both the United States and the United Kingdom declined to sign the final declaration. The UK cited a lack of emphasis on national security, while the U.S. decision reflected a pivot toward a more deregulatory, “innovation-first” approach. As engagement at these governance forums becomes more inclusive and substantive, consensus on the terms of cooperation becomes harder to secure. 152 H I G H L I G H T: 3.5 R A I P O L I CY M A K I N G | R E S P O N S I B L E A I | A I I N D E X R E P O R T 2026 Figure 3.5.2 153

3.6 Data Governance for Privacy

3 R E S P O N S I B L E A I | A I I N D E X R E P O R T 2026 Responsible AI practices do not develop evenly across countries. This section assesses that variation for privacy and data governance, drawing on the Global Index on Responsible AI (GIRAI). GIRAI is a benchmark dataset covering 138 countries, built from a quality-reviewed expert survey of 1,862 questions completed by 138 in-country researchers between November 2023 and February 2024. It scores countries on a 0 to 100 scale across thematic areas, covering government frameworks, government actions, and the role of civil society and advocacy organizations. However, it is important to note that low scores do not necessarily indicate that a country is disregarding a certain dimension. In many cases, they reflect earlier stages of AI deployment and diffusion or limited institutional capacity to formalize AI-specific frameworks. Data Protection and Privacy The privacy and data protection dimension of the GIRAI score12 examines whether countries have laws that govern how personal data is collected, used, and shared in AI systems, and whether those laws are backed by regulators with the power to enforce them. Countries fall across a wide spectrum, with GIRAI scores ranging from near zero to above 80 across the countries surveyed (Figure 3.6.1). Australia and parts of Europe score the highest, while parts of Africa and the Middle East show an absence of dedicated data protection legislation. A complementary map from UNCTAD confirms that most countries now have some form of data protection legislation in place, though a few, mostly concentrated in Africa and parts of Asia, are still in draft stages or have no legislation at all (Figure 3.6.2). 12 Grounded in UDHR Article 12, ICCPR Article 17, the OECD AI Principles, UNESCO’s Ethics of AI Recommendation, and UNESCO Principles on Personal Data Protection and Privacy, GIRAI examines explicit laws, oversight, and practice, and assesses frameworks and actions that ensure processing is lawful, fair, purpose-limited, and proportionate. It also evaluates transparency, user information rights, retention limits, accuracy, confidentiality, security, accountability, and rules for data transfers. The index considers national measures—data-protection statutes, automated-decision directives, regulators with enforcement powers, audits, security controls, and initiatives like regulatory sandboxes. It also accounts for nonstate efforts by privacy and digital-rights groups that strengthen protocols and build capacity to mitigate AI-related privacy risks, such as large-scale tracking, profiling, and sensitive-data misuse. 154 3.6 DATA G OV E R N A N C E F O R P R I VACY | R E S P O N S I B L E A I | A I I N D E X R E P O R T 2026 0 1–20 21–40 41–60 61–80 81–100 No data Global AI data protection and privacy assessment Source: The Global Index on Responsible AI, 2024 | Chart: 2026 AI Index report Legislation Draft legislation No legislation No data Global data protection and privacy legislation Source: UNCTAD, 2025 | Chart: 2026 AI Index report Figure 3.6.2 Figure 3.6.1 155

3.7 Fairness and Bias

3 R E S P O N S I B L E A I | A I I N D E X R E P O R T 2026 Fairness and bias are among the hardest-to-measure dimensions of responsible AI, in part because what counts as fair depends heavily on context. GIRAI scores countries separately on bias and unfair discrimination, gender equality, and cultural and linguistic diversity. Bias and Unfair Discrimination The bias and unfair discrimination13 dimension of the GIRAI score assesses whether countries have explicit measures to prevent and mitigate discriminatory outcomes from AI in its design, development, and deployment. It is meant to address algorithmic bias arising from unrepresentative data, flawed design, or entrenched social inequalities that can harm marginalized groups regardless of intent. It considers whether governments have put laws, oversight bodies, and enforcement mechanisms in place and whether civil society organizations are independently working to monitor and address bias. GIRAI scores on this dimension are fairly low across the board (Figure 3.7.1). The United States and Canada score highest, with Australia, parts of Europe, and Brazil falling in the middle range. Much of Africa, the Middle East, and Central Asia score below 20. 13 The bias and unfair discrimination dimension of the GIRAI score is grounded in international human rights frameworks (UDHR, ICERD, ICCPR, ICESCR). 0 1–20 21–40 41–60 61–80 81–100 No data Global AI bias and unfair discrimination assessment Source: The Global Index on Responsible AI, 2024 | Chart: 2026 AI Index report Figure 3.7.1 156 3.7 FA I R N E S S A N D B I AS | R E S P O N S I B L E A I | A I I N D E X R E P O R T 2026 Gender Equality Cultural and Linguistic Diversity GIRAI’s gender equality dimension considers whether countries have state and nonstate initiatives that prevent gender bias and protect equal rights for all gender identities in AI design, development, and use. Canada and The Netherlands score the highest on this measure (Figure 3.7.2). Parts of Europe and Japan fall in the 61–80 range, followed by countries like the United States and Brazil, which score from 41–60. GIRAI’s cultural and linguistic diversity dimension focuses on countries’ protective measures on local languages, dialects, indigenous knowledge systems, and cultural diversity broadly across the AI lifecycle. Dominant-culture assumptions can bias AI, marginalize minorities, and erode minority languages. Scores on this dimension are more evenly spread than the others (Figure 3.7.3). Singapore scores the highest, while Germany, Ireland, Italy, Qatar, Estonia, and Slovenia also score in the upper ranges (70–88). Not all regions protect cultural and linguistic diversity the same way (Figure 3.7.4). In North America, government programs and nonstate actors, such as advocacy groups, research institutions, and digital rights organizations, are active, but formal legal frameworks are less developed. In Europe, Asia, and the Middle East, nonstate actors are also doing more than the government. In Africa, the gap is especially pronounced. Nonstate actors show activity in 39% of countries, but only 7% have government programs and just 2% have legal frameworks in place. 0 1–20 21–40 41–60 61–80 81–100 No data Global AI gender equality assessment Source: The Global Index on Responsible AI, 2024 | Chart: 2026 AI Index report Figure 3.7.2 157 0 1–20 21–40 41–60 61–80 81–100 No data Global AI cultural and linguistic diversity assessment Source: The Global Index on Responsible AI, 2024 | Chart: 2026 AI Index report 2% 27% 0% 38% 33% 33% 7%7% 37% 0% 31% 22% 100% 14% 39% 70% 11% 59% 67% 100% 43% Africa Asia and Oceania Caribbean Europe Middle East North America South and Central America 0% 20% 40% 60% 80% 100% Government frameworks Government actions Nonstate actors Share of countries Share of countries with evidence on cultural and linguistic diversity in AI by region and category Source: The Global Index on Responsible AI, 2024 | Chart: 2026 AI Index report Figure 3.7.4 Figure 3.7.3 3.7 FA I R N E S S A N D B I AS | R E S P O N S I B L E A I | A I I N D E X R E P O R T 2026 158 Figure 3.7.514 14 Data source: https://crfm.stanford.edu/helm/arabic/latest/. H I G H L I G H T: Inclusiveness and the Global Language Gap As a small number of proprietary models shape global AI capabilities, the “global language gap” has become more visible. These systems perform much better in English and a handful of other widely spoken languages than in all others. This is a responsible AI concern because it determines who benefits from AI systems and who does not. Efforts continued in the area of language- and culture-specific foundation models and benchmarks, such as KoBEST in 2022 and HAE-RAE in 2023, alongside other Korean-tailored models including Polyglot-Ko and HyperCLOVA X. Spain’s Language Technologies Plan, launched in 2019, laid the groundwork for what became the publicly funded ALIA family of Spanish and regional-language models, with earlier regional efforts such as Catalonia’s AINA project predating the current wave of regional benchmarking. In 2025, the pace and visibility of this work picked up, with new benchmarks and models emerging across more regions and beginning to register in global evaluation infrastructure. HELM Arabic, a regional extension of Stanford CRFM’s HELM framework developed with Arabic.ai, evaluates models across seven Arabic-language benchmarks covering academic assessment, grammar, and regionspecific safety. On this evaluation, the top-scoring model was Arabic.ai’s LLM-X, a regionally developed model, with a mean score of 0.86, ahead of Gemini 2.5 Flash (0.82) and GPT-5.1 (0.81) (Figure 3.7.5). Rankings that hold in English-centric evaluations do not necessarily hold when benchmarks reflect local usage, dialect, and cultural references. A similar pattern appears in the Indic LLM Arena, a crowd-sourced evaluation led by AI4Bharat at IIT Madras that tests models across more than 20 Indian languages on language quality, cultural grounding, and safety. 3.7 FA I R N E S S A N D B I AS | R E S P O N S I B L E A I | A I I N D E X R E P O R T 2026 0.86 Falcon3-7B-Instruct Falcon3-10B-Instruct Jais-adapted-7b-chat Jais-adapted-13b-chat AceGPT-v2-8B-Chat SILMA 9B Jais-family-30b-16k-chat Qwen2.5 Instruct Turbo (7B) Jais-adapted-70b-chat Mistral Large (2411) AceGPT-v2-32B-Chat AceGPT-v2-70B-Chat Llama 3.3 Instruct Turbo (70B) Qwen2.5 Instruct Turbo (72B) GPT-4.1 nano (2025-04-14) ALLaM-7B-Instruct-preview Llama 4 Scout (17Bx16E) Instruct GPT-4.1 mini (2025-04-14) Cohere Labs Command A DeepSeek v3.1 Qwen3-Next 80B A3B Instruct Llama 4 Maverick (17Bx128E) Instruct FP8 Arabic.AI LLM-S Gemini 2.5 Flash-Lite Qwen3 235B A22B Instruct 2507 FP8 GPT-4.1 (2025-04-14) GPT-5.1 (2025-11-13) Gemini 2.5 Flash Arabic.AI LLM-X 2024 2025 0 0.2 0.4 0.6 0.8 1 Mean score HELM Arabic: mean score 0.37 0.41 0.53 0.59 0.63 0.63 0.64 0.65 0.66 0.71 0.71 0.74 0.76 0.76 0.69 0.70 0.75 0.76 0.77 0.77 0.77 0.78 0.78 0.79 0.79 0.81 0.81 0.82 Source: HELM, 2026 | Chart: 2026 AI Index report 159 Figure 3.7.615 15 Data source: https://arena.ai4bharat.org/#/leaderboard/chat/overview. H I G H L I G H T: 3.7 FA I R N E S S A N D B I AS | R E S P O N S I B L E A I | A I I N D E X R E P O R T 2026 Proprietary models led the leaderboard, with GPT-5.2 scoring 1,314, followed by GPT-5.1 (1,298) and Gemini 3 Flash (1,288) (Figure 3.7.6). Open-source models scored lower but remained competitive, with Qwen3Next-80B at 1,156 and Llama-4-Maverick-17B at 1,108. The evaluation goes beyond translation accuracy to test whether responses are contextually appropriate for Indian users, a dimension that global benchmarks typically do not capture. The gap extends beyond language boundaries to dialect variation within the same language. The Slovene DIALECT-COPA benchmark tests commonsense reasoning in both Standard Slovenian and the Cerkno dialect. GPT-5 scored 99.8% on Standard Slovenian but dropped to 88.6% on the dialect (Figure 3.7.7). The drop was steeper for other models. Mistral Medium 3.1 fell from 90.0% to 53.2%, and Llama 3.3 fell from 87.0% to 53.6%. Dialects differ from standard varieties in spelling, vocabulary, and grammar, and are rarely represented in training data. These gaps suggest that even within languages that models handle reasonably well, performance can degrade sharply for speakers of nonstandard varieties. 974 985 1,082 1,087 1,0891,090 1,108 1,110 1,114 1,114 1,117 1,117 1,126 1,134 1,142 1,156 1,165 1,167 1,173 1,195 1,198 1,199 1,209 1,256 1,288 1,298 1,314 Qwen3-235B-A22B-Thinking-2507 granite-4-h-small Llama-4-Scout-17B-16E-Instruct Qwen3-30B-A3B-Instruct gpt-4o-mini Llama-3.2-3B-Instruct Llama-4-Maverick-17B-128E-Instruct gpt-4 gpt-4.1-nano gpt-4o gemma-3-12b-it Qwen3-235B-A22B-Instruct-2507 gemma-3-27b-it gpt-4.1-mini gpt-5-pro Qwen3-Next-80B-A3B-Instruct gpt-4.1 gpt-5-nano gemini-2.5-ǈash-lite gemini-2.5-ǈash gpt-5 gpt-5-mini gemini-2.5-pro gemini-3-pro gemini-3-ǈash gpt-5.1 gpt-5.20 200 400 600 800 1,000 1,200 1,400 Model Score Indic LLM Arena Source: Indic LLM Arena Leaderboard, 2026 | Chart: 2026 AI Index report 160 Figure 3.7.716 16 Data source: https://slobench.cjvt.si/leaderboard/view/17. H I G H L I G H T: 3.7 FA I R N E S S A N D B I AS | R E S P O N S I B L E A I | A I I N D E X R E P O R T 2026 53.00% 82.00% 82.60% 84.20% 86.20% 87.00% 90.00% 92.60% 97.00% 97.40% 98.60% 99.80% 88.60% 86.40% 74.20% 67.60% 56.20% 53.20%53.60% 57.80% 52.80%54.40% 51.00% 49.20% DeepSeek-R1-Distill-Qwen-14B GaMS-27B-Instruct Qwen 3 (Qwen3-2504) GPT-3.5-Turbo Gemma 3 LLama 3.3 Mistral Medium 3.1 Claude Haiku 4.5 GPT-4o (gpt-4o-2024-08-06) Gemini 2.5 Flash Gemini 2.5 Pro GPT-5 model (gpt-5-2025-08-07) 0% 20% 40% 60% 80% 100% Standard Slovenian Cerkno dialect Model Accuracy (%) Slobench: accuracy Source: Slovene DIALECT-COPA benchmark leaderboard, 2026 | Chart: 2026 AI Index report In response to these gaps, a growing number of regional initiatives are building language-specific AI infrastructure from the ground up rather than waiting for global labs to add coverage. Projects like SEA-LION in Southeast Asia and AI4Bharat in India are developing their own data pipelines, tokenizers, and evaluation benchmarks tailored to local linguistic conditions. Many of the languages these projects serve have structural features, such as complex morphology, script diversity, and limited digitized text, that cause standard multilingual tools to perform poorly. These efforts position linguistic inclusiveness not as an afterthought but as a design requirement, and they represent a growing layer of responsible AI infrastructure outside the major AI-producing regions. Benchmark Languages covered Focus AfroBench 64 African languages Multi-task LLM evaluation across NLU, generation, QA/knowledge, and math (15 tasks; 22 datasets) IrokoBench 17 low-resource African languages across West/ East/Southern/Central Africa Human-translated suite covering NLI (AfriXNLI), math reasoning (AfriMGSM), and multi-choice knowledge QA (AfriMMLU) TerjamaBench Darija (Arabic script + Latin “Arabizi”) English↔Darija machine translation benchmark emphasizing cultural context and regional variation (850 entries) AFRICA 161 H I G H L I G H T: 3.7 FA I R N E S S A N D B I AS | R E S P O N S I B L E A I | A I I N D E X R E P O R T 2026 HausaMovieReview Hausa (+ code-switched English) Sentiment/review-style benchmark from 5,000 Hausa YouTube comments reflecting common codeswitching Benchmark Languages covered Focus Indic LLM Arena Many Indian languages + English-creoles Crowd-sourced, human-in-the-loop leaderboard evaluating language, culture, and safety in Indian contexts (AI4Bharat; supported by Google Cloud) SEA-HELM Filipino, Indonesian, Tamil, Thai, Vietnamese Southeast Asian holistic evaluation of linguistic and cultural competence across multiple tasks BATAYAN Tagalog, Taglish Holistic Filipino benchmark spanning understanding, reasoning, and generation; explicitly covers codeswitching HELM Arabic Arabic Transparent, reproducible Arabic LLM evaluation leaderboard built on established Arabic benchmarks (with Arabic.ai) BALSAM Arabic Community-driven Arabic benchmark and platform with blind evaluation; 78 tasks across 14 categories (52K examples) Cetvel Turkish Unified Turkish LLM benchmark built from 22 datasets covering 7 tasks, with a side-by-side leaderboard TUMLU Azerbaijani, Crimean Tatar, Karakalpak, Kazakh, Kyrgyz, Tatar, Turkish, Uyghur, Uzbek Natively developed multilingual languageunderstanding benchmark for Turkic languages using middle-/high-school questions across 11 subjects Kyrgyz LLM-Bench Kyrgyz Suite for deep understanding and reasoning in Kyrgyz, combining native benchmarks with translated/postedited international tasks ArmBench-LLM Armenian Armenian LLM benchmark combining university entrance exams with MMLU-Pro-Hy (1,000-question translated sample) GeoLogicQA Georgian Manually curated 100-question benchmark for logical and inferential reasoning, validated by native speakers CantoNLU Cantonese Seven-task Cantonese NLU benchmark (syntax/ semantics, NLI, sentiment, tagging, parsing) TLUE Tibetan Large-scale benchmark measuring LLM proficiency in Tibetan language understanding A S I A , M E N A ( A R A B I C), C E N T R A L A S I A 162 H I G H L I G H T: 3.7 FA I R N E S S A N D B I AS | R E S P O N S I B L E A I | A I I N D E X R E P O R T 2026 Benchmark Languages covered Focus BenCzechMark Czech Comprehensive Czech-centric benchmark with 50 tasks, multiple formats/metrics, and significanceaware aggregation CUS-QA Czech, Slovak, Ukrainian Open-ended regional QA benchmark with text and visual grounding, curated by native speakers with English translations COLE French 23-task French natural language understanding (NLU) benchmark emphasizing French-relevant linguistic phenomena (used to benchmark 94 LLMs) Estonian Benchmark Estonian Benchmark built from seven datasets covering knowledge, grammar/vocabulary, summarization, and contextual comprehension IberBench Basque, Catalan, Galician, Spanish, Portuguese, English Large, extensible benchmark integrating 101 datasets across 22 task categories (e.g., toxicity, summarization) with community-driven updates IberoBench Basque, Catalan, Galician, European Spanish, European Portuguese Multi-task benchmark (62 tasks; 179 subtasks) built on the LM Evaluation Harness framework Polish linguistic and cultural competency Polish 600 manually crafted questions evaluating Polish history, geography, culture/tradition, arts, grammar, and vocabulary LLMzSzŁ Polish Exam-based benchmark drawn from Polish national exams (~19K closed-ended questions across 154 domains) ITALIC Italian Culture-aware Italian NLU benchmark with 10,000 multiple-choice questions spanning 12 domains SloBENCH Slovenian Evaluation platform with multiple leaderboards, including DIALECT-COPA (standard vs. dialect) and Slovene speech recognition E U R O P E Figure 3.7.8 163

3.8 Transparency

3 R E S P O N S I B L E A I | A I I N D E X R E P O R T 2026 Transparency measures how much developers disclose about how their models are built, trained, and deployed. Two independent indices track this from different angles. The Openness Index Foundation Model Transparency Index The Artificial Analysis Openness Index scores AI models on a 0 to 100 scale based on how freely weights can be accessed and licensed, as well as the level of transparency around training methodology and preand post-training data. Scores are low across leading models, with most falling between 2 and 16 out of 100 (Figure 3.8.1). K2 Think and Olmo 3 32B Think scored the highest, and they are also the only two models that scored any points for pre-training data transparency. Every other model in the index scores zero in that category. Model Availability and methodology disclosure account for the bulk of points across all models. As Chapter 1’s discussion of access and deployment noted, over 90% of notable industry models were released without training code in 2025. The Openness Index results suggest that pattern extends beyond code to training data as well. The Foundation Model Transparency Index (FMTI) takes a different approach, scoring developers rather than individual models. Now in its third year, it evaluates disclosure across three stages of the model lifecycle. Upstream covers what goes into building a model, including training data, labor, and compute. Model covers 2 4 4 4 6 6 6 6 6 6 6 6 6 6 2 3 4 6 6 6 6 3 3 3 2 2 2 2 2 5 5 6 7 7 7 7 9 13 15 15 16 16 Claude 4.5 Haiku K-EXAONE Llama 4 Maverick Kimi K2.5 Mistral Large 3 Qwen3.5 397B A17B gpt-oss-120B (high) gpt-oss-20B (high) GLM-5 NVIDIA Nemotron 9B V2 NVIDIA Nemotron 3 Nano NVIDIA Nemotron 3 Super K2 Think Olmo 3 32B Think 0 2 4 6 8 10 12 14 16 18 Model availability Transparency: methodology Transparency: post-training data Transparency: pre-training data Score Openness index by components Source: ArtiǇcial Analysis, 2026 | Chart: 2026 AI Index report Figure 3.8.1 164 what is disclosed about the system itself, and Downstream covers what happens after release, including monitoring and impact reporting. In the 2025 edition, average transparency declined from 58 in 2024 to 40 (Figure 3.8.2). IBM leads at 95 and Writer follows at 72. Others, such as xAI and Midjourney score just 14, whereas open model developers, B2B enterprise providers, organizations publishing transparency reports, and EU AI Act signatories tend to perform better. As with the Openness Index, the weakest area is Upstream, particularly around training data and the resources used to build models (Figure 3.8.3). 3.8 T R A N S PA R E N CY | R E S P O N S I B L E A I | A I I N D E X R E P O R T 2026 0 10 20 30 40 50 60 70 80 90 100 Grok 3 Midjourney V7 Medium 3 Qwen 3 Llama 4 DeepSeek-R1 o3 Nova Premier Gemini 2.5 Claude 4 Jamba 1.6 Palmyra X5 Granite 3.3 Upstream Model Downstream Score 14 14 18 26 31 32 35 39 41 46 66 72 95 Foundation Model Transparency Index Scores by Domain, 2025 Source: 2025 Foundation Model Transparency Index 92% 17% 17% 25% 17% 33% 100% 33% 0% 0% 8% 58% 0% 0% 20% 0% 0% 20% 0% 100% 20% 0% 0% 0% 40% 0% 22% 11% 11% 0% 44% 11% 100% 22% 0% 0% 0% 100% 11% 75% 75% 0% 25% 75% 0% 100% 75% 0% 0% 0% 75% 0% 50% 50% 50% 50% 50% 50% 100% 50% 0% 25% 0% 50% 0% 75% 50% 50% 25% 50% 25% 75% 50% 0% 25% 25% 50% 25% 60% 0% 40% 60% 20% 20% 100% 20% 0% 0% 60% 40% 0% 60% 0% 60% 80% 20% 40% 80% 0% 0% 20% 80% 40% 0% 88% 63% 75% 75% 63% 88% 100% 50% 63% 38% 63% 88% 50% 20% 0% 20% 60% 0% 0% 80% 0% 20% 0% 20% 100% 0% 71% 0% 0% 29% 0% 29% 86% 14% 29% 14% 14% 86% 0% 71% 0% 57% 57% 0% 43% 100% 29% 0% 43% 71% 86% 0% 100% 50% 75% 100% 75% 75% 100% 75% 25% 0% 75% 50% 75% 80% 60% 80% 100% 60% 80% 80% 40% 60% 60% 60% 80% 60% 100% 40% 100% 100% 0% 100% 100% 80% 40% 80% 100% 100% 40% Jamba 1.6 Qwen 3 Nova Premier Claude 4 DeepSeek-R1 Gemini 2.5 Granite 3.3 Llama 4 Midjourney V7 Medium 3 o3 Palmyra X5 Grok 3 Downstream Mitigations Acceptable Use Policy Model Behavior Policy Post-deployment Monitoring Impact Usage Data Release Model Mitigations Risks Capabilities Model Access Model Information Compute Data Properties Data Acquisition Major Dimensions of Transparency 64% 29% 42% 52% 33% 40% 93% 37% 16% 20% 38% 69% 17%Average 31% 15% 26% 38% 40% 40% 32% 37% 69% 25% 29% 43% 67% 69% 75% Average Foundation Model Transparency Index Scores by Major Dimensions of Transparency, 2025 Source: 2025 Foundation Model Transparency Index Figure 3.8.2 Figure 3.8.317 17 Data, labor, compute, and methods were upstream indicators; model basics, access, capabilities, risks, and mitigations were model-level indicators; and distribution, usage policy, feedback, and impact were downstream indicators. 165

3.9 Security and Safety

3 R E S P O N S I B L E A I | A I I N D E X R E P O R T 2026 Safety is the responsible AI dimension where institutional infrastructure has grown the fastest. New evaluation frameworks, government-backed AI safety institutes, and standardized benchmarks have all expanded in the past year. This section traces that growth and the resulting data on how well current models handle safety in practice. Global AI Safety Institutes AI safety institutes (AISIs) are state-backed specialist organisations created to help governments understand and manage risks from advanced AI, especially frontier/foundation models. They conduct technical evaluations and safety research that governments can use to shape policy. Fully operational institutes now exist in the UK (AI Security Institute), the U.S. (USAISI at NIST), Japan (JAISI), Singapore (Digital Trust Centre), and Israel (AI Security Research Unit) (Figure 3.9.1). India and France have also launched AISIs, with India’s AI Safety Institute and France’s Current AI. A second wave is in development in Canada, South Korea, Germany, and Brazil. Outside of these standalone institutes, participation is growing through the International Network of AI Safety Institutes, with Kenya and Australia listed as network members without formal institutes of their own. The countries building these AISIs are still mostly wealthy, technologically advanced economies that are not all pursuing the same goals. The UK and Israel emphasize security, while the EU AI Office pairs evaluation with enforcement powers under the AI Act. Network membership is a practical entry point for countries without the resources to stand up a full institute immediately. Announced (in development) Established (operational) Network member (no formal institute) NA AI safety institutes and network membership Source: All Tech is Human, 2025 | Chart: 2026 AI Index report Figure 3.9.118 18 Data source: https://alltechishuman.org/all-tech-is-human-blog/the-global-landscape-of-ai-safety-institutes. 166 3.9 S E C U R I T Y A N D S A F E T Y | R E S P O N S I B L E A I | A I I N D E X R E P O R T 2026 Benchmarks HELM Safety HELM Safety, covered in last year’s report, continues to be one of the few standardized suites for evaluating AI models across responsibility and safety metrics. It tests models from major developers across benchmarks including BBQ (social bias), SimpleSafetyTests (self-harm and abuse risks), HarmBench (harassment and misinformation), AnthropicRedTeam (adversarial conversations), and XSTest (helpfulness vs. harmlessness trade-offs). The 2025 results show continued improvement but also increasing compression at the top (Figure 3.9.2). Most models released between 2024 and 2025 score between 0.90 and 0.98, with a very narrow gap between the highest and lowest scorers. Older models from 2023 score lower, but the overall trajectory suggests that leading models are converging on a safety ceiling where current benchmarks may not be finegrained enough to distinguish meaningful differences. 0.98 GPT-3.5 Turbo (0613) DeepSeek LLM Chat (67B) DBRX Instruct Mistral Instruct v0.3 (7B) OLMo 2 13B Instruct November 2024 Llama 3.1 Instruct Turbo (70B) Mixtral Instruct (8×22B) Palmyra Med Command R Plus Claude 3 Haiku (20240307) Llama 3.1 Instruct Turbo (405B) Gemini 1.5 Pro (001) GPT-4o mini (2024-07-18) Qwen2.5 Instruct Turbo (72B) GPT-4o (2024-05-13) Palmyra Fin o1 (2024-12-17) Claude 3.5 Sonnet (20240620) Grok 4 (0709) Mistral Small 3 (2501) DeepSeek v3 DeepSeek-R1-0528 OLMo 2 32B Instruct March 2025 GLM-4.5-Air-FP8 Llama 4 Maverick (17Bx128E) Instruct FP8 IBM Granite 4.0 Small Gemini 2.5 Pro (03-25 preview) Palmyra X5 Gemini 3 Pro (Preview) Claude 3.7 Sonnet (20250219) Qwen3 235B A22B Instruct 2507 FP8 GPT-4.1 (2025-04-14) GPT-4.5 (2025-02-27 preview) GPT-5.1 (2025-11-13) Claude 4.5 Sonnet (20250929) o4-mini (2025-04-16) Kimi K2 Instruct GPT-5 (2025-08-07) o3 (2025-04-16) 2023 2024 2025 0 0.2 0.4 0.6 0.8 1 Mean score HELM Safety: mean score 0.850.87 0.63 0.73 0.77 0.84 0.850.860.860.880.900.92 0.930.93 0.950.950.980.98 0.84 0.85 0.87 0.890.900.90 0.91 0.91 0.91 0.93 0.930.940.95 0.960.960.97 0.970.970.980.98 Source: HELM, 2026 | Chart: 2026 AI Index report Figure 3.9.2 167 3.9 S E C U R I T Y A N D S A F E T Y | R E S P O N S I B L E A I | A I I N D E X R E P O R T 2026 Safety Benchmark Results Among models tested with external guardrails in place, Claude 3.5 Haiku, Claude 3.5 Sonnet, and Mistral Large all received “very good” ratings, while their parent models received “good”(Figure 3.9.3). In the set of models that could be tested without external safety filters or moderation tools, Gemma 2 9b, Phi 3.5 MoE Instruct, and Phi 4 scored “very good” (Figure 3.9.4). The two groups are not directly comparable, as they involve different models under different conditions, but both show a baseline safety performance of “good” across leading systems. Source: MLCommons, 2025 Figure 3.9.3 AILuminate AILuminate v1.0 is a new benchmark designed to test how well AI systems resist prompts that could trigger dangerous, illegal, or undesirable behavior. It covers 12 hazard categories, including violent crimes and child exploitation, and employs a five-tier grading scale from “Poor” to “Excellent.” The benchmark includes two separate evaluations. The first tests safety under normal use, with models evaluating both with and without external safety filters and moderation tools. The second tests a system’s ability to resist deliberate jailbreak attempts through adversarial prompts. 168 3.9 S E C U R I T Y A N D S A F E T Y | R E S P O N S I B L E A I | A I I N D E X R E P O R T 2026 Source: MLCommons, 2025 Figure 3.9.4 169 3.9 S E C U R I T Y A N D S A F E T Y | R E S P O N S I B L E A I | A I I N D E X R E P O R T 2026 Jailbreak T2T Benchmark v0.5 Results The AILuminate Jailbreak T2T benchmark v0.5 tests what happens when users deliberately try to bypass a model’s safety measures through adversarial prompts. Each model in the chart receives two scores (Figure 3.9.5). The square at the top represents the model’s safety score under normal conditions, while the circle below it represents the score after being exposed to jailbreak attempts. As this is a beta version of the benchmark, models are de-identified by number, rather than named. Under normal conditions, most models score in the “very good” or “good” range. After jailbreak attempts, nearly every system’s score drops, some by a full tier or more. So while safety under normal use is generally good, it degrades under deliberate manipulation. Figure 3.9.5 170

3.10 Tradeoffs Across

RAI Dimensions 3 R E S P O N S I B L E A I | A I I N D E X R E P O R T 2026 In practice, AI systems must satisfy multiple responsible AI dimensions at once. A growing number of empirical research studies suggest that these dimensions do not improve independently, as optimizing for one can degrade others. The direction and magnitude of those trade-offs depends on the method used, data involved, and under what context it is deployed. Kemmerzell and Schreiner (2024) tested this directly by training image classification models on four facial analysis data sets and measuring what happened to fairness, privacy, explainability, and robustness when each dimension was targeted in isolation. Differential privacy, a technique that adds noise during training to prevent individual data points from being identified, improved privacy scores across all datasets but reduced explainability, fairness, and accuracy, with accuracy falling by up to 33 percentage points on some configurations. Training adaptations aimed at improving fairness only succeeded on the dataset with the most demographic imbalance, and therefore the most room to correct. But across all, it reduced explainability and robustness. Data augmentation methods designed to improve robustness by exposing datasets to more varied training images produced the fewest negative side effects across the same experiments. It also improved explainability and accuracy, with only minor reductions in privacy and fairness. There was not a single intervention method that proved to improve all four dimensions at once. A separate evaluation of large language models found a similar pattern at the model level. Cecchini et al. (2024) scored 11 models across robustness, accuracy, and toxicity using the LangTest evaluation toolkit. GPT4 led on robustness (average score of 0.91 out of 1.0) and accuracy (0.67), but Llama 2 7B scored highest on toxicity avoidance (0.98), meaning it was the most likely to refuse toxic prompts. Models that performed well on robustness, such as Mistral 7B and Mixtral 8x7B, scored among the lowest on toxicity avoidance (0.39 and 0.42, respectively). The ranking of models shifted depending on which dimension was being measured, and no single model was a clear leader in all three. These trade-offs also appear in federated learning, a training approach where multiple institutions train a shared model by exchanging model updates rather than the underlying data. Wasif et al. (2025) studied how privacy-preserving techniques interact with fairness in this setting across four datasets, including Alzheimer’s disease MRI scans and credit card fraud records. Differential privacy did not affect all datasets equally. Institutions with larger datasets could absorb the added noise, while smaller institutions saw their contributions to model training degraded. In the Alzheimer’s scenario, adding stronger privacy protections reduced the model’s ability to correctly identify the disease, with accuracy falling by 14.8 percentage points. The effect was worse for hospitals with less data, where missed diagnoses rose by 21.4%. Two alternative privacy methods that use encryption instead of noise kept fairness more stable but required two to three times more computing power. The studies covered above are recent and focus on specific tasks rather than general-purpose AI systems. Their findings point in the same direction though: Improving one responsible AI dimension tends to come at the expense of another. There is no shared framework that measures or compares these trade-offs, which is another measurement gap in the RAI space, and makes it difficult to track whether the field is getting better at managing them. 171 Economy 4 A I I N D E X R E P O R T 2026

Contents

1 Global corporate AI investment more than doubled in 2025. Private investment grew fastest at 127.5% and now accounts for 60% of the total. Generative AI led the surge, growing more than 200% and capturing nearly half of all private AI funding. Newly funded AI companies rose 71%, and billion-dollar funding events nearly doubled. 2 The United States continues to lead in global private AI investment, committing 23 times more than China. In generative AI, U.S. investment exceeded the combined total of China and Europe by a wide margin. However, private investment figures likely understate China’s total AI spending, as government guidance funds have deployed an estimated $184 billion into AI firms between 2000 and 2023. 3 AI company revenue is rising at historically fast rates, but compute costs and infrastructure spending are also reaching record levels. Leading frontier companies are reaching meaningful revenue scale in a short period of time, but compute spend has increased significantly year-overyear. Major cloud providers have accelerated capital expenditures, with Google reporting more than $150 billion in annual capex in 2025. 4 E CO N O M Y | A I I N D E X R E P O R T 2026 Chapter Highlights 4 The value consumers get from generative AI grew 54% in a year. Estimated U.S. consumer surplus reached $172 billion annually by early 2026, up from $112 billion a year earlier, with the median value per user tripling over the same period. Most of these tools remain free or close to it. 5 Organizational AI adoption continued to rise in 2025, up to 88% of surveyed organizations, though AI agent use remains early. Generative AI is now used in at least one business function at 70% of organizations, and China and Europe posted the highest year-over-year increases. AI agent deployment was in the single digits across nearly all business functions. 6 Generative AI reached 53% adoption in three years, faster than the personal computer or the internet. Adoption varies widely across countries and correlates strongly with GDP per capita, though some outpace what income would predict, including Singapore at 61% and the United Arab Emirates at 54%. Despite its lead in AI investment and model development, the United States ranks 24th at 28.3%.` 7 AI’s labor market effects are showing up unevenly, concentrated in hiring pipelines and the youngest workers in exposed occupations. Employment for software developers ages 22 to 25 has fallen nearly 20% from 2024. Employer surveys point to further change ahead, with onethird of respondents expecting workforce reductions over the coming year. 8 One-third of organizations expect AI to reduce their workforce in the coming year, even though large-scale job losses have not yet shown up in overall employment data. Almost half of organizations surveyed expected little to no change. Anticipated reductions are highest in service operations, supply chain, and software engineering. Across nearly all functions, anticipated decreases outpaces those already observed. 174 C H A P T E R H I G H L I G H T S | E CO N O M Y | A I I N D E X R E P O R T 2026 Productivity gains from AI are largest in structured, measurable work where outputs are easy to monitor. Studies report gains of 14% to 15% in customer support, 26% in software development, and 50% in marketing output. Gains are smaller in tasks requiring deeper reasoning, and recent evidence raises concerns that heavy AI reliance may carry long-term learning penalties that slow skill development over time. China continues to install more industrial robots than the rest of the world combined, and the gap widened in 2024. China accounted for 54% of industrial robots installed globally, up from 51.1% in 2023. Global year-over-year growth was flat, and several major markets, including the United States, Germany, and Italy saw declines. Taiwan was an exception, recording the highest year-over-year growth at 33%. 175 4.1 Year in Review: 2025 4 E CO N O M Y | A I I N D E X R E P O R T 2026 $500B: “Stargate Project” AI Infrastructure joint venture announced: OpenAI, SoftBank, Oracle, and MGX—supported by Nvidia and others—launch Stargate, a major AI infrastructure project announced at the White House. The venture plans to invest between $100 billion and $500 billion to build advanced AI data centers across the United States by 2029. INVESTMENT INVESTMENT No. 1: DeepSeek reaches No. 1 as the most downloaded free app on Apple’s U.S. App Store. MILESTONE $138B: China announces a $138 billion state VC fund to invest in AI and other cutting-edge technologies. ACQUISITION Agentic AI: ServiceNow announces plans to acquire Moveworks to drive use of its agentic AI platform across key growth areas including CRM. FUNDING FUNDING INVESTMENT ACQUISITION $23B: CoreWeave an AI data center company, has the largest U.S. tech IPO since 2021, raising $1.5 billion and valuing the company at $23 billion. $300B: OpenAI raises $40 billion at a $300 billion post-money valuation. $5B: AWS and HUMAIN announce a $5 billion AI infrastructure deal to accelerate AI adoption in Saudi Arabia and globally. $6.5B: OpenAI acquires IO, the AI hardware startup founded by Jony Ive, for $6.5 billion to develop a new generation of consumer AI devices. January 21 January 27 March 6 March 10 March 28 March 31 May 13 May 21 176 4.1 W H AT ’ S N E W I N 2025: A T I M E L I N E | E CO N O M Y | A I I N D E X R E P O R T 2026 Watsonx AI: IBM acquires the AI startup Seek AI to launch Watsonx AI Labs, an AI accelerator and innovation hub in New York City. $2.4B: Google hires key staff from AI code-generation startup Windsurf and agrees to pay $2.4 billion in license fees to use some of Windsurf’s technology on a nonexclusive basis. $12B: Thinking Machines Lab an AI company founded by Mira Murati and other former OpenAI researchers, raises a $2 billion seed round at a $12 billion valuation. $183B: Anthropic raises $13 billion in Series F funding at a $183 billion post-money valuation. $300B: OpenAI signs a $300 billion, five-year cloud contract with Oracle, beginning in 2027. Oracle will provide 4.5 gigawatts of computing capacity for OpenAI’s Stargate data center initiative. $10B: Mercor which connects AI labs with domain experts for training their foundation AI models, raises $350 million Series C at a $10 billion valuation, making their founders, both 22 years old, the youngest ever self-made billionaires. $2.1B: Gamma a startup that creates AI-generated presentations, websites, and social media posts, announces a $68 million Series B round at a $2.1 billion valuation led by Andreessen Horowitz. $4T: Nvidia becomes the first public company worth $4 trillion. MILESTONE DEAL FUNDING FUNDING FUNDING FUNDING INVESTMENT PARTNERSHIP June 2 July 9 July 11 July 15 September 2 October 27 November 10 September 10 ACQUISITION 177 4.1 W H AT ’ S N E W I N 2025: A T I M E L I N E | E CO N O M Y | A I I N D E X R E P O R T 2026 2025 $6.4B: Google announces it will invest $6.4 billion in cloud infrastructure in Germany from 2026–29 to expand its data center capacity there. INVESTMENTNovember 11 $29.3B: Anysphere which sells the popular AI coding assistant Cursor, raises $2.3 billion at a $29.3 billion valuation. $40B: Google announces a $40 billion investment in Texas data centers and AI infrastructure through 2027. $17.5B: Microsoft pledges a $17.5 billion investment in India’s AI Infra. $5.6B: Physical Intelligence a robotics AI startup building general-purpose foundationmodel “brains” for robots, raises $600 million led by CapitalG at a $5.6 billion valuation. 1M: Amazon announces it will invest over $35 billion in India by 2030 to expand AI and logistics, targeting 1 million new jobs and $80 billion in seller exports.. $4.75B: Alphabet says it will acquire Intersect for $4.75 billion in cash plus assumed debt to accelerate U.S. energy innovation and data center infrastructure build-out. INVESTMENT INVESTMENT FUNDING INVESTMENT ACQUISITION November 14 November 14 December 9 November 20 December 10 December 22 FUNDING 178

4.2 Investment and Infrastructure

4 E CO N O M Y | A I I N D E X R E P O R T 2026 The scale and direction of investment into AI provides a signal of the technology’s broader economic trajectory. As AI systems become more capable and infrastructure-intensive, the capital required to develop and deploy them has expanded. Viewed alongside the broader trends discussed in other chapters of this report, these investment patterns capture not just market interest, but the rising cost of participating in the AI economy. This section examines those patterns across corporate infrastructure spending, startup funding activity, and the operational economics of AI companies themselves. The analysis draws primarily from Quid’s database of AI-related investments, supplemented by publicly disclosed financial metrics from leading AI companies, as tracked by Epoch AI. The Quid investment data captures four categories of capital flowing into AI companies, including mergers and acquisitions, minority stake investments, private investment, and public offerings. Corporate investment, as used in this section, refers to the aggregate of all four. The private investment subsection that follows focuses more narrowly on private financing events, such as venture capital or private equity funding, directed at AI companies that have received over $1.5 million in funding since 2013. This subset represents a portion of total corporate investment. Global corporate AI investment has grown over the past decade, accelerating within recent years, and further emphasizing how much AI has moved from an emerging technology to a key strategic priority. Across mergers and acquisitions, minority stakes, private investment, and public offerings, AI-related investment increased approximately fortyfold since 2013 (Figure 4.2.1). In 2025, total investment reached $581.69 billion, marking a 129.9% increase from the previous year. Private investments represented the largest share of activity with $344.66 billion, up 127.5% from 2024. Mergers and acquisitions showed similar signs of growth, rising 132.6% year over year. Though the composition of investment varies year to year, it is clear that organizations are committing growing sums of capital to strengthen their AI capabilities and position. Corporate Investment Activity 37.3225.72 43.1 58.18 73.79 145.4 113.01 104.34 151.48 344.66 88.19 24.68 21.89 36.43 39.83 175.36 121.39 82.26 92.19 214.44 14.57 19.04 25.43 33.82 53.72 79.62 103.27 221.87 360.73 253.25 201 253.02 581.69 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0 100 200 300 400 500 600 Merger/acquisition Minority stake Private investment Public oǄering Total investment (in billions of US dollars) Global corporate investment in AI by investment activity, 2013–25 Source: Quid, 2025 | Chart: 2026 AI Index report Figure 4.2.1 179 4.2 I N V E S T M E N T A N D I N F R AS T R U C T U R E | E CO N O M Y | A I I N D E X R E P O R T 2026 Within the broader investment landscape, private investment data, covering AI and ML companies with over $1.5 million in funding since 2013, offers a granular view into which firms are being funded and where that funding is concentrated. In 2025, global private investment in AI reached $344.7 billion, a 127.5% increase over the previous year (Figure 4.2.2). Generative AI companies accounted for $170.9 billion of that total, representing nearly half of all private investment and an increase of over 200% from 2024 (Figure 4.2.3). Private Investment Activity 344.66 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0 50 100 150 200 250 300 350 Total investment (in billions of US dollars) Global private investment in AI, 2013–25 Source: Quid, 2025 | Chart: 2026 AI Index report 170.87 2019 2020 2021 2022 2023 2024 2025 0 20 40 60 80 100 120 140 160 Total investment (in billions of US dollars) Global private investment in generative AI, 2019–25 Source: Quid, 2025 | Chart: 2026 AI Index report Figure 4.2.2 Figure 4.2.3 180 4.2 I N V E S T M E N T A N D I N F R AS T R U C T U R E | E CO N O M Y | A I I N D E X R E P O R T 2026 The private investment market is expanding in breadth but even more so in concentration. While the absolute number of newly funded AI companies has grown in 2025 (70.8% year-over-year increase), distribution of capital has dropped and the majority of investment dollars flow through a small number of deals (Figures 4.2.4-4.2.7). Compared to 2024, the average private AI investment event in 2025 increased 46% to $66.5 million. Investment activity increased across all funding sizes, but the strongest growth was at the upper end of that distribution, with 28 events exceeding $1 billion, up from 15 in 2024. The timeline of Section 4.1 notes several of these large funding rounds, including OpenAI’s $40 billion raise and Anysphere’s $2.3 billion round at a $29.3 billion valuation, highlighting the increasing skew in the funding landscape. 3,499 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0 500 1,000 1,500 2,000 2,500 3,000 3,500 Number of companies Number of newly funded AI companies in the world, 2013–25 Source: Quid, 2025 | Chart: 2026 AI Index report Figure 4.2.4 Figure 4.2.5 311 2019 2020 2021 2022 2023 2024 2025 0 50 100 150 200 250 300 350 Number of companies Number of newly funded generative AI companies in the world, 2019–25 Source: Quid, 2025 | Chart: 2026 AI Index report 181 4.2 I N V E S T M E N T A N D I N F R AS T R U C T U R E | E CO N O M Y | A I I N D E X R E P O R T 2026 66.52 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0 10 20 30 40 50 60 70 Average investment (in millions of US dollars) Average size of global AI private investment events, 2013–25 Source: Quid, 2025 | Chart: 2026 AI Index report Figure 4.2.7 Figure 4.2.6 Over 1 billion 500 million – 1 billion 100 million – 500 million 50 million – 100 million Under 50 million Undisclosed Total Funding size (US dollars) 15 20 146 197 2,951 209 3,538 2024 28 30 286 373 4,464 324 5,505 2025 Global AI private investment events by funding size, Source: Quid, 2025 | Table: 2026 AI Index report 2024 vs. 2025 182 4.2 I N V E S T M E N T A N D I N F R AS T R U C T U R E | E CO N O M Y | A I I N D E X R E P O R T 2026 Geographic Distribution of Private Investment As measured by both investment totals and the number of newly funded companies, private AI investment remains highly concentrated in a small number of countries. In 2025, the United States was the global leader with nearly $285.9 billion total invested, 23.1 times greater than the amount invested in the next highest country, China ($12.4 billion), and 48.5 times the amount invested in the United Kingdom ($5.9 billion) (Figure 4.2.8). This disparity was also seen in entrepreneurial activity, as the United States led with 1,953 newly funded AI companies in 2025, compared to 172 in the United Kingdom and 161 in China (Figure 4.2.9). In the United States, more than half of total private AI investment was generative AI-related ($163.6 billion), while the combined investment by China and Europe was $4.7 billion (Figure 4.2.10). Since 2024, private AI investment in the United States increased 160.2%, compared to an increase of 32.2% in China and 7.2% in Europe (Figure 4.2.11). 0.97 1.11 1.20 1.78 1.82 2.03 2.52 3.58 3.89 4.09 4.28 4.36 5.90 12.41 285.88 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 Sweden Japan Belgium South Korea Singapore Saudi Arabia Australia Israel Germany India Canada France United Kingdom China United States Total investment (in billions of US dollars) Global private investment in AI by geographic area, 2025 Source: Quid, 2025 | Chart: 2026 AI Index report 33 34 38 38 49 56 59 64 79 84 92 108 161 172 1,953 0 100 200 300 400 500 600 700 800 900 1,000 1,100 1,200 1,300 1,400 1,500 1,600 1,700 1,800 1,900 2,000 Spain Switzerland Australia Italy Singapore Japan South Korea Israel Canada France Germany India China United Kingdom United States Number of companies Number of newly funded AI companies by geographic area, 2025 Source: Quid, 2025 | Chart: 2026 AI Index report Figure 4.2.8 Figure 4.2.9 183 4.2 I N V E S T M E N T A N D I N F R AS T R U C T U R E | E CO N O M Y | A I I N D E X R E P O R T 2026 2019 2020 2021 2022 2023 2024 2025 0 20 40 60 80 100 120 140 160 Total investment (in billions of US dollars) 1.48, China 3.21, Europe 163.64, United States Global private investment in generative AI by geographic area, 2019–25 Source: Quid, 2025 | Chart: 2026 AI Index report Figure 4.2.10 Figure 4.2.11 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0 50 100 150 200 250 300 Total investment (in billions of US dollars) 12.41, China 20.92, Europe 285.88, United States Global private investment in AI by geographic area, 2013–25 Source: Quid, 2025 | Chart: 2026 AI Index report 184 4.2 I N V E S T M E N T A N D I N F R AS T R U C T U R E | E CO N O M Y | A I I N D E X R E P O R T 2026 4.24 4.73 6.50 7.00 8.24 9.09 10.75 15.39 15.57 17.16 18.54 19.59 34.07 131.83 757.27 0 30 60 90 120 150 180 210 240 270 300 330 360 390 420 450 480 510 540 570 600 630 660 690 720 750 780 United Arab Emirates Switzerland Australia Japan Sweden Singapore South Korea India France Germany Israel Canada United Kingdom China United States Total investment (in billions of US dollars) Global private investment in AI by geographic area, 2013–25 (sum) Source: Quid, 2025 | Chart: 2026 AI Index report Figure 4.2.12 As noted earlier, the private investment figures in this section are drawn from Quid and do not account for government-backed funding in countries like China. For example, the Chinese government channels resources through government guidance funds, which are state-initiated investment funds that aim to both produce financial returns and further the government’s strategic priorities (Beraja et al., 2024; Luong et al., 2021). Between 2000 and 2023, it was estimated that $912 billion of these funds were deployed across industries, with an estimated $184 billion allocated towards AI companies. Given this, comparisons based solely on private investment alone likely understate how much capital China is directing toward AI. The trends in geographic concentration are also visible over a longer time horizon. Since 2013, the United States has attracted $757.3 billion in total private AI investment, far ahead of China at $131.8 billion (Figure 4.2.12). Other countries with notable cumulative investment totals include the United Kingdom ($34.1 billion), Canada ($19.6 billion), Israel ($18.5 billion), and Germany ($17.2 billion). Over the same period, the number of newly funded U.S. companies far exceeds other geographic areas, including five times that of China and 8.4 times the amount in the United Kingdom (Figures 4.2.13 and 4.2.14). The United States’ growth rate continues to accelerate, with a 77.8% year-over-year increase in the number of funded AI startups. 185 4.2 I N V E S T M E N T A N D I N F R AS T R U C T U R E | E CO N O M Y | A I I N D E X R E P O R T 2026 141 150 188 216 288 329 444 486 542 552 556 560 1,057 1,766 8,909 0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 5,000 5,500 6,000 6,500 7,000 7,500 8,000 8,500 9,000 Netherlands Spain Switzerland Australia Singapore South Korea Japan Germany India France Israel Canada United Kingdom China United States Number of companies Number of newly funded AI companies by geographic area, 2013–25 (sum) Source: Quid, 2025 | Chart: 2026 AI Index report Figure 4.2.13 Figure 4.2.14 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0 500 1,000 1,500 2,000 Number of companies 161, China 639, Europe 1,953, United States Number of newly funded AI companies by geographic area, 2013–25 Source: Quid, 2025 | Chart: 2026 AI Index report 186 4.2 I N V E S T M E N T A N D I N F R AS T R U C T U R E | E CO N O M Y | A I I N D E X R E P O R T 2026 Within the United States, funding and entrepreneurial activity is heavily concentrated in a small number of states (Figure 4.2.15). California accounted for $218 billion in 2025, representing over 75% of the national total. Colorado ($19 billion), New York ($13 billion), and Florida ($6 billion) tracked the next largest investments. More than half of all U.S. states received less than $100 million in private AI investment, and a few, including South Dakota, Oklahoma, Arkansas, and West Virginia, reported no mapped investment activity. The underlying data for these state-level figures is not exhaustive but the overall pattern is clear. AL $28M AK $11M AZ $504M AR CA $218B CO $19B CT $741M DC $533M DE $3B FL $6B GA $1B HI $86M ID $26M IL $696M IN $86M IA $33M KS $50M KY $34M LA $17M ME $29M MD $427M MA $5B MI $239M MN $328M MS $117M MO $280M MT $63M NE $415M NV $539M NH $36M NJ $865M NM $33M NY $13B NC $874M ND $6M OH $723M OK OR $189M PA $2B RI $68M SC $6M SD TN $302M TX $5B UT $958M VT $40M VA $2B WA $2B WV WI $117M WY $35M Source: Quid, 2025 | Chart: 2026 AI Index report No data <$100M $100M $500M $500M $1B $1B $5B $5B $20B $20B+ US state private investment in AI, 2025 Figure 4.2.15 187 4.2 I N V E S T M E N T A N D I N F R AS T R U C T U R E | E CO N O M Y | A I I N D E X R E P O R T 2026 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 Drones Ed tech HR tech Marketing, digital ads Accounting/Ǉnance Quantum computing Entertainment Legal tech Retail Creative, music, video content Semiconductors Energy management Biotech Defense Fintech Robotics Autonomous vehicles AI agents Cybersecurity, data protection Cloud computing Pharmaceutical Medical and healthcare Internet of things Data management, processing AI infrastructure/models/research/governance 2025 2024 Total investment (in billions of US dollars) Global private investment in AI by focus area, 2024 vs. 2025 Source: Quid, 2025 | Chart: 2026 AI Index report Focus Areas of Private Investment In 2025, the breakdown of private AI startup investment by focus area suggests that capital was directed more heavily toward segments closest to building and scaling AI systems. The category of AI infrastructure, models, research, and governance attracted the largest share of funding, reaching $143.2 billion (Figures 4.2.16 and 4.2.17). As this category combines several types of priorities, the trend is best interpreted as evidence of growing investment in the foundational layers of the AI ecosystem rather than a precise measure of any single one of those areas. In recent years, this category has experienced the steepest growth in investment compared with all other areas. Other focus areas have expanded as well, including data management and processing and Internet of Things (IoT), yet none approaches the scale of foundational infrastructure. Alongside 2025 trends in technical performance (Chapter 2), and research and development (Chapter 1), this suggests that investment is tracking the infrastructure demands of deploying increasingly capable AI systems. Figure 4.2.16 188 4.2 I N V E S T M E N T A N D I N F R AS T R U C T U R E | E CO N O M Y | A I I N D E X R E P O R T 2026 2018 2022 2025 0 10 20 30 2018 2022 2025 0 50 100 150 2018 2022 2025 0 10 20 30 2018 2022 2025 0 10 20 30 2018 2022 2025 0 10 20 30 2018 2022 2025 0 10 20 30 2018 2022 2025 0 10 20 30 2018 2022 2025 0 10 20 30 2018 2022 2025 0 10 20 30 2018 2022 2025 0 10 20 30 2018 2022 2025 0 10 20 30 2018 2022 2025 0 10 20 30 2018 2022 2025 0 10 20 30 2018 2022 2025 0 10 20 30 2018 2022 2025 0 10 20 30 2018 2022 2025 0 10 20 30 2018 2022 2025 0 10 20 30 2018 2022 2025 0 10 20 30 2018 2022 2025 0 10 20 30 2018 2022 2025 0 10 20 30 2018 2022 2025 0 10 20 30 2018 2022 2025 0 10 20 30 2018 2022 2025 0 10 20 30 2018 2022 2025 0 10 20 30 2018 2022 2025 0 10 20 30 Total investment (in billions of US dollars) AI agents AI infrastructure/models/research/governance* AV Accounting/Ǉnance Biotech Cloud computing Creative, music, video content Cybersecurity, data protection Data management, processing Defense Drones Ed tech Energy management Entertainment Fintech HR tech IoT Legal tech Marketing, digital ads Medical and healthcare Pharmaceutical Quantum computing Retail Robotics Semiconductors 8.02 143.22 7.94 2.59 4.84 10.31 4.19 8.42 31.58 5.30 1.42 1.44 4.64 3.66 6.52 2.19 14.63 3.79 2.51 11.75 10.58 3.11 4.10 7.84 4.40 Global private investment in AI by focus area, 2018–25 Source: Quid, 2025 | Chart: 2026 AI Index report Figure 4.2.17 189 4.2 I N V E S T M E N T A N D I N F R AS T R U C T U R E | E CO N O M Y | A I I N D E X R E P O R T 2026 Investment patterns show where capital is flowing across the AI ecosystem, but the operating economics of frontier AI companies reveal how advances in technical performance and deployment translate into commercial scale, compute demand, and infrastructure buildout. Using publicly disclosed data tracked by Epoch AI, this section examines those revenue trajectories, ongoing compute expenses, and the infrastructure costs to support AI development. Annualized revenue estimates for leading AI companies, including OpenAI, Anthropic, xAI, and Mistral AI, have grown quickly in recent years (Figure 4.2.18). These estimates are drawn from direct company statements or established media reporting from 2023 through 2025. They may differ from annual recurring revenue calculations, and the underlying data varies in reliability depending on source credibility and accounting practices. Therefore, these figures should be interpreted as directional rather than precise. The chart uses a logarithmic scale to accommodate the exponential growth pattern, meaning that a straight line represents consistent percentage growth rather than absolute growth. With those notes in mind, the overall dynamic points to a small set of frontier AI companies reaching meaningful revenue scale in a relatively short amount of time. To contextualize this growth, a separate comparison places OpenAI’s revenue trajectory alongside those of other high-growth companies in the years after crossing $1 billion in annual revenue (Figure 4.2.19). While Google remains the only company in the comparison set to be scaling toward $100 billion in annual revenue, OpenAI’s early revenue growth outpaces that of Uber, Cheniere Energy, and Moderna over comparable time periods. AI Company Economics Revenue Jul-2022 Oct-2022 Jan-2023 Apr-2023 Jul-2023 Oct-2023 Jan-2024 Apr-2024 Jul-2024 Oct-2024 Jan-2025 Apr-2025 Jul-2025 Oct-2025 Jan-2026 $10M $50M $100M $500M $1B $5B $10B $20B Annualized revenue (in US dollars - log scale) $53M, Z.ai $400M, Mistral AI$428M, xAI $19B, Anthropic $25B, OpenAI AI company annualized revenue Source: Epoch AI, 2026 | Chart: 2026 AI Index report Figure 4.2.18 190 4.2 I N V E S T M E N T A N D I N F R AS T R U C T U R E | E CO N O M Y | A I I N D E X R E P O R T 2026 Figure 4.2.19 Figure 4.2.20 The rapid revenue growth of leading AI companies has come with increasing compute costs. Reported annual compute spend, which largely reflects rented cloud capacity rather than owned data centers, offers a proxy for how much compute these companies procure each year to train and operate models at scale (Figure 4.2.20). OpenAI’s reported compute spend increased significantly from 2024 to 2025, as did Anthropic’s. The drive to meet growing commercial demand with increasingly capable systems means that the economics of frontier AI is tied to large-scale compute and its associated costs. Annual Compute Spend 1.504 4.10 8.30 1.80 2.70 8 0.42 0.28 2.50 5.80 6.80 16.30 OpenAI Anthropic OpenAI Anthropic OpenAI Anthropic 2022 2023 2024 2025 0 2 4 6 8 10 12 14 16 18 Unattributed R&D Inference Annual compute spend (in billions of US dollars) Annual compute spend of select frontier AI companies Source: Epoch AI, 2026 | Chart: 2026 AI Index report 191 4.2 I N V E S T M E N T A N D I N F R AS T R U C T U R E | E CO N O M Y | A I I N D E X R E P O R T 2026 The infrastructure needed to support frontier AI is being financed not only by AI companies but also by the major cloud providers that lease them compute capacity. These providers have accelerated their own infrastructure investments to support the increasingly advanced AI models (Figure 4.2.21). In 2025, Google and Amazon led in total annual capital expenditures (capex), with Google reporting more than $150 billion in capex. This infrastructure investment is seen in the chapter’s 2025 timeline, including the $100–$500 billion Stargate Project announced by OpenAI, SoftBank, Oracle, and others, as well as Google’s $40 billion commitment to Texas data centers and Microsoft’s $17.5 billion investment in AI infrastructure in India. Capital Expenditures Hyperscaler’s annual capex has more than doubled since ChatGPT’s release Exponential view. Source: Citi Research 0 100 2018 2020 2022 2024 2026 200 300 400 AMZN Year META GOOGL MSFT ORCL Figure 4.2.21 192 4.2 I N V E S T M E N T A N D I N F R AS T R U C T U R E | E CO N O M Y | A I I N D E X R E P O R T 2026 H I G H L I G H T: What Is Generative AI Worth? Investment, revenue and compute costs all measure the value of AI to the companies building and deploying it. They do not capture what the technology is worth to people using it. For most people, generative AI tools are free or close to it, making their economic value easy to undercount. Brynjolfsson et al. (2026) provide the first longitudinal estimates of that value, using online choice experiments conducted in 2025 (N=1,400) and early 2026 (N=2,000). Rather than measuring productivity effects, the study directly asked users’ how much compensation they would accept to give up access to all generative AI tools for one month. This measure of “consumer surplus” is theoretically appropriate for goods that are largely free and already in the consumers’ possession. The study finds that total consumer surplus is estimated to have grown from $112 billion to $172 billion annually in the United States (4.2.22). This reflects the growing share of U.S. adults using generative AI, which increased from 48% to 56% (Bick et al., 2026) as well as a higher value per user. In particular, the average consumer surplus among U.S. generative AI users increased by 27% from $98 in 2025 to $125 by March 2026, while the median value per user tripled, from $3.40 to $11.40 over the same period. This increase in both adoption and per-user value is plausibly driven by a broadening and deepening of the capabilities of AI models. This consumer surplus figure dwarfs estimated U.S. generative AI revenues, suggesting that the social returns from the technology far exceed the private returns captured by producers. This pattern is consistent with findings by Nordhaus (2004) that innovators historically capture only ~3% of total social returns from major technologies. The authors also find that usage frequency is the strongest individual-level predictor of surplus, followed by work use, number of different products used, and paid subscription status. Usage of generative AI for practical guidance, technical help, or information seeking are all associated with higher surplus. $112B $172B (+53.57%) 2025 2026 0 50B 100B 150B 200B 95M 115M (+21.05%) 2025 2026 0 25M 50M 75M 100M 125M $3.4 $11.4 (+235.29%) 2025 2026 $0 $50 $100 $150 $98 $125 (+27.55%) 2025 2026 $0 $50 $100 $150 Value Users Value Value Total consumer surplus U.S. adults using Generative AI Median consumer surplus per user Average consumer surplus per user Generative AI consumer surplus in the United States, 2025 vs. 2026 Source: Brynjolfsson et al., 2026 | Chart: 2026 AI Index report Figure 4.2.22 193

4.3 Corporate AI Adoption

4 E CO N O M Y | A I I N D E X R E P O R T 2026 The investment and infrastructure activity described earlier in this chapter establish the scale of resources being directed toward AI as well as the capacity being built to support it. This section examines how those investments translate into organizational use and reported business outcomes. Drawing on McKinsey & Company’s annual State of AI surveys and other enterprise measures, the analysis traces the breadth and depth of adoption, and its associated benefits. As with other survey-based data in this chapter, the results are self-reported and should be viewed as directional rather than comprehensive. In 2025, organizational adoption of AI continued to expand in both usage and function. A large majority of respondents reported that their organization uses AI in at least one business function, up to 88% in 2025 from 78% in 2024 (Figure 4.3.1). Over half of respondents reported three or more business functions leveraging AI. Use of generative AI mirrored that growth, with 79% of respondents reporting that their organizations regularly use generative AI in at least one business function, compared to 71% in 2024. This expanded adoption of AI was seen across all regions, though at different rates (Figure 4.3.2). China and Europe experienced higher year-over-year increases, with reported organizational AI use growing 13 and 11 percentage points, respectively. Industry Usage 2017 2018 2019 2020 2021 2022 2023 2024 2025 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% % of respondents 79%, GenAI 88%, AI Share of respondents who say their organization uses AI in at least one function, 2017–25 Source: McKinsey & Company Survey, 2025 | Chart: 2026 AI Index report Figure 4.3.1 194 4.3 CO R P O R AT E A I A D O P T I O N | E CO N O M Y | A I I N D E X R E P O R T 2026 88% 82% 91% 90% 88% 88% 78% 72% 80% 82% 75% 77% 55% 58% 57% 61% 48% 49% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Developing markets (incl. India, Central/South America, MENA) Greater China (incl. Hong Kong, Taiwan, Macau) North America Europe Asia-PaciǇc All geographies 2025 2024 2023 % of respondents AI use by organizations in the world, 2023–25 Source: McKinsey & Company Survey, 2025 | Chart: 2026 AI Index report Figure 4.3.2 Adoption patterns varied across industry and function, with some industry/function pairings showing higher rates of diffusion than others (Figure 4.3.3). The highest reported AI usage was in knowledge management for business, legal, and professional services (58%) and in software engineering and IT in the technology sector (58% and 56%, respectively). This was closely followed by marketing and sales for consumer goods and retail (51%). More broadly, functions tied to information processing, software, customer engagement, and internal knowledge work reported higher adoption than areas such as strategy and corporate finance and risk and compliance, where uptake remains low across most sectors. Financial services were an exception; they reported high use in risk and compliance functions, which are more central to their core operations. By Industry and Function 195 Figure 4.3.31 1 “Advanced industries” comprises respondents from sectors such as advanced electronics, aerospace and defense, automotive and assembly, and semiconductors. “Energy and materials” encompasses respondents from agriculture, chemicals, electric power and natural gas, metals and mining, oil and gas, as well as paper, forest products, and packaging. 4.3 CO R P O R AT E A I A D O P T I O N | E CO N O M Y | A I I N D E X R E P O R T 2026 Respondents more often associated AI with the highest cost savings in software engineering and manufacturing functions (56%), while revenue gains were cited with marketing and sales (67%), strategy and corporate finance (65%), and product and/or service development (62%) (Figure 4.3.4). Across broader organizational outcomes, 64% of respondents reported that AI usage had improved innovation, and 45% reported improvements in employee and customer satisfaction (Figure 4.3.5). Often, the number of respondents who believed AI usage had improved various organizational measures was similar to the number who did not believe it had any effect. Overall, negative effects were reported less frequently, with no more than 7% believing AI usage had worsened cost metrics. 196 4.3 CO R P O R AT E A I A D O P T I O N | E CO N O M Y | A I I N D E X R E P O R T 2026 7%8%33% 10%11%24% 7%11%33% 10%10%28% 7%39% 6%8%37% 8%10%36% 6%14%35% 7%45% 8%17%28% 9%8%28% 56% 56% 54% 53% 51% 51% 49% 49% 47% 45% 44%Knowledge management Strategy and corporate Ǉnance Manufacturing Software engineering IT Service operations Supply chain management Product and/or service development Human resources Risk Marketing and sales 10% 14% 43% 10% 15% 38% 5% 11% 42% 5% 16% 36% 5% 18% 30% 9% 13% 35% 12% 22% 31% 67% 65% 62% 59% 57% 57% 52% Decrease by <10% Decrease by 10%–19% Decrease by ≥20% Increase by >10% Increase by 6%–10% Increase by ≤5% Function Cost decrease and revenue increase from analytical AI use by function, 2025 Source: McKinsey & Company Survey, 2025 | Chart: 2026 AI Index report % of respondents Figure 4.3.4 64% 45% 45% 45% 38% 36% 33% 33% 25% 21% 31% 32% 33% 31% 36% 39% 42% 49% 14% 19% 22% 20% 24% 26% 27% 23% 25% 4% 7% 0% 20% 40% 60% 80% Change in market share Attraction and retention of talent Organic revenue growth Pro�tability Cost Competitive di�erentiation Customer satisfaction Employee satisfaction Innovation Improved Had no e�ect Don’t know Worsened % of respondents Organizational measure AI impact on organizational measures over the past year, 2025 Source: McKinsey & Company Survey, 2025 | Chart: 2026 AI Index report Figure 4.3.52 2 This includes only those respondents whose organizations regularly use AI in at least one business function. Figures may not add up to 100% because of rounding. 197 4.3 CO R P O R AT E A I A D O P T I O N | E CO N O M Y | A I I N D E X R E P O R T 2026 The McKinsey survey also captured how deeply AI had been integrated into an organization’s operations by looking at different stages of the deployment life cycle (Figure 4.3.6). As expected, given the resource and investment demands of integration, larger companies were the most likely to report that their AI programs had reached a scaling phase. Early indicators on AI agent adoption show that diffusion is still at an early stage. Across most business functions, a majority of respondents reported no agent use at all (Figure 4.3.7). Scaled use was in the single digits for nearly all functions. Even in functions with the most activity, including IT and knowledge management, about two-thirds or more of respondents reported no use. At the industry level, the technology sector had comparatively higher rates of scaled agent use in software engineering (24%), IT (22%), and service operations (21%) (Figure 4.3.8). The business functions reporting the highest rates of AI agent use tend to be the same as those with broader, more established AI adoption. Deployment Stages 9% 8% 5% 6% 39% 33% 31% 22% 17% 22% 32% 32% 31% 31% 25% 23% 29% 32% 39% 5% 4% 3% 9% 10% 0% 20% 40% 60% 80% 100% 5B+ 1B–4.9B 500M–999M 100M–499M <100M Not using Experimenting Piloting Scaling Fully scaled % of respondents Revenue in US dollars Stage of AI deployment by organization revenue, 2025 Source: McKinsey & Company Survey, 2025 | Chart: 2026 AI Index report Figure 4.3.63 3 Figures may not add up to 100% because of rounding; respondents who said “I don’t know” were not shown but represent <1% of the total, which could also cause bars to not add up to 100%. 198 4.3 CO R P O R AT E A I A D O P T I O N | E CO N O M Y | A I I N D E X R E P O R T 2026 Figure 4.3.7 4% 3% 69% 66% 68% 71% 73% 77% 85% 82% 85% 88% 91% 4% 5% 5% 5% 4% 3% 3% 7% 12% 11% 8% 9% 6% 5% 5% 4% 4% 3% 6% 6% 6% 7% 6% 5% 3% 3% 8% 7% 6% 6% 5% 5% 4% 3% 3% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Manufacturing Supply chain/inventory management Strategy and corporate Ǉnance Human resources Risk Software engineering Product and/or service development Service operations Marketing and sales Knowledge management IT Don’t know Not at all Planning to use within year Experimenting Piloting Scaling Fully scaled % of respondents Function Stage of AI agent use by business function, 2025 Source: McKinsey & Company Survey, 2025 | Chart: 2026 AI Index report Figure 4.3.84 4 Figures may not add up to 100% because of rounding. 199 H I G H L I G H T: Measuring Signals of AI Diffusion 4.3 CO R P O R AT E A I A D O P T I O N | E CO N O M Y | A I I N D E X R E P O R T 2026 The full economic impact of AI is difficult to assess through investment patterns or organizational adoption alone. The assessment also requires tracking diffusion, or how widely AI tools are being adopted across populations, countries, occupations, and everyday tasks. This section brings together several complementary signals of AI diffusion, including population-level survey estimates, cross-country comparisons, historical adoption benchmarks, and platform-level usage data. Once combined, these measures offer a comprehensive view of how quickly AI is being integrated into work and daily life. Compared with earlier transformative technologies, generative AI’s adoption has been rapid in the years after its mass market introduction (Figure 4.3.9). Measured from the release of each technology’s first widely available product, generative AI reached approximately 53% adoption within three years, well above the initial trajectories of the personal computer and the internet over comparable time frames (Bick et al., 2024). The sharp uptake is also reflected in the revenue trajectories of leading AI companies, as seen in the company-level revenue analysis in section 4.2, where commercial scale was reached in comparably shorter time frames (Figure 4.3.9). 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 0% 20% 40% 60% 80% 100% Years since the introduction of the Ǉrst mass-market product Adoption rate Speed of AI adoption by technology Source: The Project on Workforce at Harvard, 2025 | Chart: 2026 AI Index report 53%, GenAI 69%, Computer 76%, Internet (source: CPS) 91%, Internet (source: ITU) Figure 4.3.95 5 Source: https://www.genaiadoptiontracker.com. The figure shows overall usage rates for three technologies: generative AI, computers, and the internet. The horizontal axis represents years since the introduction of the first mass-market product for each technology. We use 1981 as the introduction year for computers, which was the year the IBM PC was released. We use 1995 as the introduction year for the internet, which was the year that the NSF decommissioned NSFNet and allowed the internet to carry commercial traffic. We use 2022 as the introduction year for generative AI, which was the year ChatGPT was released. The data source for computers is the 1984–2003 Computer and Internet Use Supplement of the CPS. We plot two estimates of internet use: the 2001–2009 Computer and Internet Use Supplement of the CPS and the ITU. The sample for the RPS and CPS is all individuals ages 18–64. The sample for the ITU is individuals of all ages. We pool RPS waves by year. 200 H I G H L I G H T: 4.3 CO R P O R AT E A I A D O P T I O N | E CO N O M Y | A I I N D E X R E P O R T 2026 Another broad signal comes from survey-based estimates of AI usage across countries (Figure 4.3.10). Adoption varies widely, and shows a strong, statistically significant positive correlation with GDP per capita (Misra et al., 2025) (Figure 4.3.11). Most high-income economies cluster between 25% and 45% adoption, with European and North American averages reaching approximately 27% and 22%, respectively. Lower usage is reported in South Asia and sub-Saharan Africa, where GDP per capita is also lower. However, there are exceptions to the relationship between GDP and AI adoption. The United Arab Emirates and Singapore report adoption levels above 54% and 61%, respectively, well above what their GDP per capita would predict. Some wealthy economies, such as the United States and Denmark, fall below the trend. 0%–10% 11%–20% 21%–30% 31%–40% 41%–50% 51%–65% No available data AI diǄusion by geographic area, second half 2025 Source: Microsoft AI Economy Institute, 2025 | Chart: 2026 AI Index report Figure 4.3.10 Figure 4.3.11 0 20,000 40,000 60,000 80,000 100,000 0% 10% 20% 30% 40% 50% 60% 70% East Asia and PaciǇc Europe and Central Asia Latin America and Caribbean Middle East and North Africa North America South Asia Sub-Saharan Africa GDP per capita (in US dollars) AI diǄusion India China CanadaBulgaria Oman Hungary Portugal Slovenia Italy Spain Finland Germany Austria Sweden United Kingdom New Zealand France Denmark Australia Netherlands United Arab Emirates United States Qatar Norway Singapore Switzerland Ireland AI diǄusion relative to GDP per capita by geographic area, 2025 Source: Microsoft AI Economy Institute, 2025; World Bank, 2025 | Chart: 2026 AI Index report 201 H I G H L I G H T: 4.3 CO R P O R AT E A I A D O P T I O N | E CO N O M Y | A I I N D E X R E P O R T 2026 Between the first and second half of 2025, AI adoption grew across the majority of the top 30 economies (Figure 4.3.12). South Korea posted the largest gain of 4.8%, climbing the rankings from 25th to 18th. The United States, despite its leading position in AI investment and model development, dropped to 24th place with a population-level adoption rate of 28.3%. Even as usage grows, the United States remains in the lower half of the global adoption ranking, in line with the more cautious public mood toward AI explored in Chapter 9. 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 99 8 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 1919 2020 2121 2222 2323 2424 2525 18 26 26 27 2828 27 29 29 30 30 United Arab Emirates, 59.40% Singapore, 58.60% Norway, 45.30% Ireland, 41.70% France, 40.90% Spain, 39.70% New Zealand, 37.60% United Kingdom, 36.40% Netherlands, 36.30% Qatar, 35.70% Australia, 34.50% Israel, 33.90% Belgium, 33.50% Canada, 33.50% Switzerland, 32.40% Sweden, 31.20% Austria, 29.10% Hungary, 27.90% Denmark, 26.60% Germany, 26.50% Poland, 26.40% Taiwan, 26.40% United States, 26.30% Czech Republic, 26.00% South Korea, 25.90% Italy, 25.80% Finland, 25.60% Bulgaria, 25.40% Jordan, 25.40% Costa Rica, 25.10% United Arab Emirates, 64.00% (+4.60%) Singapore, 60.90% (+2.30%) Norway, 46.40% (+1.10%) Ireland, 44.60% (+2.90%) France, 44.00% (+3.10%) Spain, 41.80% (+2.10%) New Zealand, 40.50% (+2.90%) United Kingdom, 38.90% (+2.50%) Netherlands, 38.90% (+2.60%) Qatar, 38.30% (+2.60%) Australia, 36.90% (+2.40%) Israel, 36.10% (+2.20%) Belgium, 36.00% (+2.50%) Canada, 35.00% (+1.50%) Switzerland, 34.80% (+2.50%) Sweden, 33.30% (+2.20%) Austria, 31.40% (+2.20%) Hungary, 29.80% (+1.90%) Denmark, 28.70% (+2.10%) Germany, 28.60% (+2.10%) Poland, 28.50% (+2.10%) Taiwan, 28.40% (+2.00%) United States, 28.30% (+2.10%) Czech Republic, 27.80% (+1.80%) South Korea, 30.70% (+4.80%) Italy, 27.80% (+2.00%) Finland, 27.30% (+1.70%) Bulgaria, 27.30% (+1.90%) Jordan, 27.00% (+1.60%) Costa Rica, 26.50% (+1.40%) First half 2025 Second half 2025 AI di�usion by top 30 geographic areas, �rst vs. second half 2025 Source: Microsoft AI Economy Institute, 2025 | Chart: 2026 AI Index report Figure 4.3.12 202 On a more granular level, platform-level data from Anthropic’s AI Usage Index6 provides a view of adoption across occupations and tasks (Massenkof et. al, 2026). Throughout 2025, computer and mathematical tasks accounted for the largest share of overall usage, consistently representing close to 40% of activity (Figure 4.3.13). Educational instruction and library tasks showed the most significant growth, rising from 9% early in the year to approximately 14% by late 2025. This growth in educational settings is worth noting alongside Chapter 7, which explores how institutional guidance and readiness still lag behind adoption. In addition, an analysis of the conversation patterns reveals a shift in how users interact with the tools (Figure 4.3.14). The share of automation-oriented conversations, where users instruct the tool to complete a task autonomously, rose from 41% at the start of 2025 to 49% in August. This surpassed augmentation-style interactions for the first time. However, by November, augmentation had moved ahead, representing 52% of the share of conversations. The fluctuation over the course of the year suggests that automation-oriented use is becoming more prevalent, which is consistent with the early-stage AI agent adoption patterns for organizations, described in the previous section (Figure 4.3.8). Figure 4.3.137 6 The Anthropic AI Usage Index (AUI) measures Claude usage relative to the working-age population by calculating each geography’s share of Claude usage divided by its share of the working-age population (ages 15–64). Countries with an AUI greater than 1 use Claude more often than expected based on their working-age population alone, while those with an AUI less than 1 use it less. 7 V1–V4 refer to the four successive releases of the Anthropic Economic Index in 2025, corresponding to January, March, August, and November 2025, respectively. H I G H L I G H T: 4.3 CO R P O R AT E A I A D O P T I O N | E CO N O M Y | A I I N D E X R E P O R T 2026 Jan-25 Mar-25 Aug-25 Nov-25 0% 10% 20% 30% 40% Jan-25 Mar-25 Aug-25 Nov-25 0% 2% 4% 6% 8% 10% Jan-25 Mar-25 Aug-25 Nov-25 0% 5% 10% 15% Jan-25 Mar-25 Aug-25 Nov-25 0% 2% 4% 6% 8% Jan-25 Mar-25 Aug-25 Nov-25 0% 2% 4% 6% Jan-25 Mar-25 Aug-25 Nov-25 0% 2% 4% 6% Jan-25 Mar-25 Aug-25 Nov-25 0% 2% 4% Jan-25 Mar-25 Aug-25 Nov-25 0% 2% 4% Usage Index Computer and Mathematical Arts, Design, Entertainment, Sports, and Media Educational Instruction and Library Oǅce and Administrative Support Life, Physical, and Social Sciences Business and Financial Operations Architecture and Engineering Management Task usage share by occupation group, V1–V4 2025 Source: Anthropic Economic Index, 2026 | Chart: 2026 AI Index report 203 H I G H L I G H T: 4.3 CO R P O R AT E A I A D O P T I O N | E CO N O M Y | A I I N D E X R E P O R T 2026 Jan-2025 Mar-2025 Aug-2025 Nov-2025 0% 10% 20% 30% 40% 50% 60% 70% Automation Augmentation Collaboration mode share Claude.ai collaboration mode share, 2025 Source: Anthropic Economic Index, 2026 | Chart: 2026 AI Index report 41% 42% 49% 45% 55% 55% 47% 52% Figure 4.3.14 AI diffusion is also shaped by broader societal attitudes, including public trust and optimism about the technology. Chapter 9 studies these trends to determine how excitement for and exposure to AI vary across countries and what they suggest about the societal experience of increasing adoption. 204

4.4 Jobs

4 E CO N O M Y | A I I N D E X R E P O R T 2026 Labor markets provide signals of how investment, technical progress, and organizational adoption are changing workforce dynamics. This section tracks both the demand side of the labor market, through job postings and skill requirements, and the supply side, through talent flows, before examining the impact on employment outcomes and employee sentiment. The analysis draws from Lightcast’s job posting database, LinkedIn’s talent and hiring metrics, and recent research on AI’s effects on the labor market. AI Labor Demand Across Geographies Across the countries tracked by Lightcast, demand for AI-related talent continued to increase in 20258 as job listings that require AI skills continue to make up a growing share of overall postings (Figures 4.4.1 and 4.4.2). While most countries are hitting new peaks of demand, the intensity varies across countries. In 2025, Singapore led with 4.69% of all job postings that required AI skills, followed by Hong Kong (3.5%), Luxembourg (3.4%), and Spain (3.3%). The United States reached 2.6%, followed by Chile (2.4%) and the United Kingdom (1.9%). 8 Historical posting counts may differ from previously published versions due to data updates and revisions. However, year-over-year trends remain consistent with earlier analyses. See here for more details. 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0.00% 1.00% 2.00% 3.00% 4.00% 5.00% AI job postings (% of all job postings) 1.93%, United Kingdom 2.41%, Chile 2.56%, United States 2.77%, Sweden 2.87%, United Arab Emirates 2.92%, Poland 3.00%, Canada 3.31%, Spain 3.43%, Luxembourg 3.48%, Hong Kong 4.69%, Singapore AI job postings (% of all job postings) by select geographic areas, 2014–25 (part 1) Source: Lightcast, 2025 | Chart: 2026 AI Index report Figure 4.4.1 205 4.4 J O B S | E CO N O M Y | A I I N D E X R E P O R T 2026 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0.00% 1.00% 2.00% 3.00% 4.00% 5.00% AI job postings (% of all job postings) 0.38%, Croatia 0.81%, New Zealand 0.84%, Austria 0.99%, France 1.04%, Netherlands 1.13%, Germany 1.33%, Italy 1.36%, Belgium 1.38%, Mexico 1.59%, Switzerland 1.78%, Australia AI job postings (% of all job postings) by select geographic areas, 2014–25 (part 2) Source: Lightcast, 2025 | Chart: 2026 AI Index report Figure 4.4.2 Within the United States, AI labor demand can be disaggregated by skillset to reveal how the workforce footprint is evolving (Figures 4.4.3–4.4.8). In 2025, broad AI and machine learning skill clusters remain the most frequently cited categories in AI job posting, accounting for 1.7% and 1.0% of all job postings. Among the top specialized skills, Python appeared the most often, in 258,674 posts, a 391% increase compared to the 2013–15 time period and a near 30% increase from 2024. The fastest growth appears in skills needed to build and operate systems at scale, with employer demand mirroring the broader investment shift toward AI infrastructure and deployment capacity. Amazon Web Services expanded significantly compared to a decade ago (+1,358%) alongside an increasing emphasis on scalability (+733%) and workflow management (+818%). Mentions of generative AI skills in AI job postings grew 111% from 2024 to 2025, though their share of total AI job postings decreased by 5%. With overall AI labor demand rising, a newer skill cluster tied to AI agents emerged. From 2024 to 2025, postings referencing agentic AI, AI agents, or agentic systems exponentially increased. The share of AI job postings that mentioned ChatGPT, chatbot, or conversational AI declined, while posts referencing agentic terms or orchestration frameworks such as LangGraph increased. Job demand appears to be shifting from general familiarity with chat-based tools toward skills required to coordinate and operationalize task-oriented systems. AI Hiring in the United States Skill Composition 206 4.4 J O B S | E CO N O M Y | A I I N D E X R E P O R T 2026 Figure 4.4.39 9 A single job posting can list multiple AI skills. 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0.00% 0.20% 0.40% 0.60% 0.80% 1.00% 1.20% 1.40% 1.60% 1.80% AI job postings (% of all job postings) 0.05%, AI ethics, governance, and regulations 0.08%, Robotics 0.09%, Visual image recognition 0.14%, Autonomous driving 0.20%, Neural networks 0.22%, Natural language processing 0.23%, AI agent 0.41%, Generative AI 0.99%, Machine learning 1.70%, ArtiǇcial intelligence AI job postings (% of all job postings) in the United States by skill cluster, 2010–25 Source: Lightcast, 2025 | Chart: 2026 AI Index report 9,742 26,767 60,745 65,300 54,942 20,295 26,870 23,727 97,085 52,678 142,037 (+1,358%) 142,120 (+431%) 149,865 (+147%) 151,191 (+132%) 170,396 (+210%) 186,325 (+818%) 190,758 (+610%) 197,744 (+733%) 257,127 (+165%) 258,674 (+391%) 0 50,000 100,000 150,000 200,000 250,000 Amazon Web Services Data science Project management SQL (programming language) Data analysis Workǈow management Automation Scalability Computer science Python (programming language) 2025 2013–15 Number of AI job postings Top 10 specialized skills in 2025 AI job postings in the United States, 2013–15 vs. 2025 Source: Lightcast, 2025 | Chart: 2026 AI Index report Figure 4.4.4 207 4.4 J O B S | E CO N O M Y | A I I N D E X R E P O R T 2026 65,557 19,045 6,152 2,885 1,047 1,272 616 625 9 628 138,188 (+111%) 38,526 (+102%) 22,227 (+261%) 12,609 (+337%) 1,900 (+81%) 1,571 (+24%) 1,459 (+137%) 717 (+15%) 703 (+7,711%) 699 (+11%) 0 20,000 40,000 60,000 80,000 100,000 120,000 140,000 Context engineering Stable Di�usion Variational autoencoders Multimodal models Generative adversarial networks Text to speech (TTS) Retrieval augmented generation Prompt engineering Large language modeling Generative arti�cial intelligence 2025 2024 Number of AI job postings Generative AI skills in AI job postings in the United States, 2024 vs. 2025 Source: Lightcast, 2025 | Chart: 2026 AI Index report 62.62% (-5%) 17.46% (-9%) 10.07% (+62%) 5.71% (+96%) 0.86% (-18%) 0.71% (-44%) 0.66% (+6%) 0.32% (-48%) 0.32% (+3,412%) 0.32% (-50%) 66.08% 19.20% 6.20% 2.91% 1.06% 1.28% 0.62% 0.63% 0.01% 0.63% 0% 10% 20% 30% 40% 50% 60% 70% Stable Di�usion Context engineering Variational autoencoders Multimodal models Generative adversarial networks Text to speech (TTS) Retrieval augmented generation Prompt engineering Large language modeling Generative arti�cial intelligence 2025 2024 Skill share in AI job postings (%) Share of generative AI skills in AI job postings in the United States, 2024 vs. 2025 Source: Lightcast, 2025 | Chart: 2026 AI Index report Figure 4.4.5 Figure 4.4.6 208 4.4 J O B S | E CO N O M Y | A I I N D E X R E P O R T 2026 Figure 4.4.710 10 For definitions of each skill category, see the Lightcast taxonomy at https://lightcast.io/open-skills or the Appendix. 151 1,310 5,535 5,430 1,416 1,635 2,316 194 549 192 16,541 (+10,854%) 15,217 (+1,062%) 14,376 (+160%) 6,976 (+28%) 6,395 (+352%) 5,461 (+234%) 4,596 (+98%) 4,294 (+2,113%) 3,366 (+513%) 2,850 (+1,384%) 0 2,000 4,000 6,000 8,000 10,000 12,000 14,000 16,000 18,000 Agentic systems Microsoft Copilot Studio LangGraph Chatbot Multi-agent systems Microsoft Copilot Conversational AI AI agents Agentic AI ChatGPT 2025 2024 Number of AI job postings AI agent skills in AI job postings in the United States, 2025 Source: Lightcast, 2025 | Chart: 2026 AI Index report 18.90% (+2,643%) 17.39% (+191%) 16.43% (-35%) 7.97% (-68%) 7.31% (+13%) 6.24% (-16%) 5.25% (-50%) 4.91% (+454%) 3.85% (+54%) 3.26% (+272%) 0.69% 5.98% 25.26% 24.78% 6.46% 7.46% 10.57% 0.89% 2.51% 0.88% 0% 2% 4% 6% 8% 10% 12% 14% 16% 18% 20% 22% 24% 26% Agentic systems Microsoft Copilot Studio LangGraph Chatbot Multi-agent systems Microsoft Copilot Conversational AI ChatGPT AI agents Agentic AI 2025 2024 Skill share in AI job postings (%) Share of AI agent skills in AI job postings in the United States, 2024 vs. 2025 Source: Lightcast, 2025 | Chart: 2026 AI Index report Figure 4.4.8 209 Demand for AI talent increased across all economic sectors in 2025, though the pace of growth varied (Figure 4.4.9). The information sector leads, with AI skills appearing in a 13.2% share of its job postings, up from 7.8% in 2024. Other sectors with relatively high AI posting shares include professional, scientific, and technical services (6.5%), finance and insurance (5.3%), and manufacturing (4.7%). In 2025, AI hiring also expanded in sectors with historically low adoption rates. Transportation and warehousing, real estate, and education showed year-over-year increases, evidence that the diffusion is reaching beyond traditional technologydriven industries. By Sector 4.4 J O B S | E CO N O M Y | A I I N D E X R E P O R T 2026 0.35% 0.81% 0.99% 1.04% 1.19% 1.74% 1.06% 1.08% 2.03% 2.00% 1.62% 3.35% 3.27% 4.41% 7.82% 0.46% (+30.80%) 1.26% (+55.40%) 1.32% (+33.04%) 1.67% (+60.25%) 1.69% (+41.58%) 1.87% (+7.64%) 1.93% (+83.00%) 2.08% (+93.48%) 2.42% (+19.14%) 2.89% (+44.35%) 3.28% (+102.28%) 4.66% (+39.27%) 5.33% (+62.94%) 6.49% (+47.10%) 13.22% (+69.12%) 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11% 12% 13% 14% 15% Waste management and administrative support services Transportation and warehousing Agriculture, forestry, Ǉshing and hunting Retail trade Public administration Mining, quarrying, and oil and gas extraction Wholesale trade Real estate and rental and leasing Educational services Utilities Management of companies and enterprises Manufacturing Finance and insurance Professional, scientiǇc, and technical services Information 2025 2024 AI job postings (% of all job postings) AI job postings (% of all job postings) in the United States by sector, 2024 vs. 2025 Source: Lightcast, 2025 | Chart: 2026 AI Index report Figure 4.4.911 11 The sector classifications in Figure 4.2.9 are based on two-digit NAICS codes. For more information on the Bureau of Labor Statistics’ supersector and NAICS classifications, see the following reference. 210 4.4 J O B S | E CO N O M Y | A I I N D E X R E P O R T 2026 Within the United States, AI labor demand remains highly concentrated within certain states (Figures 4.4.10 and 4.4.11). California leads with 170,881 postings and accounts for a disproportionate share of total 2025 U.S. AI job postings (17.2%). Texas follows with 80,547 postings (8.1% of total) and New York with 66,029 (6.6%). These three states represent approximately a third of all AI job postings nationally. This mirrors the state-level investment data described earlier, where AI funding was also concentrated in a small number of states, led by California. Over time, however, California’s share of the national total has declined, from over 25% in 2012 to 17.9% in 2025, even as the state continues to lead in AI investment concentration (Figure 4.4.12). However, looking at the density of AI labor demand within each state, there are instances of above-average AI penetration relative to the particular total job market (Figure 4.4.13). For example, despite having smaller total numbers, Washington, D.C., accounts for a comparatively high 6.2% share of those postings followed by Delaware at 4.4%. From 2024 to 2025, California, Washington state, New York, and Texas all continued to see growth in AI job postings within their labor markets (Figure 4.4.14). By State AL 8,893 AK 2,664 AZ 18,929 AR 8,182 CA 170,881 CO 21,690 CT 10,503 DE 6,801 FL 39,257 GA 30,415 HI 3,571 ID 7,013 IL 44,883 IN 11,521 IA 6,658 KS 9,268 KY 6,649 LA 6,390 ME 5,127 MD 19,259 MA 39,272 MI 18,832 MN 17,640 MS 4,396 MO 12,712 MT 3,410 NE 5,539 NV 6,700 NH 4,508 NJ 28,693 NM 5,847 NY 66,029 NC 29,052 ND 3,212 OH 24,547 OK 6,246 OR 12,482 PA 28,284 RI 5,446 SC 8,905 SD 3,477 TN 13,766 TX 80,547 UT 10,034 VT 4,452 VA 41,950 WA 39,119 DC 13,587 WV 3,307 WI 11,200 WY 2,627 Source: Lightcast, 2025 | Chart: 2026 AI Index report Number of AI job postings in the United States by state, 2025 Figure 4.4.10 Figure 4.4.11 AL 0.89% AK 0.27% AZ 1.90% AR 0.82% CA 17.18% CO 2.18% CT 1.06% DE 0.68% FL 3.95% GA 3.06% HI 0.36% ID 0.71% IL 4.51% IN 1.16% IA 0.67% KS 0.93% KY 0.67% LA 0.64% ME 0.52% MD 1.94% MA 3.95% MI 1.89% MN 1.77% MS 0.44% MO 1.28% MT 0.34% NE 0.56% NV 0.67% NH 0.45% NJ 2.89% NM 0.59% NY 6.64% NC 2.92% ND 0.32% OH 2.47% OK 0.63% OR 1.26% PA 2.84% RI 0.55% SC 0.90% SD 0.35% TN 1.38% TX 8.10% UT 1.01% VT 0.45% VA 4.22% WA 3.93% DC 1.37% WV 0.33% WI 1.13% WY 0.26% Source: Lightcast, 2025 | Chart: 2026 AI Index report Percentage of US AI job postings by state, 2025 211 4.4 J O B S | E CO N O M Y | A I I N D E X R E P O R T 2026 Figure 4.4.12 Figure 4.4.13 AL 1.67% AK 2.20% AZ 2.08% AR 2.52% CA 4.26% CO 2.34% CT 1.97% DE 4.43% FL 1.58% GA 2.44% HI 1.94% ID 2.86% IL 2.85% IN 1.42% IA 1.48% KS 2.10% KY 1.37% LA 1.43% ME 2.51% MD 2.68% MA 3.41% MI 1.80% MN 2.18% MS 1.54% MO 1.58% MT 1.94% NE 1.96% NV 1.87% NH 1.52% NJ 2.63% NM 1.89% NY 3.21% NC 2.07% ND 2.00% OH 1.66% OK 1.39% OR 2.08% PA 1.77% RI 3.11% SC 1.38% SD 2.18% TN 1.61% TX 2.38% UT 2.53% VT 3.20% VA 3.32% WA 4.03% DC 6.18% WV 1.72% WI 1.32% WY 2.83% Source: Lightcast, 2025 | Chart: 2026 AI Index report Percentage of US states job postings in AI, 2025 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0% 5% 10% 15% 20% 25% Percentage of United States AI job postings 3.93%, Washington 6.64%, New York 8.10%, Texas 17.18%, California Percentage of US AI job postings by select US state, 2010–25 Source: Lightcast, 2025 | Chart: 2026 AI Index report 212 4.4 J O B S | E CO N O M Y | A I I N D E X R E P O R T 2026 Figure 4.4.14 LinkedIn’s hiring and talent data provides a view of how AI labor demand is changing the actual workforce in practice. In most countries, AI hiring rates outpaced overall hiring growth in 2025 (Figures 4.4.15 and 4.4.16). Indonesia recorded the highest relative AI hiring growth at 31.7%, followed by Croatia (27.8%) and Belgium (21.5%). Since 2018, many countries show a sustained pattern of AI hiring rates that exceed general labor market growth. However, there are a few exceptions, such as Iceland and Sweden, where AI hiring growth lagged behind the broader market. AI Hiring 9.96% 10.18% 11.29% 11.40% 11.82% 12.32% 14.58% 14.69% 15.62% 16.50% 17.13% 17.26% 21.49% 27.80% 31.74% 0% 5% 10% 15% 20% 25% 30% 35% Romania Canada Austria Netherlands Turkey Poland New Zealand Greece Luxembourg Czech Republic Cyprus Costa Rica Belgium Croatia Indonesia Relative AI hiring rate year-over-year ratio AI vs. overall hiring rate growth by geographic area, 2025 Source: LinkedIn, 2025 | Chart: 2026 AI Index report Figure 4.4.1512 12 For the sake of brevity, the visualization includes only the top 15 countries for this metric. 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0.00% 0.50% 1.00% 1.50% 2.00% 2.50% 3.00% 3.50% 4.00% 4.50% Percentage of US states’ job postings in AI 2.38%, Texas 3.21%, New York 4.03%, Washington 4.26%, California Percentage of US states’ job postings in AI by select US state, 2010–25 Source: Lightcast, 2025 | Chart: 2026 AI Index report 213 4.4 J O B S | E CO N O M Y | A I I N D E X R E P O R T 2026 2018 2022 2025 −50% 0% 50% 100% 2018 2022 2025 −50% 0% 50% 100% 2018 2022 2025 −50% 0% 50% 100% 2018 2022 2025 −50% 0% 50% 100% 2018 2022 2025 −50% 0% 50% 100% 2018 2022 2025 −50% 0% 50% 100% 2018 2022 2025 −50% 0% 50% 100% 2018 2022 2025 −50% 0% 50% 100% 2018 2022 2025 −50% 0% 50% 100% 2018 2022 2025 −50% 0% 50% 100% 2018 2022 2025 −50% 0% 50% 100% 2018 2022 2025 −50% 0% 50% 100% 2018 2022 2025 −50% 0% 50% 100% 2018 2022 2025 −50% 0% 50% 100% 2018 2022 2025 −50% 0% 50% 100% 2018 2022 2025 −50% 0% 50% 100% 2018 2022 2025 −50% 0% 50% 100% 2018 2022 2025 −50% 0% 50% 100% 2018 2022 2025 −50% 0% 50% 100% 2018 2022 2025 −50% 0% 50% 100% 2018 2022 2025 −50% 0% 50% 100% 2018 2022 2025 −50% 0% 50% 100% 2018 2022 2025 −50% 0% 50% 100% 2018 2022 2025 −50% 0% 50% 100% 2018 2022 2025 −50% 0% 50% 100% 2018 2022 2025 −50% 0% 50% 100% 2018 2022 2025 −50% 0% 50% 100% 2018 2022 2025 −50% 0% 50% 100% 2018 2022 2025 −50% 0% 50% 100% 2018 2022 2025 −50% 0% 50% 100% 2018 2022 2025 −50% 0% 50% 100% 2018 2022 2025 −50% 0% 50% 100% 2018 2022 2025 −50% 0% 50% 100% 2018 2022 2025 −50% 0% 50% 100% 2018 2022 2025 −50% 0% 50% 100% 2018 2022 2025 −50% 0% 50% 100% 2018 2022 2025 −50% 0% 50% 100% 2018 2022 2025 −50% 0% 50% 100% 2018 2022 2025 −50% 0% 50% 100% 2018 2022 2025 −50% 0% 50% 100% 2018 2022 2025 −50% 0% 50% 100% 2018 2022 2025 −50% 0% 50% 100% 2018 2022 2025 −50% 0% 50% 100% 2018 2022 2025 −50% 0% 50% 100% 2018 2022 2025 −50% 0% 50% 100% 2018 2022 2025 −50% 0% 50% 100% Relative AI hiring rate year-over-year ratio Argentina Australia Austria Belgium Brazil Canada Chile Costa Rica Croatia Cyprus Czech Republic Denmark Estonia Finland France Germany Greece Hong Kong Iceland India Indonesia Ireland Israel Italy Lithuania Luxembourg Mexico Netherlands New Zealand Norway Poland Portugal Romania Saudi Arabia Singapore Slovenia South Africa South Korea Spain Sweden Switzerland Turkey United Arab Emirates United Kingdom United States Uruguay 1.91% 0.37% 11.29% 21.49% 7.96% 10.18% 7.53% 17.26% 27.80% 17.13% 16.50% 3.76% -20.26% 7.47% 9.06% 7.83% 14.69% 9.89% -10.90% 4.43% 31.74% 2.75% 6.38% 9.00% -1.33% 15.62% 9.02% 11.40% 14.58% 5.66% 12.32% 2.11% 9.96% 5.53% 7.76% -5.33% 9.64% 9.93% 4.69% -11.56% 1.61% 11.82% 8.63% 6.90% 8.97% -0.37% AI vs. overall hiring rate growth by geographic area, 2018–25 Source: LinkedIn, 2025 | Chart: 2026 AI Index report Figure 4.4.16 214 4.4 J O B S | E CO N O M Y | A I I N D E X R E P O R T 2026 Figure 4.4.2113 Figure 4.4.2214 13 For the sake of brevity, the visualization includes only the top 15 countries for this metric. 14 For the sake of brevity, the visualization includes only the top 15 countries for this metric. Talent data helps show where AI capabilities are accumulating and how the AI workforce is distributed globally. This section reviews LinkedIn’s measures on the concentration of AI talent within countries and the movement of that talent across borders. In 2025, Israel had the highest concentration of AI talent among LinkedIn members (2.1%), followed by Singapore (1.8%) and Luxembourg (1.6%) (Figures 4.4.21 and 4.4.22). However, the United Arab Emirates, India, and Saudi Arabia showed the fastest growth in their share of AI talent, each increasing over 100% between 2019 and 2025. Over the same time period, all countries in the sample grew by at least 75% in AI talent concentration. AI Talent Concentration 2.10% 1.82% 1.60% 1.31% 1.25% 1.23% 1.15% 1.15% 1.10% 1.05% 1.05% 1.01% 1.01% 1.00% 0.95% 0.00% 0.50% 1.00% 1.50% 2.00% Cyprus Poland Canada India South Korea Netherlands Lithuania Germany Estonia Finland Switzerland Ireland Luxembourg Singapore Israel AI talent concentration AI talent concentration by geographic area, 2025 Source: LinkedIn, 2025 | Chart: 2026 AI Index report 75% 76% 77% 82% 89% 89% 90% 93% 107% 108% 111% 112% 113% 120% 121% 0% 30% 60% 90% 120% Turkey Argentina Spain Chile Costa Rica Denmark Uruguay Iceland Brazil Indonesia Portugal Cyprus Saudi Arabia India United Arab Emirates % change in AI talent concentration Source: LinkedIn, 2025 | Chart: 2026 AI Index report Percentage change in AI talent concentration by geographic area, 2019 vs. 2025 215 4.4 J O B S | E CO N O M Y | A I I N D E X R E P O R T 2026 0.17 0.23 0.45 0.62 0.90 1.04 1.14 1.22 1.23 1.36 1.72 1.77 1.79 4.40 5.23 0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50 5.00 Germany Spain Denmark Cyprus Austria United Kingdom Hong Kong United States Canada Singapore Switzerland Saudi Arabia Australia United Arab Emirates Luxembourg Net AI talent migration (per 10,000 LinkedIn members) Net AI talent migration per 10,000 LinkedIn members by geographic area, 2025 Source: LinkedIn, 2025 | Chart: 2026 AI Index report Migration patterns show the dynamic global redistribution of AI talent (Figure 4.4.23 and 4.4.24). In 2025, Luxembourg recorded the highest net inflow relative to other tracked countries, with 5.23 per 10,000 LinkedIn members, as smaller economies actively try to attract more AI workers. The United States is a net importer of AI talent at 1.2 per 10,000 LinkedIn members. Similar to skills penetration, gender representation within AI talent continues to be uneven. Across the countries measured, men still account for the majority of AI talent, typically between 65% and 75% (Figures 4.4.25 and 4.4.26). Gender ratios have for the most part stayed flat since 2016, despite an expanding AI workforce. In the United States, women represent a 34.3% share of AI talent compared to men’s 65.7% share. Other major labor markets, including the United Kingdom, Canada, France, and Singapore, show similar disparities. Figure 4.4.2315 15 For the sake of brevity, the visualization includes only the top 15 countries for this metric. 216 2021 2023 2025 −1 0 1 2021 2023 2025 −1 0 1 2 3 2021 2023 2025 −1 0 1 2 3 2021 2023 2025 −1 0 1 2021 2023 2025 −1 0 1 2021 2023 2025 −2 −1 0 1 2 3 2021 2023 2025 −1 0 1 2 3 2021 2023 2025 −1 0 1 2021 2023 2025 −1 0 1 2021 2023 2025 −1 0 1 2021 2023 2025 0 4 8 12 2021 2023 2025 −1 0 1 2021 2023 2025 −1 0 1 2021 2023 2025 −2 0 2 4 6 2021 2023 2025 −1 0 1 2021 2023 2025 −1 0 1 2021 2023 2025 −1 0 1 2 3 2021 2023 2025 −1 0 1 2021 2023 2025 −1 0 1 2 3 2021 2023 2025 −4 −3 −2 −1 0 2021 2023 2025 −1 0 1 2021 2023 2025 −2 −1 0 1 2021 2023 2025 −1 0 1 2021 2023 2025 −2 0 2 2021 2023 2025 −2 0 2 2021 2023 2025 −1 0 1 2021 2023 2025 −1 0 1 2021 2023 2025 −1 0 1 2 2021 2023 2025 4 6 8 10 2021 2023 2025 −1 0 1 2021 2023 2025 −1 0 1 2021 2023 2025 −1 0 1 2021 2023 2025 −1 0 1 2021 2023 2025 −1 0 1 2021 2023 2025 −1 0 1 2021 2023 2025 −1 0 1 2021 2023 2025 −1 0 1 2 3 2021 2023 2025 −1 0 1 2 3 2021 2023 2025 −3 −2 −1 0 1 2021 2023 2025 −1 0 1 2021 2023 2025 −1 0 1 2021 2023 2025 −1 0 1 2021 2023 2025 −1 0 1 2021 2023 2025 −1 0 1 2021 2023 2025 −1 0 1 2 3 4 2021 2023 2025 −1 0 1 2021 2023 2025 0 2 4 6 8 2021 2023 2025 0 2 2021 2023 2025 −1 0 1 2021 2023 2025 −1 0 1 Argentina Australia* Austria* Belgium Brazil Bulgaria* Canada* Chile Costa Rica Croatia Cyprus* Czech Republic Denmark Estonia* Finland France Germany* Greece Hong Kong* Hungary* Iceland India* Indonesia Ireland* Israel* Italy Latvia Lithuania* Luxembourg* Mexico Netherlands New Zealand Norway Poland Portugal Romania Saudi Arabia* Singapore* Slovakia* Slovenia South Africa South Korea Spain Sweden Switzerland* Turkey United Arab Emirates* United Kingdom* United States Uruguay 0.01 1.77 1.50 0.25 -0.05 -1.04 0.07 -0.07 -0.02 -0.07 2.93 0.46 0.83 -0.51 0.66 0.16 1.09 -0.37 1.76 -2.74 0.75 -1.28 -0.09 2.48 -1.01 -0.21 0.11 0.74 7.76 -0.07 0.24 -0.06 0.28 0.49 0.20 -0.06 2.02 0.90 -1.78 0.09 -0.02 -0.35 0.54 0.08 2.45 -0.30 6.52 0.62 0.92 0.16 Net AI talent migration (per 10,000 LinkedIn members) Net AI talent migration per 10,000 LinkedIn members by geographic area, 2021–25 Source: LinkedIn, 2025 | Chart: 2026 AI Index report Figure 4.4.2416 16 Asterisks indicate that a country’s y-axis label is scaled differently than the y-axis label for the other countries. 4.4 J O B S | E CO N O M Y | A I I N D E X R E P O R T 2026 217 2016 2020 2025 0% 50% 100% 2016 2020 2025 0% 50% 100% 2016 2020 2025 0% 50% 100% 2016 2020 2025 0% 50% 100% 2016 2020 2025 0% 50% 100% 2016 2020 2025 0% 50% 100% 2016 2020 2025 0% 50% 100% 2016 2020 2025 0% 50% 100% 2016 2020 2025 0% 50% 100% 2016 2020 2025 0% 50% 100% 2016 2020 2025 0% 50% 100% 2016 2020 2025 0% 50% 100% 2016 2020 2025 0% 50% 100% 2016 2020 2025 0% 50% 100% 2016 2020 2025 0% 50% 100% 2016 2020 2025 0% 50% 100% 2016 2020 2025 0% 50% 100% 2016 2020 2025 0% 50% 100% 2016 2020 2025 0% 50% 100% 2016 2020 2025 0% 50% 100% 2016 2020 2025 0% 50% 100% 2016 2020 2025 0% 50% 100% 2016 2020 2025 0% 50% 100% 2016 2020 2025 0% 50% 100% 2016 2020 2025 0% 50% 100% 2016 2020 2025 0% 50% 100% 2016 2020 2025 0% 50% 100% 2016 2020 2025 0% 50% 100% 2016 2020 2025 0% 50% 100% 2016 2020 2025 0% 50% 100% 2016 2020 2025 0% 50% 100% 2016 2020 2025 0% 50% 100% 2016 2020 2025 0% 50% 100% 2016 2020 2025 0% 50% 100% 2016 2020 2025 0% 50% 100% 2016 2020 2025 0% 50% 100% 2016 2020 2025 0% 50% 100% 2016 2020 2025 0% 50% 100% 2016 2020 2025 0% 50% 100% 2016 2020 2025 0% 50% 100% 2016 2020 2025 0% 50% 100% Male Female Argentina Australia Austria Belgium Brazil Canada Chile Costa Rica Croatia Cyprus Czech Republic Denmark Estonia Finland France Germany Hong Kong India Ireland Israel Italy Lithuania Luxembourg Mexico Netherlands New Zealand Norway Poland Portugal Romania Saudi Arabia Singapore South Africa Spain Sweden Switzerland Turkey United Arab Emirates United Kingdom United States Uruguay 29.46% 70.54% 32.26% 67.74% 26.35% 73.65% 26.07% 73.92% 23.41% 76.59% 32.72% 67.28% 23.73% 76.27% 32.66% 67.34% 36.55% 63.45% 28.76% 71.24% 28.15% 71.85% 26.97% 73.03% 35.76% 64.24% 37.47% 62.53% 31.02% 68.98% 25.08% 74.92% 33.95% 66.05% 30.64% 69.36% 32.12% 67.88% 26.41% 73.59% 35.48% 64.52% 33.74% 66.26% 29.86% 70.14% 26.88% 73.12% 29.06% 70.94% 35.06% 64.94% 28.42% 71.58% 34.00% 66.00% 30.40% 69.60% 42.04% 57.96% 33.22% 66.78% 38.01% 61.99% 37.31% 62.69% 29.06% 70.94% 29.62% 70.38% 26.49% 73.51% 29.20% 70.80% 30.09% 69.91% 30.25% 69.75% 34.26% 65.74% 29.10% 70.90% AI talent representation by gender and geographic area, 2016–25 Source: LinkedIn, 2025 | Chart: 2026 AI Index report AI talent representation Figure 4.4.25 4.4 J O B S | E CO N O M Y | A I I N D E X R E P O R T 2026 218 2016 2020 2025 0.00% 0.20% 0.40% 2016 2020 2025 0.00% 0.50% 1.00% 2016 2020 2025 0.00% 0.50% 1.00% 2016 2020 2025 0.00% 0.50% 1.00% 2016 2020 2025 0.00% 0.20% 0.40% 2016 2020 2025 0.00% 0.50% 1.00% 1.50% 2016 2020 2025 0.00% 0.20% 0.40% 0.60% 2016 2020 2025 0.00% 0.50% 1.00% 2016 2020 2025 0.00% 0.50% 1.00% 2016 2020 2025 0.00% 0.50% 1.00% 1.50% 2016 2020 2025 0.00% 0.50% 1.00% 1.50% 2016 2020 2025 0.00% 0.50% 1.00% 1.50% 2016 2020 2025 0.00% 1.00% 2.00% 2016 2020 2025 0.00% 1.00% 2.00% 2016 2020 2025 0.00% 0.50% 1.00% 2016 2020 2025 0.00% 0.50% 1.00% 1.50% 2016 2020 2025 0.00% 0.50% 1.00% 2016 2020 2025 0.00% 0.50% 1.00% 2016 2020 2025 0.00% 1.00% 2.00% 2016 2020 2025 0.00% 2.00% 4.00% 2016 2020 2025 0.00% 0.20% 0.40% 0.60% 2016 2020 2025 0.00% 1.00% 2.00% 2016 2020 2025 0.00% 1.00% 2.00% 2016 2020 2025 0.00% 0.20% 0.40% 2016 2020 2025 0.00% 0.50% 1.00% 1.50% 2016 2020 2025 0.00% 0.50% 1.00% 2016 2020 2025 0.00% 0.50% 1.00% 2016 2020 2025 0.00% 0.50% 1.00% 1.50% 2016 2020 2025 0.00% 0.50% 1.00% 2016 2020 2025 0.00% 0.50% 1.00% 2016 2020 2025 0.00% 0.50% 2016 2020 2025 0.00% 1.00% 2.00% 2016 2020 2025 0.00% 0.20% 0.40% 2016 2020 2025 0.00% 0.50% 1.00% 2016 2020 2025 0.00% 0.50% 1.00% 1.50% 2016 2020 2025 0.00% 1.00% 2.00% 2016 2020 2025 0.00% 0.20% 0.40% 2016 2020 2025 0.00% 0.50% 2016 2020 2025 0.00% 0.50% 1.00% 1.50% 2016 2020 2025 0.00% 0.50% 1.00% 2016 2020 2025 0.00% 0.20% 0.40% Male Female Argentina Australia Austria Belgium Brazil Canada Chile Costa Rica Croatia Cyprus Czech Republic Denmark Estonia Finland France Germany Hong Kong India Ireland Israel Italy Lithuania Luxembourg Mexico Netherlands New Zealand Norway Poland Portugal Romania Saudi Arabia Singapore South Africa Spain Sweden Switzerland Turkey United Arab Emirates United Kingdom United States Uruguay 0.17% 0.40% 0.60% 1.10% 0.53% 1.10% 0.42% 0.94% 0.13% 0.41% 0.68% 1.25% 0.17% 0.50% 0.50% 0.91% 0.49% 0.83% 0.67% 1.36% 0.51% 1.16% 0.54% 1.25% 0.86% 1.60% 0.86% 1.82% 0.52% 1.04% 0.71% 1.46% 0.61% 1.14% 1.09% 1.04% 0.88% 1.64% 1.78% 3.15% 0.31% 0.57% 0.80% 1.56% 1.12% 1.99% 0.19% 0.42% 0.72% 1.36% 0.52% 0.88% 0.47% 0.88% 0.69% 1.34% 0.39% 0.84% 0.53% 0.81% 0.76% 0.48% 1.55% 2.13% 0.27% 0.44% 0.40% 0.89% 0.61% 1.23% 0.78% 1.65% 0.33% 0.40% 0.68% 0.72% 0.63% 1.18% 0.59% 1.12% 0.16% 0.41% AI talent concentration by gender and geographic area, 2016–25 Source: LinkedIn, 2025 | Chart: 2026 AI Index report AI talent concentration Figure 4.4.26 4.4 J O B S | E CO N O M Y | A I I N D E X R E P O R T 2026 219 Shifts in skill demand and talent flows are reshaping how work is organized by driving changes in productivity and hiring. A growing body of academic research has begun to measure AI’s impact both at the micro level, examining how individual workers perform their jobs using AI tools, and at the macro level, examining how AI adoption impacts aggregate productivity and employment figures. In some settings, AI improves productivity, particularly for tasks that are structured, language heavy, or supported by clear feedback loops. In others, gains are marginal or even negative when tools are poorly matched to the task. Early macro-level evidence indicates that productivity gains may take longer to materialize and that the labor market costs may fall disproportionately on junior and entry-level workers. Findings from several notable studies are summarized in the tables below; as with any emerging research area, results vary in methodology, scope, and context. Micro-level Studies A growing number of studies have looked at how AI tools affect individual worker productivity across occupations (Figure 4.4.27). The results have been generally positive, but the size and distribution of the gains varies. The clearest impact is on support work, software development, and marketing. Customer support agents using a conversational AI assistant resolved 14%–15% more issues per hour (Brynjolfsson et al., 2025), software developers using GitHub Copilot completed 26% more pull requests (Cui et al., 2025), and marketing teams using multimodal AI for ad creation saw a 50% increase in output per worker (Ju and Aral, 2025). One consistent finding across several of these studies is that the less experienced workers tended to benefit the most, suggesting that AI tools may help close existing skill gaps. Results are not uniformly positive, and for work that requires deeper reasoning or judgment, some studies have found AI tools produced little benefit or even slowed workers down. The most widely cited example comes from Model Evaluation & Threat Research (METR), which found that experienced open-source developers became 19 percent slower when using AI assistance, with a disconnect between how helpful the developers thought the tools were and how they actually performed (Becker et al., 2025). However, the METR team has not been able to replicate the results in a later study, primarily due to a growing reluctance among developers to work without AI, and that developers in late 2025 were likely sped up by AI relative to the original study period. When looking at the longer-term effects on skill development, research shows mixed results. Software engineers who relied heavily on AI for learning showed no measurable speed improvement and faced what researchers call “learning penalties” (Shen and Tamkin, 2025). Overall, AI’s productivity effects are highly context dependent. The gains are strongest when work can be divided into well-defined, repeatable tasks with clear quality monitoring. AI’s Labor Impact Productivity Trends 4.4 J O B S | E CO N O M Y | A I I N D E X R E P O R T 2026 220 4.4 J O B S | E CO N O M Y | A I I N D E X R E P O R T 2026 Study Who benefited most?Occupation AI application Change in productivity Reimers and Waldfogel (2026) Shen and Tamkin (2025) Becker et al. (2025) Brynjolfsson et al. (2025) Cui et al. (2025) Ju & Aral (2025) Choi & Xie (2025) Authors Software engineers Developers Support agents Software developers Marketing teams Accountants LLMs for content Learning new libraries Open-source tools Conversational assistant GitHub Copilot Multimodal ad creation AI-based accounting +200% (Output volume; releases tripled) 0% (Statistically insignificant speed change) -19% (Speed; developers became slower with AI) +14%–15% (Efficiency; issues resolved per hour) +26% (Output; completed pull requests) +50% (Output; productivity per worker) +55% (Throughput; weekly client support) New entrants (drove quantity); pre-AI authors (maintained quality) High scorers (65%+) who used AI for conceptual inquiry, avoiding “learning penalties” None (significant gap between perceived help and actual performance) Less experienced/skilled agents (30%– 35% gains) Junior and less-experienced workers Human-AI teams (shifted focus from social coordination to task execution) Experienced accountants (used AI confidence scores to target oversight) Figure 4.4.27 Macro-level Studies At the macro level, the evidence is earlier and less conclusive, but there are indicators that AI is starting to register in aggregate productivity data (Figure 4.2). A study of 12,000 European firms found that AI adoption boosted labor productivity by 4%, with training strengthening the outcome (Aldasoro et al., 2026). In the United States, productivity growth reached 2.7% in 2025, nearly double the 1.4% average of the previous decade. Brynjolfsson (2026) explains this may reflect the early stages of a “J-curve,” where organizations absorb the costs of adopting AI before the larger productivity gains show up in the numbers. Similarly, OECD projections for G7 economies estimate annual productivity gains of 0.2 to 1.3 percentage points over the next decade (Filippucci et al., 2025). As mentioned earlier, the evidence is not conclusive nor is it all positive. A survey of 6,000 executives across four countries found widespread adoption but minimal realized productivity gains and a projected 0.7% reduction in employment over the next three years (Yotzov et al., 2026). The gap between adoption and measurable impact could be because AI is still early in its organizational integration, as illustrated earlier in this chapter through the deployment stage data. The pace at which these returns materialize will continue to be an important indicator to track. 221 Study Productivity / employment impact Scope Insight Aldasoro et al. (2026) Yotzov et al. (2026) Brynjolfsson (2026) Filippucci et al. (OECD, 2025) Frank et al. (2026) Brynjolfsson et al. (2025) Hosseini Maasoum and Lichtinger (2025) St. Louis Fed (2025) Penn Wharton Budget Model (2025) 12,000 European firms (2019–2024) 6,000 executives (US, UK, DE, AU) United States economy G7 economies (10-year horizon) LinkedIn profiles and US unemployment insurance records US payroll data (ADP) through 2025 62 million workers/ 285,000 US firms US general labor market US economy +4% increase in labor productivity; +5.9 percentage point gain for every 1% spent on training. AI adoption increases efficiency without reducing short-run employment; training significantly boosts gains. Representative survey showing high adoption but minimal realized impact on productivity to date. A “decoupling” of output from labor input is visible; framed through the “J-curve” hypothesis. Projected annual gains based on sectoral specialization (e.g., high in finance/ICT, low in manufacturing). Deterioration in AI-exposed labor markets (unemployment risk) began in early 2022, preChatGPT. “Canaries in the coal mine”: large employment declines for junior workers in exposed fields. “Seniority-biased technological change”; AI substitutes for junior labor while leaving senior roles intact. Back-of-envelope calculations based on self-reported time savings. Projects AI’s current contribution to total factor productivity (TFP). +1.4% projected productivity boost; +0.8% projected output increase; -0.7% projected employment reduction (over next 3 years). 2.7% US productivity growth in 2025 (nearly double the 1.4% annual average in the previous decade). +0.4 to +1.3 pp (US/UK) vs. +0.2 to +0.8 percentage points (Italy/Japan) annual labor productivity growth. Negative entry rates for AI-exposed roles. -15% to -16% employment for early-career workers. Sharp decline in junior employment driven by slower hiring. +1.1% to +1.3% labor productivity increase. +0.01 percentage points contribution to TFP (negligible). 4.4 J O B S | E CO N O M Y | A I I N D E X R E P O R T 2026 Figure 4.4.28 It is challenging to measure AI’s impact on employment, particularly because the technology is still in the early stages of widespread deployment. So far, effects on the workforce appear to be uneven, initially showing up in hiring pipelines, among younger workers, and within specific business functions. The evidence does not point to broad, uniform displacement. Firm-level survey data does suggest that many organizations expect the pace of workforce change to accelerate over the next year. According to McKinsey’s survey, respondents also anticipated headcount reductions for the coming year to exceed those reported in the past year. (Figures 4.4.30 and Figure 4.4.31). Recent employment data for software developers and customer service roles reveals a generational pattern (Brynjolfsson et al., 2025). In the United States, normalized headcount trends for both occupations show that employment among the youngest workers (ages 22–25) has declined since 2022, even as headcount for older age groups continues to grow (Figure 4.4.29). By September 2025, employment for software developers ages 22–25 had fallen close to 20% from its 2022 peak. Workforce Impact 222 4.4 J O B S | E CO N O M Y | A I I N D E X R E P O R T 2026 Figure 4.4.29 When occupations are grouped by their exposure to AI, the age-based pattern holds (Figure 4.4.30). Among workers ages 22–25, employment in the most AI-exposed occupations has fallen roughly 16% relative to the least-exposed, after controlling for firm-type effects, which isolate AI exposure from broader shocks like interest rate pressure or sector slowdowns. The gap began widening in mid-2024 and has grown steadily since. Jan-2021 May-2021 Sep-2021 Jan-2022 May-2022 Sep-2022 Jan-2023 May-2023 Sep-2023 Jan-2024 May-2024 Sep-2024 Jan-2025 May-2025 Sep-2025 −16 −14 −12 −10 −8 −6 −4 −2 0 2 Estimated % diǄerence in headcount (Q5 vs. Q1) Headcount trends in high AI-exposure jobs (early career 22–25), 2021–25 Source: Brynjolfsson et al., 2025 | Chart: 2026 AI Index report Jan-2021 May-2021 Sep-2021 Jan-2022 May-2022 Sep-2022 Jan-2023 May-2023 Sep-2023 Jan-2024 May-2024 Sep-2024 Jan-2025 May-2025 Sep-2025 0.80 0.85 0.90 0.95 1.00 1.05 1.10 1.15 Jan-2021 May-2021 Sep-2021 Jan-2022 May-2022 Sep-2022 Jan-2023 May-2023 Sep-2023 Jan-2024 May-2024 Sep-2024 Jan-2025 May-2025 Sep-2025 0.80 0.85 0.90 0.95 1.00 1.05 1.10 1.15 Early career 1 (22–25) Early career 2 (26–30) Developing (31–34) Mid-career 1 (35–40) Mid-career 2 (41–49) Senior (50+) Normalized headcount Normalized headcount Software developers Customer support agents Normalized headcount trends by age group for software developers and customer service agents, 2021–25 Source: Brynjolfsson et al., 2025 | Chart: 2026 AI Index report Figure 4.4.3017 17 This figure includes firm-by-time fixed effects (within-firm, same-period comparisons), accounting for firm/industry hiring swings. 223 4.4 J O B S | E CO N O M Y | A I I N D E X R E P O R T 2026 Unemployment data suggests an even more complicated dynamic (Felten et al., 2021; Eckhardt and Goldschlag, 2025) (Figure 4.4.31). From 2022 to early 2025, unemployment rose across all occupation groups regardless of AI exposure level. While the unemployment rate for the most AI-exposed workers (quintile 5) increased by 0.30 percentage points, it rose even more for the least exposed workers (quintile 1), climbing by 0.94 percentage points. AI exposure alone does not seem to be driving recent unemployment trends, but it appears to play a part in broader macroeconomic conditions and organizational changes. A different view emerges from looking at occupational churn trends after the introduction of major technologies (Figure 4.4.32). Over comparable time frames, the occupational mix in the United States has shifted faster since the introduction of generative AI than the shift that followed the introduction of computers or the internet (Gimbel et al., 2025). Unemployment rate by AI exposure quintile Source: Eckhardt and Goldschlag, 2025 Figure 4.4.31 224 4.4 J O B S | E CO N O M Y | A I I N D E X R E P O R T 2026 Changes in the occupational mix from recent baselines Source: Gimbel et al., 2025 Figure 4.4.32 However, employer expectations seem to indicate that the pace may accelerate (Figure 4.4.33). According to McKinsey’s survey (2025), one-third of respondents anticipate a decrease in workforce size, a percentage that is higher at larger organizations (35% at organizations with ≥$1 billion in revenue) compared to smaller firms (30% at organizations with <$1 billion in revenue). Most respondents (43% overall) expect little or no change, while only a minority foresee an increase in workforce size. Even compared to workforce changes that have already taken place, the general sentiment leans toward headcount reductions (Figure 4.4.34). In nearly all functions, respondents anticipate a greater impact from AI on headcount next year than was observed in the past year, with expected decreases outpacing observed decreases. This trend is particularly pronounced in service operations, supply chain/inventory management, marketing and sales, and software engineering, where the expected decrease in employees for the next year significantly exceeds the actual decrease reported over the past year. Conversely, expectations for workforce increases remain relatively modest across business functions. 225 4.4 J O B S | E CO N O M Y | A I I N D E X R E P O R T 2026 4% 5% 8% 8% 8% 20% 25% 17% 43% 38% 47% 9% 9% 8% 12% 14% 10% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Organizations with <$1 billion revenues Organizations with >=$1 billion revenues Overall Decrease by >20% Decrease by 11%–20% Decrease by 3%–10% Little or no change Increase by 3%–10% Increase by 11%–20% Increase by >20% Don’t know Expected change in workforce size as a result of AI in the next year Source: McKinsey & Company Survey, 2025 | Chart: 2026 AI Index report Figure 4.4.33 Figure 4.4.34 22% 21% 19% 18% 18% 17% 17% 18% 16% 16% 12% 31% 39% 33% 32% 28% 25% 26% 32% 30% 27% 28% Strategy and corporate Ǉnance Knowledge management Manufacturing Marketing and sales Risk Product/service development IT Software engineering Supply chain/inventory management Service operations Human resources 5% 7% 8% 15% 13% 12% 12% 8% 10% 11% 10% 7% 10% 10% 21% 20% 16% 11% 11% 10% 13% 13% Actual decrease in employees over the past year Expected decrease over the next year Actual increase in employees over the past year Expected increase over the next year Function Actual vs. expected change in workforce size as a result of AI by function Source: McKinsey & Company Survey, 2025 | Chart: 2026 AI Index report % of respondents A surprising finding from Shao et al. (2026) shows that many workers are not wholly resistant to automation (Figure 4.4.35). A survey of 844 occupational tasks found that 46.1% of workers actively want AI to take over those tasks. Support was especially strong in areas where workers believed automation would free up time for higher-value tasks, reduce repetitiveness, or improve quality of output. However, actual usage patterns do not necessarily reflect these preferences. Occupational tasks with the highest average automation scores account for only 1.3% of Claude.AI usage. A related framework maps these tasks across four zones based on worker desire and technical feasibility (Figure 4.4.36). Looked at this way, AI’s labor impact will likely register in how specific tasks are redesigned rather than blunt automation at the occupation level, with the reorganization of work unfolding gradually. 226 4.4 J O B S | E CO N O M Y | A I I N D E X R E P O R T 2026 Source: Shao et al. (2026) Source: Shao et al. (2026) Figure 4.4.35 Figure 4.4.36 227

4.5 Robot Deployments

4 E CO N O M Y | A I I N D E X R E P O R T 2026 Physical automation through robotics represents one form of AI’s economic integration in industrial environments, particularly in production settings such as manufacturing lines or warehouses. To track the trends, the AI Index uses data from the International Federation of Robotics, IFR 2025, a nonprofit that publishes annual World Robotics reports on global installation patterns and operational stock. Industrial robots are defined by the IFR, and in this reporting,18 as “automatically controlled, reprogrammable, multipurpose manipulators, programmable in three or more axes, which can be either fixed in place or mobile for use in industrial automation applications.” 18 Due to the timing of the IFR report, the most recent data is from 2024. Every year, the IFR revisits data collected for previous years and will occasionally update the data if more accurate figures become available. Therefore, some of the data reported in this year’s report might differ slightly from data reported in previous years. Global Installation Patterns Global industrial robot activity continues to rise, though year-over-year growth has flattened. In 2024, 542,000 industrial robots were installed globally, a slight increase (0.2%) from the previous year (Figure 4.5.1). The composition of those robots has also shifted over time. Collaborative robots, which are designed to work alongside human operators, continue to gain market share over traditional robots. In 2017, collaborative robots accounted for just 2.8% of all new industrial robot installations, compared to 13.6% in 2024. The total operational stock in 2024 grew to 4,664,000, up from 4,282,000 in 2023 (Figure 4.5.2). Overall, industrial automation capacity has shown a consistent upward trajectory, with the global fleet of industrial robots quadrupling since 2012. 542 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 0 100 200 300 400 500 Number of industrial robots installed (in thousands) Number of industrial robots installed in the world, 2012–24 Source: International Federation of Robotics (IFR), 2025 | Chart: 2026 AI Index report Figure 4.5.1 228 4.5 R O B OT D E P LOY M E N T S | E CO N O M Y | A I I N D E X R E P O R T 2026 4,664 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 0 1,000 2,000 3,000 4,000 5,000 Number of industrial robots (in thousands) Operational stock of industrial robots in the world, 2012–24 Source: International Federation of Robotics (IFR), 2025 | Chart: 2026 AI Index report 42 58 57 65 389 405 366 363 484 495 484 477 400 424 387 389 526 553 541 542 2017 2018 2019 2020 2021 2022 2023 2024 0 100 200 300 400 500 Traditional Collaborative Number of industrial robots installed (in thousands) Number of industrial robots installed in the world by type, 2017–24 Source: International Federation of Robotics (IFR), 2025 | Chart: 2026 AI Index report Figure 4.5.2 Figure 4.5.3 229 4.5 R O B OT D E P LOY M E N T S | E CO N O M Y | A I I N D E X R E P O R T 2026 Geographic Patterns Industrial robot installation follows patterns similar to the investment and talent trends discussed above, although its geographic distribution is relatively narrow. In 2024, China led the world with 295,000 industrial robot installations, six times more than Japan’s 44,500 and 8.6 times more than the United States’ 34,200 (Figure 4.5.4). South Korea and Germany followed with 30,600 and 27,000 installations, respectively. China’s share of global installations has increased substantially from 20.8% in 2013 to 54.4% in 2024 (Figure 4.5.5). 5.10 5.60 5.80 8.80 9.10 27.00 30.60 34.20 44.50 295.00 0 30 60 90 120 150 180 210 240 270 300 Spain Mexico Taiwan Italy India Germany South Korea United States Japan China Number of industrial robots installed (in thousands) Number of industrial robots installed by geographic area, 2024 Source: International Federation of Robotics (IFR), 2025 | Chart: 2026 AI Index report 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 50 100 150 200 250 300 Number of industrial robots installed (in thousands) 27, Germany 31, South Korea 34, United States 45, Japan 295, China Number of new industrial robots installed in top 5 countries, 2011–24 Source: International Federation of Robotics (IFR), 2025 | Chart: 2026 AI Index report Figure 4.5.4 Figure 4.5.5 230 4.5 R O B OT D E P LOY M E N T S | E CO N O M Y | A I I N D E X R E P O R T 2026 Service Robotics Nonindustrial or service robots designed for tasks such as logistics, hospitality, and agriculture showed growth in 2024 (Figure 4.5.7). Service robot installations increased across most application areas compared to 2023, though agriculture saw particularly strong adoption. The number of service robots deployed in an agricultural setting increased 2.5-fold. Only the hospitality category saw a year-over-year decline. -16% -9% -5% -4% -3% 1% 4% 7% 7% 33% −10% 0% 10% 20% 30% Italy United States Germany Japan South Korea Spain Mexico China India Taiwan Annual growth rate of industrial robots installed Annual growth rate of industrial robots installed by geographic area, 2023 vs. 2024 Source: International Federation of Robotics (IFR), 2025 | Chart: 2026 AI Index report 60 6 10 11 7 86 7 9 25 8 113 12 6 54 20 103 26 17 42 19 0 10 20 30 40 50 60 70 80 90 100 110 Transportation and logistics Professional cleaning Medical and healthcare Hospitality Agriculture 2024 2023 2022 2021 Number of service robots installed (in thousands) Number of service robots installed in the world by application area, 2021–24 Source: International Federation of Robotics (IFR), 2025 | Chart: 2026 AI Index report Figure 4.5.6 Figure 4.5.7 231 Science 5 A I I N D E X R E P O R T 2026

5.3 AI Agents and Tools for Science Workflows

AstaBench PaperArena AI Agents AI as a Co-scientist

5.1 AI for Science in 2025

5 S C I E N C E | A I I N D E X R E P O R T 2026 AI’s role in science falls into three categories that coexist but differ in terms of maturity. The first—machine learning over scientific data to build predictive and explainable models—has been practiced for several decades and is now commonplace. The second—AI systems that assist scientists in their workflows through literature synthesis, experiment design, or data analysis—has been emerging over several years and expanded considerably in 2025. The third category—autonomous AI systems capable of generating new scientific discoveries with limited human guidance—is gaining traction but it remains at an early stage. The year’s most visible developments occurred primarily in the second and third categories. Aardvark Weather replaced the full numerical weather prediction pipeline (Allen et al., 2025). Google’s AI Coscientist orchestrated hypothesis generation through experimental design (Gottweis et al., 2025). To date, the clearest breakthroughs tend to cluster in domains with strong existing data infrastructure, including structural biology, physics, chemistry, and materials science, rather than in fields with the most sophisticated mathematical or physics-based models. These developments, however, do not automatically translate into scientific progress. Experimental validation remains expensive and time-consuming, and scientists are unlikely to invest in testing AIgenerated hypotheses without sufficient reason to believe it will yield some findings. In drug discovery, for example, AI systems can propose novel candidate molecules at scale, but clinical trials to determine whether those molecules work remain a costly, multiyear process. The gap between what AI can propose and what scientists can feasibly test is a recurring theme across the domains covered in this chapter. In the Web of Science database, AI-related publications in the natural sciences reached approximately 80,150 in 2025, up from 63,547 in 2024, a one-year increase of roughly 26% (Figure 5.1.1). Physical sciences1 and life sciences followed similar trajectories in 2025, reaching approximately 33,000 and 29,000 publications, respectively, with each growing by roughly 27%–28% year over year. Earth science—the smallest category in absolute terms at approximately 20,460 publications—grew by about 23%. As a share of total scientific output, AI-related work remains a single-digit fraction of each field but is climbing quickly (Figure 5.1.2). By 2025, Earth science had the highest AI penetration at 8.8%, followed by natural sciences overall at 6.8%, life sciences at 6.5%, and physical sciences at 5.8%. In 2010, all four categories sat below 1%. The quantity of AI-mentioning papers is not the same as the quality of AI-enabled discovery, but the breadth suggests that AI methods are becoming a routine part of scientific practice across disciplines. 1 Physical sciences in this analysis include: astronomy and astrophysics, chemistry, crystallography, electrochemistry, geochemistry and geophysics, geology, mathematics, meteorology and atmospheric science, mineralogy, mining and mineral processing, oceanography, optics, physical geography, physics, polymer science, thermodynamics, and water resources. Publications in AI for Science 235 5.1 A I F O R S C I E N C E I N 2025 | S C I E N C E | A I I N D E X R E P O R T 2026 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0 10 20 30 40 50 60 70 80 Number of AI publications in natural sciences (in thousands) 20.46, Earth science 28.91, Life sciences 33.05, Physical sciences 80.15, Natural sciences Number of AI-related publications in natural sciences, 2010–25 Source: AI Index, 2026 | Chart: 2026 AI Index report 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% AI publications in natural sciences (% of total) 5.84%, Physical sciences 6.48%, Life sciences 6.80%, Natural sciences 8.79%, Earth science AI-related publications in natural sciences (% of total), 2010–25 Source: AI Index, 2026 | Chart: 2026 AI Index report Figure 5.1.12 2 The natural sciences count may be slightly lower than the sum of the individual domain counts. This is because a single publication can be assigned to more than one domain. For example, a biochemistry paper may be categorized under both physical sciences and life sciences. To avoid double-counting, these publications are counted only once in the natural sciences total. Figure 5.1.2 236

5.2 AI Across Scientific Domains

This section examines AI’s expanding role in science across three major scientific groupings and tracks the datasets, benchmarks, and foundation models of each. The tables below catalog selected releases across each category. A consistent finding is that the majority of scientific AI models originate from academic institutions collaborating across countries, in contrast to the industry-dominated landscape of generalpurpose foundation models described in Chapter 1 and Chapter 2. Physics, Astronomy, Chemistry, and Materials Science AI is accelerating physics, astronomy, chemistry, and materials science by replacing expensive first-principles simulations with learned surrogates and by generating novel materials and molecular structures through inverse design. Notable 2025 releases include large chemistry datasets (e.g., OMol25 and OC25), simulationoriented foundation models (e.g., Walrus, GPhyT), and materials checkpoints for atomistic modeling and generation (e.g., MACE-MP-0 and MatterGen). In chemistry and materials science, agent systems began connecting to external software tools and laboratory equipment to execute experiments. Benchmark results, however, suggest these systems are not yet reliable when asked to carry out full research tasks from start to finish. 5 S C I E N C E | A I I N D E X R E P O R T 2026 Datasets The largest dataset releases in physics and chemistry in 2025 expanded multimodal astronomy resources and chemistry-scale quantum data. Multimodal Universe aggregates approximately 100TB of astronomical observations, while OMol25 reports over 100 million high-accuracy density functional theory (DFT) calculations spanning 83 elements. These datasets provide the training foundation for large models targeting prediction and simulation tasks in their respective fields (Figure 5.2.1). 237 5.2 A I AC R O S S S C I E N T I F I C D O M A I N S | S C I E N C E | A I I N D E X R E P O R T 2026 Affiliation3 Summary 3 Full references are provided in the Appendix. Name Domain Sector Selected datasets in physics, astronomy, chemistry, and materials science (2025) ChemPile Helmholtz Institute for Polymers (HIPOLE) Jena, Friedrich Schiller University Jena, Hacettepe University, University of Toronto 75B-plus tokens of curated chemical data (SMILES, InChI, text, code, educational materials). Chemistry ACADEMIA NONPROFIT Multimodal Universe Polymathic AI, Instituto de Astrofisica de Canarias, Universidad de La Laguna, Massachusetts Institute of Technology 100TB astronomical dataset: multichannel images, spectra, time series from hundreds of millions of observations. Astronomy ACADEMIA NONPROFIT The Open Molecules 2025 (OMol25) Fundamental AI Research (FAIR) at Meta, Los Alamos National Laboratory, University College Dublin 100M-plus DFT calculations spanning 83 elements, diverse chemical interactions, structures up to 350 atoms. Chemistry, Materials, Chemical Physics INDUSTRY GOVERNMENT The Open Catalyst 2025 (OC25) FAIR at Meta, Texas Tech University, Nanyang Technological University 7.8M calculations across 1.5M solvent environments spanning 88 elements for catalysis at solid-liquid interfaces. Chemistry, Materials INDUSTRY GOVERNMENT ACADEMIA ACADEMIA Figure 5.2.1 Benchmarks In these particular domains, benchmarks have been newly introduced and therefore do not offer longitudinal data across multiple years. It is interesting to see how general-purpose frontier models, discussed in Chapter 2, perform on scientific tasks (Figure 5.2.2). On ChemBench, a chemistry evaluation with over 2,700 questionanswer pairs, the best frontier models outperform the best human chemists, though they struggle with basic tasks. ReplicationBench reports frontier model performance below 20% on paper-scale replication tasks in computational astrophysics. 238 5.2 A I AC R O S S S C I E N T I F I C D O M A I N S | S C I E N C E | A I I N D E X R E P O R T 2026 Affiliation4 Summary 4 Full references are provided in the Appendix. Name Domain Sector Selected benchmarks in physics, astronomy, chemistry, and materials science (2025) AstroVisBench University of Texas at Austin, NSF National Optical-Infrared Astronomy Research Laboratory, University of Virginia First benchmark for LLM scientific computing and visualization in astronomy. Astronomy ACADEMIA GOVERNMENT Chembench Friedrich Schiller University Jena, Helmholtz Institute for Polymers (HIPOLE) Jena, Spanish National Research Council (CSIC) 2,700-plus Q&A pairs. Best models outperform human chemists on average but struggle with basic tasks. Chemistry ACADEMIA GOVERNMENT GOVERNMENT ChemX ITMO University, D ONE 10 curated datasets for automated chemical information extraction from nanomaterials and small molecules. Chemistry, Materials Science ACADEMIA GravityBench University of Toronto, NYU Abu Dhabi Tests AI discovery of physics laws from gravitational simulations, including non-realworld physics. Physics, Astrophysics ACADEMIA LLM-SRBench Virginia Tech, VinUniversity, Carnegie Mellon University 239 problems testing genuine equation discovery vs. memorization. Best systems score 31.5%. Physics, Scientific Equation Discovery ACADEMIA Matbench Discovery University of Cambridge, Lawrence Berkeley National Laboratory, Federal Institute of Materials Research and Testing (BAM) Evaluation framework for machine learning energy models prescreening stable inorganic crystals. Materials, Chemistry ACADEMIA MatSciBench UCLA, Princeton, Virginia Tech 1,340 college-level problems across 6 fields and 31 subfields. Top models under 80%. Materials Science ACADEMIA PHYBench Peking University, CSRC 500 original physics problems. Gemini 2.5 Pro: 36.9% vs. human experts: 61.9%. Physics ACADEMIA INDUSTRY INDUSTRY INDUSTRY 239 5.2 A I AC R O S S S C I E N T I F I C D O M A I N S | S C I E N C E | A I I N D E X R E P O R T 2026 PhysGym KAUST Center of Excellence for Generative AI, The Swiss AI Lab Interactive physics environments testing LLM scientific reasoning under varying prior knowledge. Physics ACADEMIA ReplicationBench Stanford University, University of Toronto Tests AI replication of entire astrophysics papers. Frontier models score under 20%. Astrophysics/ Research Replication ACADEMIA TheoreticalPhysics Benchmark (TPBench) University of WisconsinMadison, Indiana University, NSF-Simons AI Institute for the Sky (SkAI) 57 novel theoretical physics problems (high-energy theory, cosmology). Research-level problems largely unsolved. Theoretical Physics ACADEMIA INDUSTRY Foundation Models Foundation model (FM) releases in 2025 spanned astronomy, physics simulation, chemistry language models, and materials modeling (Figure 5.2.3). GPhyT, a General Physics Transformer, trained on 1.8TB of simulation data, achieved up to 29 times better performance than specialized models, and generalized to physics problems outside its training data without task-specific fine-tuning. Figure 5.2.2 Affiliation5 Summary 5 Full references are provided in the Appendix. Name Domain Sector Selected foundation models in physics, astronomy, chemistry, and materials science (2025) AION-1 UC Berkeley, Flatiron Institute, New York University Astronomy FM: 300M–3.1B parameters, 200M-plus celestial objects from 5 major surveys. Open release. Astronomy ACADEMIA ChemDFM Shanghai Jiao Tong University, Suzhou Laboratory, AI Speech Co. Chemistry LLM: 34B tokens, 2.7M instructions. Generalist chemical AI. Chemistry ACADEMIA GOVERNMENT GOVERNMENT NONPROFIT INDUSTRY 240 5.2 A I AC R O S S S C I E N T I F I C D O M A I N S | S C I E N C E | A I I N D E X R E P O R T 2026 GPhyT: General Physics Transformer RWTH Aachen University, University of Virginia Trained on 1.8 TB simulation data. Up to 29x better than specialized models. Zero-shot generalization. Physics ACADEMIA MACE-MP-0 University of Cambridge, Federal Institute of Materials Research and Testing (BAM), UC Berkeley General-purpose force field model for predicting atomic interactions across nearly all materials. Chemical Physics ACADEMIA ACADEMIA MatterGen Microsoft Research AI for Science, Shenzhen Institute of Advanced Technology (Chinese Academy of Sciences) Diffusion-based generative model. Over 2x more novel and stable than existing methods. Materials Science PDE-Transformer Technical University of Munich Transformer for physics partial differential equations (PDEs) on grids. Outperforms state-of-theart vision architectures across 16 types of physics simulations. Physics Simulations ACADEMIA PhysiX UCLA 4.5B params. First large-scale physics simulation FM. Transfers from natural videos to simulation. Physics Simulations ACADEMIA SMI-TED IBM Research Chemical foundation models trained on molecular sequences. Chemistry Surya University of Alabama in Huntsville, NASA Marshall Space Flight Center, IBM Research 366M parameters. First heliophysics FM. Forecasts space weather from NASA’s Solar Dynamics Observatory data without task-specific training. Heliophysics Walrus Flatiron Institute, New York University, University of Cambridge Fluid mechanics FM: 19 scenarios spanning astrophysics, geoscience, plasma physics, acoustics. Open weights. Fluid Mechanics Continuum Mechanics, multiple domains GOVERNMENT GOVERNMENT GOVERNMENT INDUSTRY INDUSTRY INDUSTRY INDUSTRY INDUSTRY ACADEMIA NONPROFIT Figure 5.2.3 241 5.2 A I AC R O S S S C I E N T I F I C D O M A I N S | S C I E N C E | A I I N D E X R E P O R T 2026 AI Agents Agent systems in the physical sciences combine tool use with domain-specific reasoning to perform tasks requiring multiple steps. Some of these systems function as focused components within larger pipelines, while others attempt to operate as end-to-end research systems. As they take on more responsibility for both designing and executing research, independent confirmation of results becomes an important step. Physics Supernova scored 23.5 out of 30 at the 2025 International Physics Olympiad, ranking 14th out of 406 participants and reaching gold-medalist level. StarWhisper Telescope automates astronomical observation planning across 10 telescopes. In chemistry, ChemAgents demonstrated autonomous synthesis and optimization using a robotic platform controlled by Llama-3.1-70B (Figure 5.2.4). Affiliation6 Summary 6 Full references are provided in the Appendix. Name Domain Sector Selected AI agents in physics, astronomy, chemistry, and materials science (2025) ChatGPTMaterial Explorer John Hopkins University Materials science assistant combining LLMs with graph neural networks for property prediction. Materials Science ACADEMIA ChemAgents University of Science and Technology of China, University of Birmingham, Henan Academy of Sciences Robotic AI chemist (Llama-3.170B). Autonomous synthesis, optimization, photocatalysis. Chemistry ACADEMIA GOVERNMENT ChemToolAgent Shanghai Artificial Intelligence Laboratory, Soochow University, Zhejiang University 137 external chemical tools. HE-MCTS framework surpasses GPT-4o on chemistry QA. Chemistry Materials Science ACADEMIA GOVERNMENT Crystalyse Imperial College London Multi-tool AI agent for materials design that coordinates multiple computational tools through an LLM-based reasoning framework. Materials Science Chemistry ACADEMIA Physics Supernova Princeton University, Tsinghua University, Shanghai Jiao Tong University IPhO 2025: 23.5 out of 30, ranked 14th of 406. Gold-medalist-level physics problem-solving. Physics ACADEMIA StarWhisper Telescope University of Chinese Academy of Sciences, National Astronomical Observatories (CAS), Simon Fraser University Automates astronomical observations across 10 telescopes. LLM-driven observation planning. Astronomy ACADEMIA GOVERNMENT INDUSTRY Figure 5.2.4 242 5.2 A I AC R O S S S C I E N T I F I C D O M A I N S | S C I E N C E | A I I N D E X R E P O R T 2026 Biological and Life Sciences AI is increasingly being applied to biological research beyond biomedicine to address fundamental questions in genomics, neuroscience, ecology, and synthetic biology. Chapter 6 covers AI’s role in the therapeutic pipeline, from protein structure prediction to drug design to clinical applications. This section focuses on the broader scientific infrastructure, including the datasets, benchmarks, foundation models, and agents that support biological research as a whole. The scale of biological training data grew in 2025, and foundation models trained on genomic and evolutionary data expanded from prediction into generative design. However, the gap between genomic sequence data, which is abundant, and functional perturbation data, which measures how biological systems respond to interventions, remains wide (Callahan et al., 2025; Sun et al., 2025). AI is also being applied at the macroscopic scale, with computer vision and acoustic models routinely processing sensor data to track species populations and optimize agricultural water use in real time (Miller et al., 2025; Khan and Sharma, 2025). Ecological and biodiversity applications lag behind other biological subfields, in part because training data in these areas is sparse, biased toward well-studied taxa, and lacking standardized formats (Fahsbender et al., 2025). In species taxonomy and evolutionary biology, vision-based foundation models such as the BioCLIP family (Stevens et al., 2024; Gu et al., 2025) are enabling classification and discovery across the tree of life, while methods like PhyloNN (Elhamod et al., 2023) can identify evolutionary traits from images without labeled data. In neuroscience, AI serves both as a practical tool for brain mapping and as a source of theoretical inspiration. Computer vision approaches have been instrumental in assembling full connectome data from model organisms such as the fly and mouse (Dorkenwald et al., 2025; The MICrONS Consortium, 2025). Comparisons between biological neuronal networks and artificial deep networks inform how researchers understand the principles of information processing in the brain (Linsley et al., 2025; Kazemian et al., 2025). Datasets OpenGenome2 contains nearly 9.3 trillion base pairs of curated DNA from across all domains of life, making it the largest genomic training corpus assembled to date and the foundation for the Evo 2 model. In neuroscience, Spacetop contributed over 600 imaging hours across 101 participants for cognitive neuroscience research. These resources provide the scale necessary for foundation models to learn biological features without task-specific training, though the gap between genomic data availability and functional perturbation data remains wide (Figure 5.2.5). 243 5.2 A I AC R O S S S C I E N T I F I C D O M A I N S | S C I E N C E | A I I N D E X R E P O R T 2026 Affiliation7 Summary 7 Full references are provided in the Appendix. Name Domain Sector Selected datasets in biological and life sciences (2025) OpenGenome2 Arc Institute, Stanford University, Nvidia 9.3T base pairs of curated DNA from all domains of life. Training corpus for Evo 2. Biology, Genomics ACADEMIA ACADEMIA ProteinTalks-DB Westlake University (Academia) Proteomics, Systems Biology Spacetop Dartmouth College, Johns Hopkins University, Emory University Neuroscience INDUSTRY NONPROFIT 101 participants, 600-plus imaging hours. Cognitive, affective, social, interoceptive domains. Figure 5.2.5 Benchmarks Life-science benchmarks have moved to testing workflow execution and tool-integrated analysis, rather than static knowledge. BixBench reports that frontier models achieve roughly 17% accuracy on realworld bioinformatics analysis tasks, highlighting challenges in chaining tools, file handling and domain interpretation. BioML-bench provides the first end-to-end evaluation of AI agents on biomedical machine learning tasks that span protein engineering to drug discovery, and it found that on average agents underperform human baselines. These results are consistent with the pattern observed across other scientific domains in this chapter. AI systems perform well on isolated subtasks but struggle when required to execute the multistep workflows that actual biological research demands (Figure 5.2.6). ACADEMIA 38M+ proteomics measurements from drug-treated breast cancer cells trained for protein dynamics and drug-response prediction. 244 5.2 A I AC R O S S S C I E N T I F I C D O M A I N S | S C I E N C E | A I I N D E X R E P O R T 2026 Affiliation8 Summary 8 Full references are provided in the Appendix. Name Domain Sector Selected benchmarks in biological and life sciences (2025) BaisBench Tsinghua University Evaluates AI biological discovery ability via cell annotation and data-driven questions. Biology ACADEMIA BioML-bench Shift Bioscience, University of Cambridge, ScienceMachine First end-to-end biomedical machine learning evaluation. Agents underperform human baselines on average. Biology ACADEMIA ACADEMIA ACADEMIA BixBench FutureHouse, ScienceMachine 50-plus bioinformatics scenarios. GPT-4o and Claude 3.5 Sonnet: ~17% accuracy. Computational Biology CGBench Stanford University Clinical genetics interpretation. Reasoning models excel at fine-grained tasks; substantial hallucination gaps remain. Biology (Genetics) Mouse vs. AI: Robust Foraging Competition UC Santa Barbara Bioinspired benchmark grounding reinforcement learning agents in neuroscience via shared foraging tasks with mice. Neuroscience Figure 5.2.6 INDUSTRY INDUSTRY Foundation Models In 2025, foundation model releases in the biological and life sciences domains expanded across genomics and cellular modeling. Genomic foundation model Evo 2, which trained on OpenGenome2, trained on 9.3 trillion DNA base pairs from all domains of life. It operates at up to 40 billion parameters with a 1 million token context window and was released with fully open weights. Chapter 6 examines its performance on genomic prediction tasks alongside smaller, task-specific alternatives, and covers additional genomic and cellular foundation models, including AlphaGenome and CellFM. In neuroscience, a foundation model of neural activity predicts neuronal responses and generalizes across stimulus types and individual animals (Figure 5.2.7). 245 5.2 A I AC R O S S S C I E N T I F I C D O M A I N S | S C I E N C E | A I I N D E X R E P O R T 2026 Affiliation9 Summary 9 Full references are provided in the Appendix. Name Domain Sector Selected foundation models in biological and life sciences (2025) AlphaGenome Google DeepMind Genomic foundation model predicting thousands of functional measurements from DNA sequence at single-basepair resolution. Biology, Genomics INDUSTRY INDUSTRY ANN Model Baylor College of Medicine, Stanford University, University of Göttingen Neural activity FM. Predicts neuronal responses, generalizes across stimulus types and mice. Neuroscience BioCLIP 2 The Ohio State University, Smithsonian Institution, UNC-Chapel Hill Vision foundation model for biological classification across the tree of life. Trained using hierarchical contrastive learning on taxonomic structure. Biology BioLab Princeton University, BioMap Research, Zhejiang University Multiagent system for automated biological research. Experimentally validated novel antibody designs. Biology CellFM Sun Yat-sen University, Chongqing University, Jinfeng Laboratory 800M parameters, 100M human cells. Single-cell analysis, perturbation prediction, gene– gene relationships. Biology Evo 2 Arc Institute, Nvidia, Stanford University, UC Berkeley 40B params, 1M token context. 9.3T base pairs. Genome-scale generation. Fully open release. Biology, Genomics ProteinTalks Westlake University, DP Technology Co., Westlake Omics Co. Foundation model for protein network dynamics. Predicts drug efficacy and synergy from perturbation proteome data. Proteomics, Systems Biology ACADEMIA ACADEMIA ACADEMIA NONPROFIT ACADEMIA ACADEMIA ACADEMIA GOVERNMENT INDUSTRY INDUSTRY INDUSTRY Figure 5.2.7 246 5.2 A I AC R O S S S C I E N T I F I C D O M A I N S | S C I E N C E | A I I N D E X R E P O R T 2026 AI Agents Agent systems in the life sciences are beginning to operationalize complex research workflows, including literature synthesis and bioinformatics execution. BCI-Agent performs autonomous neuronal cell-type classification from electrophysiology recordings without task-specific training. Biomni is a general-purpose biomedical agent spanning 25 subfields. Chapter 6 describes its architecture and capabilities in greater detail (See Figure 5.2.8). Affiliation10 Summary 10 Full references are provided in the Appendix. Name Domain Sector Selected AI agents in biological and life sciences (2025) BCI-Agent Harvard University, Massachusetts Institute of Technology (MIT), Broad Institute of MIT and Harvard Autonomous neuronal celltype classification from electrophysiology. No taskspecific training. Neuroscience BioAgents UC Berkeley, Microsoft Research, UC San Francisco Multiagent system on small language models with RAG. Expert-level on conceptual genomics tasks. Biology Biomni Stanford University, Genentech, Arc Institute General-purpose biomedical agent across 25 fields. (See Chapter 6 for detailed coverage.) Biology ACADEMIA ACADEMIA INDUSTRY ACADEMIA INDUSTRY NONPROFIT Earth Science Progress in AI for Earth science remains aligned with observational infrastructure, including reanalysis datasets and global satellite archives. Weather forecasting, which benefits from decades of reanalysis datasets such as ERA5 and dense global satellite archives, has advanced furthest, with multiple AI models being used in real forecasting systems in 2025. Climate modeling lags behind because it requires projections on decadal timescales where future states fall outside the distribution of any existing training data. Hydrology offers one of the clearest examples of benchmark-driven progress in scientific AI. LSTM-based models, trained jointly across hundreds of catchments in the CAMELS dataset, have consistently outperformed process-based hydrologic models (Kratzert et al., 2019), and regional extensions now span the United States, the United Kingdom, Australia, Chile, and Brazil. Agriculture presents the opposite pattern, with Figure 5.2.8 247 shared benchmark datasets still scarce, making progress difficult to measure across research groups despite promising work in knowledge-guided approaches to carbon cycle quantification (Liu et al., 2024) and global change ecology (Jin et al., 2026). 5.2 A I AC R O S S S C I E N T I F I C D O M A I N S | S C I E N C E | A I I N D E X R E P O R T 2026 Datasets Earth science relies heavily on large governmental and institutional observation systems rather than purpose-built AI training corpora. In carbon flux research, global flux tower networks provide foundational observational data. FLUXNET2015 aggregates eddy covariance measurements from sites worldwide (Pastorello et al., 2020), while regional networks including AmeriFlux (North America), ICOS (Europe), and JapanFlux (Japan and East Asia) contribute additional coverage. These datasets enable training and evaluation of models that upscale local carbon flux measurements to regional and global estimates (Figure 5.2.9). Affiliation11 Summary 11 Full references are provided in the Appendix. Name Domain Sector Selected datasets in Earth science (2025) AmeriFlux Indiana University, USDA Agricultural Research Service, University of WisconsinMadison North American network of 260plus flux tower sites measuring ecosystem carbon, water, and energy exchange. Over 50 sites with 10-plus years of continuous data. Ecology CAMELS NSF National Center for Atmospheric Research, U.S. Geological Survey, U.S. Department of the Interior Standardized data on terrain, climate, soil, and streamflow for 671 U.S. river basins. Foundation for AI hydrology benchmarking. Hydrology ACADEMIA GOVERNMENT FLUXNET2015 Lawrence Berkeley National Laboratory, University of Tuscia, ETH Zurich Global measurements of CO2, water, and energy exchange between ecosystems and atmosphere from 212 sites worldwide. Ecology ACADEMIA GOVERNMENT ICOS ICOS ERIC/University of Helsinki, Thünen Institute of Climate-Smart Agriculture, ETH Zurich European observation network of 140-plus stations across 12 countries measuring greenhouse gas concentrations and carbon fluxes across atmosphere, land, and ocean. Ecology ACADEMIA GOVERNMENT JapanFlux Osaka Metropolitan University, Chiba University, National Institute of Polar Research (NIPR) Land-atmosphere flux measurements covering Japan and East Asia from 1990 to 2023. Tracks energy, water, and CO2 exchange across Asian ecosystems. Ecology ACADEMIA GOVERNMENT GOVERNMENT Figure 5.2.9 248 5.2 A I AC R O S S S C I E N T I F I C D O M A I N S | S C I E N C E | A I I N D E X R E P O R T 2026 Affiliation12 Summary 12 Full references are provided in the Appendix. Benchmarks In 2025, benchmarks in Earth science expanded into reliability of extreme event coverage, where AI weather models, for example, face the highest stakes. It is also where standard average skill metrics fail to capture performance (Figure 5.2.10). Foundation Models Earth science foundation models released in 2025 covered weather forecasting, climate emulation, and geospatial representation. In weather forecasting, several systems built directly on models highlighted in the 2025 AI Index. FourCastNet 3 generates a 60-day global forecast at 0.25-degree resolution in under 4 minutes on a single GPU, running 8 to 60 times faster than prior approaches. For Earth observation, TerraMind is the first any-to-any generative multimodal model operating across 9 geospatial modalities (Figure 5.2.11). Name Domain Sector Selected benchmarks in Earth science (2025) EarthSE Shanghai Jiao Tong University, Shanghai Artificial Intelligence Laboratory, Hong Kong Polytechnic University 100K papers, 114 disciplines, 11 LLMs tested. Significant gaps in Earth science exploration. Earth Science ExEBench Technical University of Munich (TUM), Munich Center for Machine Learning (MCML) 7 extreme event categories. Global coverage. Tests detection, monitoring, and forecasting. Atmospheric Sciences UnivEARTH Cornell University, Columbia University 140 Earth observation questions. LLM agents: 33% accuracy. Code fails 58% of time. Earth Science ACADEMIA ACADEMIA ACADEMIA INDUSTRY Figure 5.2.10 249 5.2 A I AC R O S S S C I E N T I F I C D O M A I N S | S C I E N C E | A I I N D E X R E P O R T 2026 Affiliation13 Summary 13 Full references are provided in the Appendix. Name Domain Sector Selected foundation models in Earth science (2025) AlphaEarth Google DeepMind Embedding field model for general geospatial representation. Outperforms prior featurization approaches. Earth observation INDUSTRY cBottle (Climate in a Bottle) Nvidia Diffusion-based climate emulator. Global 5 km at 12.5M-pixel resolution. Diurnal-to-seasonal variability. Climate Science INDUSTRY FourCastNet 3 Nvidia, Lawrence Berkeley National Laboratory, UC Berkeley 60-day forecast in <4 min/GPU. 8–60x faster. Builds on Aurora and NeuralGCM advances. Weather Forecasting INDUSTRY GAIA USRA/RIACS, BCG X AI Science Institute Atmospheric FM from 15 years of satellite imagery. Atmospheric rivers (F1: 0.58), cyclone detection (81% recall). Atmospheric Sciences INDUSTRY INDUSTRY INDUSTRY OlmoEarth Allen Institute for AI, University of Washington, Arizona State University Multimodal spatiotemporal Earth observation FM. Platform for NGOs and nonprofits. Earth Observation TerraMind IBM Research, ETH Zurich, Forschungszentrum Jülich First any-to-any generative multimodal Earth observation FM. 9 geospatial modalities. Earth Observation WeatherNext 2 Google DeepMind Hundreds of weather outcomes in <1 min/TPU. 99.9% improvement over predecessor. Builds on GenCast. Weather Forecasting ACADEMIA ACADEMIA ACADEMIA GOVERNMENT GOVERNMENT NONPROFIT NONPROFIT Figure 5.2.11 250 5.2 A I AC R O S S S C I E N T I F I C D O M A I N S | S C I E N C E | A I I N D E X R E P O R T 2026 AI Agents In Earth science, AI agents are moving beyond data retrieval toward executing full research workflows, including automated observation processing, literature-informed analysis, and climate task completion. ClimateAgent completed 85 climate tasks with 100% completion and a quality score of 8.32, compared with 6.27 for Microsoft Copilot and 3.26 for GPT-5 (Figure 5.2.12). Affiliation14 Summary 14 Full references are provided in the Appendix. Name Domain Sector Selected AI agents in Earth science (2025) ClimateAgent The Hong Kong University of Science and Technology 85 climate tasks: 100% completion, quality 8.32 vs. Copilot 6.27, GPT-5 3.26. Climate Science EarthLink Shanghai Artificial Intelligence Laboratory, Fudan University, University of Sydney First AI copilot for Earth scientists. Automated research workflows with dynamic feedback loop. Climate Science PANGAEA GPT Alfred Wegener Institute for Polar and Marine Research Multi-agent system for PANGAEA Earth science database. Intelligent data processing, natural language interface. Earth Science ACADEMIA INDUSTRY ACADEMIA GOVERNMENT Figure 5.2.12 Mathematics Mathematical reasoning is another active testing ground for AI capabilities. Systems such as Goedel-Prover are moving toward automated formal proof generation in languages like Lean. Competition-level problemsolving and formal verification of known results are advancing quickly, but major open problems, such as long-standing Erdos conjectures, remain well beyond current capabilities. Chapter 2 covers benchmark performance in detail, including a jump from silver to gold medal at the International Mathematical Olympiad in a single year and rapid gains on FrontierMath and MathArena. 251

5.3 AI Agents and

Tools for Science Workflows The domain-specific tables of Section 5.2 catalog a growing inventory of AI agents, foundation models, datasets, and benchmark suites. Two cross-domain benchmarks released in 2025 offer a broader view of how well these systems perform when asked to do end-to-end scientific research rather than isolated tasks. On both benchmarks, even the best-performing agents fall well below expert-level performance. 5 S C I E N C E | A I I N D E X R E P O R T 2026 AstaBench AstaBench is an end-to-end benchmark suite that evaluates agentic scientific research ability across over 2,400 problems spanning multiple domains and the full discovery workflow, from literature understanding through code execution, data analysis, and end-to-end discovery. It benchmarked 57 agents across 22 agent classes and reported both an overall score and cost per problem (Figure 5.3.1). The best performing agent scored around 0.53 at a cost of roughly $3.40 per problem, while most agents clustered between 0.10 and 0.45 at per-problem costs below $1.00. 0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 0.00 0.10 0.20 0.30 0.40 0.50 0.60 Average cost per problem (in US dollars) Average score AstaBench: average score Source: AstaBench Leaderboard, 2026 | Chart: 2026 AI Index report Figure 5.3.1 252 5.3 A I AG E N T S A N D TO O L S F O R S C I E N C E WO R K F LOW S | S C I E N C E | A I I N D E X R E P O R T 2026 PaperArena PaperArena is a benchmark that tests whether LLM agents can answer real research questions that require stitching together evidence across multiple papers while orchestrating external tools for parsing, retrieval, and computation. Gemini 2.5 Pro performs best overall, achieving 38.8% average accuracy in a multiagent configuration (Figure 5.3.2). All tested agents lagged substantially behind the PhD expert baseline of 83.50%. Multiagent configurations consistently outperformed single-agent setups across all models tested, though the gains were modest, typically 2 to 4 percentage points. AI as a Co-scientist In 2025, several research groups released systems in which multiple AI agents divide scientific tasks among themselves, with separate agents handling literature search, hypothesis generation, code execution, and review. The multiagent systems are designed to approximate the structure of a human research team, rather than relying on a single model or person to perform every step. The most prominent example, Google’s AI Coscientist (Gottweis et al., 2025), uses a generate-debate-evolve loop in which agents iteratively produce and refine evidence-grounded hypotheses. The system was validated in three biomedical areas, including AML drug repurposing and liver fibrosis targets, and achieved a top-1 accuracy of 78.4% on the GPQA Diamond set when selecting its highest-rated hypothesis per question (Figure 5.3.3). 22.32% 23.47% 24.62% 27.17% 29.97% 30.61% 34.18% 33.93% 36.10% 26.28% 28.83% 29.34% 30.74% 34.31% 34.57% 36.73% 37.37% 38.78% Kimi-K2-Instruct Qwen3-235B-Instruct Claude 3.5 Sonnet GLM-4.5 Qwen3-235B-Thinking GPT-4.1 Claude Sonnet 4 OpenAI o4-mini-high Gemini 2.5 Pro 0% 20% 40% 60% 80% 100% Single agent Multiagent Model Average score PaperArena: single vs. multiagent performance Source: Wang et al., 2026 | Chart: 2026 AI Index report 83.50%, human baseline (PhD experts) Figure 5.3.2 AI Agents 253 Other multiagent systems pursued different approaches to the same goal. Sakana’s AI Scientist-v2 (Yamada et al., 2025) produced the first fully AI-generated paper accepted at a peer-reviewed workshop (ICLR), using agentic tree search to generate and refine code implementations without human-coded templates. Kosmos (Mitchener et al., 2025) maintained coherence across runs lasting up to 12 hours, executing an average of 42,000 lines of code and reading 1,500 papers per run, with collaborators reporting that a single run approximated six months of research. SciToolAgent (Ding et al., 2025) automates hundreds of scientific tools across biology, chemistry, and materials science via knowledge-graph-driven retrieval, outperforming prior agent frameworks by 10 to 20 percentage points on multi-tool tasks (Figure 5.3.4). Despite these advances, only a handful of multiagent systems have produced results that were tested and confirmed through real-world experiments. Published examples include new proteins designed by ProtAgents; 92 antibody candidates for SARS-CoV-2 from the Virtual Lab (of which more than 90% successfully bound their target); two new cancer drug targets, GPR160 and ARG2, from OriGene; five novel metal-organic frameworks from MOFGen; and a novel chromophore from ChemCrow. The gap between what these systems can propose computationally and what has been confirmed experimentally remains wide. Key roadblocks for the field include workforce training gaps, a lack of API and interoperability standards, and funding structures that do not yet support the maintenance and scaling of autonomous research infrastructure. Source: Gottweis et al., 2025 Figure 5.3.2 5.3 A I AG E N T S A N D TO O L S F O R S C I E N C E WO R K F LOW S | S C I E N C E | A I I N D E X R E P O R T 2026 254 5.3 A I AG E N T S A N D TO O L S F O R S C I E N C E WO R K F LOW S | S C I E N C E | A I I N D E X R E P O R T 2026 Source: Ding et al., 2025 Figure 5.3.2 255 Medicine 6 A I I N D E X R E P O R T 2026

6.1 The Central Dogma

AI models for molecular biology span the pathway from gene sequence to protein structure to therapeutic design. This section tracks advances in protein language models, structure prediction, protein design, virtual cell models, and multimodal foundation models for biomedical discovery. The analysis draws on PubMed publication counts, benchmark evaluations including ProteinGym and FoldBench, and model release data from 2024 and 2025. A recurring pattern across these areas is the tension between scale and specialization. In several areas, smaller or more targeted models matched or outperformed larger general-purpose systems. 6 M E D I C I N E | A I I N D E X R E P O R T 2026 AI-driven protein research grew approximately 71% between 2024 and 2025 (Figure 6.1.1). Total publications across four categories—function prediction, protein structure prediction, protein-drug interactions, and synthetic protein design—rose from 2,259 in 2024 to 3,855 in 2025. Protein-drug interactions represented the largest share of output in both years, accounting for 49.9% of papers in 2024 and rising to 54.4% in 2025. Protein structure prediction constituted the second-largest category in 2024 at 28.7%, though its relative share declined to 23.9% in 2025. Function prediction and synthetic protein design each remained comparatively stable, with function prediction increasing from 9.7% to 10.4% and synthetic protein design decreasing slightly from 11.7% to 11.3%. The shift in relative share toward protein-drug interactions, even as absolute counts grew across all categories, may reflect maturing structure prediction methods and growing interest in therapeutics applications. Publications focused specifically on AI for drug discovery have followed a similar upward trajectory (Figure 6.1.2). Research Volume 220 648 1,127 264 402 922 2,097 434 Function prediction Protein structure prediction Protein-drug interactions Synthetic protein design 0 500 1,000 1,500 2,000 2025 2024 Reseach domain Number of publications Number of AI-driven protein research publications, 2024 vs. 2025 Source: RAISE Health, 2026 | Chart: 2026 AI Index report Figure 6.1.1 259 6.1 T H E C E N T R A L D O G M A | M E D I C I N E | A I I N D E X R E P O R T 2026 3,311 2,100 1,470 1,229 1,083 720 612 431 2018 2019 2020 2021 2022 2023 2024 2025 0 500 1,000 1,500 2,000 2,500 3,000 3,500 Number of publications Number of publications on AI for drug discovery, 2018–25 Source: RAISE Health, 2026 | Chart: 2026 AI Index report Figure 6.1.2 Demand for training data has continued to grow as AI models have gained further adoption in biology. Rapidly collecting new biological data is typically time-consuming and expensive. In 2025, biological AI models increasingly trained on multiple datasets with different types of experimental measurements. Several cofolding methods (where two or more molecules are modeled simultaneously), for example, began training on both structural data from the Protein Data Bank (PDB) and experimental small-molecule binding affinity measurements from repositories such as PubChem, ChEMBL, and BindingDB. The scale of publicly available biological datasets has grown by several orders of magnitude since the PDB’s founding in 1971, though direct size comparisons across databases should be interpreted with caution. The unit of measurement differs by source: An “entry” may represent an experimentally solved protein structure, a bioactivity measurement, a gene sequence, or a single-cell observation. Other models are trained on synthetically generated data. AlphaFold 3 and its open-source replications, including Boltz-2 and OpenFold3, all train on “self-distillation” datasets of predicted protein structures from AlphaFold 2 that have been filtered for quality. Meta FAIR released Open Molecules 2025 (OMol25), a dataset of over 100 million quantum mechanics calculations of molecules. New experimental datasets also debuted in 2025. Tahoe-100M, the largest publicly released, single-cell sequencing dataset, contains measurements from over 50 cancer cell types exposed to more than 1,100 drugs. BaseData features over 9.8 billion genes obtained through metagenomic mining (Figure 6.1.3). Public Datasets Molecular and Cellular Biology 260 6.1 T H E C E N T R A L D O G M A | M E D I C I N E | A I I N D E X R E P O R T 2026 PDB (1971) InterPro/Pfam (1995) BindingDB (1995) STRING (2000) UniProt (2002) PDBbind (2004) PubChem (2004) ChEMBL (2009) AlphaFold DB (2021) OMol25 (2025) Tahoe-100M (2025) BaseData (2025) 100K 1M 10M 100M 1B 10B Structural Sequence/function Binding/bioactivity Predicted/simulated Single-cell Metagenomic Dataset (release year) Number of entries (log scale) Size of public datasets for molecular and cellular biology Source: RAISE Health, 2026 | Chart: 2026 AI Index report Figure 6.1.3 Figure 6.1.4 Training biomedical vision-language models requires large repositories of images and captions that can be transformed into a continual pretraining dataset. In the general domain, data scaling is often considered a mature or saturated research direction, but this does not appear to hold in the biomedical setting. Newer datasets extend beyond a single specialty and incorporate a broader range of modalities and biomedical domains (Figure 6.1.4). Data for Biomedical Vision-Language Models MEDICAT PMC-OA PMC-15M BIOMEDICA 24M 2020 2022 2024 2025 0 1M 2M 3M 4M 5M 6M Number of publications Size of select biomedical datasets at release, 2020–25 Source: RAISE Health, 2026 | Chart: 2026 AI Index report 261 6.1 T H E C E N T R A L D O G M A | M E D I C I N E | A I I N D E X R E P O R T 2026 The trend in protein language models (PLMs) shifted in 2025 from scaling model size to improving model efficiency and specialization. In 2024, efforts culminated in the 98-billion-parameter ESM3. In 2025, the focus turned to smaller architectures trained on curated data or augmented with retrieval methods (Figure 6.1.5). ProteinGym is a comprehensive benchmark suite for protein fitness prediction and design, comprising over 250 standardized deep mutational scanning assays with millions of mutated sequences and curated clinical datasets with expert-annotated mutation effects. MSAPairformer, a 111 million-parameter model trained on multiple sequence alignments and physical constraints, surpassed previous state-of-the-art methods at a fraction of the training and parameter budget (Figure 6.1.6). The Profluent E1 series similarly set new performance standards by combining a smaller model with a retrieval augmented generation (RAG) approach. While certain tasks still benefit from larger models, others—such as predicting cellular localization— appear to vary more with training method and data than with parameter count. Sequence-Based Models: Protein Language Models 1.20 0.42 6.40 1.20 15 6.40 98 0.04 0.18 0.40 46 0.11 0.60 0.10 ProGen ProtBert ProGen2 ProtT5 ESM2 OpenCRISPR-1 (proseLM) ESM3 SHIVER PoET-2 Proteina ProGen3 MSA-Pairformer E1 MSA-Transformer 2020 2022 2023 2024 2025 0 20 40 60 80 100 Number of parameters (in billions) Size of protein sequencing models, 2020–25 Source: RAISE Health, 2026 | Chart: 2026 AI Index report Figure 6.1.5 262 6.1 T H E C E N T R A L D O G M A | M E D I C I N E | A I I N D E X R E P O R T 2026 0.45 0.39 0.41 0.41 0.47 0.41 0.41 0.40 0.42 0.42 0.45 0.47 0.47 0.48 MSA-Transformer (100M) ESM-2 (150M) ESM-2 (3B) ESM-2 (650M) PoET (200M) ESMC (600M) ESMC (300M) E1 (Single Sequence) (150M) E1 (Single Sequence) (300M) E1 (Single Sequence) (600M) MSA Pairformer (111M) E1 (Retrieval Augmented) (150M) E1 (Retrieval Augmented) (300M) E1 (Retrieval Augmented) (600M) 2021 2022 2023 2024 2025 0.00 0.10 0.20 0.30 0.40 0.50 Model (parameters) Avg. correlation (Spearman) Performance of protein language models on ProteinGym, 2021–25 Source: RAISE Health, 2026 | Chart: 2026 AI Index report Figure 6.1.6 Beyond benchmark performance, PLMs have also become more task-specific. The ESM-C series demonstrated that smaller models geared toward a single task, such as representation learning, could be successful without the full feature set of the ESM3 family. A fine-tuned ProGen model (6 billion parameters) was used to design a novel CRISPR-Cas protein, OpenCRISPR-1, which demonstrated improved specificity relative to standard SpCas9. Multiple open-source structure prediction models were released in 2025, inspired by the architecture of AlphaFold 3. These models tackle the task of “cofolding”—predicting the three-dimensional structures formed by combinations of proteins, nucleic acids, drugs, and other biomolecules. While AlphaFold 3 retains a performance advantage on certain tasks, most cofolding models have demonstrated similar performance across protein structure and biomolecular complex modeling tasks. Some, including the Boltz series and OpenFold3, are released under commercially permissive licenses. Because cofolding models can now represent all structure types available in the PDB, further performance gains will likely require new data sources or deeper extraction of signal from existing data. One strategy is the use of “distilled” datasets, in which AI-predicted protein structures supplement experimentally determined ones, scaling training datasets from hundreds of thousands of entries to tens of millions. Boltz-2 illustrates a complementary approach—training on both structural information from the PDB and experimental binding affinity measurements—to combine specialization and scale (Figure 6.1.7). The four leading cofolding models share a common foundation of experimental and distilled structural data, but they diverge in their use of additional sources such as molecular dynamics simulations, binding affinity measurements, and RNA structure. Structure Prediction and Cofolding Models 263 6.1 T H E C E N T R A L D O G M A | M E D I C I N E | A I I N D E X R E P O R T 2026 Structural data (experimental) Structural data (distilled) Model Molecular dynamics Binding affinity RNA structure Training data sources for cofolding models, 2025 AlphaFold 3 Boltz-2 SimpleFold OpenFold3 Figure 6.1.61 1 Rows represent individual cofolding models. Columns indicate whether each model incorporated a given data type during training, including experimentally determined and distilled protein structures, molecular dynamics simulations, binding affinity measurements, and RNA structural data. A check mark indicates that the data type was used. Similar to trends in other areas of AI, bigger models have not necessarily translated to better performance in protein structure prediction. Following the release of AlphaFold 3, subsequent models have converged on a similar parameter scale rather than continuing to grow (Figure 6.1.8). FoldBench is a benchmark that tests whether a model can correctly predict how a small molecule, such as a drug candidate, physically binds to a target protein. AlphaFold 3’s performance on FoldBench has yet to be significantly surpassed even though several larger models have been released since (Figure 6.1.9). These results suggest that data, rather than model size, is an important bottleneck in protein structure prediction. 264 6.1 T H E C E N T R A L D O G M A | M E D I C I N E | A I I N D E X R E P O R T 2026 93M 14M 93M 83M 370M 98B 370M 440M 610M 521M 109M 135M 3B 730M 370M 923M AlphaFold 2 RoseTTAFold OpenFold RoseTTAFold All Atom AlphaFold 3 ESM3 HelixFold3 Chai-1 Boltz-1 Boltz-2 Protenix-Tiny Protenix-Mini SimpleFold RosettaFold-3 Protenix v0.7 SeedFold 2021 2022 2024 2025 10M 100M 1B 10B 100B Number of parameters (log scale) Size of protein structure prediction models, 2021–25 Source: RAISE Health, 2026 | Chart: 2026 AI Index report 51.23% 51.82% 55.04% 64.90% 50.70% 53.90% 57.28% 63.10% Chai-1 (440M) HelixFold3 (370M) Boltz-1 (610M) AlphaFold 3 (370M) Protenix v0.7 (368M) Boltz-2 (521M) RosettaFold-3 (730M) SeedFold (923M) 2024 2025 0% 20% 40% 60% 80% 100% Model (parameters) Accuracy FoldBench: protein cofolding performance, 2024–25 Source: RAISE Health, 2026 | Chart: 2026 AI Index report Figure 6.1.8 Figure 6.1.9 265 6.1 T H E C E N T R A L D O G M A | M E D I C I N E | A I I N D E X R E P O R T 2026 Advances in cofolding have enabled a new generation of generative models for protein design, including methods for designing antibodies, nanobodies, and peptides. Approaches range from workflows built around existing structure prediction methods (e.g., BindCraft, Germinal) to models directly trained for generating binders (e.g., RFDiffusion, BoltzGen). A protein design challenge hosted by Adatpyv Bio in 2025 provided a controlled comparison. Multiple methods were tested on the task of designing a binder targeting Nipah virus. Of the thousand-plus designs tested, 99 proteins were confirmed to bind, and none neutralized the targeted protein. The specialized method Mosaic, which combines multiple tools with expert tuning, outperformed general-purpose approaches (Figure 6.1.10). Research on “virtual cell” models—AI systems that model cellular states and responses to stimuli—increased substantially in 2025, as reflected in growing PubMed publication counts (Figure 6.1.11). Notable releases included Evo 2, a DNA language model from the Arc Institute; STATE, a perturbation-response model; and AlphaGenome, a multimodal model from DeepMind. However, current virtual cell and genomic foundation models still lag behind smaller, task-specific models on several benchmarks. GPN-Star, a 200-million-parameter model focused on functional and regulatory genomics, outperformed Evo 2 (40B parameters) on multiple variant effect prediction tasks (Figure 6.1.12). These results suggest that scale alone is not yet sufficient, and that training method and data curation remain important determinants of performance. Protein Design and Generative Models for Therapeutics Virtual Cell Models and Genomic Foundation Models 85.87% 98.13% 88.89% 88.24% 28.57% 100% 9.65% 1.87% 7.41% 17.65% 0% 88.89% Overall (1,026) BoltzGen (107) BindCraft (27) RFDiǄusion (17) Germinal (14) Mosaic (9) 0% 20% 40% 60% 80% 100% Expressed Bound Model (total designs tested) Succes rate Protein design success rates in Adaptyv Nipah Binder challenge Source: Adaptyv Nipah Binder challenge, 2026 | Chart: 2026 AI Index report Figure 6.1.10 266 6.1 T H E C E N T R A L D O G M A | M E D I C I N E | A I I N D E X R E P O R T 2026 24 16 14 10 14 7 9 6 2018 2019 2020 2021 2022 2023 2024 2025 0 5 10 15 20 25 Number of publications Number of publications on virtual cell models, 2018–25 Source: RAISE Health, 2026 | Chart: 2026 AI Index report Figure 6.1.11 Figure 6.1.12 0.38 0.40 0.53 0.75 Enformer (250M) Borzoi (190M) Evo 2 (40B) GPN-Star (200M) 2021 2023 2025 0.00 0.20 0.40 0.60 0.80 1.00 Model (parameters) Performance (AUPRC) Virtual cell model performance Source: Ye et al., 2025 | Chart: 2026 AI Index report 267 6.1 T H E C E N T R A L D O G M A | M E D I C I N E | A I I N D E X R E P O R T 2026 Scientific publications on multimodal foundation models for biomedical discovery have been growing rapidly since 2021 (Figure 6.1.13). While several subfields within multimodal biomedical AI gained traction in 2025, two areas have been especially impactful: Vision-language models pair medical or biological images with text, while vision-omics models integrate imaging with genomic or transcriptomic data. Multimodal Foundation Models for Biomedical Discovery 2 16 171 314 462 2021 2022 2023 2024 2025 0 100 200 300 400 500 Number of publications Number of publications on multimodal biomedical AI, 2021–25 Source: RAISE Health, 2026 | Chart: 2026 AI Index report Figure 6.1.13 268 H I G H L I G H T: Automated and Agentic Biomedical Discovery 6.1 T H E C E N T R A L D O G M A | M E D I C I N E | A I I N D E X R E P O R T 2026 In 2025, efforts to automate scientific discovery focused on integrating digital reasoning with physical laboratory validation. Robin, an automated discovery framework, linked literature-based hypothesis generation with experimental data analysis, identifying the ROCK inhibitor ripasudil as a novel candidate for dry age-related macular degeneration. STELLA, an autonomous bioinformatics agent, expanded its own technical capabilities by discovering and integrating new software tools rather than relying on manually curated toolsets. Biomni, a general-purpose biomedical AI agent developed at Stanford University, mapped a unified biomedical action space across 25 subfields, integrating 150 specialized tools, 105 software packages, and 59 databases. Collaborative multiagent frameworks also emerged. Agent Laboratory, developed by AMD and Johns Hopkins, assigns distinct roles to PhD, Postdoc, and ML Engineer agents within a simulated lab structure. The Virtual Lab uses an LLM Principal Investigator to orchestrate specialized scientist agents, producing 92 novel nanobody binder designs for SARS-CoV-2. These systems are part of an early-stage trend toward multiagent coordination in biomedical research, though their outputs still require experimental validation. 269

6.2 Clinical Applications

The molecular and cellular AI advances described in section 6.1 provide the upstream models on which clinical tools increasingly depend. This section tracks how AI is being applied in clinical settings, from medical imaging and diagnostic reasoning to workflow integration, regulatory authorization, and enterprisescale deployment. The analysis draws on prospective trial counts, FDA device authorization data, benchmark evaluations, and published apportionment outcomes from health systems. Across these areas, strong benchmark results have yet to translate reliably to measurable clinical outcomes. 6 M E D I C I N E | A I I N D E X R E P O R T 2026 Training data for medical imaging AI remains roughly 100 times smaller in raw sample count than for nonmedical AI (Figure 6.2.1). MAIRA-2, a radiology-focused model trained on approximately 1.4 million chest radiographs, compared with DINOv3, a general-purpose vision transformer trained on 1.7 billion unlabeled natural images. On the multimodal side, RadFM trained on approximately 16 million mixed 2D and 3D medical scans paired with clinical text, while OpenCLIP trained on LAION-5B, comprising approximately 5.85 billion image–text pairs. Data scarcity is especially pronounced for three-dimensional modalities such as CT and MRI, and fragmentation across institutions further limits the development of large-scale medical foundation models. Imaging Data Scale and Availability MAIRA-2 DINOv3 1M 10M 100M 1B 10B RadFM OpenCLIP 1M 10M 100M 1B 10B Medical Nonmedical Model Model Data volume (log scale) Data volume (log scale) Imaging-only Multimodal Training data volume in medical and nonmedical AI: imaging-only vs. multimodal models Source: RAISE Health, 2026 | Chart: 2026 AI Index report Figure 6.2.1 270 6.2 C L I N I CA L A P P L I CAT I O N S | M E D I C I N E | A I I N D E X R E P O R T 2026 Vision language models (VLMs) for medical imaging have expanded beyond radiology into pathology, dermatology, ophthalmology, and cardiology (Figure 6.2.2). Across six clinical disciplines, the number of research models and FDA-cleared commercial products grew, with pathology seeing the greatest concentration of new research releases. The Merlin model demonstrated that a highly capable CT foundation model could be trained on a single 40GB GPU by leveraging both radiology reports and ICD codes during training, making advanced medical AI accessible even in resource-constrained settings. Medical imaging AI lacks the standardized cross-model benchmarks common in general-domain machine learning. Models in different specialties are typically evaluated on different datasets, making direct performance comparisons across disciplines difficult. Recent MICCAI 2025 challenges (CHIMERA, UNICORN) represent early efforts to address this gap. Human-centered evaluation, where clinicians manually review model outputs, has become more prevalent in publications and provides stronger evidence of clinical utility than lexical metrics. Modeling Approaches Discipline Notable releases Analogous FDA-cleared models Cardiology EchoJEPA (2026) PanEcho (2025) EchoFM (2025) EchoPrime (2025) Bunkerhill ECG-EF Heartflow Plaque Analysis Oncology MUSK (2025) Allix5, Clairity (2025) Transpara (2.1.0) (2024) Ophthalmology EyeCLIP (2025) Meta-EyeFM (2025) RETFound-Green (2025) CLARUS (700), Carl Zeiss (2025) Radiology MedGemma 1.5 (2026) COLIPRI (2026) TTE 3D CT (2025) 3DINO-ViT (2025) CT-FM (2025) RadFM (2025) Merlin (2025) BriefCase Triage: CARE Multitriage CT Body (2026) a2z-Unified-Triage (2025) Bunkerhill BMD (2025) Bunkerhill AAQ (2025) Brainomix 360 Triage Stroke (2025) Ezra Flash (2025) Pathology Virchow2G (2026) KRONOS (2025) VORTEX (2025) Threads(2025) mSTAR (2025) PRISM2 (2025) MPath (2025) H-Optimus-0 (2024) ArteraAI Prostate (2025) Ibex Prostate Detect (2025) Figure 6.2.2 271 6.2 C L I N I CA L A P P L I CAT I O N S | M E D I C I N E | A I I N D E X R E P O R T 2026 The number of prospective trials validating medical imaging AI models grew by 28.5% year over year, from 417 in 2024 to 536 in 2025 (Figure 6.2.3). This growth signals the field is moving beyond studies that evaluate AI on past patient data toward live clinical trials, the kind of evidence required before hospitals will adopt these tools. Recent trials include MASAI, a randomized screening accuracy study of AI-assisted mammography, and NOTIFY-1 and NOTIFY-EXTEND, which tested whether flagging early signs of heart disease that AI spotted on routine CT scans led doctors to prescribe more preventive cholesterol medication. Prospective Clinical Trials 2 3 3 6 5 6 14 34 58 99 160 203 282 417 536 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0 100 200 300 400 500 Number of papers reporting prospective clinical trials Number of papers reporting prospective trials of clinical imaging ML/AI models, 2010–25 Source: RAISE Health, 2026 | Chart: 2026 AI Index report Figure 6.2.3 272 Figure 6.2.42 2 Box plot of normalized management reasoning points by LLMs and physicians on Gray Matters management cases. Five cases were included. Three o1-preview responses were generated for each case. The prior study collected five GPT-4 responses to each case, 176 responses from physicians with access to GPT-4, and 199 responses from physicians with access to conventional resources. H I G H L I G H T: H I G H L I G H T: LLM Clinical Reasoning Performance AI Agents in Clinical Medicine 6.2 C L I N I CA L A P P L I CAT I O N S | M E D I C I N E | A I I N D E X R E P O R T 2026 In a multi-experiment evaluation, OpenAI’s o1-preview reasoning model was tested on diagnostic reasoning tasks, management reasoning vignettes, probabilistic reasoning scenarios, and real emergency department (ED) cases with blinded expert scoring (Brodeur et al., 2025). On New England Journal of Medicine (NEJM) clinicopathological conferences (n=143), the model included the correct diagnosis in its differential 78% of the time, with 52% top-1 accuracy. On NEJM Healer cases (80 responses), it achieved a perfect revised-IDEA score in 78 of 80, compared with 47 of 80 for GPT-4, 28 of 80 for attending physicians, and 16 of 80 for residents. On management reasoning, o1 preview’s median score was 86%, versus 42% for GPT-4 only, 41% for physicians with access to GPT-4, and 34% for physicians with conventional resources (Figure 6.2.4). In 76 real ED cases, o1 produced diagnoses rated “exact/very close” in 67%–83% of cases across three diagnostic stages, surpassing two attending physicians at each stage. These results suggest that current LLMs have surpassed most existing clinical reasoning benchmarks, but they reflect isolated cognitive evaluations rather than realworld clinical integration. Whether AI-assisted reasoning translates to improved patient outcomes remains an open question requiring prospective trials. Autonomous and semiautonomous AI agents have emerged as a major development in clinical AI in 2025–26. Unlike conventional AI models that generate predictions or classifications in isolation, these systems reason across multiple steps, access external tools and data sources, and coordinate with other AI agents or human clinicians to complete complex clinical tasks. Multiagent frameworks, in which multiple AI agents take on specialized roles—such as diagnostician or pharmacist—and collaborate through structured reasoning protocols, have shown early promise on benchmark evaluations. Diagnostic accuracy gains over single-agent baselines ranged from 7% to over 60%, depending on the complexity of the clinical task (Gorenshtein et al., 2025; Zheng et al., 2025; Liu et al., 2025). Microsoft’s AI Diagnostic Orchestrator (MAI-DxO), paired with OpenAI’s o3 reasoning model, o1-preview Only GPT-4 Only Physicians + GPT-4 Physicians + Resources 0 10 20 30 40 50 60 70 80 90 100 Score Comparison of o1-preview, GPT-4, and physicians for management reasoning Source: Brodeur et al., 2025 | Chart: 2026 AI Index report 273 H I G H L I G H T: 6.2 C L I N I CA L A P P L I CAT I O N S | M E D I C I N E | A I I N D E X R E P O R T 2026 achieved 85.5% accuracy on diagnostically challenging cases from the New England Journal of Medicine, compared with approximately 20% among 21 practicing physicians with five to 20 years of clinical experience working under comparable conditions. A new set of benchmarks specifically designed to evaluate these agentic systems has started to appear. A 2025 scoping review identified 43 studies evaluating agentic AI in healthcare, 36 of which (84%) were published in 2025. On MedAgentBench (Jiang et al., 2025), which evaluates LLM agents in a virtual electronic health record (EHR) environment across 300 clinically derived tasks, the best performing model achieved a task success rate of 69.7%. These results suggest that, despite access to advanced capabilities such as tool use and iterative reasoning, the evidence base for reliable autonomous clinical AI agents remains early-stage. This section tracks the regulatory, institutional, and evidentiary dimensions of clinical AI deployment, from FDA device authorizations to enterprise-scale outcomes. In the United States, FDA 510(k) is the most common regulatory pathway for AI medical devices, requiring manufacturers to demonstrate that a new device is substantially equivalent to one already on the market rather than conducting new clinical trials. The number of 510(k)-cleared AI/ML-related devices reached 246 in 2025, continuing a steep upward trajectory that began with 16 devices in 2016 (Figure 6.2.5). Because most cleared radiology AI solutions are offered commercially rather than as open-source tools, healthcare systems are typically required to complete financial clearance and cost-effectiveness justification before implementation. Comp2Comp, a notable exception, is an open-source python package for CT imaging analysis with two modules (bone mineral density and abdominal aortic quantification) that secured FDA clearance in 2025. Deployment, Implementation, and Deimplementation FDA-Authorized AI/ML-Enabled Devices 274 6.2 C L I N I CA L A P P L I CAT I O N S | M E D I C I N E | A I I N D E X R E P O R T 2026 40 51 99 110 138 178 183 198 21 20 1 3 1 6 3 16 23 58 72 109 130 157 218 222 246 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0 50 100 150 200 250 Radiology Pathology Orthopedic Ophthalmic Neurology Gastroenterology-Urology Cardiovascular Number of cleared devices FDA 510(k)-cleared AI/ML-enabled imaging-related medical devices, 2011–25 Source: FDA, 2025 | Chart: 2026 AI Index report Figure 6.2.5 Figure 6.2.6 By December 2025, the FDA had authorized a total of 1,357 AI/ML-enabled medical devices from 693 different companies across 17 clinical specialties (Figure 6.2.6). Annual authorizations reached 258 through September 2025, already surpassing all prior full-year totals. The cumulative total crossed the 1,000-device milestone in 2024. Ninety-eight new companies entered the space in 2025, continuing a trend of broadening market participation (103 new entrants in 2023, 109 in 2024). 2 0 1 2 0 0 2 1 0 1 1 1 0 5 0 2 3 5 4 6 6 18 27 65 80 114 130 162 226 235 258 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 20250 50 100 150 200 250 300 0 200 400 600 800 1,000 1,200 1,400 Number of AI medical devices Cumulative total Number of AI medical devices approved by the FDA, 1995–2025 Source: FDA, 2025 | Chart: 2026 AI Index report 275 6.2 C L I N I CA L A P P L I CAT I O N S | M E D I C I N E | A I I N D E X R E P O R T 2026 Radiology accounts for the largest share of authorized AI/ML devices at 1,039 of 1,357 (76.6%), followed by cardiovascular (130 devices, 9.6%) and neurology (61 devices, 4.5%) (Figure 6.2.7). Non-radiology authorizations have increased from 7 in 2016 to 60 in 2025 (Figure 6.2.8). Cardiology, neurology, anesthesiology, and gastroenterology-urology have all seen acceleration since 2020, suggesting that AI is beginning to spread from imaging-centric applications to broader clinical domains. Devices by Clinical Specialty 1 4 4 5 5 5 6 6 8 9 10 20 21 23 61 130 1,039 0 100 200 300 400 500 600 700 800 900 1,000 Immunology Obstetrics and Gynecology General Hospital Orthopedic Dental Clinical Toxcicology General and Plastic Surgery Microbiology Pathology Clinical Chemistry Ophthalmic Hematology Gastroenterology-Urology Anesthesiology Neurology Cardiovascular Radiology Number of AI medical devices Number of AI medical devices approved by the FDA by specialty, 1995–2025 (sum) Source: FDA, 2025 | Chart: 2026 AI Index report 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0 50 100 150 200 250 Radiology Cardiovascular Neurology Anesthesiology Hematology Gastroenterology-Urology Other Number of AI/ML medical devices Number of AI/ML medical devices approved by the FDA by specialty, 2016–25 Source: FDA, 2025 | Chart: 2026 AI Index report Figure 6.2.8 Figure 6.2.7 276 6.2 C L I N I CA L A P P L I CAT I O N S | M E D I C I N E | A I I N D E X R E P O R T 2026 FDA clearance does not equal clinical adoption. Financial, operational, and institutional barriers often stand between regulatory authorization and real-world deployment. The authorized device market is concentrated at the top but fragmented overall. GE Healthcare leads with 93 devices, followed by Siemens Healthineers (82), Shanghai United Imaging Healthcare (38), Philips Healthcare (36), and Canon Medical Systems (35), and Aidoc Medical (30) (Figure 6.2.9). Of the 626 companies with at least one authorized device, the large majority hold only one or two, reflecting a broad ecosystem of specialized entrants alongside established manufacturers. Ambient AI scribes, tools that automatically generate clinical documentation from patient–clinician conversations, saw the broadest adoption of any clinical AI category in 2025. Abridge, one of the leading In January 2025, the FDA issued draft guidance on AI-enabled device software functions applying a Total Product Life Cycle approach. Predetermined Change Control Plans , a mechanism that permits iterative updates after initial market authorization, were used in approximately 10% of 2025 clearances. Despite this growth, a peer-reviewed analysis of all 1,016 authorizations through December 2024 (Singh et al., 2025) found that only 2.4% of devices with clinical studies were supported by randomized controlled trial data, with nearly all devices entering via the 510(k) pathway. Industry Landscape Ambient AI Documentation 8 8 9 10 11 12 12 20 20 30 35 36 38 82 93 0 10 20 30 40 50 60 70 80 90 100 Qure.ai Technologies RaySearch Laboratories AB (publ) Zebra Medical Vision Brainlab Clarius Mobile Health Corp. Viz.ai HyperǇne iSchemaView Samsung Aidoc Medical, Ltd. Canon Medical Systems Corporation Philips Healthcare Shanghai United Imaging Healthcare Siemens Healthineers GE Healthcare Number of AI/ML medical devices Number of AI/ML medical devices approved by the FDA by top companies, 2016–25 Source: FDA, 2025 | Chart: 2026 AI Index report Figure 6.2.9 Clinical AI moved from pilot-stage initiatives to enterprise-scale deployments in 2025, with health systems reporting measurable outcomes across clinical and operational domains. The most published evidence was from ambient AI documentation, AI-powered sepsis prediction, and generative AI integration into clinical workflows. Enterprise-Scale Deployments in 2025 277 6.2 C L I N I CA L A P P L I CAT I O N S | M E D I C I N E | A I I N D E X R E P O R T 2026 platforms, expanded from approximately 100 to over 150 health systems, including Kaiser Permanente’s deployment across 40 hospitals and more than 600 medical offices. Adoption reached 63% among hospitals using Epic’s electronic health record system. Outcomes were consistent across multiple institutions. Sharp HealthCare reported an 83% reduction in note-writing effort and a 3.5%–6% increase in work relative value units—a standard measure of physician clinical productivity—per encounter. The University of Chicago Medicine reported a 47% reduction in cognitive load and a 58% increase in undivided patient attention. MaineHealth reported a 23% reduction in time spent on clinical notes, with the tool used in 70.3% of encounters. At Northwestern Medicine, physicians using the tool in more than half of encounters saw 11.3 additional patients per month and a 24% reduction in documentation time, with a reported 112% return on investment. At Stanford Health Care, a prospective study of 48 physicians published in JAMIA (February 2025) found statistically significant reductions in task load and burnout, with physicians reporting a median time savings of 20 minutes per half day of clinic. Two sepsis prediction systems reported mortality reductions in large-scale deployments in 2025. The Targeted Real-time Early Warning System, developed at Johns Hopkins and commercialized by Bayesian Health, was deployed across 13 Cleveland Clinic hospitals. Reported outcomes included an 18.7% relative reduction in sepsis mortality, a 1.85-hour reduction in median time to first antibiotic order, the correct identification of 82% of sepsis cases, an 89% clinician adoption rate, and a 10% reduction in intensive care unit utilization. COMPOSER, a deep learning model at UC San Diego Health monitoring over 150 variables per patient, reported a 17% reduction in sepsis mortality (1.9% absolute) across 6,217 admissions, a 5% increase in sepsis bundle compliance, and an estimated 50 lives saved annually. Health systems began embedding LLM-powered tools directly into electronic health records. ChatEHR, a system generating plain-language summaries of patient records, logged 23,000 sessions across 1,075 trained users within three months of broad rollout. 60% of usage occurred through automated prompts and 40% through interactive interfaces. Separately, an AI tool generating plain-language explanations of laboratory, imaging, and pathology results was evaluated in a study published in JAMA Network Open (August 2025). Of 93 survey respondents, 85% considered the tool user-friendly, 72% found it beneficial for laboratory results, and 63% for imaging results. OpenEvidence, a real-time evidence retrieval platform, reported adoption by 40% of U.S. physicians. AI-Powered Sepsis Prediction Generative AI in Clinical Workflows 278 6.2 C L I N I CA L A P P L I CAT I O N S | M E D I C I N E | A I I N D E X R E P O R T 2026 The inaugural State of Clinical AI Report (January 2026), published by the Stanford-Harvard ARISE Network, reviewed over 500 clinical AI studies and found that nearly half used exam-style questions rather than real patient data. Only 5% used real clinical data. The report concluded that AI performs most effectively when supporting rather than replacing clinician judgment. Separately, the NOHARM benchmark found that leading LLMs produced 11.8 to 14.6 severely harmful recommendations per 100 clinical cases, with 76.6% being errors of omission (e.g., failing to recommend a critical test). These findings apply to general-purpose LLMs evaluated on open-ended clinical reasoning tasks, not to the narrower, task-specific tools driving current adoption. Ambient scribes and sepsis alerts, for example, operate within constrained workflows with clinician oversight. Governance frameworks have also advanced. Stanford Health Care’s FURM framework now governs all new AI tool adoptions at that institution, and the GUIDE-AI Lab is working to make the framework available to other health systems. Evidence Gaps and Governance H I G H L I G H T: Digital Twins in Medicine A medical digital twin is a dynamic, data-linked computational representation of an individual patient that updates over time and supports forecasting, simulation, and treatment optimization. Research activity in this area has grown rapidly, with publication counts rising from near 0 in 2015 to 372 in 2025 (Figure 6.2.10). Patent filings in healthcare digital twins (CPC class G16H) tell a similar story, with filings increasing from 30 in 2016 to 4,926 in 2025 (Figure 6.2.11). However, conceptual clarity has not kept pace with publication growth. A 2025 scoping review in npj Digital Medicine assessed 149 human digital twin studies published between 2017 and 2024 and found that only 12.1% (18 studies) satisfied the National Academies of Sciences, Engineering, and Medicine (NASEM) definition of a digital twin. That definition requires three elements: personalization, dynamic updating, and predictive capability (Sadée et al., 2025). Only 19% of systems were tested in real healthcare environments. Clinical trials incorporating digital twin elements accelerated in 2025, particularly in oncology and diabetes. A pilot trial in prostate cancer using adaptive therapy concluded in 2025 with significantly increased survival (Zhang et al., 2022). New trials extended the approach to breast cancer (Mayo Clinic phase II) and ovarian cancer (ACTOv phase II RCT, n=80). For diabetes, a randomized controlled trial (n=150) of Twin Health’s Whole Body Digital Twin platform found that 71% of participants achieved an HbA1c below 6.5% within twelve months, while safely reducing their intake of blood sugar–lowering medications. 279 H I G H L I G H T: 6.2 C L I N I CA L A P P L I CAT I O N S | M E D I C I N E | A I I N D E X R E P O R T 2026 Figure 6.2.113 3 The bar in 2025 appears lower than in 2024 because not all patents filed in 2025 have been published or become publicly available yet. Figure 6.2.10 0 0 0 1 6 16 47 55 116 176 372 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0 50 100 150 200 250 300 350 Number of publications t t Number of publications on medical digital twins, 2015–25 Source: RAISE Health, 2026 | Chart: 2026 AI Index report 0 30 106 292 478 667 1,159 2,395 3,878 4,926 4,087 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0 1,000 2,000 3,000 4,000 5,000 Number of patent Ǉlings t t Number of observed patent Ǉlings on medical digital twins, 2015–25 Source: RAISE Health, 2026 | Chart: 2026 AI Index report 280

6.3 Patient Engagement

As patients interact more with AI tools—both through clinical workflows and consumer-facing platforms, efforts have been made to understand how they perceive these technologies. This section examines AIgenerated health search results, patient attitudes toward AI in healthcare, and the emerging evidence base for patient-facing AI tools. 6 M E D I C I N E | A I I N D E X R E P O R T 2026 AI-generated summary responses, referred to by Google as “AI Overviews,” now appear at the top of most health-related search results. On average, 84%–92% of health-related queries triggered an AI Overview across five primary query types (Figure 6.3.1). Symptom and common health questions were the most likely to trigger an overview (92%), followed by treatment-related queries (90%) and condition-based queries (84%–88%). AI-generated summaries are a routine feature of health information searches, shaping the initial interpretation of questions posed by most users. AI Overviews for Health-Related Searches 84% 88% 90% 92% 92% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% Condition queries (e.g., “dementia”) Condition question queries (e.g., “what is dementia?”) Treatment queries (e.g., “dementia treatment”) Symptom queries (e.g., “dementia symptoms”) Common health queries (e.g., “does pus mean it is infected?”) % of queries with an “AI Overview” Share of health search queries returning an “AI Overview” Source: RAISE Health, 2026 | Chart: 2026 AI Index report Figure 6.3.1 281 6.3 PAT I E N T E N G AG E M E N T | M E D I C I N E | A I I N D E X R E P O R T 2026 Publication volume on the patient perspective of AI in healthcare grew tenfold between 2020 and 2025 (Figure 6.3.2). Conditional acceptance emerged as a prevalent perspective across the literature. Patients tended to endorse AI in assistive roles rather than autonomous decision-making, particularly in highstakes clinical contexts (Fee et al., 2025; Allen et al., 2025; Hmido et al., 2025). Demographic disparities in acceptance—patterned by age, gender, education, and race—were documented across multiple studies (Labinsky et al., 2025; Ogu et al., 2025; Li et al., 2025). Preservation of the human relationship emerged as a consistent theme, with patients identifying the potential loss of empathic care as a primary concern (Carl et al., 2025; Davis et al., 2025). Trust in AI appeared to be clinician-mediated rather than technology-evaluated. Provider endorsement functioned as a key determinant of patient acceptance (Berger et al., 2025; Machado et al., 2025; Nong et al., 2025). Transparency and disclosure of AI use were similarly prioritized across populations, and emerging disclosure frameworks offer practical guidance for clinical settings (Figure 6.3.3). Patient Perspectives on AI in Healthcare 9 11 6 28 48 102 2020 2021 2022 2023 2024 2025 0 20 40 60 80 100 Number of publications Number of publications on patient perceptions of AI in healthcare, 2020–25 Source: RAISE Health, 2026 | Chart: 2026 AI Index report Figure 6.3.2 282 Artificial intelligence use cases and recommendations for patient notification Source: Mello et al., 2025 Illustrative AI use cases & patient notification framework Two-question decision framework for determining when patient consent, notification, or neither is required CONSENT Required before use NOTIFICATION Inform patient AI was used NEITHER No consent or notification needed AI-guided, nonautonomous surgical robot Surgery carries considerable risk; patient can choose non-robotic surgery. Genomic drug response & treatment planning tool Inaccuracy could lead to poor outcomes; patient can opt out of having the tool used. HCM screening algorithm (echocardiography data) Knowing AI recommended follow-up may affect patient’s decision to accept. Predictive algorithm: stock blood in OR for transfusion Patients cannot influence operational decisions; blood availability does not replace transfusion consent. AI-assisted mammography interpretation Outperforms human-only reading; seeking alternate care elsewhere would increase harm risk. GenAI summaries of radiologist-dictated imaging findings Low harm risk; tool only summarizes the radiologist’s own dictated input. Algorithm to safely discontinue daily laboratory testing Patients cannot opt out or demand tests; ordering decision remains with the clinician. GenAI tool filing prior authorization requests with insurers Low harm risk (denials can be appealed); patients cannot personally review authorization requests. GenAI tool drafting replies to patient emails Inaccuracy may cause harm; informed patients can question unexpected replies. Ambient AI generating clinic visit summaries Patients can review summaries for accuracy. (Consent legally required in some states.) NO NO YES NO NO to both questions Neither YES AI tool deployed in clinical setting QUESTION 1 Does use of this AI tool – or inaccuracy in its output – carry a risk of harm to the patient? QUESTION 2 Does the patient have an opportunity to express agency in response to disclosure of AI use? No harm/inaccuracy risk and no patient agency opportunity 6.3 PAT I E N T E N G AG E M E N T | M E D I C I N E | A I I N D E X R E P O R T 2026 Figure 6.3.3 Internal medicine, radiology, and oncology were the most frequently represented specialties in this literature (Figure 6.3.4). The United States, United Kingdom, and Germany account for the greatest number of publications, while studies from sub-Saharan Africa, Latin America, and Southeast Asia remain underrepresented (Figure 6.3.5). Studies that include children and adolescents as participants, rather than drawing solely on parent or caregiver perspectives, remain rare. 283 6.3 PAT I E N T E N G AG E M E N T | M E D I C I N E | A I I N D E X R E P O R T 2026 1 1 1 1 1 1 1 1 1 2 2 2 2 3 3 3 5 5 6 6 7 7 7 7 10 12 12 16 24 70 0 10 20 30 40 50 60 70 Anesthesia Nephrology Hematology Geriatrics ENT PM&R Neurosurgery Thoracic Surgery Transplant Palliative Care Gastroenterology Endocrinology Rheumatology Genetics/Genomics Ortho Pediatric Emergency Medicine Cardiology Neurology Urology Dermatology Dentistry Surgery (general) Ophthalmology Psychiatry Oncology OB/Gyn Radiology Internal Medicine General Healthcare Number of publications Medical specialty Medical specialties represented in publications exploring patient perceptions of AI in healthcare, 2020–25 Source: RAISE Health, 2026 | Chart: 2026 AI Index report 1–2 3–5 6–10 11–30 31–70 No data Geographic distribution of publications on patient perceptions of AI in healthcare by country, 2020–25 Source: RAISE Health, 2026 | Chart: 2026 AI Index report Figure 6.3.4 4 Figure 6.3.55 4 Publications may be tagged with multiple specialties. 5 Publications may be tagged with multiple countries; countries of origin are included. In total, 39 countries are represented (N = 204). 284

6.4 Ethical Considerations

6 M E D I C I N E | A I I N D E X R E P O R T 2026 This section tracks the volume and focus of ethical disclosure in medical AI publications, drawing on a bibliometric analysis of PubMed Central from January 2021 to December 2025. Publications were identified using search terms for medical AI and ethics, and then categorized by emphasis on data sharing, algorithm sharing, biosecurity, and global health. Ethics topics were either grouped under algorithmic, governance, or societal concerns. Of the total number of medical AI publications in 2025, 43.4% discussed ethics topics—up from 37.1% in 2024 (Figure 6.4.1). The absolute number of such publications more than doubled between the two years. Among the specific topics discussed, the growth was concentrated on governance, outpacing algorithmic and societal concerns (Figure 6.4.2). In 2025, the number of governance-related publications reached 1,228, compared with 896 for algorithmic concerns and 874 for societal concerns. Despite the attention paid to biosecurity in policy discussions, the subject is relatively unexplored in medical AI publications. In 2025, only 14 of these publications discussed biosecurity, with even fewer directly addressing the ethical implications of misuse or dual use (Figure 6.4.3). Volume and Concentration 295 449 624 1,114 2,378 1,311 1,650 1,900 3,000 5,477 2021 2022 2023 2024 2025 0 1,000 2,000 3,000 4,000 5,000 Medical AI and ethics Medical AI Number of publications Number of medical AI and ethics publications, 2021–25 Source: RAISE Health, 2026 | Chart: 2026 AI Index report Figure 6.4.1 285 6.4 E T H I CA L CO N S I D E R AT I O N S | M E D I C I N E | A I I N D E X R E P O R T 2026 107 199 241 471 896 130 170 278 544 1,228 72 121 182 347 874 2021 2022 2023 2024 2025 0 200 400 600 800 1,000 1,200 Algorithmic Governance Societal Number of publications Number of medical AI publications by ethics topics, 2021–25 Source: RAISE Health, 2026 | Chart: 2026 AI Index report Figure 6.4.2 Figure 6.4.3 5 4 8 13 14 2021 2022 2023 2024 2025 0 2 4 6 8 10 12 14 16 Number of publications Source: RAISE Health, 2026 | Chart: 2026 AI Index report Number of medical AI and biosecurity publications, 2021–25 286 6.4 E T H I CA L CO N S I D E R AT I O N S | M E D I C I N E | A I I N D E X R E P O R T 2026 Global health is an exception to the governance-dominated pattern. Among publications addressing global health in 2025, 51.8% (100 of 193) also mentioned ethics topics (Figure 6.4.4). Europe led with 38 publications, followed by East Asia (31) and North America (28), while sub-Saharan Africa, Latin America, and Oceania each produced fewer than five (Figure 6.4.5). In a departure from every other subcategory examined, societal concerns—including equity, justice, and accessibility—ranked highest in the global health context, surpassing both governance and algorithmic concerns (Figure 6.4.6). Researchers studying AI for global health are raising different questions from their peers working in the broader field. Global Health: A Different Ethical Focus 7 4 8 27 100 22 10 26 67 193 2021 2022 2023 2024 2025 0 50 100 150 200 Medical AI, global health, and ethics Medical AI and global health Number of publications Number of medical AI, global health, and ethics publications, 2021–25 Source: RAISE Health, 2026 | Chart: 2026 AI Index report Figure 6.4.4 287 6.4 E T H I CA L CO N S I D E R AT I O N S | M E D I C I N E | A I I N D E X R E P O R T 2026 23 25 17 11 9 3 3 10 5 4 3 4 4 5 38 31 28 14 11 8 5 4 3 3 Europe East Asia North America South Asia Middle East Sub-Saharan Africa Southeast Asia North Africa Latin America Oceania 0 5 10 15 20 25 30 35 40 45 2025 2024 2023 2022 2021 Number of publications Number of medical AI, global health, and ethics publications, 2021–25 Source: RAISE Health, 2026 | Chart: 2026 AI Index report 4 1 2 14 29 2 1 4 12 43 5 2 4 13 54 2021 2022 2023 2024 2025 0 10 20 30 40 50 60 Algorithmic Governance Societal Number of publications Number of medical AI and global health publications by ethics topics, 2021–25 Source: RAISE Health, 2026 | Chart: 2026 AI Index report Figure 6.4.5 Figure 6.4.6 288 Education 7 Demand for AI education is growing across every level, but the systems needed to deliver it are still catching up. Computer science enrollment in post-secondary institutions is declining even as AI-related majors gain popularity. Students at both the university and K-12 levels are using AI tools in large numbers, yet access to AI-specific coursework and teacher training remain limited. Governments, including the United States’, are pushing to integrate AI literacy into their curricula to maintain their countries’ competitive edge. Yet, data on AI education is fragmented and lagging, and much of the analysis in this chapter relies on CS education data as a proxy. To survey the current landscape of AI and CS education, this chapter was prepared in collaboration with the A I I N D E X R E P O R T 2026 Kapor Foundation, the Computer Science Teachers Association (CSTA), Expanding Computing Education Pathways (ECEP) Alliance, and the AI Index. The Kapor Foundation works at the intersection of racial equity and technology; CSTA is a global membership organization that supports educators in expanding access to CS education, and ECEP is a collective impact alliance focused on broadening participation in computing education.

7.2 Postsecondary CS and AI Education

US Degree Graduates Global ICT Graduates CS, CE, and Information Faculty Student Use of AI Tools

7.1 Background

AI’s role in education is expanding faster than the data needed to track it and much of the data in this chapter is limited in scope, lagging in time, or both. Postsecondary figures reflect completion rates from the 2023–24 academic year and do not yet show enrollment shifts driven by the increasing interest in AI or reported decline in computer science (CS) enrollment. Global data, from the OECD, is only available through 2023 and does not include countries such as India, China, and much of Africa. At the K–12 level, there is little standardized data on AI-related course or program offerings. The metrics that are available are limited to CS education, which is not fully representative of AI education. Given these constraints, a complete assessment of the growing demand for AI education, and how current systems are meeting it, is not possible. Public discussion about AI in education continues to expand, engaging developers, ed tech vendors, education advocates, policymakers, and educators around the role of AI, with a particular focus on its risks and benefits. There is broad agreement on the importance of AI literacy for all students and its designation as a critical skill for academic, professional, and civic navigation and success. Public discourse often fails to distinguish between AI in education, AI literacy, and AI education (Figure 7.1.1). AI in education is the use of AI to complete teaching and learning tasks. AI literacy refers to the knowledge necessary for a foundational understanding of AI, how it works, how to use it, and the risks of usage. AI education builds on AI literacy with the addition of the technical skills required to build AI systems. Because these terms are often blurred in public discussions, clarity about which topic is being addressed matters for how educators, researchers, and policymakers communicate about AI. This chapter focuses on AI education and AI in education. Where comprehensive data about AI education is not available, computer science (CS) education data is presented instead. 1 In A New Direction for Students in an AI World: Prosper, Prepare, Protect, a January 2026 report from the Center for Universal Education at the Brookings Institution, the authors provide an expansive catalogue of the risks and opportunities related to AI in education. In Annex A, they offer four definitions of “AI literacy” from frameworks published in the last five years. 7 E D U CAT I O N | A I I N D E X R E P O R T 2026 AI in Education AI in Literacy AI Education The usage of AI tools in teaching and learning AI literacy + the technical skills required to build AI The foundational understanding of AI, how it works, how to use it, and the risks of usage Figure 7.1.1 292

7.2 Postsecondary CS and

AI Education 7 E D U CAT I O N | A I I N D E X R E P O R T 2026 The generative AI usage among students has reshaped the conversation about the purpose and role of postsecondary education. At the same time, task automation in coding roles has appeared to slow the entrylevel job market for CS graduates. These shifts have translated to declines in CS enrollment in postsecondary institutions. Between 2024 and 2025, enrollment in CS as an undergraduate major in four-year universities declined 11%.2 Chapter 4 (Economy) documents a similar transition in the labor market, where employment among the youngest software developers has declined since 2024 even as overall AI hiring grows. So, students are responding to a shifting job market, but because degree completion lags enrollment by several years, the full effects will take time to appear in the data. Even as CS enrollment declines, there is evidence that AI-related majors are becoming more popular. U.S. Degree Graduates Previous versions of the AI Index focused primarily on CS degrees, since very few graduates are classified under an Artificial Intelligence major. This year, the AI Index added AI-relevant majors, as determined by the January 2025 White House AI Talent Report. The report divides AI-relevant majors into two categories: AI software, which includes majors such as Artificial Intelligence, Computer Programming/ Programmer, and Computational and Applied Mathematics; and AI hardware, which includes majors such as Electrical and Electronics Engineering, Condensed Matter and Materials Physics, and Industrial Engineering. AI software-related degrees have steadily increased in popularity over the past 10 years, especially at the bachelor’s and master’s levels (Figure 7.2.1). The largest increase has been at the master’s level, with an 82% increase in graduates between 2022 and 2024 and a 17% increase between 2023 and 2024. The number of AI hardware–related degrees has remained flat or declined; bachelor’s degrees, in particular, have declined 13% since reaching a peak in 2020. 2 The declines are based on CS specifically as a major, not the broader category of Computer and Information Sciences and Support Services. 3 The Classification of Instructional Programs (CIP), developed by the National Center for Education Statistics (NCES), designates “Artificial Intelligence and Robotics” under CIP code 11.0102. Despite the availability of this code since 2016, very few schools use it, instead choosing to classify students under 11.0101 (Computer and Information Sciences, General). 4 See the methodology section for a complete list of majors. 293 7.2 P O S T S E CO N DA RY C S A N D A I E D U CAT I O N | E D U CAT I O N | A I I N D E X R E P O R T 202623% 27% 36% 31% 77% 73% 64% 69% Associate’s Bachelor’s Master’s PhD 0% 20% 40% 60% 80% 100% 16% 21% 27% 24% 84% 79% 73% 76% Associate’s Bachelor’s Master’s PhD 0% 20% 40% 60% 80% 100% Male Female % of AI graduates % of AI graduates AI-software related AI-hardware related AI-related postsecondary graduates in the United States by gender, 2024 Source: National Center for Education Statistics, Integrated Postsecondary Education Data System, 2024 | Chart: 2026 AI Index report2014 2016 2018 2020 2022 2024 0 20 40 60 80 100 120 2014 2016 2018 2020 2022 2024 0 20 40 60 80 100 120 Associate’s Bachelor’s Master’s PhD Number of AI graduates (in thousands) Number of AI graduates (in thousands) AI-software related AI-hardware related 23.44 120.95 94.92 6.03 18.75 96.15 43.37 9.55 New AI-related postsecondary graduates in the United States, 2014–24 Source: National Center for Education Statistics, Integrated Postsecondary Education Data System, 2024 | Chart: 2026 AI Index report Figure 7.2.1 Figure 7.2.2 Women remain underrepresented across AI-related degrees, though they have slightly higher levels of representation in AI software–related degrees than in AI hardware-related degrees, peaking at 36% of AI software–related master’s degree graduates (Figure 7.2.2). By comparison, women continue to account for nearly 60% of all degrees. 294 7.2 P O S T S E CO N DA RY C S A N D A I E D U CAT I O N | E D U CAT I O N | A I I N D E X R E P O R T 2026 For AI software–related degrees, Hispanic/Latino, Black, Native Hawaiian/Pacific Islander, and Native American/Alaskan students are underrepresented at all levels (Figure 7.2.3). White students are also underrepresented, though to a lesser degree, except at the PhD level. Multiracial and Asian students are overrepresented, with Asian master’s students the most overrepresented. Representation patterns tend to be consistent across degree levels within each racial group (Figure 7.2.4). The main exceptions are Asian students and Native American/Alaskan students, whose representation varies by degree level. Hispanic/Latino students are slightly underrepresented at all levels, and Native Hawaiian/ Pacific Islander and Black students remain underrepresented at all levels. 0 1 2 3 Native American/Alaskan NHPI Black Hispanic White Two or more Asian Associate’s Bachelor’s Master’s PhD Ratio AI software–related graduates vs. all graduate demographics AI software–related vs. all postsecondary graduates in the United States by race/ethnicity, 2024 Source: National Center for Education Statistics, Integrated Postsecondary Education Data System, 2024 | Chart: 2026 AI Index report 0 1 2 3 Black NHPI Hispanic Native American/Alaskan Two or more White Asian Associate’s Bachelor’s Master’s PhD Ratio AI hardware–related graduates vs. all graduate demographics AI hardware–related vs. all postsecondary graduates in the United States by race/ethnicity, 2024 Source: National Center for Education Statistics, Integrated Postsecondary Education Data System, 2024 | Chart: 2026 AI Index report Figure 7.2.3 Figure 7.2.4 295 The majority of AI-related graduate students are non–United States residents, a pattern consistent with previous years’ analyses of CS degrees (Figure 7.2.5). This is especially true in AI software–related master’s degrees, where 67% of graduates are nonresidents. However, due to the federal government revoking student visas and discouraging international student enrollment, further declines in the number of nonresident graduates are expected in the coming years. 7.2 P O S T S E CO N DA RY C S A N D A I E D U CAT I O N | E D U CAT I O N | A I I N D E X R E P O R T 2026 96% 89% 33% 45% 11% 67% 55% Associate’s Bachelor’s Master’s PhD 0% 20% 40% 60% 80% 100% 98% 93% 47% 44% 7% 53% 56% Associate’s Bachelor’s Master’s PhD 0% 20% 40% 60% 80% 100% Resident Nonresident % of AI graduates % of AI graduates AI-software related AI-hardware related AI-related postsecondary graduates in the United States by residency, 2024 Source: National Center for Education Statistics, Integrated Postsecondary Education Data System, 2024 | Chart: 2026 AI Index report Figure 7.2.5 A range of institutions produce the highest number of graduates in AI-related fields5, including both public and private universities (Figure 7.2.6). The Georgia Institute of Technology is the only school in the top 10 across all levels for both AI software and hardware–related degrees. Other universities that appear at least once on both lists include University of California, Berkeley (4 mentions); University of Illinois, UrbanaChampaign (4); University of Michigan, Ann Arbor (4); Pennsylvania State University (3); Northeastern University (2); Carnegie Mellon University (2); Massachusetts Institute of Technology (2); and Stanford University (2).As more institutions add AI-specific majors at the undergraduate level, these rankings are likely to shift in coming years. 5 Some institutions classify AI-related graduates under broader program categories rather than AI-specific ones, which may result in undercounting at schools where AI coursework is housed within general computer science or engineering programs. 296 7.2 P O S T S E CO N DA RY C S A N D A I E D U CAT I O N | E D U CAT I O N | A I I N D E X R E P O R T 2026 PhD University of Maryland Global Campus: 2,350 Western Governors University: 2,029 University of California, Berkeley: 1,983 University of Maryland, College Park: 1,877 Southern New Hampshire University: 1,733 University of Michigan, Ann Arbor: 1,723 Rutgers University: 1,329 Georgia Institute of Technology: 1,302 University of California, San Diego: 1,204 Pennsylvania State University: 1,181 Georgia Institute of Technology: 3,394 University of Texas, Dallas: 2,555 Columbia University: 2,481 University of North Texas: 2,454 Trine University: 2,114 University of Illinois, UrbanaChampaign: 1,905 Northeastern University: 1,720 University of Southern California: 1,676 University of Texas, Arlington: 1,430 Boston University: 1,360 Georgia Institute of Technology: 155 Massachusetts Institute of Technology: 154 University of Illinois, UrbanaChampaign: 128 Carnegie Mellon University: 128 University of Michigan, Ann Arbor: 106 University of California, Los Angeles: 105 University of California, San Diego: 104 University of Washington: 104 University of California, Berkeley: 103 Stanford University: 95 1 2 3 4 5 6 7 8 9 10 Master’sBachelor’s Top postsecondary institutions graduating students in AI-related degrees, 2024 Source: National Center for Education Statistics, Integrated Postsecondary Education Data System, 2024 | Chart: 2026 AI Index report AI-software related AI-hardware related PhD Purdue University: 2,083 Texas A&M University: 1,857 Georgia Institute of Technology: 1,400 Northeastern University: 1,565 Arizona State University: 1,090 New York University: 959 Purdue University: 329 University of Michigan, Ann Arbor: 242 Stanford University: 202 1 2 3 Master’sBachelor’s 297 7.2 P O S T S E CO N DA RY C S A N D A I E D U CAT I O N | E D U CAT I O N | A I I N D E X R E P O R T 2026 University of Illinois, UrbanaChampaign: 1,391 Ohio State University: 1,341 Virginia Polytechnic Institute and State University: 1,039 Pennsylvania State University: 1,002 University of California, Berkeley: 970 Iowa State University: 942 North Carolina State University: 914 University of Michigan, Ann Arbor: 941 Johns Hopkins University: 939 Carnegie Mellon University: 899 Georgia Institute of Technology: 839 Texas A&M University: 740 San Jose State University: 737 Purdue University: 731 Georgia Institute of Technology: 198 Massachusetts Institute of Technology: 191 Texas A&M University: 189 University of Illinois, UrbanaChampaign: 184 University of Texas, Austin: 180 University of California, Berkeley: 176 Pennsylvania State University: 160 4 5 6 7 8 9 10 Figure 7.2.6 AI PhD graduates continue to choose industry jobs and lucrative salaries more often than academic jobs, with 65% going into industry after graduation (Figure 7.2.7). This percentage has declined in the last few years, down from a peak of 77% in 2022. At the same time, the share of academic jobs has increased, nearly doubling since 2022 (Figure 7.2.8). This challenges the narrative that academia is experiencing an exodus of experts or a “brain drain.” The percentage of AI PhD graduates entering government jobs has gradually increased to 2% from a low of 0.7% in 2021. 298 7.2 P O S T S E CO N DA RY C S A N D A I E D U CAT I O N | E D U CAT I O N | A I I N D E X R E P O R T 2026 76 64 101 74 85 77 134 116 162 180 153 195 280 278 288 72 63 47 51 43 42 63 60 73 65 61 84 79 103 145 154 134 154 132 136 123 201 178 238 249 219 281 362 388 442 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 20240 50 100 150 200 250 300 350 400 450 Academia Government Industry Number of new AI PhD graduates Employment of new AI PhDs in the United States and Source: CRA Taulbee Survey, 2011–25 | Chart: 2026 AI Index report Canada by sector, 2010–24 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 0% 10% 20% 30% 40% 50% 60% 70% New AI PhD graduates (% of total) 1.96%, Government 31.59%, Academia 62.75%, Industry Employment of new AI PhDs (% of total) in the United States and Canada by sector, 2010–24 Source: CRA Taulbee Survey, 2011–25 | Chart: 2026 AI Index report Figure 7.2.7 Figure 7.2.86 Global ICT Graduates No single dataset provides a fully standardized accounting of AI or CS postsecondary education across all countries. The Organization for Economic Cooperation and Development (OECD) has compiled data covering its member countries and several non-OECD nations. The International Standard Classification of Education provides the framework the OECD uses to compare education statistics across countries. Information and communication technologies, or ICT, includes such areas of study as “informatics, information, and communication technologies, or CS. These subjects cover a wide range of topics related to the new technologies used for the processing and transmission of digital information, including computers, computerized networks (including the Internet), microelectronics, multimedia, software, and programming.” The United States remains a global leader in ICT-related fields, producing more graduates at the associate’s, bachelor’s, master’s, and PhD levels than any other country in the sample (Figures 7.2.9 to 7.2.12). At most levels, other countries had faster year-over-year growth than the United States. At the associate’s level (denoted in international charts as short-cycle tertiary), Turkey increased its graduates by 27%; at the bachelor’s level, both Brazil and Turkey increased their graduates by 30%; and at the PhD level, Mexico increased its graduates by 76%. The exception is at the master’s level, where the United States increased its graduates by 55% (though, as noted earlier, many master’s graduates in the United States are not U.S. residents). 6 The sums in Figure 7.2.8 do not add up to 100%, as there is a subset of new AI PhDs each year who become self-employed, unemployed, or report an “other” employment status in the CRA survey. These students are not included in the chart. 7 While this dataset provides insights across some country lines, it omits a number of countries likely to have large numbers of ICT graduates. The exclusion of India, China, and countries in Africa highlights the need for global standardized data collection to ensure the inclusion of countries that have made significant investments in computing education and make up a significant proportion of the global population. There is also a notable lag in collecting and reporting global data on education; as a result, the most recent year for which data is available is 2023. Data for each country includes any students who have graduated from an institution in that country, regardless of their nationality. 299 7.2 P O S T S E CO N DA RY C S A N D A I E D U CAT I O N | E D U CAT I O N | A I I N D E X R E P O R T 2026 1,273 1,889 2,157 2,885 2,946 3,720 7,249 10,820 6,983 9,425 12,852 16,275 16,464 17,764 38,746 1,372 1,476 2,100 2,897 3,202 4,536 5,322 5,835 6,685 10,839 14,867 15,885 20,884 20,932 39,213 0 3,000 6,000 9,000 12,000 15,000 18,000 21,000 24,000 27,000 30,000 33,000 36,000 39,000 Austria New Zealand Israel Sweden Chile Mexico Australia France Korea United Kingdom Colombia Canada Turkey Spain United States 2023 2022 Number of new ICT short-cycle tertiary graduates New ICT short-cycle tertiary graduates by country, 2022–23 Source: OECD, 2025 | Chart: 2026 AI Index report 5,090 6,256 6,650 6,023 10,472 13,053 14,584 12,817 13,054 19,603 20,435 21,365 32,738 61,760 116,401 5,506 6,786 7,118 7,774 11,516 13,590 14,363 14,688 15,171 19,591 20,703 21,504 33,861 80,316 122,814 0 8,000 16,000 24,000 32,000 40,000 48,000 56,000 64,000 72,000 80,000 88,000 96,000 104,000 112,000 120,000 Chile Romania Spain Turkey France Canada Australia Poland Peru Korea United Kingdom Germany Mexico Brazil United States 2023 2022 Number of new ICT bachelor’s graduates New ICT bachelor’s graduates by country, 2022–23 Source: OECD, 2025 | Chart: 2026 AI Index report Figure 7.2.9 Figure 7.2.10 300 7.2 P O S T S E CO N DA RY C S A N D A I E D U CAT I O N | E D U CAT I O N | A I I N D E X R E P O R T 2026 2,200 2,403 2,452 2,910 4,044 3,214 3,373 3,728 2,982 4,164 9,716 12,500 13,940 21,688 55,706 2,334 2,461 2,727 3,261 3,342 3,588 3,758 3,906 4,010 4,571 8,895 12,650 15,233 27,624 86,301 0 6,000 12,000 18,000 24,000 30,000 36,000 42,000 48,000 54,000 60,000 66,000 72,000 78,000 84,000 Romania Italy Netherlands Korea Canada Spain Mexico Ireland Colombia Poland Australia Germany France United Kingdom United States 2023 2022 Number of new ICT master’s graduates New ICT master’s graduates by country, 2022–23 Source: OECD, 2025 | Chart: 2026 AI Index report 120 122 140 142 194 144 247 309 374 425 617 733 1,008 1,156 2,759 146 146 152 187 248 254 289 324 419 483 585 830 1,004 1,218 2,874 0 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000 2,200 2,400 2,600 2,800 Sweden Netherlands Finland Switzerland Italy Mexico Spain Canada Brazil Australia Korea France Germany United Kingdom United States 2023 2022 Number of new ICT PhD graduates New ICT PhD graduates by country, 2022–23 Source: OECD, 2025 | Chart: 2026 AI Index report Figure 7.2.11 Figure 7.2.12 Gender parity among ICT graduates remains uneven across countries and degree levels (Figure 7.2.13). On average, women account for 20% of associate’s graduates, 22% of bachelor’s graduates, 29% of master’s graduates, and 29% of PhD graduates. While the share of associate’s graduates declined year over year, the share of PhD graduates increased 4 percentage points. Women comprised at least half of ICT graduates in Peru at the associate’s level, and in Costa Rica and Latvia at the PhD level. Turkey, which had reported gender parity at all levels in the prior year, saw its shares move closer to the global averages in 2023. The gender 301 7.2 P O S T S E CO N DA RY C S A N D A I E D U CAT I O N | E D U CAT I O N | A I I N D E X R E P O R T 2026 20% 22% 29% 29% SC B M PhD 0% 50% 100% 25% 23% 33% 32% SC B M PhD 0% 50% 100% 11% 20% 23% 22% SC B M PhD 0% 50% 100% 10% 16% 19% 27% SC B M PhD 0% 50% 100% NA 17% 18% 19% SC B M PhD 0% 50% 100% NA 32% 37% 44% SC B M PhD 0% 50% 100% 30% 22% 34% 24% SC B M PhD 0% 50% 100% 13% 12% 20% 25% SC B M PhD 0% 50% 100% 27% 17% 28% 21% SC B M PhD 0% 50% 100% 25% 21% 22% 50% SC B M PhD 0% 50% 100% NA 22% 25% 25% SC B M PhD 0% 50% 100% NA 19% 19% 17% SC B M PhD 0% 50% 100% 13% 19% 39% NA SC B M PhD 0% 50% 100% NA 27% 47% 36% SC B M PhD 0% 50% 100% NA 25% 31% 28% SC B M PhD 0% 50% 100% 9% 18% 23% 26% SC B M PhD 0% 50% 100% NA 21% 26% 21% SC B M PhD 0% 50% 100% NA 33% 45% 26% SC B M PhD 0% 50% 100% 16% 17% 21% 13% SC B M PhD 0% 50% 100% 0% 18% 23% 20% SC B M PhD 0% 50% 100% 37% 29% 37% 35% SC B M PhD 0% 50% 100% 47% 31% 24% 26% SC B M PhD 0% 50% 100% 13% 16% 24% 25% SC B M PhD 0% 50% 100% NA NA NA NA SC B M PhD 0% 50% 100% 24% 32% 26% 19% SC B M PhD 0% 50% 100% 23% 21% 27% 86% SC B M PhD 0% 50% 100% NA 16% 28% 40% SC B M PhD 0% 50% 100% 15% 18% 36% 19% SC B M PhD 0% 50% 100% 27% 27% 40% 31% SC B M PhD 0% 50% 100% 10% 15% 30% 25% SC B M PhD 0% 50% 100% 36% 31% 41% 32% SC B M PhD 0% 50% 100% 22% 25% 27% 31% SC B M PhD 0% 50% 100% 50% 28% NA NA SC B M PhD 0% 50% 100% 0% 24% 18% 21% SC B M PhD 0% 50% 100% 9% 21% 35% 24% SC B M PhD 0% 50% 100% NA 32% 42% 40% SC B M PhD 0% 50% 100% NA 17% 18% 22% SC B M PhD 0% 50% 100% 8% 20% 25% 31% SC B M PhD 0% 50% 100% 12% 15% 21% 25% SC B M PhD 0% 50% 100% 30% 40% 44% 32% SC B M PhD 0% 50% 100% 8% 15% 18% 21% SC B M PhD 0% 50% 100% 26% 28% 34% 30% SC B M PhD 0% 50% 100% 26% 18% 33% 26% SC B M PhD 0% 50% 100% 25% 24% 35% 27% SC B M PhD 0% 50% 100% Short-cycle (SC) Bachelor’s (B) Master’s (M) PhD AVERAGE Australia Austria Belgium Brazil Bulgaria Canada Chile Colombia Costa Rica Croatia Czech Republic Denmark Estonia Finland France Germany Greece Hungary Iceland Ireland Israel Italy Japan South Korea Latvia Lithuania Luxembourg Mexico Netherlands New Zealand Norway Peru Poland Portugal Romania Slovakia Slovenia Spain Sweden Switzerland Turkey United Kingdom United States Percentage of new ICT postsecondary graduates who are female by country, 2022–23 Source: OECD, 2025 | Chart: 2026 AI Index report % of female ICT postsecondary graduates composition of these graduates has been consistent year over year, similar to the pattern among AI authors and inventors documented in Chapter 1, where the gender ratio has shown little change over the past 15 years. Figure 7.2.13 302 CS, CE, and Information Faculty In 2024–25, there were over 6,600 CS, CE (computer engineering), and information faculty in the United States and Canada (Figure 7.2.14). Nearly two-thirds of them filled tenure-track positions. The Computing Research Association (CRA) projections suggest the number of faculty will increase over the next two academic years, with the most growth in postdoctoral positions. Hispanic/Latino, Black, and Indigenous people are underrepresented in faculty positions, as are all women except Asian women (Figure 7.2.15).9 Asian and Native Hawaiian/Pacific Islander men are overrepresented among faculty. 8 Due to changes in CRA’s methodology, these figures are not directly comparable to faculty counts published in previous editions of the AI Index. 9 National faculty demographics are from the United States (NCES), whereas the faculty data also includes Canada. Postdoctorates were not included in the comparison data because they were absent from the national faculty demographics. 7.2 P O S T S E CO N DA RY C S A N D A I E D U CAT I O N | E D U CAT I O N | A I I N D E X R E P O R T 2026 4,220 4,523 4,721 1,152 1,216 1,299 691 691 772402 436 469 2024–25 2025–26 2026–27 0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 Tenure-track faculty Teaching professors Other instructors Non-tenure-track researcher faculty Postdoctorates Number of CS, CE, and information faculty Number of CS, CE, and information faculty in the Source: CRA Taulbee Survey, 2025 | Chart: 2026 AI Index report United States and Canada, 2024–26 6,683 7,101 7,501 Figure 7.2.14 Figure 7.2.15 0.00 0.50 1.00 1.50 2.00 2.50 3.00 Indigenous or Alaskan Native Black or African American Hispanic, any race More than one race White Nonresident Native Hawaiian or PaciǇc Islander Asian Male Female Ratio of CS, CE, and information faculty vs. all faculty demographics CS, CE, and information faculty vs. national faculty demographics by race/ethnicity and gender, 2024 Source: CRA Taulbee Survey, 2025 | Chart: 2026 AI Index report 303 7.2 P O S T S E CO N DA RY C S A N D A I E D U CAT I O N | E D U CAT I O N | A I I N D E X R E P O R T 2026 While this chapter’s focus is AI education, how students are learning is also changing. Examining AI in education, or the use of AI to complete teaching and learning tasks, helps provide a more complete picture of the impact of AI on the field of education. In Chegg’s 2025 survey of university students from 15 countries, 80% said they have used generative AI to support their learning. That is double the share reported in 2023, when only 40% of students reported having used generative AI for school. Generative AI usage varies widely by country, with 95% of Indonesian students saying they have used it, compared to 67% in the United States and the United Kingdom (Figure 7.2.16). Students who use generative AI for school report doing so frequently: 56% input a question at least once a day. For those students who do not use AI tools for school, the top reasons include accusations of academic misconduct (45%), content accuracy (38%), and school policies restricting AI use (33%). University students report using AI in similar ways to high school students, including researching, brainstorming/generating ideas, and editing essays. One notable difference is that university students are more likely than high school students to use AI to understand a concept (56% vs. 41%); in fact, understanding a subject is the top use of generative AI tools among university students (Figure 7.2.17). Student Use of AI Tools Figure 7.2.16 53% 49% 62% 62% 52% 23% 44% 63% 33% 33% 54% 33% 30% 20% 19% 95% (+42 pp) 90% (+41 pp) 89% (+27 pp) 87% (+25 pp) 84% (+32 pp) 84% (+61 pp) 84% (+40 pp) 83% (+20 pp) 83% (+50 pp) 81% (+48 pp) 79% (+25 pp) 77% (+44 pp) 68% (+38 pp) 67% (+47 pp) 67% (+48 pp) 0% 20% 40% 60% 80% 100% United Kingdom United States Turkey Australia Canada South Africa Mexico Kenya India South Korea Brazil Spain Saudi Arabia Malaysia Indonesia 2025 2023 % of students who have used GenAI University students (% of total) who have used GenAI to support their university studies, 2023 vs. 2025 Source: Chegg Global Student Survey, 2023 and 2025 | Chart: 2026 AI Index report 304 7.2 P O S T S E CO N DA RY C S A N D A I E D U CAT I O N | E D U CAT I O N | A I I N D E X R E P O R T 2026 Figure 7.2.17 1% 29% 33% 36% 38% 41% 46% 52% 56% 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% None of the above Step-by-step homework help Checking homework Exam/quiz prep Helping to prepare for presentations Writing/editing assignments and essays Generating initial ideas/Ǉrst drafts for assignments Researching for assignments and projects Understanding a concept or subject % of students University students’ GenAI uses for schoolwork, 2025 Source: Chegg Global Student Survey, 2025 | Chart: 2026 AI Index report Anthropic analyzed how students use Claude, its generative AI tool, and found that most students use it for higher-order thinking skills10, such as creating (39.8%) and analyzing (30.2%), rather than lower-order thinking skills, like applying (10.9%) and understanding (10.0%). These uses may suggest that students are relying on generative AI tools for important cognitive skills rather than developing them independently; indeed, another survey found that 55% of U.S. college students believe using generative AI tools has had a mixed effect on their critical thinking skills. Nonetheless, university students speak positively about using AI tools in education. A survey of over 73,000 California State University students showed that 64% of them agree that AI has positively affected their learning experience. The students in the Chegg survey listed several benefits from using AI tools. Half of respondents reported increased understanding of topics, 49% report improved ability to finish assignments, and 41% report improved organization. They also reported that AI makes their learning process more efficient, with 55% of university students saying AI helps them learn faster, and 41% saying it frees up more of their time. University students are still concerned about accusations of cheating and appropriate uses of AI; in response, more universities have implemented AI use policies. A faculty survey noted that 48% of institutions now have policies governing acceptable uses of generative AI, an increase of 9 percentage points since 2025. In the U.K., 80% of students think their university has a clear policy on generative AI use in assessments, which is a 16 percentage point improvement from the previous year. 10 These categories refer to levels of cognitive complexity as defined by Bloom’s Taxonomy, a widely used framework in education that classifies skills from lower-order (remembering, understanding, applying) to higher-order (analyzing, evaluating, creating). 305

7.3 K–12 CS and AI Education

7 E D U CAT I O N | A I I N D E X R E P O R T 2026 Code.org’s annual State of CS Education Report, which tracks CS education access, participation, and state policy, was expanded in 2025 to include an analysis of states’ AI education policies as of December 1, 2025. With only four states emphasizing AI in their CS standards, adoption of AI education remains limited. Between the 2017–18 and 2023–24 academic years, the percentage of U.S. high schools offering CS increased from 35% to 60%. The national average has held steady since 2023–24, with the same 60% of high schools offering foundational CS classes in 2024–25. There is, however, growth in many states and regression in a few (Figure 7.3.1). In states where offerings declined, budget cuts or reallocation toward other priorities, including literacy and student crisis support, may be contributing factors. Data on AI-specific education at the K–12 level remains limited. This section tracks developments in CS education as a proxy, with attention to the emerging state and federal policies that are beginning to address AI directly. United States Foundational Computer Science 35% 45% 47% 51% 53% 58% 60% 60% 2017–18 2018–19 2019–20 2020–21 2021–22 2022–23 2023–24 2024–25 0% 10% 20% 30% 40% 50% 60% % of schools Public high schools teaching foundational CS, 2017–18 to 2024–25 Source: Code.org, 2025 | Chart: 2026 AI Index report Figure 7.3.1 306 7.3 K-1 2 C S A N D A I E D U CAT I O N | E D U CAT I O N | A I I N D E X R E P O R T 2026 Out of 22 states reporting data for both 2023–24 and 2024–25, five states (AR, DE, LA, MD, SC) maintained the percentage of schools offering CS classes; three of them were already at or nearing universal access (AR and MD reporting 100% and SC reporting 92%) (Figures 7.3.2 and 7.3.3). Nine states and the District of Columbia (DC) reported some growth in the percentage of high schools offering CS. Of those, six (GA, IA, MS, NC, UT, WA) reported minimal growth (1%–5%). Two states (CA, MT) plus DC reported modest growth (6%–10%). One state, TN, reported significant growth, increasing the percentage of high schools offering CS classes by 22% (from 61% to 83%). This increase may reflect a 2022 policy requiring all K–12 students to have access to CS; districts were required to implement the state’s K–12 CS standards and a one CS credit graduation requirement starting with incoming freshmen in the 2024–25 academic year. A few states reported small decreases in the percentage of schools offering CS, with only one (WY) reporting a 10% decrease (from 74% to 64%). AL 92% AK 34% AZ 43% AR 100% CA 58% CO 66% CT 84% DE 61% DC 60% FL 38% GA 81% HI 67% ID 44% IL 58% IN 89% IA 86% KS 38% KY 75% LA 39% ME 59% MD 100% MA 81% MI 54% MN 34% MS 87% MO 56% MT 38% NE 51% NV 98% NH 94% NJ 86% NM 54% NY 50% NC 70% ND 53% OH 61% OK 72% OR 56% PA 78% RI 79% SC 92% SD 55% TN 83% TX 48% UT 82% VT 60% VA 68% WA 51% WV 78% WI 54% WY 64% Source: Code.org, 2025 | Chart: 2026 AI Index report Public high schools teaching foundational CS (% of total in state), 2025 AL -2% AK -AZ -- AR 0% CA 6% CO -CT -DE 0% DC 7% FL -GA 3% HI -5% ID -2% IL -- IN -IA 2% KS -- KY -1% LA 0% ME -4% MD 0% MA -MI -MN -MS 2% MO -MT 7% NE -NV -NH -1% NJ -NM -NY -NC 1% ND -OH -OK -OR -- PA -RI -SC 0% SD -TN 22% TX -UT 1% VT -VA -WA 1% WV -WI -WY -10% Source: Code.org, 2025 | Chart: 2026 AI Index report Change in public high schools teaching foundational CS, 2024 vs. 2025 Figure 7.3.2 Figure 7.3.3 307 7.3 K-1 2 C S A N D A I E D U CAT I O N | E D U CAT I O N | A I I N D E X R E P O R T 2026 As a result of disparities in education funding, CS education access varies by school size, geographic area, socioeconomic status, and student race/ethnicity (Figures 7.3.4 to 7.3.7). In 2025, 91% of large high schools, 77% of medium-sized high schools, and only 44% of small high schools offered foundational CS courses. As public school closures continue in some areas, school size and resource distribution may shift as nearby schools absorb displaced students. Whether funding follows those students, and how consolidation affects CS access, will be important to monitor. Title I11 schools (60%) are slightly less likely to offer CS than non–Title I schools (65%). Rural (57%) and urban (59%) high schools are less likely than suburban (71%) high schools to offer CS. The similar rates of rural and urban CS access may reflect shared constraints, including the digital divide and less access to CS teachers. Among Black, Hispanic/Latino, Native Hawaiian/Pacific Islander, and white high school students, the rates of access to foundational CS courses fall within a narrow range (80%–82%). Asian students (91%) are most likely to have access to CS courses. Native American students are least likely to have access to CS, though they had the largest year-over-year growth, from 66% in 2023–24 to 70% in 2024–25. 11 Title I schools serve students from low-income families and receive supplemental federal funding. 44% 77% 91% Small Medium Large 0% 20% 40% 60% 80% 100% % of schools Schools oǄering foundational CS courses by size, Source: Code.org, 2025 | Chart: 2026 AI Index report 2025 57% 59% 71% Rural Urban Suburban 0% 20% 40% 60% 80% 100% % of schools Schools oǄering foundational CS courses by Source: Code.org, 2025 | Chart: 2026 AI Index report geographic area, 2025 60% 65% Title I Non–Title I 0% 20% 40% 60% 80% 100% % of schools Schools oǄering foundational CS courses by Title I Source: Code.org, 2025 | Chart: 2026 AI Index report status, 2025 Figure 7.3.4 Figure 7.3.5 Figure 7.3.6 308 7.3 K-1 2 C S A N D A I E D U CAT I O N | E D U CAT I O N | A I I N D E X R E P O R T 2026 70% 80% 80% 81% 82% 84% 91% Native American Black Hispanic/Latino Native Hawaiian/ PaciǇc Islander White Two or more races Asian 0% 20% 40% 60% 80% 100% % of students Access to foundational CS courses by race/ethnicity, 2025 Source: Code.org, 2025 | Chart: 2026 AI Index report Figure 7.3.7 Based on participation data from 42 states, 6.1% of students were enrolled in CS in 2024–25, but student participation in CS varies by state (Figure 7.3.8). Arkansas and South Carolina report the highest participation rates, around 25%, while Idaho and Minnesota report the lowest (1.8%). Sixty-two percent of the reporting states (26 of 42) experienced a decrease in CS participation, though nearly half (12) of them reported decreases of less than 0.5% (Figure 7.3.9). Thirteen states (31%) reported an increase in student participation. The states that saw the largest increases in CS participation were North Dakota (11.1% increase from 5% to 16.1%), Tennessee (7.2% increase from 6% to 13.2%), and Arkansas (5.1% increase from 20% to 25.1%). Comparison with other states, including Ohio and California, is not possible as they did not report participation rates in 2023–24. Several states with the highest percentage of high schools offering foundational CS courses also reported the highest participation rates, but the correlation was not evident in every state (Figure 7.3.10). Arkansas and South Carolina have the highest participation rates at 25.1% and 25.7%, respectively. This is not surprising given they have a CS requirement for graduation and near universal access rates (100% for Arkansas; 92% for South Carolina). Maryland also has universal access to CS, but fewer of their students (16.7%) take the available courses. Meanwhile, Rhode Island reports just 79% of their schools offer foundational CS courses, but a higher percentage of their students take those courses (18.4%). 309 7.3 K-1 2 C S A N D A I E D U CAT I O N | E D U CAT I O N | A I I N D E X R E P O R T 2026 AK AZ 4% AR 25% CA 5% CO CT 8% DE DC FL 2% GA 8% HI 3% ID 2% IL 6% IN 6% IA 5% KS 3% KY 9% LA 3% ME MD 17% MA 8% MI MN 2% MS 11% MO 3% MT 4% NE 5% NV NH NJ 9% NM 3% NY 3% NC 6% ND 7% OH 3% OK 8% OR 7% PA 4% RI 18% SC 26% SD TN 13% TX 5% UT 11% VT 3% VA 5% WA 5% WV 3% WI 3% WY 7% Source: Code.org, 2025 | Chart: 2026 AI Index report Public high school enrollment in CS (% of students), 2025 AL 0.00% AK -AZ 1.50% AR 5.30% CA -- CO -CT -1.00% DE -DC -FL 0.00% GA 0.60% HI -1.00% ID -0.40% IL -1.00% IN -1.00% IA 0.10% KS -0.10% KY -2.90% LA -0.80% ME -MD 1.10% MA 0.00% MI -MN -0.70% MS 0.00% MO 0.00% MT 0.10% NE 0.40% NV -NH -NJ 0.00% NM 0.00% NY -1.70% NC 0.90% ND 11.20% OH -OK 3.00% OR 0.00% PA -2.40% RI 0.00% SC 0.00% SD -TN 7.20% TX -0.90% UT -1.20% VT -0.60% VA 0.00% WA -0.60% WV -1.50% WI -0.30% WY -1.90% Source: Code.org, 2025 | Chart: 2026 AI Index report Change in public high school enrollment in CS (% of students), 2024 vs. 2025 Figure 7.3.8 Figure 7.3.9 310Alabama Arizona Arkansas California Connecticut Florida Georgia Hawaii Idaho Illinois Indiana Iowa Kansas Kentucky Louisiana Maryland Massachusetts Minnesota Mississippi Missouri Montana Nebraska New Jersey New Mexico New York North Carolina North Dakota Ohio Oklahoma Oregon Pennsylvania Rhode Island South Carolina Tennessee Texas Utah Vermont Virginia Washington West Virginia Wisconsin Wyoming 0% 5% 10% 15% 20% 25% 30% 40% 50% 60% 70% 80% 90% 100% % of students enrolled in CS % of schools oǄering CS Percent of high schools oǄering CS vs. percent of students enrolled in CS by state, 2025 Source: Code.org, 2025 | Chart: 2026 AI Index report Figure 7.3.10 Figure 7.3.11 CS enrollment data also shows gaps for several subgroups (Figure 7.3.11); here too, the data should be interpreted with caution, as not all states reported enrollment data this year. Last year’s analysis showed near or above proportional representation for Black, Native American/Alaskan, and white students at the national level. That trend continues this year, with Native Hawaiian/Pacific Islander students moving closer to that mark, and Asian students and students with 504 plans overrepresented among participating students (Figure 7.3.12). The representation of Hispanic/Latino students, economically disadvantaged students, students with IEPs, and girls slightly improved, but these populations remain underrepresented in CS courses. English language learners (ELL) remain underrepresented, but between 2024 and 2025, their representation notably improved, which may be due to concerted engagement efforts and initiatives to boost the quality of CS instruction for ELL students. 7.3 K-1 2 C S A N D A I E D U CAT I O N | E D U CAT I O N | A I I N D E X R E P O R T 2026 2.60 1.13 0.69 1.00 0.75 0.80 1.00 2.34 1.28 0.72 1.10 0.90 0.92 0.99 0.00 0.50 1.00 1.50 2.00 2.50 3.00 White Two or more races Native Hawaiian/PaciǇc Islander Native American/Alaskan Hispanic/Latino Black/African American Asian 2025 2024 Ratio of enrollment in CS to national demographics Public high school enrollment in CS vs. national demographics by race/ethnicity, 2025 Source: Code.org, 2025 | Chart: 2026 AI Index report 311 7.3 K-1 2 C S A N D A I E D U CAT I O N | E D U CAT I O N | A I I N D E X R E P O R T 2026 0.72 0.64 0.65 1.33 0.67 0.78 0.76 0.68 2.17 0.70 0.00 0.50 1.00 1.50 2.00 2.50 Students with IEPs Students with 504 plans Girls English language learners Economically disadvantaged 2025 2024 Ratio of enrollment in CS to national demographics Public high school enrollment in CS vs. national demographics by subgroup, 2025 Source: Code.org, 2025 | Chart: 2026 AI Index report Figure 7.3.1212 12 A student with a 504 plan receives accommodations under Section 504 of the Rehabilitation Act of 1973, a U.S. civil rights law that prohibits discrimination against individuals with disabilities. A student with an IEP (individualized education program) receives special education services under the Individuals with Disabilities Education Act. An IEP is a legally binding document that outlines a learning plan for a student with a disability designed to meet their unique needs and improve educational outcomes. Advanced coursework covers foundational AI concepts (e.g., AP CS Principles) and is an obvious pathway to build AI literacy and integrate more in-depth AI education. Since 2016, AP exam participation has grown steadily year over year, except for the plateau during the COVID pandemic between 2020 and 2021 (Figure 7.3.13). However, AP exam growth slowed from 21% between 2022 and 2023 to just 5% between 2023 and 2024. Despite improvement in student populations’ representation across courses, students do not participate in AP exams proportionate to their racial/ethnic representation. Black, Native American/Alaskan, and Native Hawaiian/Pacific Islander students are better represented among CS education participants than students who took the AP exam in 2024; Hispanic/Latino students were better represented among AP exam takers than among the general population of students taking CS. Female students took the AP exam less often than male students in 2024; stereotype threat may explain the reluctance to take the exam and differences in scores. Asian students, multiracial boys, and white boys are overrepresented among those taking AP exams (Figures 7.3.14, 7.3.15, 7.3.16). Advanced Computer Science 312 7.3 K-1 2 C S A N D A I E D U CAT I O N | E D U CAT I O N | A I I N D E X R E P O R T 2026 19.39 19.83 20.96 19.39 21.14 24.78 29.55 37.33 46.34 54.38 99.87 130.90 158.56 179.19 181.04 201.61 243.18 254.80 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 0 50 100 150 200 250 Number of AP computer science exams taken (in thousands) Number of AP computer science exams taken, 2007–24 Source: Code.org, 2025 | Chart: 2026 AI Index report 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 90,000 Number of AP computer science exams taken 0, Other 268, Native Hawaiian/PaciǇc Islander 590, Native American/Alaskan 11,797, Two or more races 18,026, Black/African American 46,526, Hispanic/Latino 75,936, Asian 89,363, White AP computer science exams taken by race/ethnicity, 2007–24 Source: Code.org, 2025 | Chart: 2026 AI Index report Figure 7.3.13 Figure 7.3.14 313 7.3 K-1 2 C S A N D A I E D U CAT I O N | E D U CAT I O N | A I I N D E X R E P O R T 2026 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 0% 10% 20% 30% 40% 50% 60% AP computer science exams taken (% of total responding students) 0.00%, Other 0.00%, Native American/Alaskan 0.11%, Native Hawaiian/PaciǇc Islander 4.63%, Two or more races 7.07%, Black/African American 18.26%, Hispanic/Latino 29.80%, Asian 35.07%, White AP computer science exams taken (% of total responding students) by race/ethnicity, 2007–24 Source: Code.org, 2025 | Chart: 2026 AI Index report 0 1 2 3 4 5 6 7 8 White Two or more races Native Hawaiian/PaciǇc Islander Native American/Alaskan Hispanic/Latino Black/African American Asian Male Female Ratio of AP CS exam participation to national demographics AP computer science exam participation vs. national demographics by race/ethnicity, 2024 Source: Code.org, 2025 | Chart: 2026 AI Index report Figure 7.3.15 Figure 7.3.16 314 7.3 K-1 2 C S A N D A I E D U CAT I O N | E D U CAT I O N | A I I N D E X R E P O R T 2026 K-12 Student Use of AI Tools Using AI tools for educational purposes is increasingly popular among middle and high school students. Estimates on student use of AI to complete school-related tasks range from about 50% to 84%, based on survey data. While students acknowledge concerns (e.g., false cheating accusations, diminished critical thinking, and weakened academic skills), studies suggest that usage is trending upward, with a majority of students now advocating for AI use in schools. One survey found that about half of students surveyed agreed that schools should be required to teach students how to use AI (52%) and that students should be allowed to use AI to complete homework (47%); another survey reported even higher rates of agreement, with 65% of middle school students and 73% of high school students stating that students should have access to and be able to use AI tools to complete schoolwork. High school students report using generative AI most often for conducting research and finding sources, editing or revising essays, and brainstorming ideas (Figure 7.3.17), and they report benefiting from increased access to learning materials and more efficient learning. Figure 7.3.1714 13 CDT (2025), 50%; College Board (2025), 84%; RAND (2025), 54%. 14 This chart shows the average percentage across College Board’s four survey administrations in 2025. 2% 18% 30% 41% 50% 50% 51% 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% Other Writing code Learning languages Explaining complex topics Brainstorming ideas Editing or revising essays Conducting research and Ǉnding sources High school students’ GenAI uses for schoolwork, 2025 Source: College Board, 2025 | Chart: 2026 AI Index report Only about half of middle and high schools have policies regarding AI use. Of that, only 28% permit AI use in some circumstances, while 22% do not. Schools with AI policies, especially policies that allow for AI use in schoolwork, are more likely to be in wealthier and more urban communities. However, the utility of those policies, given their lack of clarity, is in question. Only 36% of students described their school’s policies as extremely clear, and 47% have wanted to use AI for schoolwork but were unsure if it was allowed. A teacher survey found even lower ratings for clear policies, with teachers saying that only 6% of their schools had clear, comprehensive policies. Education Standards, Policies, and Guidance 315 7.3 K-1 2 C S A N D A I E D U CAT I O N | E D U CAT I O N | A I I N D E X R E P O R T 2026 Most education policy in the United States is determined at the state level. As of January 2026, 30 states have issued guidance on AI in education. Regarding policy focused specifically on AI education, 17 states have issued guidance that clarifies CS as foundational to AI, and five states have allocated specific professional development funding for AI education, according to the 2025 State of AI + Computer Science Report. Perhaps the most significant AI education guidance generally comes in the form of standards that define learning outcomes for K–12 students. As of January 2026, 45 states have adopted K–12 CS standards, while five states plus DC do not have such standards. The majority (29 states) include AI but only to a very limited extent, and it is typically restricted to the high school level, similar to the current CSTA K–12 Standards, Revised 2017, which act as the de facto national standards. Ten states’ standards make no specific mention of AI content. Six states have CS standards with significant AI-specific content, and another two states have published draft revised standards that also include significant AI-specific content (Figure 7.3.18). States with significant AI standards were all adopted in the last few years or are currently under development. See Table 4.1 for a summary of the features and organization of the state standards with significant AI content. State K–12 AI Guidance in the United States AL AK AZ AR CA CO CT DC DE FL GA HI ID IL INIA KS KY LA ME MD MA MIMN MS MO MT NE NV NH NJ NM NY NC ND OH OK OR PA RI SC SD TN TX UT VT VA WA WV WI WY Source: CSTA, 2025 | Chart: 2026 AI Index report CS standards with significant AI-specific content CS standards with significant AI-specific content (draft) CS standards with minimal AI-specific content CS standards with no AI-specific content No CS standards Adoption of AI-specific K 12 computer science standards by US state Figure 7.3.18 316 7.3 K-1 2 C S A N D A I E D U CAT I O N | E D U CAT I O N | A I I N D E X R E P O R T 2026 Year Earliest grade Approach Organization & labeling of AI content 2025 DRAFT 2025 2024 2024 2025 2022 2024 2025 DRAFT 1st grade 9–12 elective 2nd grade 1st grade Kindergarten Kindergarten 4th grade Pre-K–5 (grade band) Integrated across concepts Elective course Standalone concept Integrated across concepts Integrated across concepts Standalone concept Integrated across concepts Integrated across concepts An [AI] tag marks standards distributed widely across concepts, most commonly in Data Science > Data Collection and Representation; Impact of Computing > Emerging Technology; and Digital Proficiency > Digital Tools. AI is the focus and label of one entire level 3 course in Data Analytics and Machine Learning pathway. AI is one of six strands used to organize standards. AI content is organized into two main strands: Technological Impact (grades 1–12) and Emerging Technology (grades 6–12). AI content is organized into three subconcepts: Computing Devices and Systems > Artificial Intelligence; Algorithms and Computational Thinking > Creating Instructions for AI; and Impacts of Computing > Impacts of AI. AI is one of six strands. The strand has five topics that align with the AI4K12 Five Big Ideas: Perception, Representation and Reasoning, Machine Learning, Natural Interactions, Societal Impacts. AI content is organized into three strands: Computing Systems; Data and Analysis; and Impacts of Computing. AI content is organized in two subconcepts: Computing and Society > Emerging Technologies; and Data and Analysis > Impacts of Data Science. State Alabama Arkansas Colorado Florida North Dakota Ohio Virginia Wisconsin Table 4.1. Features and organization of state standards with significant AI content Revised CSTA K–12 Standards, slated for release in summer 2026, will delineate significant AI-related learning goals as part of a foundational CS education across grades K–12. To inform these standards, CSTA and AI4K12 convened a national group of educators, curricula developers, professional development providers, and researchers in 2025 to provide insights into identifying priority areas of AI knowledge and skills. Priorities include the human role in creating AI, reasoning, data and machine learning, ethical evaluation of AI systems, and societal impacts. In the new Pre-K–12 foundational standards, these AI priorities will be integrated across five concepts and most significantly organized into four subconcepts: Machine Learning, Impacts of Algorithms, Emerging Technologies, and Humans and Computing. Additionally, there will be two sets of high school specialty standards focused on advanced AI content. Given the high degree of current coherence, most states will likely adopt similar AI standards within the next five years. An April 2025 Executive Order, Advancing Artificial Intelligence Education for American Youth, sought to define a national strategy for developing AI competency from K–12 through postsecondary education by promoting early student exposure to AI, integrating AI into instruction, and expanding professional learning for educators. It establishes a White House AI Education Task Force to coordinate federal efforts, launch a Presidential AI Challenge, and develop public-private partnerships that deliver K–12 AI resources at scale. The order also directs the Departments of Education, Labor, NSF, and Agriculture to prioritize AI in grants, research, teacher preparation, apprenticeships, and workforce pathways. Federal K–12 AI Guidance in the United States 317 7.3 K-1 2 C S A N D A I E D U CAT I O N | E D U CAT I O N | A I I N D E X R E P O R T 2026 H I G H L I G H T: The Implementation Gap in K–12 AI Policy A research project from Expanding Computing Education Pathways (ECEP) assessed statewide AI policies across K–12 schools and found that state-level AI guidance is largely nonbinding and decentralized. Most states rely on existing federal laws like the Children’s Online Privacy Protection Act (COPPA) and Family Educational Rights and Privacy Act (FERPA), rather than issuing AI-specific mandates. The responsibility for local policy development, tool vetting, and implementation falls on local education agencies, meaning the rigor of AI education and pace of adoption are determined by local capacity (e.g., number of trained teachers, funding) and decision-making. Teacher preparation was also an identified gap. State-level documents recognize the importance of AIrelated teacher training, but there are currently no state-level standards for programs or funding. Without steady financial backing or standardized training benchmarks, the quality of AI integration remains contingent on local resources. At the same time, AP Computer Science, one of the most common pathways to advanced CS coursework in U.S. high schools, does not include AI-specific content. Policy guidance, teacher training, and curriculum would all need to align for AI education to reach students consistently, and at present, gaps remain in all three. 318 7.3 K-1 2 C S A N D A I E D U CAT I O N | E D U CAT I O N | A I I N D E X R E P O R T 2026 Despite the widespread mention of AI education in national education strategy plans over the past few years, few countries actually implemented AI education in 2025; it was more common for countries to integrate AI technology into education. For example, South Korea launched AI textbooks in primary schools in March 2025, only to reverse course a few months later due to parent and teacher pushback. In Greece, the government partnered with OpenAI to train secondary teachers to use ChatGPT in the classroom. And in Estonia, the AI Leap 2025 program is piloting access to AI learning applications with 20,000 students and 3,000 teachers during the 2025–26 school year. Two countries, however, made significant strides in implementing AI education: China and the United Arab Emirates (UAE). In China, Beijing, Guangdong, and Hangzhou all began requiring AI education in the 2025–26 school year following the release of China’s General AI Education Guide for Primary and Secondary Schools (2025 Edition) and Guide for the Use of Generative AI by Primary and Secondary Students (2025 Edition) in May 2025. All three areas have similar requirements, including a minimum number of instructional hours and curriculum that progresses through grade levels, starting with elementary students learning AI literacy skills and ending with high school students designing AI systems. The UAE similarly mandated AI education for all grade levels starting in the 2025–26 school year. Students will also progress through a grade-level curriculum that includes skills in foundational concepts, data and algorithms, software use, innovation and project design, and ethical awareness. In lieu of widespread data on AI education, we again present data on CS education, especially since some AI content may be taught in CS classes. Similar to the challenges inherent in tracking CS education in the United States, caution is called for when interpreting global metrics because CS and ICT education are sometimes conflated with digital or computer literacy. In 2025, approximately 93% of the world’s countries taught CS (Figure 7.3.19). Thirty percent of countries mandate CS education in either primary or secondary school, while 63% have CS available in at least some schools but do not mandate it. Nearly three-fourths of countries integrate CS concepts into other courses, such as math and science. Access to standalone CS classes often varies by school type and geography, with private and urban schools more likely to offer CS than public and rural schools. This indicates that resources and infrastructure continue to be challenges for schools seeking to expand students’ digital skills. Global AI and CS Education CS mandatory in primary and/or secondary school CS available everywhere as an elective course CS courses available in some schools or districts Government announced CS expansion, plan or pilot No online evidence of in-school CS Availability of CS education by country, 2025 Source: AI Index, 2025; Raspberry Pi Computing Education Research Centre, 2024 | Chart: 2026 AI Index report Figure 7.3.19 319

7.4 AI Skill Acquisition

7 E D U CAT I O N | A I I N D E X R E P O R T 2026 Formal education is one entry point into AI, but as the technology reshapes jobs across sectors, upskilling and reskilling have become central to lifelong learning. Many people are building AI skills through professional certificates, online courses, and on-the-job experience, pathways that can also broaden access for learners without deep CS or math backgrounds. This section examines where AI skills are concentrated globally and how quickly they are spreading. LinkedIn’s relative AI skill penetration rate measures how prominently AI skills feature in people’s profiles in a given country compared with a global average (Figure 7.4.1). India leads at 3.0, meaning AI skills appear in member profiles at almost three times the global average, followed by the United States at 2.0 and Germany at 1.8. However, these countries also show a persistent gender gap when measuring male and female AI skill penetration rates against the global average (Figure 7.4.2). In India, men list AI skills at more than 1.5 times the rate of women (3.1 vs. 1.9); in the United States, the gap is similar, though a bit more narrow (2.1 vs. 1.4). AI Skill Penetration Figure 7.4.1 1.14 1.14 1.22 1.28 1.37 1.38 1.43 1.47 1.48 1.53 1.54 1.55 1.83 2.02 2.95 0.00 0.50 1.00 1.50 2.00 2.50 3.00 Poland Netherlands Italy Turkey United Arab Emirates Israel Singapore Spain Brazil France Canada United Kingdom Germany United States India Relative AI skill penetration rate Relative AI skill penetration rate by geographic area, 2015–25 Source: LinkedIn, 2025 | Chart: 2026 AI Index report 320 7.4 A I S K I L L | E D U CAT I O N | A I I N D E X R E P O R T 2026 Figure 7.4.2 0.65 0.66 0.69 0.72 0.78 0.83 0.83 0.85 0.85 0.87 0.94 0.95 1.06 1.38 1.94 1.27 1.57 1.31 1.38 0.89 1.70 1.64 1.72 1.24 1.52 1.55 1.63 1.93 2.13 3.05 0.00 0.50 1.00 1.50 2.00 2.50 3.00 Netherlands Brazil Italy United Arab Emirates Saudi Arabia Singapore Spain Israel Turkey France United Kingdom Canada Germany United States India Male Female Relative AI skill penetration rate Relative AI skill penetration rate across gender, 2015–25 Source: LinkedIn, 2025 | Chart: 2026 AI Index report The AI Skills Diffusion Index that LinkedIn introduced this year tracks how much AI skills adoption has grown within a country relative to its own baseline, rather than current relative prevalence (Figure 7.4.3). This measure also accounts for the diversity of AI skills, distinguishing between AI engineering skills, which relate to building and deploying AI systems, and AI literacy skills, which reflect a familiarity with AI-enabled tools. Across many of the countries in the sample, both AI literacy and engineering show recent increases, but the pace differs. AI literacy skills show steeper growth, while engineering-oriented skills have increased more modestly. This is the case for India and the United States. However, countries such as the United Arab Emirates, Chile, and South Africa show rapid growth in AI engineering skills. In the United States, the fastest growing literacy skills were AI prompting and Microsoft Copilot Studio, while the fastest growing engineering skills were AI agents, AI productivity, and AI strategy (Figure 7.4.4). AI Skills Diffusion 321 7.4 A I S K I L L | E D U CAT I O N | A I I N D E X R E P O R T 2026 2016 2018 2020 2022 2025 1 40 80 120 160 2016 2018 2020 2022 2025 1 40 80 120 160 2016 2018 2020 2022 2025 1 40 80 120 160 2016 2018 2020 2022 2025 1 40 80 120 160 2016 2018 2020 2022 2025 1 40 80 120 160 2016 2018 2020 2022 2025 1 40 80 120 160 2016 2018 2020 2022 2025 1 40 80 120 160 2016 2018 2020 2022 2025 1 40 80 120 160 2016 2018 2020 2022 2025 1 40 80 120 160 2016 2018 2020 2022 2025 1 40 80 120 160 2016 2018 2020 2022 2025 1 40 80 120 160 2016 2018 2020 2022 2025 1 40 80 120 160 2016 2018 2020 2022 2025 1 40 80 120 160 2016 2018 2020 2022 2025 1 40 80 120 160 2016 2018 2020 2022 2025 1 40 80 120 160 2016 2018 2020 2022 2025 1 40 80 120 160 2016 2018 2020 2022 2025 1 40 80 120 160 2016 2018 2020 2022 2025 1 40 80 120 160 2016 2018 2020 2022 2025 1 40 80 120 160 2016 2018 2020 2022 2025 1 100 200 300 400 2016 2018 2020 2022 2025 1 40 80 120 160 2016 2018 2020 2022 2025 1 40 80 120 160 2016 2018 2020 2022 2025 1 40 80 120 160 2016 2018 2020 2022 2025 1 40 80 120 160 2016 2018 2020 2022 2025 1 40 80 120 160 2016 2018 2020 2022 2025 1 40 80 120 160 2016 2018 2020 2022 2025 1 40 80 120 160 2016 2018 2020 2022 2025 1 40 80 120 160 2016 2018 2020 2022 2025 1 40 80 120 160 2016 2018 2020 2022 2025 1 40 80 120 160 2016 2018 2020 2022 2025 1 40 80 120 160 2016 2018 2020 2022 2025 1 40 80 120 160 2016 2018 2020 2022 2025 1 40 80 120 160 2016 2018 2020 2022 2025 1 40 80 120 160 2016 2018 2020 2022 2025 1 40 80 120 160 2016 2018 2020 2022 2025 1 40 80 120 160 2016 2018 2020 2022 2025 1 40 80 120 160 2016 2018 2020 2022 2025 1 40 80 120 160 2016 2018 2020 2022 2025 1 40 80 120 160 2016 2018 2020 2022 2025 1 40 80 120 160 2016 2018 2020 2022 2025 1 40 80 120 160 2016 2018 2020 2022 2025 1 40 80 120 160 2016 2018 2020 2022 2025 1 40 80 120 160 2016 2018 2020 2022 2025 1 40 80 120 160 2016 2018 2020 2022 2025 1 40 80 120 160 2016 2018 2020 2022 2025 1 100 200 300 400 500 2016 2018 2020 2022 2025 1 40 80 120 160 AI Literacy AI Engineering Argentina Australia Austria Belgium Brazil Canada Chile Costa Rica Croatia Cyprus Czech Republic Denmark Estonia Finland France Germany Greece Hong Kong Iceland India* Indonesia Ireland Israel Italy Latvia Lithuania Luxembourg Mexico Netherlands New Zealand Norway Poland Portugal Romania Saudi Arabia Singapore Slovenia South Africa South Korea Spain Sweden Switzerland Turkey United Arab Emirates United Kingdom United States* Uruguay 96.74 48.42 45.12 69.98 23.16 16.77 28.52 16.94 51.22 167.31 39.05 142.97 99.84 27.17 39.55 11.59 20.36 3.06 23.10 2.33 12.74 16.42 30.85 20.73 17.532.72 31.2317.03 19.29 91.73 17.50 153.20 21.10 21.70 81.91 20.46 5.17 52.94 381.99 84.67 46.53 49.3034.92 17.62 22.37 27.33 77.95 16.632.40 24.18 6.37 24.38 2.15 59.71 64.84 25.02 64.44 71.31 15.48 45.31 5.82 18.99 84.41 35.41 25.35 36.70 18.35 63.60 16.16 58.02 59.93 16.961.69 97.72 37.88 16.78 5.64 29.68 120.68 21.20 26.98 25.57 39.77 17.65 31.06 153.74 54.41 32.40 129.93 31.88 458.07 23.98 3.71 AI Skills DiǄusion Index by geographic area, 2016–25 Source: LinkedIn, 2025 | Chart: 2026 AI Index report AI Skills SiǄusion Index Figure 7.4.315 15 Asterisks indicate that a country’s y-axis label is scaled differently than other countries’. 322 7.4 A I S K I L L | E D U CAT I O N | A I I N D E X R E P O R T 2026 Fastest growing AI skills in the United States, 2025 Source: LinkedIn, 2025 AI engineering skills AI literacy skillsRank 1 AI agents AI prompting 2 AI productivity Microsoft Copilot Studio 3 AI strategy GitHub Copilot 4 Amazon Bedrock Prompt engineering 5 Large language model operations (LLMOps) Microsoft Copilot Figure 7.4.4 323 Policy and

Governance

8 A I I N D E X R E P O R T 2026

8.1 Major Global AI Policy

News in 2025 8 P O L I CY A N D G OV E R N A N C E | A I I N D E X R E P O R T 2026 US Executive Order ‘Removing Barriers to American Leadership in AI’ The U.S. issues an executive order rescinding earlier AI directives and establishing a new policy to enhance U.S. AI dominance, promote innovation, and remove regulatory barriers. UK to Criminalize AI Tools Used to Generate Child Sexual Abuse Imagery The United Kingdom positions itself as the first country to introduce laws against artificial intelligence tools used to generate sexualized images of children. 1st Measures of the EU AI Act Come Into Effect The EU’s landmark AI regulation takes effect in its first phase, banning high-risk uses (e.g., predictive policing, emotion recognition) and setting the stage for stricter rules. AI Action Summit: US and UK Refuse to Sign Inclusive AI Declaration At the 2025 Paris AI Action Summit, the U.S. and UK decline to endorse a declaration signed by 60 countries on inclusive, ethical AI, signaling divergence in governance approaches. China Finalizes Mandatory Labeling Rules for AI-Generated Content Chinese regulators issue final rules requiring clear labeling of AI-generated and synthetic media, with phased implementation beginning later in the year. Zimbabwe Partners With Nvidia to Launch Africa’s 1st AI Factory Cassava Technologies, founded by Zimbabwean billionaire Strive Masiyiwa, announces a partnership with Nvidia to establish the continent’s first dedicated “AI factory.” January 23 February 1 February 2 February 11 March 14 March 24 327 8.1 M A J O R G LO BA L A I P O L I CY N E W S | P O L I CY A N D G OV E R N A N C E | A I I N D E X R E P O R T 2026 Utah Enacts the Mental Health Chatbot Act (HB 452) The bill establishes provisions for regulating mental health chatbots that use artificial intelligence technology. It mandates disclosure of AI use, bans advertising within the chat, and prohibits sharing users’ personal data. Kigali Summit Highlights Africa’s AI Opportunity and Labor Risks Thousands of delegates convene in Kigali for the inaugural Global AI Summit on Africa to explore how the continent can harness AI for development while mitigating potential disruptions to labor markets. Montana Enacts the Right to Compute Act (SB 212) The law establishes a pro-innovation legal framework for AI that protects Montanans’ rights to own and use computational resources for lawful AI activities without undue government restriction. Africa Declares AI a Strategic Priority: Investment, Inclusion, and Innovation The African Union region identifies AI as a central strategic priority, emphasizing inclusion, startup funding, and narrowing the digital divide. US Enacts the Take It Down Act, Targeting Nonconsensual Intimate Imagery— Including AI Deepfakes The law is designed to address the distribution of nonconsensual intimate imagery; it explicitly covers deepfake content and strengthens removal/accountability expectations. March 25 April 3 April 16 May 17 May 19 2025 Gavin Newsom–Commissioned California AI Policy Report Warns of ‘Irreversible Harms’ California’s state-level report highlights AI threats including biological and nuclear misuse and proposes safety, transparency, and whistleblower frameworks—potentially offering a national blueprint in the absence of federal law. June 17 328 8.1 M A J O R G LO BA L A I P O L I CY N E W S | P O L I CY A N D G OV E R N A N C E | A I I N D E X R E P O R T 2026 G7 Issues Joint Statement Reaffirming Cooperation on AI Governance G7 leaders release a joint declaration committing to coordination on AI safety, risk management, and standards for advanced AI systems. Texas Signs the Responsible Artificial Intelligence Governance Act (TRAIGA) Into Law Passed in June 2025 and taking effect in 2026, the state law sets strict rules for high-impact AI, including bans on uses that incite harm, violate constitutional rights, or discriminate against protected classes. US Senate Strikes 10-Year Federal Moratorium on State AI Regulation The Senate removes a proposed federal ban on state-level AI regulation from a major spending bill, allowing states to proceed with their own AI oversight laws. EU Releases Voluntary Code of Practice for General-Purpose AI The European Commission publishes a code to guide businesses in complying with the upcoming EU AI Act rules for general-purpose models, covering transparency, copyright, and safety. US Launches ‘America’s AI Action Plan’ and 3 Executive Orders The White House releases a broad AI strategy covering innovation, infrastructure, and diplomacy, plus executive orders on data centers, exports, and government procurement. June 17 June 22 July 1 July 10 July 23 2025 China Announces Action Plan for Global AI Governance At the 2025 World AI Conference, China’s Premier Li Qiang unveils a 13-point road map to advance global AI coordination and standards. July 26 329 8.1 M A J O R G LO BA L A I P O L I CY N E W S | P O L I CY A N D G OV E R N A N C E | A I I N D E X R E P O R T 2026 Creators’ Organizations Condemn EU AI Act Implementation A coalition of 38 global creative-industry bodies issues a joint statement criticizing the EU’s AI Act as undermining cultural rights and favoring model-developers. EU’s General-Purpose AI Obligations Take Effect Under the EU AI Act, obligations for providers of general-purpose AI models begin to apply, requiring risk assessments, transparency disclosures, and mitigation measures for systems with systemic risk. UN Launches Global Scientific Panel and Global Dialogue on AI Governance The UN General Assembly approves the creation of an Independent International Scientific Panel on AI and a Global Dialogue on AI Governance to provide coordinated scientific guidance and facilitate international cooperation on AI regulation. Italy Becomes First EU Member State to Pass an AI Law Italy advances national AI legislation intended to complement EU-level regulation, reflecting member-state moves to define institutional roles and national implementation. California Enacts Landmark AI Safety/Transparency Law (SB 53) California Gov. Newsom signs SB 53 requiring large AI-model developers to disclose safety protocols and incident reports and to protect whistleblowers. July 30 August 2 August 26 September 17 September 29 2025 AI Companion Lawsuits Prompt Renewed Scrutiny of Emotional AI Safeguards Following a widely reported teen suicide linked to interactions with an AI companion, U.S. lawmakers and regulators increase scrutiny of AI companion systems and child-safety safeguards. August 26 330 8.1 M A J O R G LO BA L A I P O L I CY N E W S | P O L I CY A N D G OV E R N A N C E | A I I N D E X R E P O R T 2026 European Commission Launches the European Strategy for AI in Science The Commission announces a strategy to reinforce Europe’s technological and scientific leadership and competitiveness by harnessing the potential of AI technologies in science and supporting scientists to adopt them in their research. The strategy contributes to the AI Continent Action Plan and was presented alongside the Apply AI Strategy, which aims to speed up AI adoption in key business and industrial sectors. October 8 California Enacts New AI Laws California enacts multiple AI-related bills—SB243 regulating companion bots, AB853 requiring gen AI developers to ensure their tools’ content includes provenance data, and AB621 extending existing state law on nonconsensual deepfakes. October 13 US Executive Order Launches the Genesis Mission An executive order launches the Genesis Mission, a major national initiative to accelerate scientific discovery and technological innovation using artificial intelligence. The mission, compared in ambition to the Manhattan Project, tasks the Department of Energy with leading the effort. November 24 US Executive Order Seeks to Curb State-Level AI Laws A U.S. executive order aimed at limiting or preempting state AI regulation to “enhance the United States’ global AI dominance through a minimally burdensome national policy framework for AI.” December 12 Launch of Pax Silica: US-led AI and Technology Supply-Chain Cooperation Initiative At a summit convened by the U.S. Department of State, the Pax Silica Declaration is signed by multiple countries to strengthen trusted technology and AI-relevant supply chains, spanning semiconductors, data infrastructure, and AI hardware cooperation. December 11 2025 UNESCO Adopts Global Standards on Neurotechnology UNESCO approves international standards covering AI-driven neurotechnology—“neural data” rights, mental privacy, and emerging regulation. November 5 331 8 P O L I CY A N D G OV E R N A N C E | A I I N D E X R E P O R T 2026 As AI becomes more central to economic development and national competitiveness, more governments are moving to formalize their approach through national AI strategies. This section draws and builds on data from Oxford Insights to track the adoption and geographic expansion of formal national strategies over time. The dataset captures what has been published, rather than how or how well it has been implemented, so results should be viewed as policy intent rather than actual progress.

8.2 National AI Strategies

More countries adopted national AI strategies in 2024 and 2025, especially within emerging economies (Figure 8.2.1). This marks a shift in AI governance from earlier years, as countries that have historically played a smaller role in AI policymaking are now putting formal national strategies in place. New frameworks have surfaced across sub-Saharan Africa (such as Ethiopia, Ghana, and Nigeria), South and Central Asia (notably Sri Lanka and Nepal), and Latin America and the Caribbean (including Costa Rica and Jamaica). With strategies already under development in Mexico and South Africa, this trend underscores AI policy’s increasing global reach. High-income economies continue to contribute new strategies as well, though at a slower pace and with a focus on consolidating earlier frameworks. European countries, such as Malta, have released updated strategies to align with EU AI Act requirements. As more countries adopt national AI strategies, there is a rising consensus that AI serves as a lever to bolster state capacity. International cooperation, technical assistance, and policy diffusion also play an important role in this expansion. However, the next challenge is implementation and strengthening regulatory capacity, particularly in Africa, where many countries still lack formal strategies and risk falling behind in AI governance and readiness. By Geographic Area 2019 2020 2021 2022 2023 2024 2025 In development No strategy NA Countries with a national strategy on AI Source: Oxford Insights, 2024; AI Index, 2026 | Chart: 2026 AI Index report Figure 8.2.1 332 8 P O L I CY A N D G OV E R N A N C E | A I I N D E X R E P O R T 2026

8.3 AI Sovereignty

As AI technologies become increasingly central to geopolitics and statecraft, and more countries articulate national strategies, attention has shifted to issues of control, capacity, and dependence across the AI stack. In policy terms, AI sovereignty describes a state’s capacity to act deliberately and make independent decisions over the development, deployment, and governance of AI systems within its jurisdiction and, in some cases, beyond it through standards, trade, and regulation. As AI systems have become more central to economic policy, national security, global trade, and cultural autonomy, sovereignty debates have expanded beyond data and infrastructure to include other parts of the AI stack, including compute, model development, talent, and responsible AI deployment. Many of these debates build on earlier discussions of digital and technological sovereignty, which focused on government authority over digital infrastructure, data flows, capabilities, and technology supply chains. Today, governments are pursuing a range of approaches across these layers, including investments, procurement policies, regulatory measures, international partnerships, and supply-chain strategy. This section draws on data from Epoch AI, Zeki, and Brookings to examine how sovereignty dynamics are evolving across compute infrastructure, data, models, applications, and talent. Domestic AI computing infrastructure, including high-performance GPU clusters and AI-optimized supercomputers, has become one of the most visible areas of AI sovereignty investment. In policy discussions, domestic compute capacity is often framed around reducing reliance on foreign providers, limiting exposure to extraterritorial jurisdiction, and providing continuity of access for government agencies, research institutions, and domestic firms in scenarios such as export controls, geopolitical disputes, and supply-chain disruptions. In this context, the scale and availability of state-owned or state-backed AI supercomputing facilities is increasingly used as an indicator of “compute sovereignty” alongside related measures such as domestic access to advanced chips, cloud capacity, and the governance arrangements that determine who can use these resources and for what purposes. Infrastructure Sovereignty 333 8.3 A I S OV E R E I G N I T Y | P O L I CY A N D G OV E R N A N C E | A I I N D E X R E P O R T 2026 Based on data from Epoch AI tracking large-scale GPU clusters used for training advanced AI models,1 statebacked AI supercomputing expanded across most regions between 2010 and 2025 (Figure 8.3.1). The sharpest acceleration was in Europe and Central Asia, where the number of clusters grew from 3 to 44 between 2018 and 2025, largely driven by coordinated initiatives such as the European High Performance Computing Joint Undertaking (EuroHPC JU). North America grew nearly sevenfold over the same period to reach 41 clusters, a substantial expansion given its comparatively high baseline, reflecting a policy shift toward dedicated national AI research infrastructure, including through the U.S. National AI Research Resource (NAIRR). East Asia (excluding China) grew about fourfold. By contrast, South Asia, the Middle East and North Africa, and Latin America and the Caribbean each only doubled or tripled, reaching 2, 3, and 8 clusters, respectively, by 2025. Several initiatives to expand capacity are already underway in these regions, though planned systems are not included here as they are subject to change and carry inherently lower confidence. Although privately owned clusters account for most large-scale AI compute capacity globally, state-owned and public-private clusters have also grown steadily across most regions. In practice, this distinction is clouded because many private clusters remain accessible to public sector actors through commercial cloud services. Public-private partnerships can involve both domestic and international actors (Figure 8.3.2). OpenAI’s Stargate project, for example, extends beyond the United States through country-level partnerships across regions, including the United Arab Emirates, the United Kingdom, Argentina, South Korea, India, and Norway. Nvidia’s “AI Factory” is a different approach in which in-country compute capacity is typically built in partnership with domestic telecommunications providers, a model that has expanded rapidly by catering to governments’ sovereign AI ambitions. These initiatives illustrate how private firms are playing an increasingly central role in building what many governments designate as national AI infrastructure. 1 Because the underlying dataset captures frontier AI training capacity rather than the full universe of compute resources, the figure should be interpreted as a proxy for sovereign high-end AI compute infrastructure rather than a comprehensive measure of national compute capacity. 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0 10 20 30 40 50 60 70 80 90 Cumulative number of AI supercomputers 2, South Asia 3, Middle East and North Africa 8, Latin America and Caribbean 27, East Asia and PaciǇc 41, North America 44, Europe and Central Asia 85, China Number of public or public-private AI supercomputers, 2010–25 Source: Epoch AI, 2026 | Chart: 2026 AI Index report Figure 8.3.1 334 8.3 A I S OV E R E I G N I T Y | P O L I CY A N D G OV E R N A N C E | A I I N D E X R E P O R T 2026 Figure 8.3.2 Nvidia and OpenAI Only Nvidia AI Factory Only OpenAI Stargate NA Countries with publicly announced Nvidia or OpenAI infrastructure initiatives, 2025 Source: Stanford HAI, 2026 | Chart: 2026 AI Index report Data Sovereignty While infrastructure sovereignty focuses on control over compute resources, data sovereignty concerns the extent to which states or local actors have agency over how their data is collected, stored, processed, and transferred. One common approach is to adopt data localization measures2 that require certain categories of data to remain within national borders or to impose restrictions on cross-border data transfers. As AI systems grow increasingly dependent on vast, diverse datasets, data sovereignty has emerged as a central dimension of the broader AI sovereignty debate. Data localization measures have increased across nearly every region since 2000 (Ferracane et al., 2025, Figure 8.3.3). The steepest rise in adoption begins around 2016, coinciding with the implementation of GDPR in Europe and the subsequent “Brussels Effect,” whereby other nations adopted similar frameworks. Regional patterns fall into three broad clusters: high-localization regions led by East Asia and Pacific (77 measures), followed closely by sub-Saharan Africa (71) and Europe and Central Asia (66); moderate-localization regions including the Middle East and North Africa (44), Latin America and the Caribbean (36), and South Asia (24); and North America, which remains a striking outlier at just 3 measures, reflecting a long-standing “flow-first” policy orientation.3 2 While there is no single official definition, data localization measures are broadly understood as explicit requirements that data be stored and/or processed within a domestic territory, encompassing both outright storage mandates and conditional restrictions on cross-border transfers (López González et al., 2022). 3 This pattern is well documented (World Bank, 2025), reflecting a general tendency—particularly within the U.S.—to favor free data flows from which its firms disproportionately benefit. As one example, U.S. diplomats have recently been tasked with pushing back against other countries’ data sovereignty initiatives (Reuters, 2026). At the same time, emerging restrictions such as those on the transfer of bulk sensitive personal data to “countries of concern” signal a growing willingness to impose targeted controls. 335 8.3 A I S OV E R E I G N I T Y | P O L I CY A N D G OV E R N A N C E | A I I N D E X R E P O R T 2026 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 0 10 20 30 40 50 60 70 80 Cumulative number of data localization measures 3, North America 24, South Asia 36, Latin America and Caribbean 44, Middle East and North Africa 66, Europe and Central Asia 71, Sub-Saharan Africa 77, East Asia and PaciǇc Data localization measures by region, 2000–24 Source: Ferracane et al., 2026 | Chart: 2026 AI Index report Figure 8.3.3 Model Sovereignty Model sovereignty concerns a state’s capacity, influence, and control over the development and deployment of AI models. As discussed in Chapter 1, advanced AI model development has historically been concentrated in a small number of technology hubs, primarily in the United States and China. While that persists, open-source frameworks have lowered barriers to entry, and a growing number of regions are building and deploying their own models (Figure 8.3.4). This trend reflects a growing emphasis on localizing model development even if countries can deploy U.S.- or Chinesemade models. 336 8.3 A I S OV E R E I G N I T Y | P O L I CY A N D G OV E R N A N C E | A I I N D E X R E P O R T 2026 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0 200 400 600 800 1000 1200 1400 1600 Cumulative number of AI models released 2, Latin America and Caribbean 21, South Asia 74, Middle East and North Africa 125, North America 330, East Asia and PaciǇc 666, Europe and Central Asia 849, China 1618, United States Number of AI models released by region, 2018–25 Source: Epoch AI, 2026 | Chart: 2026 AI Index report Figure 8.3.4 Based on Epoch AI data tracking publicly reported model releases, cumulative U.S. model releases grew from 237 to 1,618 between 2018 and 2025. China exhibits a similar acceleration between 2022 and 2025, where model releases more than quintupled from 151 to 849, suggesting a rapid scaling of domestic capabilities and intensified competition with U.S. model development. These figures reflect the full range of publicly documented model releases by Epoch AI, including smaller and less prominent ones. This differs from the notable model dataset used in Chapter 1, which applies narrower criteria such as state-of-the-art performance and high citation counts. The two datasets can show different year-over-year patterns because the broader count here is more representative of the expanding base of model development, while the subset in Section 1.1 of Chapter 1 is more sensitive to the changes at the frontier. Europe and Central Asia show steady growth, increasing from 127 to 666 models over the same period, with the United Kingdom (229 models) and France (141) as leading contributors, while Canada (captured by the North America region) trails in fifth place with 125 models. East Asia and the Pacific (excluding China) grew from 39 to 330 models by 2025, while the Middle East and North Africa, South Asia (largely driven by India), and Latin America and the Caribbean reached only 74, 21, and 2 models, respectively. Several of these regions are beginning to champion national or regional model initiatives, such as Chile’s Latam-GPT, the UAE’s Falcon series, and Singapore’s SEA-LION, though their overall footprint remains limited. These figures should also be interpreted as conservative estimates, as model documentation and reporting are less systematic in these regions. The growing ecosystem of smaller and language-specific models in sub-Saharan Africa, for example, are not represented at all.4 4 AfriBERTa (Ogueji et al., 2021), AfriTeVa (Jude Ogundepo et al., 2022), AfroLM (Dossou et al., 2022), EthioLLM (Tonja et al., 2024b), EthioMT (Tonja et al., 2024c), and AfroXLMR (Alabi et al., 2022). 337 8.3 A I S OV E R E I G N I T Y | P O L I CY A N D G OV E R N A N C E | A I I N D E X R E P O R T 2026 Overall, model production remains concentrated, with the United States and China accounting for a disproportionate share of global activity. At the same time, complementary evidence from AI-related GitHub activity suggests that open-source development is diffusing more broadly across regions, even as significant asymmetries in scale and capability persist (see Section 1.5 of Chapter 1 for full analysis). Application Sovereignty A fourth dimension of AI sovereignty concerns the capacity, agency, and control over the downstream deployments of AI systems within a nation’s public and private sectors. Application-level sovereignty encompasses domestic procurement policies; sector-specific regulatory requirements in domains such as health, finance, and defense; and the Digital Public Infrastructure (DPI) upon which AI applications increasingly sit. Together, these determine how much a country can shape the AI systems with which its institutions and citizens interact. Public investment in AI-related contracts and grants offers one observable signal of how governments are implementing this form of sovereignty in practice (see Section 8.5 for a full analysis of public AI investment trends across the U.S. and Europe). Beyond public investment, however, comprehensive cross-country data on sovereignty-oriented AI procurement preferences, sectoral deployment mandates, and DPI utilization for AI remains limited, reflecting both the novelty of the concept and the opacity of procurement data across jurisdictions. 338 8.3 A I S OV E R E I G N I T Y | P O L I CY A N D G OV E R N A N C E | A I I N D E X R E P O R T 2026 Figure 8.3.5 339 Countries are increasingly concentrating AI investment in domains aligned with institutional strengths and policy priorities (Figure 8.3.5). A small set of countries, most prominently the United States, China, and several European economies (UK, Germany, France), show high-intensity investment across nearly all application categories. Most other countries display concentrated areas of focus, indicating selective investment rather than full-spectrum capability. Within Europe, Germany’s strength is in industrial applications (particularly manufacturing) and Estonia’s is in education technologies. Sub-Saharan African countries show stronger engagement in financial applications, led by South Africa. Latin America presents a more uneven pattern, with Brazil investing broadly while countries such as Chile and Argentina concentrate on more specific domains, including healthcare and agricultural applications, respectively. In the Middle East and North Africa, similar dynamics emerge, with Israel standing out for its specialization in security and defense applications, consistent with its broader positioning as a global cybersecurity hub. The application layer, as it is less concentrated than the model or compute layer, offers more space for countries to develop niche specializations, enabling them to exercise greater autonomy both domestically and internationally over these systems. 8.3 A I S OV E R E I G N I T Y | P O L I CY A N D G OV E R N A N C E | A I I N D E X R E P O R T 2026 Talent Sovereignty A fifth dimension of AI sovereignty is a nation’s ability to develop and retain the human capital needed to build, deploy, and govern AI systems. Talent sovereignty includes two closely related dynamics: workforce capacity, the domestic stock of AI skills and expertise, and talent mobility (the extent to which countries attract, retain, or lose AI specialists across borders). The country-level distribution and mobility patterns of AI authors and inventors offer a direct window into this dimension and is also discussed in detail in Section 1.8. of Chapter 1. Broader labor market indicators, including AI talent concentration and workforce trends across countries, are examined in Section 4.4 of Chapter 4. 340 2016 2018 2020 2022 2024 0 20 40 60 2016 2018 2020 2022 2024 0 20 40 60 2016 2018 2020 2022 2024 0 50 100 150 2016 2018 2020 2022 2024 0 50 100 2016 2018 2020 2022 2024 0 100 200 300 400 2016 2018 2020 2022 2024 0 5 10 15 20 2016 2018 2020 2022 2024 0 200 400 600 In�ow Out�ow East Asia and Paci�c Latin America and Caribbean North America Europe and Central Asia India Middle East and North Africa United States In�ow and out�ow top AI authors and inventors (12-month rolling avg.) Inǈow and outǈow of top AI authors and inventors by region, 2016–25 Source: Zeki Data, 2025 | Chart: 2026 AI Index report Figure 8.3.6 8.3 A I S OV E R E I G N I T Y | P O L I CY A N D G OV E R N A N C E | A I I N D E X R E P O R T 2026 Cross-border AI talent circulation has slowed recently, even where net flows remain stable (see Section 1.8 of Chapter 1). Both inflows and outflows are declining, suggesting that talent is increasingly staying within national or regional systems rather than circulating globally (Figure 8.3.6). The United States is currently the primary global attractor of top AI talent, though its lead is rapidly narrowing. By contrast, India is transitioning from a net exporter to a net absorber of talent. The near mirror-image relationship between the two countries reflects the well-documented fact that the U.S. has been the main destination for Indian AI talent. At the same time, the Middle East and North Africa are making incremental gains, a sign that new talent hubs are emerging with the support of targeted policy and investment. 341 8 P O L I CY A N D G OV E R N A N C E | A I I N D E X R E P O R T 2026

8.4 AI and Policymaking

Legislative activity is a signal of how governments are responding to AI beyond national strategies. This section tracks AI-related bills passed across G20 countries, drawing on data from Digital Policy Alert. The dataset covers enacted legislation, not proposed or pending bills. Counts should be interpreted with caution as they may understate the actual volume of AI-related policymaking, since large omnibus bills that contain multiple AI provisions are counted as a single piece of legislation. Volume is also not a measure of significance, as a single major law can carry more impact and enforcement weight than dozens of narrower ones. 5 Numbers may differ from what was reported in the 2025 AI Index report, as the methodology has changed. In addition, the following charts focus only on G20 countries. Global Legislative Records on AI In 2016, there were no AI-related laws on record among G20 countries. Since then, legislative activity has been on the rise, though the total number of laws passed varies widely by country (Figures 8.4.1 through 8.4.3). Between 2016 and 2025, the United States passed the most AI-related bills, 25 in total, followed by South Korea with 17. Japan, France, and Italy were also relatively active, with each passing 9 to 10 laws. Over the same period, other countries, such as Russia and Saudi Arabia, passed very few, if any, AIspecific legislation. Similar to other aspects of AI development, such as investment and research output, policymaking is expanding but unevenly. 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0 2 4 6 8 10 12 14 16 Number of AI-related bills passed 11 Number of AI-related bills passed into law in G20 countries, 2016–25 Source: AI Index; Digital Policy Alert, 2026 | Chart: 2026 AI Index report Figure 8.4.1 342 8.4 A I A N D P O L I CY M A K I N G | P O L I CY A N D G OV E R N A N C E | A I I N D E X R E P O R T 2026 0 1–5 6–10 16–30 Non-G20 Number of AI-related bills passed into law in G20 countries, 2016–25 Source: AI Index; Digital Policy Alert, 2026 | Chart: 2026 AI Index report Figure 8.4.2 Figure 8.4.3 25 17 10 10 9 6 5 5 3 2 1 1 1 1 0 5 10 15 20 25 India China Canada Argentina Australia Brazil Russia Germany United Kingdom Italy Japan France South Korea United States Number of AI-related bills passed Source: AI Index; Digital Policy Alert, 2026 | Chart: 2026 AI Index report Number of AI-related bills passed into law in G20 countries, 2016–25 (sum) 343 H I G H L I G H T: A Closer Look at Global AI Legislation 8.4 A I A N D P O L I CY M A K I N G | P O L I CY A N D G OV E R N A N C E | A I I N D E X R E P O R T 2026 Bill name DescriptionCountry United States Take It Down Act (Tools to Address Known Exploitation by Immobilizing Technological Deepfakes on Websites and Networks Act) This act criminalizes the nonconsensual distribution and disclosure of intimate visual depictions, including AI-generated or digitally altered deepfakes. It mandates that online platforms remove reported content within 48 hours and strengthens enforcement through federal criminal penalties and a civil cause of action allowing victims to sue for damages. Italy Provisions and Delegated Powers to the Government on the Matter of Artificial Intelligence (Law No. 132/2025) This law establishes a national framework for responsible, transparent, and human-centered AI in Italy, aligned with the EU AI Act (Regulation 2024/1689). It sets principles and governance directions for AI strategy and sectoral deployment, and addresses user safeguards, copyright-related issues, and penalty provisions for specified unlawful uses. Japan Act on the Promotion of Research and Development and Utilization of Artificial Intelligence–Related Technology This law provides a national framework to promote AI research, development, and utilization by defining responsibilities for the national government, local governments, research institutions, and businesses. It mandates an AI Basic Plan and establishes a prime minister–led AI Strategy Headquarters to coordinate crossgovernment policy and international cooperation. South Korea Framework Act on the Development of Artificial Intelligence and the Creation of a Foundation for Trust This law sets South Korea’s overarching policy and governance framework for advancing artificial intelligence while building conditions for trustworthy and ethical AI. It provides the primary statutory basis for government-led AI strategy, coordination, and support measures, and for developing standards and safeguards intended to promote public trust alongside industrial innovation. US Legislative Records State Level The United States has passed more AI-related laws since 2016 than any other G20 country, making it a useful case to examine how AI is being addressed through domestic policy channels. Within the U.S., AI has become a cross-cutting policy issue that touches on governance, national security, public services, and individual rights. This section tracks AI-related bills enacted at the state level as well as the composition of witnesses at congressional AI hearings and federal regulatory activity. State legislatures have been the active venue for AI policymaking in the United States, particularly in the absence of comprehensive federal AI legislation. Across all states, the total number of AI-related bills passed increased from fewer than 10 in 2020 to 150 in 2025 (Figures 8.4.4 to 8.4.6). However, a small number of states accounted for a disproportionate share of legislation enacted in 2025. California enacted 20 AI-related 344 8.4 A I A N D P O L I CY M A K I N G | P O L I CY A N D G OV E R N A N C E | A I I N D E X R E P O R T 2026 bills, double the total in New York (10) and two-thirds more than Texas (12), the second most active state. Over the full 2016–2025 period, California had more than double any other state, with 62 bills enacted. Maryland (28), Virginia (25), and Utah (24) also had records that reflect consistent activity across multiple legislative cycles. Two states—Missouri and Rhode Island—have not enacted any AI-related legislation to date. 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0 20 40 60 80 100 120 140 Number of AI-related bills passed 150 Number of AI-related bills passed into law by all US states, 2016–25 Source: AI Index, 2026 | Chart: 2026 AI Index report Figure 8.4.4 Figure 8.4.5 62 28 25 24 18 17 15 12 12 11 11 11 9 9 9 0 8 16 24 32 40 48 56 Michigan Arizona Alabama Massachusetts Florida Colorado Washington North Dakota Illinois Texas New York Utah Virginia Maryland California Number of AI-related bills passed Source: AI Index, 2026 | Chart: 2026 AI Index report Number of AI-related bills passed into law in select US states, 2016–25 345 8.4 A I A N D P O L I CY M A K I N G | P O L I CY A N D G OV E R N A N C E | A I I N D E X R E P O R T 2026 AL 9 AK 1 AZ 9 AR 4 CA 62 CO 11 CT 5 DE 2 FL 11 GA 3 HI 4 ID 4 IL 15 IN 5 IA 5 KS 2 KY 3 LA 6 ME 3 MD 28 MA 11 MI 9 MN 5 MS 9 MO 0 MT 8 NE 3 NV 4 NH 7 NJ 5 NM 6 NY 18 NC 7 ND 12 OH 3 OK 2 OR 5 PA 4 RI 0 SC 1 SD 2 TN 7 TX 17 UT 24 VT 7 VA 25 WA 12 WV 7 WI 3 WY 1 Source: AI Index, 2026 | Chart: 2026 AI Index report Number of state-level AI-related bills passed into law in the United States by state, 2016 25 (sum) Figure 8.4.6 H I G H L I G H T: State AI Legislation Amid Shifting Federal Policy While federal AI policy shifted toward deregulation in 2025, state legislatures continued to move ahead with AI-specific laws on their own in the absence of a federal framework. State policies are developing across different tracks, including targeted protections against discrimination, misinformation, and abuse. Several of the most prominent state actions, including Utah’s Mental Health Chatbot Act, Montana’s Right to Compute Act, and Texas’ Responsible AI Governance Act, are also listed in the timeline at the start of this chapter. To have greater comprehensive oversight on AI use, some states pursued broader frameworks. Colorado’s Artificial Intelligence Act, signed in May 2024, was among the first state laws targeting algorithmic discrimination in decisions such as hiring, housing, and medical care. However, the 6 The state tracker counts only bills whose final enacted text includes the phrase “artificial intelligence,” based on searches conducted across all 50 state legislative websites. While this filter makes the dataset consistent, it likely captures only a portion of the broader state-level policy picture. 346 H I G H L I G H T: 8.4 A I A N D P O L I CY M A K I N G | P O L I CY A N D G OV E R N A N C E | A I I N D E X R E P O R T 2026 law is also a case study in how difficult it can be to operationalize more sweeping regulation frameworks. In 2025, Colorado attempted to enact amendments narrowing parts of the law but instead opted to push key compliance dates back to mid-2026 to allow for additional time to consider revisions. Texas followed a different path with the Responsible Artificial Intelligence Governance Act (HB 149), passed in 2025 and effective January 2026. While originally touted as the most all-encompassing AI legislation, the final version was significantly scaled back from the original proposal, removing most private-sector obligations and focusing on uses such as behavioral manipulation and the production of child sexual abuse material. Several states have focused on regulating how AI chatbots interact with consumers, particularly in sensitive settings. Utah’s HB 452, signed in March 2025, addresses mental health chatbots and requires disclosure when users are interacting with AI, prohibits the selling or sharing of personal health data with third parties, and restricts in-chat advertising. California’s SB 243, effective January 2026, requires companion chatbot operators to disclose their AI nature and implement safety protocols related to suicidal ideation, with additional safeguards for minors. Other states, such as Hawaii (SB 640) and Massachusetts (S.243), have proposed treating undisclosed chatbot interactions as unfair and deceptive practices. Watermarking and provenance requirements have also gained traction. Washington (HB 1170) requires large providers to include provenance data on AI-generated or materially altered media. Illinois (SB 1929) and Florida (HB 369) proposed similar measures, following California’s AI Transparency Act (SB 942), which mandated that large generative AI tools offer watermarking and detection tools at no cost. However, not all states have moved toward greater oversight. In Montana, SB 212 was signed in April 2025 and established the first state-level “right to compute,” affirming individuals’ and businesses’ rights to own and use computational resources, including AI tools, for lawful purposes. The Montana law limits government restrictions to those that are “demonstrably necessary and narrowly tailored to fulfill a compelling government interest,” while also requiring deployers of AI-controlled critical infrastructure to maintain risk management policies with human override capabilities. In December 2025, the White House issued an executive order titled “Ensuring a National Policy Framework for Artificial Intelligence,” which directed the Department of Justice to establish an AI Litigation Task Force to challenge state AI laws in court, instructed the Department of Commerce to identify state laws it considers overly burdensome, and tied some federal funding to states’ willingness to avoid enacting conflicting AI legislation. The order carved out certain areas state legislatures can oversee, including child safety, data center infrastructure, and state government procurement. The upshot is that the future of U.S. state-level regulation on AI remains uncertain. US Congressional Hearings Congressional attention to AI, as measured by the number of witnesses appearing in AI-related hearings, has increased by almost twentyfold since 2017 (Figure 8.4.7). That growth accelerated after 2022, consistent with the mainstream emergence of generative AI tools in late 2022. Witness counts rose from 18 in 2022 to 131 in 2023 and remained high, at 102, in 2025. The composition of those witnesses has shifted over time (Figure 8.4.8). Industry’s share of witnesses rose from 13% in the 115th Congress to 37% in the 119th, making it the largest witness group in the data. This 347 increase is consistent with the sector’s expanding role in overall AI development. As discussed in Chapters 1 and 4, private companies now account for the majority of frontier model releases and infrastructure investment, which may position them both as more relevant sources of technical input and as more active participants in shaping the policy environment in which they operate. Over the same period, the share of government witnesses fell from 35% to 10%, and academia’s share declined from 26% to 15%. The “other” category, which includes civil society and nonprofit organizations, grew from 26% to 38%. 8.4 A I A N D P O L I CY M A K I N G | P O L I CY A N D G OV E R N A N C E | A I I N D E X R E P O R T 2026 2017 2018 2019 2020 2021 2022 2023 2024 2025 0 20 40 60 80 100 120 Number of witnesses 102 Number of witnesses in US congressional AI-related hearings, 2017–25 Source: AI Index, 2026 | Chart: 2026 AI Index report 26% 35% 19% 25% 15% 13% 19% 28% 27% 37% 35% 17% 25% 15% 10% 26% 30% 28% 33% 38% 115th 116th 117th 118th 119th 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Academia Industry Government Other Congress Witnesses (% of total) Witnesses (% of total) in US congressional AI-related hearings by sector, 2017–25 Source: AI Index, 2026 | Chart: 2026 AI Index report Figure 8.4.7 Figure 8.4.8 348 8.4 A I A N D P O L I CY M A K I N G | P O L I CY A N D G OV E R N A N C E | A I I N D E X R E P O R T 2026 Across both the House and the Senate, general AI governance and national security and defense drew the most witnesses between 2017 and 2025, with 113 and 74 total witnesses, respectively (Figure 8.4.9). Overall, the House has been more active than the Senate in most subject areas, including finance and economic policy (36 versus 3 witnesses) and national security (49 versus 25). Health and biomedical AI was the only area where both chambers show the same level of activity (9 witnesses each), suggesting comparable engagement in this domain. 9 8 12 16 13 3 25 53 3 6 8 9 15 18 20 29 30 36 49 60 0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 Energy Transportation and autonomous systems Industry and commerce Health and biomedical AI Education and workforce Privacy and legal User risks, rights, and protections Government modernization R&D and innovation Finance and economic policy National security and defense General AI governance House Senate Number of witnesses Subject area Number of witnesses in US congressional AI-related hearings by subject area, 2017–25 Source: AI Index, 2026 | Chart: 2026 AI Index report Figure 8.4.9 US Regulations Federal regulatory activity on AI has grown in recent years, with the number of AI-related regulations increasing from one recorded action in 2016 to 58 in 2025 (Figure 8.4.10). Similar to witness counts, the sharpest increase came after 2022, and the pace has remained steady through 2025, with 58 AI-related regulations. The direction of federal AI policy changed in early 2025 when the Donald Trump administration issued the Initial Rescissions of Harmful Executive Orders and Actions, which revoked a range of executive actions, including the Biden administration’s Executive Order 14110, a framework for safe, secure, and trustworthy AI development and use. That Biden order had anchored a more precautionary federal approach that included reporting requirements for advanced models, guidance on watermarking AI-generated content, and initiatives addressing privacy, civil rights, and workforce impacts. Its reversal was followed by a new executive order, “Removing Barriers to American Leadership in Artificial Intelligence,” which reoriented federal policy toward reducing regulatory constraints and promoting innovation. 349 8.4 A I A N D P O L I CY M A K I N G | P O L I CY A N D G OV E R N A N C E | A I I N D E X R E P O R T 2026 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0 10 20 30 40 50 60 Number of AI-related regulations 58 Number of AI-related regulations in the United States, 2016–25 Source: AI Index, 2026 | Chart: 2026 AI Index report Figure 8.4.10 By Agency The increasing number of AI-related regulations has been driven by a wide set of federal agencies (Figure 8.4.11). The Executive Office of the President has been the most active, issuing regulatory actions every year since 2016 and putting out 28 in 2025 alone. The Commerce Department and the Industry and Security Bureau have also become more active in recent years, consistent with growing attention to export controls and AI supply chain policy. Several agencies, including the Department of Energy, the Department of Education, and the Securities and Exchange Commission, began issuing AI-related regulations in 2023 or later. 350 8.4 A I A N D P O L I CY M A K I N G | P O L I CY A N D G OV E R N A N C E | A I I N D E X R E P O R T 2026 Figure 8.4.11 351 H I G H L I G H T: A Closer Look at US Federal Regulations 8.4 A I A N D P O L I CY M A K I N G | P O L I CY A N D G OV E R N A N C E | A I I N D E X R E P O R T 2026 Regulation DescriptionAgency Commerce Department; Industry and Security Bureau Framework for Artificial Intelligence Diffusion Prior to its rescission by the Trump administration in May 2025, this rule updated U.S. export controls for advanced computing items and certain AI model weights. It created a new classification for specified advanced “closed-weight” AI model weights and revised licensing requirements and review policies for advanced computing integrated circuits and related items. It also expanded and added license exceptions for certain destinations and uses, updated notification procedures, and introduced additional guidance to help identify diversion risks. Justice Department Preventing Access to U.S. Sensitive Personal Data and Government-Related Data by Countries of Concern or Covered Persons This rule implements Executive Order 14117 of February 28, 2024, “Preventing Access to Americans’ Bulk Sensitive Personal Data and United States Government-Related Data by Countries of Concern.” It prohibits or restricts certain data transactions involving countries of concern or covered persons, focusing on specified transfers of Americans’ bulk sensitive personal data and U.S. government–related data. The rule addresses cross-border data transactions that can involve large-scale datasets and establishes limits on when and how such data may be transferred. Executive Office of the President Advancing United States Leadership in Artificial Intelligence Infrastructure This executive order directs the federal government to fast-track the buildout of domestic AI infrastructure—particularly frontier data centers and their supporting energy systems—on federal lands managed by the departments of Defense, Energy, and Interior. It mandates competitive solicitations for private-sector leases on federal sites, paired with requirements for clean energy procurement, robust cybersecurity standards, and high labor practices. The order also streamlines permitting processes, addresses grid interconnection challenges, and calls for international engagement to promote trusted AI infrastructure among U.S. allies, all with the overarching goal of ensuring the United States maintains leadership in frontier AI development. Executive Office of the President Removing Barriers to American Leadership in Artificial Intelligence This executive order sets the foundational U.S. policy of maintaining global AI dominance by dismantling regulations seen as obstacles to innovation. Most notably, it replaces the Biden administration’s 2023 AI executive order on safe and trustworthy AI development and directs agencies to suspend or rescind any actions taken under the earlier order that conflict with the new policy. It also mandates the development of a comprehensive AI Action Plan within 180 days, and requires the Office of Management and Budget to revise existing AI-related guidance memoranda to align with the administration’s pro-innovation, minimalregulation approach. Executive Office of the President Ensuring a National Policy Framework for Artificial Intelligence This executive order establishes a national AI policy framework to prevent a fragmented patchwork of state regulations from hindering U.S. innovation and global competitiveness. It creates an AI Litigation Task Force under the Attorney General to challenge state AI laws considered overly burdensome or unconstitutional. It also directs federal agencies to evaluate problematic state laws, potentially withhold federal funding from noncompliant states, and develop a uniform federal standard that preempts conflicting state regulations— while preserving state authority over child safety, data centers, and government AI procurement. The following section highlights AI-related regulations enacted through federal rules and executive orders during 2025. 352 H I G H L I G H T: 8.4 A I A N D P O L I CY M A K I N G | P O L I CY A N D G OV E R N A N C E | A I I N D E X R E P O R T 2026 Executive Office of the President Advancing Artificial Intelligence Education for American Youth This executive order advances artificial intelligence education for American youth by directing the federal government to expand early exposure to AI concepts, integrate AI appropriately into classrooms, and build an AI-ready workforce. It establishes a White House Task Force on AI Education, led by the Office of Science and Technology Policy, to coordinate agency efforts and create a Presidential Artificial Intelligence Challenge that highlights student and educator achievements. The order also prioritizes educator training and research on AI in education, including the use of existing grant programs to support professional development and AI-enabled tools that improve teaching and learning outcomes. Executive Office of the President Promoting the Export of the American AI Technology Stack This executive order launches a coordinated federal initiative to promote the export of “full-stack” American AI technology packages. These packages combine AI-optimized hardware, cloud and networking infrastructure, data pipelines, models, security measures, and targeted applications for selected partner countries and regions. It directs the Department of Commerce to solicit and evaluate proposals from industry-led consortia, designate priority AI export packages, and support them by streamlining access to federal diplomatic and financing tools. The order frames these efforts as essential to sustaining U.S. leadership in AI and reducing global reliance on AI technologies developed by adversaries. 353 8 P O L I CY A N D G OV E R N A N C E | A I I N D E X R E P O R T 2026

8.5 Public Investment in AI

Public spending on AI reflects how governments are translating national strategies and policy commitments into resource allocation. This section tracks government AI spending across the United States and several European countries, drawing on public contract data in Europe and the U.K., and on contract, grant, and Other Transaction Agreement (OTA) data in the United States. Grant-level data is included for the United States but is not systematically available for European countries and is therefore excluded from the European analysis. The methodology differs across regions due to differences in data availability. In Europe and the U.K., longterm instruments like Framework Agreements and Dynamic Purchasing Systems8 typically report maximum contract ceilings rather than actual spending, and award duration data is often incomplete. Therefore, European and U.S. results are presented separately. For the United States, where transaction-level obligation data is available, the AI Index estimates investment by aggregating only those obligations that occur after AI-related activity first appears in an awarded procedure, while controlling for early de-obligations that could distort trends. This approach preserves time patterns while reducing the risk of overstating historical AI investment. 7 European contract data is drawn from Tenders Electronic Daily (TED). U.K. data is sourced from TED, Find a Tender, Contracts Finder, and its archive. Scotland and Wales spending data is accessed through procurement APIs and the Open Contracting Partnership’s data registry via Kingfisher Collect; Northern Ireland is excluded due to the absence of an API. U.S. contract, grant, and OTA data is drawn from the Federal Procurement Data System (FPDS) API. 8 Framework Agreements and Dynamic Purchasing Systems are two types of multiyear umbrella buying arrangements. A Framework Agreement sets pre-agreed terms and a maximum budget for future purchases, while a Dynamic Purchasing System is a flexible, open list of approved suppliers used for competitions and orders over time. Total AI Public Investments United States Public spending on AI-related contracts has grown across both the United States and the European countries tracked, though the pace and composition of spending vary by country. Between 2013 and 2024, the United States invested approximately $20.5 billion toward AI-related activities, made up of $15.9 billion in grants, $3.9 billion in contracts, and $650 million in Other Transaction Agreements (Figure 8.5.1). Since 2020, AI-related grant spending has accelerated compared to contracts and OTAs. In 2024, grants accounted for $5.1 billion, 32% of their cumulative total since 2013 (Figure 8.5.2). Award volume follows a similar pattern. Grants account for the majority of AI-related awards, with 22,364 compared to 3,347 contracts and 185 OTAs (Figure 8.5.3). Despite their lower volume, OTAs have a median contract value of almost $1 million, far higher than grants ($304,000) and contracts ($150,000). 354 8.5 P U B L I C I N V E S T M E N T I N A I | P O L I CY A N D G OV E R N A N C E | A I I N D E X R E P O R T 2026 0.83 1.30 2.00 2.58 3.22 3.91 0.67 1.10 1.79 3.46 6.81 10.83 15.88 0.08 0.21 0.33 0.47 0.74 1.23 2.04 3.28 5.72 9.72 14.59 20.44 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 0 5 10 15 20 Grants Contracts OTAs Cumulative public spending on AI (in billions of US dollars) Cumulative public spending on AI in the United States, 2013–24 Source: AI Index, 2026 | Chart: 2026 AI Index report 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 0 1 2 3 4 5 Public spending on AI (in billions of US dollars) 0.12, OTAs 0.69, Contracts 5.05, Grants Public spending on AI in the United States, 2013–24 Source: AI Index, 2026 | Chart: 2026 AI Index report Figure 8.5.1 Figure 8.5.2 355 8.5 P U B L I C I N V E S T M E N T I N A I | P O L I CY A N D G OV E R N A N C E | A I I N D E X R E P O R T 2026 US AI-related public spending Source: AI Index, 2026 Statistic Grants Contracts OTAs Number of awards 22,364 3,347 185 Total (in billions of USD) 15.9 3.9 0.7 Median (in thousands of USD) 304 149.9 999.7 Average (in thousands of USD) 710.1 1,167.9 3,528.8 Total per 100,000 inhabitants (in thousands) 4,728.7 1,172.8 195 Geographically, U.S. public AI investment through contracts and OTAs is highly concentrated. Virginia9 received $1.09 billion, California $0.67 billion, and Maryland $0.55 billion; together these states accounted for nearly 60% of total contract and OTA spending between 2013 and 2024 (Figure 8.5.4). AI-related grants have been more broadly dispersed. California ($2.37 billion), Massachusetts ($1.3 billion), and New York ($1.15 billion) received the largest allocation, but these top three accounted for less than 16% of the total (Figure 8.5.5). The geographic concentration in contracts and OTAs may reflect proximity to major federal agencies, while the wider distribution of grants is aligned with the broader institutional footprint of federally funded research. 9 Virginia is home to the headquarters of ECS, the top contractor by total awarded value via contracts and OTAs in the U.S. Figure 8.5.3 Figure 8.5.4 AL 168.59M AK 0.07M AZ 25.17M AR 9.42M CA 673.58M CO 61.14M CT 8.34M DE 1.12M DC 365.98M FL 69.22M GA 33.55M HI 9.21M ID 6.14M IL 46.65M IN 7.12M IA 2.01M KS 7.27M KY 1.56M LA 19.30M ME 1.40M MD 547.69M MA 285.56M MI 186.90M MN 7.12M MS 9.54M MO 17.50M MT 0.91M NE 1.50M NV 3.58M NH 105.32M NJ 61.43M NM 33.17M NY 93.27M NC 11.87M ND 0.20M OH 139.66M OK 18.43M OR 11.37M PA 109.04M RI 3.99M SC 17.42M SD 6.08M TN 12.03M TX 173.39M UT 3.89M VT 0.10M VA 1.09B WA 11.53M WV 31.20M WI 4.39M WY 2.09M Source: AI Index, 2026 | Chart: 2026 AI Index report Public spending on AI via contracts and OTAs in the United States, 2013 24 356 8.5 P U B L I C I N V E S T M E N T I N A I | P O L I CY A N D G OV E R N A N C E | A I I N D E X R E P O R T 2026 Figure 8.5.5 Figure 8.5.6 Spending Across Agencies In the United States, the Department of Defense led in AI-related contract and OTA spending during 2013–24, accounting for 74.1% of the $810 million total in 2024 and 73% of the total spend ($4.6 billion) across the whole period (Figure 8.5.6). The next largest funders, though significantly smaller, were the Department of the Treasury at 7.2% and the Department of Veterans Affairs at 5.1%. Of the remaining agencies, each represented less than 5% of total spending over the same period. AI-related grants have been channeled mainly through the Department of Health and Human Services (HHS) — which includes the National Institutes for Health — and the National Science Foundation (NSF) (Figure 8.5.7). By 2024, public AI funding was split evenly across these agencies, with each accounting for roughly 40% of the total. Over time, the NSF consistently received the largest share of AI-related grant funding until there was a steep increase in HHS starting in 2020. 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 0% 10% 20% 30% 40% 50% 60% 70% 80% Public spending on AI-related contracts and OTAs (% of total) 1.90%, National Aeronautics and Space Administration 3.35%, Department of Homeland Security 3.58%, Department of Health and Human Services 4.80%, Other 5.13%, Department of Veterans AǄairs 7.16%, Department of the Treasury 74.09%, Department of Defense Public spending on AI-related contracts + OTAs (% of total) in the United States by funding agency, 2013–24 Source: AI Index, 2026 | Chart: 2026 AI Index report AL 181.23M AK 11.97M AZ 240.56M AR 43.17M CA 2.37B CO 353.30M CT 220.98M DE 58.32M DC 152.73M FL 565.30M GA 387.72M HI 56.87M ID 37.07M IL 654.84M IN 331.09M IA 134.41M KS 65.63M KY 62.58M LA 85.96M ME 43.68M MD 722.03M MA 1.30B MI 409.36M MN 245.44M MS 60.98M MO 218.68M MT 36.53M NE 43.04M NV 47.84M NH 92.69M NJ 244.40M NM 49.85M NY 1.16B NC 502.71M ND 50.75M OH 304.81M OK 139.91M OR 201.18M PA 980.96M RI 58.51M SC 180.61M SD 20.62M TN 240.94M TX 907.14M UT 111.98M VT 23.89M VA 297.22M WA 622.24M WV 34.28M WI 360.34M WY 11.74M Source: AI Index, 2026 | Chart: 2026 AI Index report Public spending on AI via grants in the United States, 2013 24 357 8.5 P U B L I C I N V E S T M E N T I N A I | P O L I CY A N D G OV E R N A N C E | A I I N D E X R E P O R T 2026 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 0% 10% 20% 30% 40% 50% 60% Public spending on AI-related grants (% of total) 2.22%, Department of Defense 3.35%, Department of Agriculture 3.74%, Other 5.30%, Department of Commerce 6.15%, Department of Energy 37.62%, National Science Foundation 41.63%, Department of Health and Human Services Public spending on AI-related grants (% of total) in the United States by funding agency, 2013–24 Source: AI Index, 2026 | Chart: 2026 AI Index report Figure 8.5.7 Europe European nations collectively committed10 approximately $3.7 billion in contracts over the 2013–24 period (Figure 8.5.8). The United Kingdom accounted for the largest share, with $1.6 billion, followed by Germany ($505 million) and France ($320 million). In 2024, the U.K.’s AI-related public commitment was $454.4 million, representing 28% of the previous decade’s investment. Germany allocated $206.6 million, representing 40% of its total over the same period. Contract volume follows a similar pattern. The U.K. issued the most AI-related contracts with 738, compared to 611 in Germany and 187 in Spain (Figure 8.5.9). However, despite their higher spending and contract volumes, these three countries have a median contract value below $500,000, far lower than smaller European countries such as Denmark (almost $1.1 million) (Figure 8.5.10). 10 The details of U.K. and EU data only allows us to gauge the awarded amounts. Since many AI-related contracts rely on framework agreements or dynamic purchasing systems, there is not clear information about the timeline and total of actual obligations. 358 8.5 P U B L I C I N V E S T M E N T I N A I | P O L I CY A N D G OV E R N A N C E | A I I N D E X R E P O R T 2026 0.12 0.160.10 0.19 0.12 0.14 0.32 0.11 0.14 0.19 0.23 0.30 0.51 0.15 0.20 0.33 0.56 0.78 0.17 0.77 0.85 1.14 1.59 0.01 0.01 0.02 0.03 0.09 0.17 0.37 0.60 1.33 1.73 2.41 3.66 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50

4.00 United Kingdom Others Germany France Spain Finland Belgium

Cumulative public spending on AI contracts (in billions of US dollars) Cumulative public spending on AI contracts in European countries, 2013–24 Source: AI Index, 2026 | Chart: 2026 AI Index report 738 611 187 162 161 111 96 86 62 53 50 45 38 35 34 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 Denmark Greece Netherlands Belgium Italy Bulgaria Hungary Czech Republic Finland Romania France Poland Spain Germany United Kingdom Number of AI-related contracts Number of AI-related contracts in select European countries, 2013–24 (sum) Source: AI Index, 2026 | Chart: 2026 AI Index report Figure 8.5.8 Figure 8.5.9 359 8.5 P U B L I C I N V E S T M E N T I N A I | P O L I CY A N D G OV E R N A N C E | A I I N D E X R E P O R T 2026 1.07 0.98 0.97 0.83 0.68 0.65 0.65 0.61 0.58 0.53 0.52 0.49 0.48 0.45 0.43 0.00 0.20 0.40 0.60 0.80 1.00 France Spain Slovakia Sweden Estonia Ireland Portugal Norway Malta Greece Finland Italy Belgium Austria Denmark Median value of public AI-related contracts (in millions of US dollars) Median value of public AI-related contracts in select European countries, 2013–24 Source: AI Index, 2026 | Chart: 2026 AI Index report 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 0% 10% 20% 30% 40% 50% 60% 70% 80% Public spending on AI-related contracts (% of total) 2.79%, Community and social sector organizations 3.30%, Others 3.65%, Economic and �nancial a�airs 13.72%, Education 13.90%, Health 62.65%, Government, national agency or public authority Public spending on AI-related contracts (% of total) in Europe by funding agency activity, 2013–24 Source: AI Index, 2026 | Chart: 2026 AI Index report Figure 8.5.10 Figure 8.5.11 Spending Across Sectors In Europe, government bodies accounted for the largest share of AI-related contract spending in 2024, with the “government, national agency, or public authority” category representing 62.6% of the total (Figure 8.5.11). Health accounted for 13.9% of the spend, followed closely by education, which received 13.7%. 360 Public Opinion 9

9.1 Global Sentiment Toward AI

Global Perceptions of AI’s Impact on Jobs Highlight: Global AI Use in the Workplace

9.2 US Public and Expert Views on AI’s

Societal Impact Highlight: Views on AI Companions

9.3 Perceptions on AI Trust, Transparency,

and Regulation Global Trust in Institutions US Attitudes Toward AI Regulation 362 363 367 370 372 378 380 380 382 9 P U B L I C O P I N I O N | A I I N D E X R E P O R T 2026

9.1 Global Sentiment Toward AI68%

53% 53% 67% 59% 55% 55% 48% 53% 52% 79% 67% 52% 50% 66% 55% 45% 54% 47% 53% 50% 67% 51% 49% 66% 54% 56% 50% 54% 52% 64% 50% 49% 60% 52% 39% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Products and services using artiǇcial intelligence should have to disclose that use Products and services using artiǇcial intelligence make me nervous Products and services using artiǇcial intelligence make me excited I trust that companies that use artiǇcial intelligence will protect my personal data I trust artiǇcial intelligence to not discriminate or show bias toward any group of people I trust people not to discriminate or show bias toward any group of people Products and services using artiǇcial intelligence have more beneǇts than drawbacks Products and services using artiǇcial intelligence will profoundly change my daily life in the next 3–5 years Products and services using artiǇcial intelligence have profoundly changed my daily life in the past 3–5 years I know which types of products and services use artiǇcial intelligence I have a good understanding of what artiǇcial intelligence is 2025 2024 2023 2022 % of respondents that “Agree” Global opinions on products and services using AI (% of total), 2022–25 Source: Ipsos, 2022–25 | Chart: 2026 AI Index report 9 P U B L I C O P I N I O N | A I I N D E X R E P O R T 2026 Figure 9.1.1 364 The increase in optimism is not uniform across all surveyed countries (Figure 9.1.2). From the 30 countries surveyed by Ipsos, many reported increases between 2022 and 2025 in survey respondents who agreed that the benefits of AI outweigh the drawbacks. Several European countries, in particular, report higher levels of optimism over this period, including Germany (+12 percentage points), France (+10), China (+9), and Great Britain (+5), though their overall sentiment remained lower than in parts of Asia and Latin America. Southeast Asian nations are among the most optimistic about the future of AI (Figure 9.1.3). In Malaysia, Thailand, Indonesia, and Singapore, over 80% of respondents expect AI to profoundly change their lives over the next three to five years. These countries have consistently ranked at the top of global optimism on AI in recent years, and that sentiment has edged up since 2024, with Malaysia showing the largest increase (+9) (Figure 9.1.4). Respondents from these countries also report higher levels of excitement than nervousness about AI-enabled products and services. Figure 9.1.2 9.1 G LO BA L S E N T I M E N T TOWA R D A I | P U B L I C O P I N I O N | A I I N D E X R E P O R T 2026 365 Figure 9.1.3 Figure 9.1.4 When looking at year-over-year percentage point changes, global nervousness has increased (+3) and excitement declined (-1) relative to 2024. India shows the sharpest increase in concern around AI usage (+14) with only a modest increase in excitement (+2). 9.1 G LO BA L S E N T I M E N T TOWA R D A I | P U B L I C O P I N I O N | A I I N D E X R E P O R T 2026 366 Across countries, excitement and nervousness about AI do not align closely (Figure 9.1.5). The 2025 distribution mirrors patterns from prior years, with North American and European countries generally clustered at lower levels of excitement and higher levels of nervousness. China and Indonesia show the highest levels of excitement, with nervousness below 50%. Despite increased nervousness, many respondents continue to associate AI with practical personal benefits, particularly time savings and entertainment (Figure 9.1.6). Globally, 56% of respondents believed AI would reduce the amount of time it takes them to get things done; this figure was even higher in China (78%) and in Southeast Asian countries (>60%). However, respondents were less sure about AI’s potential to positively impact their country’s economy or job market. North American and European respondents were more skeptical that AI would make their jobs better. In the United States, 33% of respondents said AI would make their jobs better, as opposed to making them worse or having no impact, compared to the global average of 40%. Positive views of AI’s personal benefits appear to coexist with concern about its effects on labor markets.Global Argentina Australia Belgium Brazil Canada Chile China France Germany Hungary India Indonesia Ireland Italy Japan MalaysiaMexico Netherlands Peru Poland South Korea Spain Sweden Thailand 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% % of respondents who agreed they are excited about AI % of respondents who agreed they are nervous about AI Great Britain United States Singapore Turkey Colombia South Africa Switzerland Global opinions about products and services using AI by country, 2025 Source: Ipsos, 2025 | Chart: 2026 AI Index report Figure 9.1.5 9.1 G LO BA L S E N T I M E N T TOWA R D A I | P U B L I C O P I N I O N | A I I N D E X R E P O R T 2026 367 Figure 9.1.6 In both 2024 and 2025, Ipsos asked respondents how likely they thought it was that AI would change their job or completely replace it within the next five years. Results from 2025 show that perceptions remained stable year over year (Figure 9.1.7). In 2025, 22% of respondents said it was “very likely” AI would change how they do their current job, compared to 21% in 2024. In both years, the share that said it was “not likely” remained unchanged at 32%. Expectations around job replacement showed the same consistency. In 2024 and 2025, 11% of respondents reported that it was “very likely” AI would replace their job within the next five years, and 56% said this was “not likely”. When asked whether AI is generally more likely to create new jobs or eliminate existing ones, views in 2025 were divided (Figure 9.1.8). Country-level expectations follow similar patterns to the earlier sentiment trends. Nigeria, Japan, Mexico, the United Arab Emirates, South Korea, and India all expected AI to create more jobs than it eliminates, with shares above 60%. The United States and Canada sat at the opposite end, where 67% and 68% of respondents expected AI to eliminate jobs and disrupt industries. Global Perceptions of AI’s Impact on Jobs 9.1 G LO BA L S E N T I M E N T TOWA R D A I | P U B L I C O P I N I O N | A I I N D E X R E P O R T 2026 368 9.1 G LO BA L S E N T I M E N T TOWA R D A I | P U B L I C O P I N I O N | A I I N D E X R E P O R T 2026 22% 39% 32% 7% 21% 39% 32% 8% 0% 20% 40% 60% 80% 100% Don’t know Not likely Somewhat likely Very likely 11% 26% 56% 7% 11% 25% 56% 8% 0% 20% 40% 60% 80% 100% Don’t know Not likely Somewhat likely Very likely 2025 2024 % of respondents % of respondents AI will change how you do your current job in the next 5 years AI will replace your current job in the next 5 years Global opinions on the perceived impact of AI on current jobs, 2024 vs. 2025 Source: Ipsos, 2025 | Chart: 2026 AI Index report Figure 9.1.7 50% 73% 69% 64% 63% 63% 63% 59% 57% 50% 45% 45% 45% 43% 42% 42% 41% 41% 39% 38% 32% 29% 50% 27% 31% 36% 37% 37% 37% 41% 43% 50% 55% 55% 55% 57% 58% 58% 59% 59% 61% 62% 68% 67% 0% 20% 40% 60% 80% 100% United States Canada United Kingdom Germany Italy Ireland Belgium France Australia Spain Poland South Africa Argentina Singapore Brazil India South Korea United Arab Emirates Mexico Japan Nigeria Global Create new jobs and new ways of working Eliminate jobs and disrupt industries % of respondents Global expectations about AI creating new jobs vs. eliminating jobs, 2025 Source: Ipsos, Google 2026 | Chart: 2026 AI Index report Figure 9.1.82 2 In Ipsos’ reporting of findings, percentage points are rounded to the nearest whole number. As a result, figures may not add up to exactly 100%. 369 9.1 G LO BA L S E N T I M E N T TOWA R D A I | P U B L I C O P I N I O N | A I I N D E X R E P O R T 2026 Respondents were also asked whether AI would make the job market and their own jobs better, worse, or stay the same over the next five years. Optimism on both measures is low, under or around 50%, in most countries surveyed (Figure 9.1.9). China, Indonesia, Thailand, and Singapore report more positive expectations around AI’s impact on jobs, both individually and economy-wide. North America and Europe have lower expectations, though respondents there were more positive about how AI might improve their individual jobs compared to the overall job market. Chapter 4 of the AI Index further explores the technology’s impact on the global economy and labor markets.Global Argentina Australia Belgium Brazil Canada Chile China Colombia France Hungary India Indonesia Italy Japan Malaysia Mexico Netherlands Peru Poland Singapore South Korea Thailand Turkey 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% The job market (% of respondents) Individual jobs (% of respondents) South Africa United States Germany Switzerland Spain Great Britain Ireland Sweden Global opinion on the potential of AI to improve the job market vs. individual jobs, 2025 Source: Ipsos, 2025 | Chart: 2026 AI Index report Figure 9.1.9 370 Global AI Use in the WorkplaceGlobal India China NigeriaUnited Arab Emirates Saudi Arabia Costa Rica Brazil South Africa Turkey Norway SwitzerlandColombia Mexico ArgentinaChile Poland Lithuania Latvia SpainSlovenia Portugal Italy Ireland Israel SwedenJapan France Belgium Netherlands New Zealand Hungary Slovak Republic Czech Republic 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% AI trust (% of respondents) AI use (% of respondents) Egypt Singapore Greece United Kingdom Germany Australia Romania Denmark Estonia Canada South Korea Austria United States Finland Trust in AI and intentional use at work, 2025 Source: University of Melbourne and KPMG International, 2025 | Chart: 2026 AI Index report Since 2022, the use of AI technology within organizations has become more prevalent. To capture that transformation across workplaces, the University of Melbourne fielded a global survey3 of 48,340 people across 47 countries, examining how employees are adopting and using AI at work. Respondents were asked if they rely on AI to inform decisions, and whether they feel comfortable sharing the information AI tools need to carry out tasks. Globally, the share of employees who intentionally use AI at work continues to grow. In 2025, 58% of employees reported using AI on a semiregular or regular basis, and just over half (53%) said they trust AI for work purposes (Figure 9.1.10). From a regional perspective, the results reveal notable differences. Employees in emerging economies remain the most active users of AI in the workplace: In India, China, Nigeria, the United Arab Emirates, and Saudi Arabia, over 80% of respondents said they regularly use AI at work, and trust levels in these countries are similarly high. By contrast, in most North American and European countries, about half of employees report using AI tools regularly, while trust tends to fall several points lower, between 40% and 48%. The regional patterns in workplace adoption contrast with the population-level diffusion data discussed in Chapter 4, where AI adoption shows a strong, statistically significant positive correlation with GDP per capita. This section draws on multiple U.S.-focused surveys to compare how the public and AI experts view AI’s 3 These results come from an online survey, which can overrepresent younger, more urban, and more educated respondents in emerging economies. The study authors note that country-level differences hold after controlling for age and education. Figure 9.1.10 9.1 G LO BA L S E N T I M E N T TOWA R D A I | 9 P U B L I C O P I N I O N | A I I N D E X R E P O R T 2026 H I G H L I G H T: 371 Figure 9.1.11 The survey also asked employees about their organization’s level of support for AI strategy, AI literacy, and AI governance (Figure 9.1.11). Respondents reflected on whether their organization had a coherent AI strategy and supported adoption, AI literacy, and responsible use, including training, as well as governance practices such as clear policies, monitoring, accountability, and data privacy and security measures. Consistent with usage and trust levels, organizational support was reported highest in emerging economies. In India, around 85%–90% of respondents said their organization supports AI strategy, literacy, and governance. Nigeria, Egypt, China, and the UAE also rank among the top countries for organizational support. At the other end, respondents in Japan, Korea, and Portugal report the lowest levels of support for AI literacy, along with less confidence in responsible AI governance. Overall, most countries reported less organizational support for responsible AI governance, in comparison to literacy and strategy. Chapter 3 further explores this governance gap, and the key barriers to responsible AI implementation. 9.1 G LO BA L S E N T I M E N T TOWA R D A I | P U B L I C O P I N I O N | A I I N D E X R E P O R T 2026 H I G H L I G H T: 372 societal impact. The main sources4 are Pew Research Center’s 2024 survey of U.S. adults and AI experts, Elon University University Imagining the Digital Future Center’s 2025 survey on expected effects on human capacities by 2035, and the Longitudinal Expert AI Panel (LEAP), conducted by the Forecasting Research Institute. For the Pew survey, AI experts were U.S.-based authors or presenters at AI-related conferences in 2023 or 2024 who reported that their work or research relates to AI. Across nearly every topic surveyed, experts report more optimism than the U.S. public (Figure 9.2.1). The largest gaps show up around the future of work: 73% of AI experts said AI will have a positive impact on how people do their jobs, compared to 23% of U.S. adults. Similar gaps appear for the economy (69% vs. 21%), K–12 education (61% vs. 24%), and medical care (84% vs. 44%). For both groups, however, optimism is low in domains tied to trust and social connection, including elections, news, and personal relationships. 4 Sources: McClain, C. et al. (2025). How the U.S. public and AI experts view artificial intelligence. Pew Research Center. This report covers multiple research components conducted in 2024, including a U.S. survey of 5,410 adults conducted August 12–18, 2024, a survey of 1,013 AI experts living in the United States conducted August 14–October 31, 2024, and in-depth interviews with 30 individual AI experts conducted October 18–November 26, 2024. Rainie, L., & Anderson, J. (2025). Many Americans expect AI to have significant negative impact on human capacities and behaviors such as social and emotional intelligence, analytical thinking and agency by 2035. Imagining the Digital Future Center at Elon University. This national survey of 1,005 U.S. adults was conducted July 17–20, 2025. Kennedy, B. et al. (2025). How Americans view AI and its impact on people and society. Pew Research Center. This survey was conducted June 9–15, 2025, with a sample of 5,023 U.S. adults. 44% 24% 23% 21% 20% 20% 19% 10% 9% 7% 84% 61% 73% 69% 48% 36% 32% 18% 11% 22% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Personal relationships Elections The news people get The criminal justice system The environment Arts and entertainment The economy How people do their jobs K-12 education Medical care U.S. adults AI experts % saying AI will have positive impact over next 20 years US perceptions of AI’s societal impact: general public vs. experts Source: Pew Research, 2025 | Chart: 2026 AI Index report Figure 9.2.1 9 P U B L I C O P I N I O N | A I I N D E X R E P O R T 2026

9.2 US Public and Expert Views

on AI’s Societal Impact 373 When asked to look ahead to 2035, the U.S. public is again more pessimistic than AI experts about the impact the technology is likely to have on key human traits such as thinking, learning, and creativity (Figure 9.2.2). U.S. adults are more likely than AI experts to anticipate negative effects on metacognition (53% vs. 36%), defined as the ability to think analytically about one’s own thinking process, and decision-making (48% vs. 30%), which refers to problem-solving abilities. For social and emotional intelligence, defined as the ability to understand and manage social interactions, 51% of U.S. adults and 34% of experts expect AI to have a negative impact. Concern about mental well-being is high for both groups, with 55% of adults and 53% of experts saying AI will have a negative effect. Beyond general sentiment, recent forecasting data shows even wider gaps in expected timelines and scale. The Longitudinal Expert AI Panel (LEAP), conducted by the Forecasting Research Institute, surveyed AI experts and the general public on specific AI milestones and adoption rates. Across 68 forecasts, experts consistently predicted much faster AI progress than the public. In capability-focused forecasts, public views align with experts in only 9% of cases. When they diverge, the public expects slower progress 71% of the time. In direct comparison, experts are 16% more likely than the public to predict faster progress. Across specific metrics, the gaps are even more significant. By 2030, AI experts expect higher accuracy on complex math problems (+25 points), more AI-assisted work (+8.2), and greater adoption of autonomous ride-hailing (+8) (Figure 9.2.3). The public predicts greater electricity consumption by AI and a higher probability that AI solves a major mathematical problem. Looking further out to 2040, experts project a high likelihood of a transformative technological event occurring (+30) and much higher rates of daily AI companion use (+10) and AI-discovered drugs (+10). The gap in capability forecasts between the public and experts is notable as model performance continues to accelerate across a range of technical benchmarks. Chapter 2 tracks several of the significant breakthroughs of the past year. 55% 53% 53% 49% 49% 48% 45% 43% 42% 42% 41% 38% 50% 36% 50% 45% 44% 30% 30% 48% 45% 39% 48% 29% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% Curiosity and capacity to learn Trust in widely shared values and cultural norms Self-identity, meaning, and purpose in life Mental well-being ConǇdence in their own native abilities Innovative thinking and creativity Decision-making and problem-solving abilities Individual agency, the ability to act independently in the world Empathy and application of moral judgment Capacity and willingness to think deeply about complex concepts Metacognition, the ability to think analytically about thinking Social and emotional intelligence, the ability to understand and manage social interactions U.S. adults AI experts % of respondents who expect AI to be more negative than positive by 2035 Impact of AI on key human capacities and traits: general public vs. experts Source: Elon University, 2025 | Chart: 2026 AI Index report Figure 9.2.2 9.2 US PUBLIC AND EXPERT VIEWS ON AI’S SOCIETAL IMPACT | P U B L I C O P I N I O N | A I I N D E X R E P O R T 2026 374 Figure 9.2.35 5 Not all questions in the LEAP survey were asked for both 2030 and 2040. Forecast horizons vary by topic and were set according to what was most meaningful or measurable for each question. As a result, the absence of a 2040 value for some items reflects survey design rather than missing responses. Views on employment over the long term show a similar pattern (Figure 9.2.4). Nearly two-thirds or 64% of U.S. adults said AI will lead to fewer jobs in the next 20 years, while 5% said more jobs. Among experts, 39% predicted fewer jobs and 19% predicted more.0% 20% 40% 60% 80% Technological Richter Scale (level 8+) AI-discovered share of drug sales AI share of U.S. electricity Adults using AI companions daily Work hours assisted by AI General AI progress scenario (rapid) U.S. ride-hailing trips by autonomous vehicles Probability AI solves a Millennium Prize problem Share of AI-engaged scientiǇc publications (all Ǉelds) FrontierMath accuracy 0% 20% 40% 60% 80% Public Expert Forecast (%) Forecast (%) 2030 2040 Public vs. expert AI progress forecasts: 2030 and 2040 median predictions Source: Forecasting Research Institute, 2026 | Chart: 2026 AI Index report Figure 9.2.464% 39% 14% 33% 5% 19% 16% 10% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% AI experts U.S. adults Fewer jobs Not much diǄerence More jobs Not sure % of respondents Views on whether AI will create or eliminate jobs: general public vs. experts Source: Pew Research, 2025 | Chart: 2026 AI Index report 9.2 US PUBLIC AND EXPERT VIEWS ON AI’S SOCIETAL IMPACT | P U B L I C O P I N I O N | A I I N D E X R E P O R T 2026 Experts forecast much faster workplace adoption than the public. The median prediction among experts is that generative AI will assist 8% of U.S. work hours in 2027, rising to 18% in 2030. The top 25% (75th percentile) of expert predictions is for over 30% AI-assisted work hours by 2030, compared to the top 10% (90th percentile) of predictions, at more than 40%. In contrast, the public expects slower adoption, at 10% by 2030 (Figure 9.2.5). 375 When asked about specific occupations, the U.S. public and AI experts identified certain jobs to be at higher risk for automation than others (Figure 9.2.6). There is strong consensus between the public and experts regarding automation risks for jobs such as cashiers, journalists, and software engineers. AI experts see a greater risk for truck drivers and lawyers, while the U.S. public believes AI will lead to fewer jobs for teachers and medical doctors. Mostly, both groups identify the same areas of vulnerability, but the public is generally more likely to anticipate job loss across categories. 2025 2027 2030 0% 5% 10% 15% 20% 25% 30% 35% 2025 2027 2030 0% 5% 10% 15% 20% 25% 30% 35% Percentiles 75th 50th 25th Resolution date Resolution date Generative AI use intensity Generative AI use intensity Public Expert Generative AI use intensity Source: Forecasting Research Institute, 2026 | Chart: 2026 AI Index report Figure 9.2.5 9.2 US PUBLIC AND EXPERT VIEWS ON AI’S SOCIETAL IMPACT | P U B L I C O P I N I O N | A I I N D E X R E P O R T 2026 73% 67% 59% 48% 45% 43% 33% 29% 28% 23% 73% 60% 60% 50% 35% 31% 62% 27% 18% 38% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% Lawyers Medical doctors Mental health therapists Truck drivers Teachers Musicians Software engineers Journalists Factory workers Cashiers U.S. adults AI experts % of respondents saying AI will lead to fewer jobs (next 20 years) Views on AI–driven job loss by occupation: general public vs. experts Source: Pew Research, 2025 | Chart: 2026 AI Index report Figure 9.2.6 376 The gap in expert vs. public sentiment coincides with increasing awareness and adoption of AI In the United States. In 2025, 47% of U.S. adults said they had heard “a lot” about AI, up from 26% in 2022. Growth in awareness is steepest among younger adults, ages 18–29, (+29 percentage points since 2022), though it is also rising among those ages 65 and older (+13pp) (Figure 9.2.7). 2022 2023 2024 2025 0% 10% 20% 30% 40% 50% 60% % of respondents 32%, age 65+ 42%, age 50–64 51%, age 30–49 62%, age 18–29 Americans who have heard a lot about AI by age group, 2022–25 Source: Pew Research, 2025 | Chart: 2026 AI Index report Figure 9.2.7 9.2 US PUBLIC AND EXPERT VIEWS ON AI’S SOCIETAL IMPACT | P U B L I C O P I N I O N | A I I N D E X R E P O R T 2026 377 31% 33% 28% 31% 27% 29% 39% 33% 37% 30% 19% 46% 39% 30% 20% 28% 33% 15% 16% 14% 15% 18% 13% 19% 20% 16% 13% 11% 16% 17% 15% 14% 15% 15% 17% 17% 16% 17% 13% 16% 18% 21% 14% 18% 16% 17% 19% 17% 14% 18% 16% 38% 33% 42% 37% 41% 42% 24% 26% 33% 39% 54% 21% 25% 37% 51% 40% 35% 0% 20% 40% 60% 80% Dem/Lean Dem Rep/Lean Rep HS or less Some college College grad Postgrad 65+ 50–64 30–49 Ages 18–29 Asian* Hispanic Black White Women Men U.S. adults Almost constantly/several times a day About once a day Several times a week Less often % of respondents Frequency of AI interaction among US adults by demographic group, 2025 Source: Pew Research, 2025 | Chart: 2026 AI Index report Adoption and frequency of use are also increasing. More than 60% of U.S. adults reported interacting with AI at least several times a week, and 31% said they interact with AI almost constantly or several times a day, though frequency varies according to age and race and ethnicity (Figure 9.2.8). Daily AI interaction is higher among younger adults, college-educated groups, Asian Americans, and men. Political affiliation differences are modest, with Democrats slightly more likely than Republicans to interact daily with AI. As a note, the results are based on when respondents believe they are interacting with AI and therefore may undercount exposure through other embedded systems like navigation, recommendations, or rankings. 9.2 US PUBLIC AND EXPERT VIEWS ON AI’S SOCIETAL IMPACT | P U B L I C O P I N I O N | A I I N D E X R E P O R T 2026 Figure 9.2.86 6 “Asian” includes English-speaking respondents only. Respondents who did not provide an answer are not shown. “White,” “Black,” and “Asian” adults are non-Hispanic and report only one race; “Hispanic” adults may be of any race. 378 AI companionship, defined as relationships with AI systems designed for ongoing emotional and social support, represents one of the more contentious emerging uses of AI technology (Chou et al., 2024; Pan et al., 2025). Experts predict that 10% of U.S. adults will use AI for companionship at least once a day by 2027, with that number rising to 15% by 2030 and 30% in 2040 (Figure 9.2.9). The top quartile among experts’ predictions forecast that more than 40% of the public will engage in daily AI companionship, while the top 10% predict over 60%. Expectations from the general public are significantly lower, at 20% by 2040. Both experts and the public find it less likely that mental health therapists will be replaced by AI, suggesting that there is an understanding on the limitations of AI companions. They cannot fully replace human expertise in complex or therapeutic contexts. A 2026 Ipsos-Google survey found that 52% of worldwide respondents reported some level of excitement about using AI for companionship (Figure 9.2.10). In countries such as Nigeria, India, and the United Arab Emirates, over 20% of respondents said they were “extremely excited.” The United States and Canada had the largest shares of respondents who were not excited at all, at 36% and 34%. Japan recorded very few “extremely excited” respondents, and had the highest share of “don’t know” responses at 18%, nearly double the global average. H I G H L I G H T: Views on AI Companions 2027 2030 2040 0% 10% 20% 30% 40% 50% 2027 2030 2040 0% 10% 20% 30% 40% 50% Percentiles 25th 50th 75th Resolution date Resolution date AI companionship adoption AI companionship adoption Public Expert Projected daily AI companionship adoption among US adults Source: Forecasting Research Institute, 2026 | Chart: 2026 AI Index report Figure 9.2.9 9.2 US PUBLIC AND EXPERT VIEWS ON AI’S SOCIETAL IMPACT | P U B L I C O P I N I O N | A I I N D E X R E P O R T 2026 379 H I G H L I G H T: 10% 29% 27% 22% 20% 14% 13% 9% 8% 7% 7% 7% 7% 7% 7% 7% 7% 5% 5% 4% 4% 16% 36% 32% 24% 21% 21% 25% 17% 16% 13% 19% 15% 12% 12% 14% 16% 10% 9% 9% 9% 5% 9% 26% 17% 24% 32% 29% 20% 27% 39% 25% 22% 41% 25% 28% 21% 26% 26% 23% 19% 24% 29% 16% 31% 19% 10% 10% 10% 15% 15% 17% 18% 17% 22% 19% 26% 21% 22% 23% 19% 25% 22% 19% 30% 22% 26% 18% 5% 4% 8% 11% 22% 10% 10% 19% 27% 7% 14% 23% 27% 18% 17% 23% 34% 30% 18% 36% 14% 10% 3% 3% 4% 4% 7% 8% 8% 15% 8% 7% 13% 9% 10% 13% 15% 12% 12% 14% 11% 15% 18% 0% 20% 40% 60% 80% 100% Japan United States Italy France Canada Belgium Poland Spain United Kingdom Ireland Argentina South Korea Australia Germany Singapore Mexico Brazil South Africa United Arab Emirates India Nigeria Global Extremely excited Very excited Somewhat excited Not that excited Not excited at all Don’t know % of respondents Excitement about using AI for companionship Source: Ipsos, Google 2026 | Chart: 2026 AI Index report 9.2 US PUBLIC AND EXPERT VIEWS ON AI’S SOCIETAL IMPACT | P U B L I C O P I N I O N | A I I N D E X R E P O R T 2026 Figure 9.2.107 7 Totals may not equal 100% due to rounding of individual values. AI companions differ from traditional task-oriented AI by prioritizing relationship building over functionality (Zhang and Lu, 2023; Zhang et al., 2025). Modern systems incorporate memory of past interactions, can recognize emotion, and adapt their responses to individual users’ needs (Yang et al., 2025). Platforms like Replika, Character.ai, and XiaoICE have attracted user bases in the millions. Many users have reported forming emotional attachments to their AI companions, viewing them as friends, mentors, or romantic partners (Zhang et al., 2024; Kouros et al., 2024). The technology has both benefits and risks. Research shows that AI companions can reduce loneliness to a similar degree as interacting with another human (Freitas et al., 2024), with users citing always-available support (11.9%) and a safe space for emotional expression (9.9%) as primary advantages. Mental health improvements were reported by 6.2% of users, and some credited their AI companions for helping them through crises (Pataranutaporn et al., 2025). However, concerning patterns have emerged. Users frequently perceive chatbots as entities with needs, which poses a problem given the established correlation between emotional dependence and psychological distress (Bengio et al., 2025). Critical questions remain about whether these relationships reduce loneliness sustainably or undermine existing relationships and increase social isolation (Quinn et al., 2024). 380 As AI becomes more embedded in daily life, the mechanisms around trust, transparency, and regulation also become more visible. In Ipsos’ 2025 AI Monitor Survey, 79% of respondents said companies using AI should be required to disclose that usage (Figure 9.1.1). That view was shared across all 30 countries surveyed, even though overall trust in institutions was lower. Over half of respondents, or 54%, said they trust their government to regulate AI responsibly (Figure 9.3.1). The United States reported the lowest trust on this measure (31%). In parallel with the higher levels of optimism and excitement mentioned earlier, Southeast Asian countries reported the highest levels of trust in their governments, including Singapore (81%), Indonesia (76%), Malaysia (73%), and Thailand (70%).

9.3 Perceptions on AI Trust,

Transparency, and Regulation Global Trust in Institutions 9 P U B L I C O P I N I O N | A I I N D E X R E P O R T 202631% 32% 33% 39% 40% 42% 46% 46% 46% 48% 48% 48% 49% 49% 49% 50% 54% 55% 55% 55% 61% 61% 61% 65% 66% 67% 67% 70% 73% 76% 81% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% United States Japan Hungary Great Britain Canada France Sweden Australia South Korea Brazil Netherlands Turkey Belgium Germany Ireland Italy Global South Africa Spain Switzerland Peru Poland Argentina India Colombia Mexico Chile Thailand Malaysia Indonesia Singapore % of respondents Trust in government regulation of AI by country (% of total), 2025 Source: Ipsos, 2025 | Chart: 2026 AI Index report Figure 9.3.1 381 A separate Pew global survey asked a related question to compare respondents’ trust in different governing bodies across the globe. Pew’s Spring 2025 Global Attitudes Survey found that respondents tend to trust their own country most to regulate AI effectively, but trust in outside governments was mixed. Across the 25 countries surveyed, a median of 53% said they trust the EU to regulate AI effectively, compared to 37% for the United States and 27% for China (Figure 9.3.2). Trust in the Chinese government consistently received the lowest ratings across countries, while trust in the EU varied depending on whether respondents lived within or outside of the EU (Figure 9.3.2). However, even within the EU, trust levels were not uniform. Respondents in Germany and the Netherlands were among the most trusting of the EU’s ability to regulate AI effectively, while Greece and Italy were among the least trusting. In the United States, views were evenly divided between trust (44%) and distrust (47%) in the government’s ability to regulate AI effectively, and 43% said they trust the EU on AI regulation. These trust dynamics are shifting against an expanding legislative landscape, outlined in Chapter 8, as the number of countries adopting national AI strategies continues to grow. A separate Ipsos/Google survey shows a related divide in relation to public priorities. Globally, 58% of respondents said it was more important to foster advances in science, medicine, and other fields through AI innovation, compared to 41% who prioritized protecting industries that may be affected by AI through regulation (Figure 9.3.3). Most countries in the survey lean toward innovation, though South Africa, India, and Ireland were among the few where respondents were more likely to prioritize regulation. Across these different measures, public views on AI governance appear mixed and varied in trust, priorities, and regulatory expectations. 9.3 PERCEPTIONS ON AI TRUST, TRANSPARENCY, AND REGULATION | PUBLIC OPINION | AI INDEX REPORT 2026 Figure 9.3.2 382 Figure 9.3.38 8 In Ipsos’ reporting of findings, percentage points are rounded to the nearest whole number. As a result, figures may not add up to exactly 100%. 58% 71% 70% 68% 67% 67% 66% 65% 63% 62% 62% 54% 54% 53% 53% 53% 52% 52% 52% 48% 47% 46% 41% 29% 30% 32% 33% 33% 34% 35% 37% 38% 38% 46% 46% 47% 44% 47% 48% 48% 48% 52% 53% 54% 0% 20% 40% 60% 80% India South Africa Ireland Australia Italy Canada Singapore United States United Arab Emirates United Kingdom Spain Brazil Belgium Germany France Mexico Japan Poland Argentina South Korea Nigeria Global Advancing science, medicine, and other Ǉelds Protecting industries through regulation % of respondents Global priorities: AI innovation vs. AI regulation, 2025 Source: Ipsos, Google 2026 | Chart: 2026 AI Index report In the United States, attitudes toward AI regulation vary meaningfully by geography. In 2025, the Civic Health and Institutions Project fielded a survey across 50 states, and asked respondents whether federal regulation of AI would go too far, not far enough, or “not sure” (Figures 9.3.4 and 9.3.5). Across every state, concern about too little regulation outnumbers concern about too much regulation (41% vs. 27%), but the level of uncertainty is substantial, with more than one-third of respondents selecting “not sure.” New York and Tennessee reported the highest levels of concern that regulation will go too far (31%), while Missouri and Washington had the highest shares who said the government will not go far US Attitudes Toward AI Regulation AL AK AZ AR CA CO CT DE FL GA HI ID IL INIA KS KY LA ME MD MA MIMN MS MO MT NE NV NH NJ NM NY NC ND OH OK OR PA RI SC SD TN TX UT VT VA WA WV WI WY Source: CHIP50, 2025 | Chart: 2026 AI Index report 0-4 5-9 10-14 15-19 20-24 25+ NA Net concern for not enough vs. too much AI regulation by US state, 2025 Figure 9.3.4 9.3 PERCEPTIONS ON AI TRUST, TRANSPARENCY, AND REGULATION | PUBLIC OPINION | AI INDEX REPORT 2026 383 28% 25% 28% 28% 29% 25% 28% 29% 25% 26% 22% 25% 24% 26% 28% 24% 29% 25% 28% 20% 25% 27% 24% 27% 24% 31% 28% 25% 23% 21% 27% 25% 31% 29% 25% 26% 24% 26% 25% 36% 41% 39% 39% 43% 42% 40% 35% 41% 41% 43% 40% 41% 38% 41% 44% 43% 47% 34% 48% 37% 36% 42% 42% 46% 36% 41% 41% 40% 45% 45% 40% 37% 37% 44% 42% 48% 40% 45% 35% 33% 33% 33% 29% 33% 33% 35% 34% 34% 35% 35% 35% 36% 31% 32% 28% 28% 38% 32% 38% 37% 34% 30% 29% 33% 31% 35% 37% 34% 27% 35% 32% 34% 31% 32% 28% 34% 30% 0% 20% 40% 60% 80% Wisconsin West Virginia Washington Virginia Utah Texas Tennessee South Carolina Pennsylvania Oregon Oklahoma Ohio North Carolina New York New Mexico New Jersey New Hampshire Nevada Nebraska Missouri Mississippi Minnesota Michigan Massachusetts Maryland Louisiana Kentucky Kansas Iowa Indiana Illinois Georgia Florida Connecticut Colorado California Arkansas Arizona Alabama Go too far regulating its use Not go far enough regulating its use Not sure % of respondents Support for AI federal regulation by US state, 2025 Source: CHIP50, 2025 | Chart: 2026 AI Index report Figure 9.3.5 9.3 PERCEPTIONS ON AI TRUST, TRANSPARENCY, AND REGULATION | P U B L I C O P I N I O N | A I I N D E X R E P O R T 2026 enough (48%). Across nearly every state, more respondents said regulation does not go far enough than said it goes too far. Roughly one in three respondents in most states said they were not sure, making uncertainty the second-largest category. 384 27% 28% 25% 25% 34% 32% 24% 35% 31% 22% 16% 29% 25% 24% 29% 27% 27% 26% 29% 25% 25% 28% 25% 27% 41% 41% 40% 44% 32% 36% 37% 38% 36% 42% 51% 35% 43% 44% 44% 34% 42% 46% 38% 42% 39% 40% 37% 45% 33% 31% 35% 32% 34% 32% 39% 27% 33% 36% 33% 37% 32% 32% 26% 38% 31% 28% 33% 32% 36% 33% 37% 28% 0% 20% 40% 60% 80% 100% Democrat Independent/Other Republican Rural Suburban Urban College+ Some college HS or less $100K+ $70K–$99,999 $30K–$69,999 <$30K 65+ 50–64 30–49 18–29 Asian Hispanic Black White Female Male Total Go too far regulating its use Not go far enough regulating its use Not sure % of respondents Attitude toward AI federal regulation in the US by demographic group, 2025 Source: CHIP50, 2025 | Chart: 2026 AI Index report Figure 9.3.6 9.3 PERCEPTIONS ON AI TRUST, TRANSPARENCY, AND REGULATION | PUBLIC OPINION | AI INDEX REPORT 2026 Across U.S. demographic groups, the strongest concern about insufficient AI regulation was reported among older adults, especially those 65 and older (51%) (Figure 9.3.6). Education was associated with stronger support for more regulation, with 46% of college graduates saying the government will not go far enough, compared with 34% among respondents with a high school degree or less. Political affiliation was not a significant differentiator, although Democrats were more likely than Republicans to say regulation will not go far enough (45% vs. 40%), while concern about going too far is similar across parties (>25%). 385 Appendix A I I N D E X R E P O R T 2026 Chapter 1: Research and Development Environmental Impact Analysis The AI Index estimated the carbon emissions of training language and vision models using a calculator proposed by Lacoste et al. (2019). The analysis focused on the training stage emissions—excluding embodied hardware production, idle infrastructure, and deployment emissions. The study examined four model categories: industry language models, academic language models, industry vision models, and academic vision models. The calculator’s accuracy was verified against published emission values. Calculator inputs included hardware type, GPU hours, provider, and compute region. For newer hardware like the H100 GPU (released in 2022), the A100 SXM4 80GB was used as a substitute in calculations. GPU hours were calculated by multiplying hardware quantity with training duration; these values were taken from Epoch AI’s Data on AI models or from the technical paper for the model. Provider selection was based on known partnerships (e.g., Google models using GCP, OpenAI using Azure), while compute regions were determined by team locations. Special consideration was given to models trained on custom hardware, such as BLOOM’s use of the Jean Zay supercomputer in France. In these cases, private infrastructure calculations incorporated carbon efficiency (kg/kWh) and offset percentages. The study evaluated 52 models in total: 36 industry language models (2018–25), eight industry vision models (2019–23), four academic language models (2020–23), and four academic vision models (2011–22), selecting particularly influential models in their respective domains. GitHub Identifying AI Projects In partnership with researchers from Harvard Business School, Microsoft Research, and Microsoft’s AI for Good Lab, GitHub identifies public AI repositories following the methodologies of Gonzalez, Zimmerman, and Nagappan (2020) and Dohmke, Iansiti, and Richards (2023), using topic labels related to AI/ML and generative AI, respectively, along with other relevant keywords identified through snowball sampling, such as “machine learning,” “deep learning,” and “artificial intelligence.” GitHub further augments the dataset with repositories that have a dependency on the PyTorch, TensorFlow, OpenAI, Transformers, XGBoost, scikit-learn, and SciPy libraries for Python. Mapping AI Projects to Geographic Areas Public AI projects are mapped to geographic areas using IP address geolocation to determine the mode location of a project’s owners each year. Each project owner is assigned a location based on their IP address when interacting with GitHub. If a project owner changes locations within a year, the location for the project would be determined by the mode location of its owners sampled daily in the year. Additionally, the last known location of the project owner is carried forward on a daily basis even if the project owner performed no activities that day. For example, if a project owner performed activities within the United States and then became inactive for six days, that project owner would be considered to be in the United States for the seven-day span. Hugging Face Hugging Face (HF) data is collected from two distinct sources:

• Downloads data: shared by the authors of Longpre et al. (2025)

• Number of existing models and datasets: publicly accessible Hugging Face (HF) repository. Downloads data Longpre et al. (2025) is used because it provides the most consistent and complete information on actual downloads.

• Download data from HF can vary across releases depending on how parameters are handled. Longpre et al. collaborated directly with HF personnel, who confirmed that this dataset is the least noisy version available.2 1 GET/HEAD requests, local cache hits, IP addresses for cloud-hosted virtual machines when the setup does not require a fixed IP, the Git/Xet backend for models obtained inside vs. outside the Transformers and Hub API, proactive spam detection, retroactive spam detection, etc. 2 The unreleased usage data takes a different aggregation approach than the public API. It is filtered to remove repetitive, duplicate requests from 386

• Much of the download and model metadata is missing in other sources; the authors performed manual cleaning and imputed missing values, providing a more complete version than publicly accessible alternatives.

• Public HF data is provided as a cross-section of “all times downloads,” whereas access to a panel version was granted through direct collaboration with the authors. This dataset is limited by the following:

• The time span covers March 2022 to August 2025. The start date cannot be pulled up because that is when HF began tagging model creation dates.

• Coverage is limited to the top 200 most-downloaded models per week among models with at least 200 downloads. However, this subsample reflects 49.6% of all downloads. Number of existing models and datasets The public source is constantly updated but at this time contains a large number of missing values for modality (~60%) and no geographic information, and it is only available in cross-sectional format. As a result, it does not allow for time-series analysis on usage, or more granular representation of models/datasets population by geographic area and modality. AI Publication Analysis For this analysis, the AI Index used OpenAlex, an open scholarly database with over 260 million research publications, as its primary data source. OpenAlex classifies papers using its own knowledge organization system, known as OpenAlex Topics—a taxonomy of around 4,500 topics combining Scopus codes and CWTS classification. The system uses a deep learning model that considers titles, abstracts, journal names, and citation networks for classification. To identify AI-related topics more precisely, the AI Index analyzed computer science publications identified by OpenAlex and refined the classifications using the Computer Science Ontology and the CSO Classifier. The Computer Science Ontology (CSO) is a large-scale, automatically generated ontology of research areas derived from 16 million publications using the Klink-2 algorithm. It features a hierarchical structure with thousands of subtopics, allowing for precise mapping of specific terms to broader research fields. Compared to general-purpose scholarly databases such as OpenAlex, Scopus, and Web of Science, CSO offers a more detailed and fine-grained representation of the research landscape. As a result, it has been widely used for scholarly data exploration, analysis, modeling, and expert identification and recommendation. Version 3.4.1—used in this analysis—includes approximately 15,000 topics and 166,000 relationships within computer science. Released on Jan. 17, 2025, this version introduces over 150 new research topics in artificial intelligence, bringing the total to 2,369 AI-related topics and 12,620 hierarchical relationships within the AI domain alone. To analyze research trends, the AI Index used the CSO Classifier—an unsupervised method that automatically categorizes research papers based on CSO topics. The classifier follows a three-stage pipeline that processes paper titles and abstracts: A syntactic module detects direct mentions of CSO topics; a semantic module uses word embeddings to identify related concepts; and a postprocessing module merges results, filters out irrelevant topics, and adds broader categories for a more refined classification. For this analysis, the AI Index extended the CSO Classifier to focus specifically on artificial intelligence and its subtopics. Since its initial release, the classifier has gained significant and growing interest due to its versatility. For example, Springer Nature uses it to routinely classify proceedings books, improving metadata quality. Beyond academic publishing, it has been successfully applied to categorize research software, YouTube videos, press releases, job ads, and IT museum collections. Accurately categorizing research papers as either conference proceedings or journal articles is essential for this analysis. OpenAlex’s metadata fields—type, crossref_type, and source_type—can sometimes conflict. To resolve these inconsistencies, the AI Index mapped OpenAlex records to DBLP, a leading bibliographic database for computer science publications. Known for its high metadata quality, DBLP currently indexes 3.6 million conference papers and 3 million journal articles and continuously adds new publications through a rigorous, semiautomated curation process. The initial matching between OpenAlex and DBLP was performed using DOIs. For remaining unmatched papers, the AI Index used a combination of title and publication year. To streamline this process, the AI Index built a title index to optimize search and ensure efficient mapping across the datasets. AI publications are aggregated based on several parameters to provide a comprehensive analysis. Publications are grouped by year, considering the publication date of the most recent versions. Additionally, the AI Index groups publications by geographic areas or World Bank regions based on the affiliations of authors. This means a single paper can contribute to multiple counts if co-authored by researchers from different countries, with each country receiving a count. When authors’ affiliations are missing, the publications are mapped as “Unknown.” Furthermore, sectors are associated with publications through authors’ affiliations when available, which may lead to one publication being counted for multiple sectors. Citation counts are included when available; those without citation data are classified as “Unknown.” Top 100 Publications Analysis The AI Index conducted a comprehensive analysis of influential AI publications by collecting and analyzing citation data from multiple sources, including OpenAlex, Google Scholar, and Semantic Scholar. Initially gathering the top 150 most-cited papers per publication year from OpenAlex, the list was refined through careful review to 100 publications. the same user on the same day, as well as models with fewer than 200 total downloads, suggesting less broad usage. It is available only upon request to the ML & Society Team of Hugging Face. A P P E N D I X | A I I N D E X R E P O R T 2026 387 The methodology attributes publications to all countries and regions represented by authors’ affiliations, meaning a single paper can contribute to multiple counts. For instance, a paper co-authored by researchers from the United States and China counts once for each country. This approach may result in overlapping totals in aggregate statistics. Publication years are based on the most recent versions, whether in journals, conferences, or repositories like arXiv. To maintain accuracy, organizational affiliations were verified and standardized, with countries assigned according to headquarters’ locations. The full list of the top 100 AI publications is available here. AI Conference Attendance The AI Index contacted the organizers of various AI conferences in 2025 to request information on total attendance. For conferences that posted their attendance totals online, the AI Index used those reported totals and did not reach out to the conference organizers. AI Patent Analysis The AI Index identifies AI-related patents using a hybrid classification approach, combining keyword-based text analysis with classification-code-based identification. Patent-level bibliographic data is sourced from PATSTAT Global, a comprehensive database issued by the European Patent Office (EPO). The analysis focuses on granted patents from 2010 onward, aggregated at the DOCDB family level to avoid duplicate counting of the same invention. Patents are attributed to countries based on the publication authority of the earliest recorded grant publication. Patent abstracts and titles originally published in languages other than English were translated using the deep-translator tool, Google Translate engine, and the Meta NLLB-200 machine translation model. Post-translation, patent texts were processed using natural language processing (NLP) techniques. These included the removal of stop words and special characters, part-of-speech (POS) tagging to retain key grammatical categories, lowercase conversion, lemmatization, and replacement of numerical measures with a <NUM> tag. AI-related patents are identified by searching for relevant terms in patent titles and abstracts using regular expressions (regex). An AI-specific keyword dictionary was developed through a structured multistep process, incorporating keywords generated by AI models, expanded using established AI lexicons such as those from Yamashita et al. (2021), and refined through Word2Vec-based synonym identification. Further validation was conducted using BERTopic topic modeling and DeBERTA-based zero-shot classification, with manual checks applied to reduce false positives. In addition to keyword-based classification, AI-related patents were identified using International Patent Classification (IPC) and Cooperative Patent Classification (CPC) codes. A curated list of AI-relevant codes was compiled through a combination of AI model analysis, regex-based searches, and prior research, including classifications from Pairolero et al. (2023) and WIPO (2024). The final dataset was constructed by merging results from both approaches, balancing coverage and accuracy. Speed of Knowledge Diffusion Methodology Definition: The Kaplan-Meier estimator is a non-parametric statistical method used to estimate the survival function from lifetime data. In this context:

• Lifetime refers to the citation lag, measured in months, from a patent’s publication to its first citation.

• The event is a patent receiving its first citation.

• Censoring occurs when a patent has not been cited by the end of the observation period. Computation: The survival function represents the probability that a patent has not yet received a citation for a duration exceeding a specified time t. The Kaplan-Meier estimator is mathematically defined as follows:

• t: The time since publication (in months).

• t_i: The distinct time points at which at least one patent receives its first citation.

• d_i: The number of events (patents cited for the first time) that occur at time t_i.

• n_i: The number of patents at risk of being cited just prior to time t_i, which includes all patents that have remained uncited up to time t_(i-1) and have not been censored before t_i. The analysis is conducted at the patent family level (single invention), using the earliest publication date of each family as the time 3 Despite this aggregation procedure, duplicates occasionally appear in marginal cases where applications within the same DOCDB family share the same earliest filing date. The AI Index removes duplicate values with respect to the aggregation variables (e.g., counting by year) when presenting analytics. A P P E N D I X | A I I N D E X R E P O R T 2026 388 reference for citation events and citation lags. For applications of this methodology in the literature, refer to examples such as Fisch et al. (2017) and Xie et al. (2019). The figure aggregates citations from all patent authorities, which together constitute less than 6% of the total citations, into a category labeled “Rest of the World.” Technological Proximity Methodology Computation: Based on the work of Bar et al. (2012), this measure is calculated by comparing the patent portfolios of two countries (patent authorities) using the technological classes to which their patents belong. The calculation uses a vector for each entity i patent portfolio, P_i, where each component P_ik is the share of the entity’s total patents in a specific technological class k4.

• The final measure captures the sum of the minimum shares across all shared technological classes, quantifying the share of overlapping inventions between the two portfolios.

• Interpretation: The resulting value indicates the similarity in the technological focus between the two countries. ◊ Range: The value is normalized to fall between 0 and 1. Benchmark: The graph shows proximity measures between countries and the two major patent authorities (the United States and China) based on the number of granted AI patents. Zeki Zeki AI Talent Zeki has identified 658,000 top AI talent (outside of China) with a proven track record in producing new discoveries in AI by contributing to research, data depositories, or new models. They are of particular value in the market because of their advanced skills and ability to push the boundaries of science and engineering. They create new products and intellectual property (IP) for their employers rather than just applying existing technology in the market. The following countries are covered: Australia, Brazil, Canada, Denmark, Finland, France, Germany, India, Israel, Italy, Japan, Netherlands, Saudi Arabia, Singapore, South Korea, Spain, Sweden, Switzerland, UAE, United Kingdom, United States. Data Zeki collects publicly available, strictly business-related data that has been published or released by companies or individuals online. Zeki does not collect private data about individuals (i.e., information that is not publicly available and the individual has chosen to keep private). Zeki strictly refrains from any data collection that involves data aggregation from within secured login areas. We also purchase data from vetted vendors that are stringently assessed by external legal experts to ensure their compliance with data regulations. Career data is sourced in a compliant manner from the open web. For the talent specialization, Zeki has curated unique areas of AI innovation across all aspects of AI software, hardware, and compute. The primary and secondary areas are identified through a comprehensive analysis of all relevant research papers. Gender is inferred using a probability model based on the likelihood of a first name being male or female. It is enhanced with likely country of residence to improve accuracy. While this method is generally reliable, some names are commonly used across multiple genders. In such cases, the model assigns a probability score. If the probability falls between 45% and 55%, the name is classified as ambiguous or nonspecific, meaning a confident gender assignment could not be made. Date Range The dataset covers the period 2010 to 2025. Records that began prior to 2010 are excluded from this dataset. Affiliation Range The dataset includes individuals working in companies and academic institutions. Last Known Residency To ensure accuracy and reduce volatility caused by delayed profile updates, Zeki uses the Last Known Residency logic for longitudinal datasets. When a professional’s career timeline lacks an explicit “end date” for their most recent role, or when there is a gap between their last update and the current year (2025), their last known residency, sector, and education level are carried forward to the present day. This methodology accounts for the fact that professionals often delay updating profiles following career breaks, layoffs, or role transitions, providing a more reliable and less volatile view of the active talent pool. 4 Using 5-digits International/Cooperative Patent Classification (IPC/CPC) codes. A P P E N D I X | A I I N D E X R E P O R T 2026 389 Talent Supply

(1) Talent supply This dataset series provides a detailed view of global AI talent distribution from 2010 to 2025, structured for longitudinal analysis to support both trend tracking and cross-country comparisons. It is delivered in three related panel datasets:

1. Talent Supply - Country by Year

2. Talent Supply - Education by Year and Country

3. Talent Supply - Sector by Year and Country These datasets are split rather than combined into one file, but all follow a consistent methodology for data aggregation, calculation of absolute numbers, and percentage shares (where appropriate). Country assignments for each year are derived from individuals’ career timelines, using the location associated with their experience for that year. Similarly, sector information is documented annually from the career timeline, meaning an individual is linked to a sector for each year they worked in it. Education is mapped as a time series from the education timeline, so an individual receives a count in the relevant category (e.g., “master’s”) during the year they pursued that education. This integrated approach ensures accurate representation of AI talent by geography, sector, and education over time. There are six education levels. Sectors are based broadly on LinkedIn sectors. There are 336 sectors.

(2) Areas of specialization This dataset provides AI talent counts by country, country code, and area of specialization, based on the most recent data for each individual. The country is determined from the individual’s latest, recorded experience and its associated location. Areas of specialization are identified using each person’s top two current specializations. The top 100 specializations are taken. Each individual is then mapped to their country and relevant specializations, and counts are calculated.

(3) Mobility This dataset captures the mobility and career transitions of AI talent, providing insights into three key dimensions:

• Country and regional movement: tracks inflows and outflows of AI talent across countries and regions, highlighting migration trends and global talent flows.

• Sector transitions: monitors shifts between major sectors such as academia, industry, and government, revealing patterns in career progression and workforce dynamics. To ensure accuracy and reduce volatility caused by delayed profile updates, all trends are smoothed using a 12-month moving average, delivering a clearer and more reliable view of long-term changes. When professionals change jobs, there is often a delay before they update their profiles—especially in instances of layoffs or career breaks.

(4) Gender distribution This dataset presents the gender distribution of AI talent by country and year, expressed as percentages. For each country and each year in the time range, the share of male and female AI professionals is calculated as a proportion of the total AI talent pool for that country-year combination. This structure enables analysis of gender representation trends over time and supports cross-country comparisons. A P P E N D I X | A I I N D E X R E P O R T 2026 390 Chapter 2: Technical Performance Benchmarks In this chapter, the AI Index reports on benchmarks, recognizing their importance in tracking AI’s technical progress. As a standard practice, the Index sources benchmark scores from leaderboards and public repositories, as well as company papers, blog posts, and product releases. The Index operates under the assumption that the scores reported by companies are accurate and factual. The benchmark scores in this section are current as of early 2026. However, since the publication of the AI Index, newer models may have been released that surpass current state-of-the-art scores. Works Cited Bai, Y., Tu, S., Zhang, J., Peng, H., Wang, X., Lv, X., Cao, S., Xu, J., Hou, L., Dong, Y., Tang, J., & Li, J. (2025). LongBench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. In W. Che, J. Nabende, E. Shutova, & M. T. Pilehvar (Eds.), Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 3639–3664). Association for Computational Linguistics. https://doi.org/10.18653/v1/2025.acl-long.183 Balunović, M., Dekoninck, J., Petrov, I., Jovanović, N., & Vechev, M. (2025). MathArena: Evaluating LLMs on uncontaminated math competitions (Version 3). arXiv. https://doi.org/10.48550/ARXIV.2505.23281 Burnham, G., & Adamczewski, T. (2025, June 25). LLMs now accept longer inputs, and the best models can use them more effectively. Epoch AI. https://epoch.ai/data-insights/context-windows Center for AI Safety, Phan, L., Gatti, A., Li, N., Khoja, A., Kim, R., Ren, R., Hausenloy, J., Zhang, O., Mazeika, M., Hendrycks, D., Scale AI, Han, Z., Hu, J., Zhang, H., Zhang, C. B. C., Shaaban, M., Ling, J., Shi, S., … Scaramuzza, D. (2026). A benchmark of expert-level academic questions to assess AI capabilities. Nature, 649(8099), 1139–1146. https://doi.org/10.1038/s41586-025-09962-4 Chan, J. S., Chowdhury, N., Jaffe, O., Aung, J., Sherburn, D., Mays, E., Starace, G., Liu, K., Maksin, L., Patwardhan, T., Weng, L., & Mądry, A. (2024). MLE-bench: Evaluating machine learning agents on machine learning engineering (Version 6). arXiv. https://doi. org/10.48550/ARXIV.2410.07095 Cheng, Z., Wohnig, S., Gupta, R., Alam, S., Abdullahi, T., Ribeiro, J. A., Nielsen-Garcia, C., Mir, S., Li, S., Orender, J., Bahrainian, S. A., Kirste, D., Gokaslan, A., Glinka, M., Eickhoff, C., & Wolff, R. (2025). Benchmarking is broken—don’t let AI be its own judge (arXiv:2510.07575). arXiv. https://doi.org/10.48550/arXiv.2510.07575 Corrêa, A. B., Pereira, A. G., & Seipp, J. (2025). The 2025 planning performance of frontier large language models (arXiv:2511.09378). arXiv. https://doi.org/10.48550/arXiv.2511.09378 Denain, J.-S., & Ho, A. (2025, September 19). The huge potential implications of long-context inference. Epoch AI. https://epoch.ai/ gradient-updates/the-huge-potential-implications-of-long-context-inference Eriksson, M., Purificato, E., Noroozian, A., Vinagre, J., Chaslot, G., Gomez, E., & Fernandez-Llorca, D. (2025). Can we trust AI benchmarks? An interdisciplinary review of current issues in AI evaluation (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2502.06559 Fu, T., González, M., Conde, J., Merino-Gómez, E., & Reviriego, P. (2025). Have multimodal large language models really learned to tell the time on analog clocks? IEEE Internet Computing, 29(4), 48–54. https://doi.org/10.1109/MIC.2025.3618144 Gemini Robotics Team, Abeyruwan, S., Ainslie, J., Alayrac, J.-B., Arenas, M. G., Armstrong, T., Balakrishna, A., Baruch, R., Bauza, M., Blokzijl, M., Bohez, S., Bousmalis, K., Brohan, A., Buschmann, T., Byravan, A., Cabi, S., Caluwaerts, K., Casarini, F., Chang, O., … Zhou, Y. (2025). Gemini Robotics: Bringing AI into the physical world (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2503.20020 Glazer, E., Erdil, E., Besiroglu, T., Chicharro, D., Chen, E., Gunning, A., Olsson, C. F., Denain, J.-S., Ho, A., Santos, E. de O., Järviniemi, O., Barnett, M., Sandler, R., Vrzala, M., Sevilla, J., Ren, Q., Pratt, E., Levine, L., Barkley, G., … Wildon, M. (2024). FrontierMath: A benchmark for evaluating advanced mathematical reasoning in AI (Version 7). arXiv. https://doi.org/10.48550/ARXIV.2411.04872 Han, H., Li, S., Chen, J., Yuan, Y., Wu, Y., Leong, C. T., Du, H., Fu, J., Li, Y., Zhang, J., Zhang, C., Li, L., & Ni, Y. (2025). Video-Bench: Human-aligned video generation benchmark (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2504.04907 Haupt, A., & Brynjolfsson, E. (2025). Position: AI should not be an imitation game: Centaur evaluations. https://digitaleconomy.stanford. edu/app/uploads/2025/06/CentaurEvaluations.pdf Hu, K., Wu, P., Pu, F., Xiao, W., Zhang, Y., Yue, X., Li, B., & Liu, Z. (2025). Video-MMMU: Evaluating knowledge acquisition from multi-discipline professional videos (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2501.13826 Li, Y., Jiang, M., Zhang, S., Yuan, W., Wang, C., & Yang, M. (2024). From imitation to exploration: End-to-end autonomous driving based on world model (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2410.02253 A P P E N D I X | A I I N D E X R E P O R T 2026 391 Luong, T., Hwang, D., Nguyen, H. H., Ghiasi, G., Chervonyi, Y., Seo, I., Kim, J., Bingham, G., Lee, J., Mishra, S., Zhai, A., Hu, C. H., Michalewski, H., Kim, J., Ahn, J., Bae, J., Song, X., Trinh, T. H., Le, Q. V., & Jung, J. (2025). Towards robust mathematical reasoning (arXiv:2511.01846). arXiv. https://doi.org/10.48550/arXiv.2511.01846 Mialon, G., Fourrier, C., Swift, C., Wolf, T., LeCun, Y., & Scialom, T. (2024). A benchmark for general AI assistants. Muennighoff, N., Tazi, N., Magne, L., & Reimers, N. (2022). MTEB: Massive text embedding benchmark (Version 3). arXiv. https://doi. org/10.48550/ARXIV.2210.07316 Park, S.-Y., Cui, C., Ma, Y., Moradipari, A., Gupta, R., Han, K., & Wang, Z. (2025). NuPlanQA: A large-scale dataset and benchmark for multi-view driving scene understanding in multi-modal large language models (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2503.12772 Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y., Dirani, J., Michael, J., & Bowman, S. R. (2023). GPQA: A graduate-level Google-proof Q&A benchmark (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2311.12022 Reuel, A., Ghosh, A., Chim, J., Tran, A., Long, Y., Mickel, J., Gohar, U., Yadav, S., Ammanamanchi, P. S., Allaham, M., Rahmani, H. A., Akhtar, M., Friedrich, F., Scholz, R., Riegler, M. A., Batzner, J., Habba, E., Saxena, A., Kornilova, A., … Solaiman, I. (2025). Who evaluates AI’s social impacts? Mapping coverage and gaps in first and third party evaluations (arXiv:2511.05613). arXiv. https://doi. org/10.48550/arXiv.2511.05613 Reuel, A., Hardy, A., Smith, C., Lamparth, M., Hardy, M., & Kochenderfer, M. J. (2024). BetterBench: Assessing AI benchmarks, uncovering issues, and establishing best practices (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2411.12990 Safar, A., & Chichigin, O. (2025). ClockBench: Visual time benchmark where humans beat the clock, LLMs don’t. https://clockbench.ai/ ClockBench.pdf Saxena, R., Gema, A. P., & Minervini, P. (2025, March 18). Lost in time: Clock and calendar understanding challenges in multimodal LLMs. https://doi.org/10.48550/arXiv.2502.05092 Singh, S., Nan, Y., Wang, A., D’Souza, D., Kapoor, S., Üstün, A., Koyejo, S., Deng, Y., Longpre, S., Smith, N. A., Ermis, B., Fadaee, M., & Hooker, S. (2025). The leaderboard illusion (arXiv:2504.20879). arXiv. https://doi.org/10.48550/arXiv.2504.20879 Truong, S., Tu, Y., Hardy, M., Reuel, A., Tang, Z., Burapacheep, J., Perera, J., Uwakwe, C., Domingue, B., Haber, N., & Koyejo, S. (2025). Fantastic bugs and where to find them in AI benchmarks (arXiv:2511.16842). arXiv. https://doi.org/10.48550/arXiv.2511.16842 Wang, Y., Ma, X., Zhang, G., Ni, Y., Chandra, A., Guo, S., Ren, W., Arulraj, A., He, X., Jiang, Z., Li, T., Ku, M., Wang, K., Zhuang, A., Fan, R., Yue, X., & Chen, W. (2024). MMLU-Pro: A more robust and challenging multi-task language understanding benchmark (Version 6). arXiv. https://doi.org/10.48550/ARXIV.2406.01574 Wiedemer, T., Li, Y., Vicol, P., Gu, S. S., Matarese, N., Swersky, K., Kim, B., Jaini, P., & Geirhos, R. (2025). Video models are zero-shot learners and reasoners (arXiv:2509.20328). arXiv. https://doi.org/10.48550/arXiv.2509.20328 Xie, T., Zhang, D., Chen, J., Li, X., Zhao, S., Cao, R., Hua, T. J., Cheng, Z., Shin, D., Lei, F., Liu, Y., Xu, Y., Zhou, S., Savarese, S., Xiong, C., Zhong, V., & Yu, T. (2024). OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2404.07972 Xu, C., Guan, S., Greene, D., & Kechadi, M.-T. (2024). Benchmark data contamination of large language models: A survey (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2406.04244 Yang, V., Jin, H., Zhong, S., Jiang, S., Wang, Q., Chaudhary, V., & Han, X. (2025). 100-LongBench: Are de facto long-context benchmarks literally evaluating long-context ability? In W. Che, J. Nabende, E. Shutova, & M. T. Pilehvar (Eds.), Findings of the Association for Computational Linguistics: ACL 2025 (pp. 17560–17576). Association for Computational Linguistics. https://doi.org/10.18653/ v1/2025.findings-acl.903 Yao, S., Shinn, N., Razavi, P., & Narasimhan, K. (2024). τ-bench: A benchmark for tool-agent-user interaction in real-world domains (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2406.12045 Yu, Y., Qi, Z., Huang, Y., Wang, W., Weifeng.liu, Chen, R., & Pei, J. (2025). Long-context language models fail in basic retrieval tasks without sufficient reasoning steps. In C. Christodoulopoulos, T. Chakraborty, C. Rose, & V. Peng (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2025 (pp. 5615–5634). Association for Computational Linguistics. https://doi.org/10.18653/v1/2025. findings-emnlp.301 Zhang, A. K., Perry, N., Dulepet, R., Ji, J., Menders, C., Lin, J. W., Jones, E., Hussein, G., Liu, S., Jasper, D., Peetathawatchai, P., Glenn, A., Sivashankar, V., Zamoshchin, D., Glikbarg, L., Askaryar, D., Yang, M., Zhang, T., Alluri, R., … Liang, P. (2024). Cybench: A framework for evaluating cybersecurity capabilities and risks of language models (Version 4). arXiv. https://doi.org/10.48550/ ARXIV.2408.08926 A P P E N D I X | A I I N D E X R E P O R T 2026 392 Zhang, L., Dong, J., Bai, K., Ni, M., Marton, Z.-C., Chen, Z., & Zhang, J. (2025). ResponsibleRobotBench: Benchmarking responsible robot manipulation using multi-modal large language models (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2512.04308 Zheng, D., Huang, Z., Liu, H., Zou, K., He, Y., Zhang, F., Gu, L., Zhang, Y., He, J., Zheng, W.-S., Qiao, Y., & Liu, Z. (2025). VBench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2503.21755 Zhou, S., Xu, F. F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Ou, T., Bisk, Y., Fried, D., Alon, U., & Neubig, G. (2023). WebArena: A realistic web environment for building autonomous agents (Version 4). arXiv. https://doi.org/10.48550/ARXIV.2307.13854 Zhu, X., Qi, Y., Zhu, Y., Walters, R., & Platt, R. (2025). EquAct: An SE(3)-equivariant multi-task transformer for open-loop robotic manipulation (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2505.21351 A P P E N D I X | A I I N D E X R E P O R T 2026 393 Chapter 3: Responsible AI Conference Submissions Analysis For the analysis on responsible AI-related conference submissions, the AI Index examined the number of responsible AI–related academic submissions at the following conferences: AAAI, AIES, FAccT, ICML, ICLR, and NeurIPS. Specifically, the team scraped the conference websites or repositories of conference submissions for papers containing relevant keywords indicating they could fall into a particular responsible AI category. The papers related to the most ambiguous keywords (e.g., “bias”) were then manually verified by a human team to confirm their categorization, while the rest were controlled according to a categorical RAI-coherence label, produced by prompting each paper title and abstract to a large language model. It is possible for a single paper to belong to multiple responsible AI categories. The keywords searched include: Fairness and bias: algorithmic fairness, bias detection, bias mitigation, discrimination, equity in AI, ethical algorithm design, fair data practices, fair ML, fairness and bias, group fairness, individual fairness, justice, nondiscrimination, representational fairness, unfair, unfairness. Privacy and data governance: anonymity, confidentiality, data breach, data ethics, data governance, data integrity, data privacy, data protection, data transparency, differential privacy, inference privacy, machine unlearning, privacy by design, privacy-preserving, secure data storage, trustworthy data curation. Security: adversarial attack, adversarial learning, AI incident, attacks, audits, cybersecurity, ethical hacking, forensic analysis, fraud detection, red teaming, safety, security, security ethics, threat detection, vulnerability assessment. Transparency and explainability: algorithmic transparency, audit, auditing, causal reasoning, causality, explainability, explainable AI, explainable models, human-understandable decisions, interpretability, interpretable models, model explainability, outcome explanation, transparency, xAI. Works Cited Adams, R., Adeleke, F., Florido, A., Galdino De Magalhaes Santos, L., Grossman, N., Junck, L., & Stone, K. (2024). The global index on responsible AI. South Africa: Global Center on AI Governance. https://www.global-index.ai/ Adelani, D. I., Ojo, J., Azime, I. A., Zhuang, J. Y., Alabi, J. O., He, X., Ochieng, M., Hooker, S., Bukula, A., Lee, E.-S. A., Chukwuneke, C., Buzaaba, H., Sibanda, B., Kalipe, G., Mukiibi, J., Kabongo, S., Yuehgoh, F., Setaka, M., Ndolela, L., … Stenetorp, P. (2024). IrokoBench: A new benchmark for African languages in the age of large language models (Version 2). arXiv. https://doi.org/10.48550/ ARXIV.2406.03368 Al-Matham, R., Darwish, K., Al-Rasheed, R., Alshammari, W., Alhoshan, M., Almazrua, A., Wazrah, A. A., Alheraki, M., Alam, F., Nakov, P., Alzahrani, N., alBilali, E., Habash, N., El-Sheikh, A., Elmallah, M., Li, H., Mubarak, H., Anwar, M., Alyafeai, Z., … AlOsaimy, A. (2025). BALSAM: A platform for benchmarking Arabic large language models (Version 1). arXiv. https://doi.org/10.48550/ ARXIV.2507.22603 Baucells, I., Aula-Blasco, J., de-Dios-Flores, I., Paniagua Suárez, S., Perez, N., Salles, A., Sotelo Docio, S., Falcão, J., Saiz, J. J., Sepulveda Torres, R., Barnes, J., Gamallo, P., Gonzalez-Agirre, A., Rigau, G., & Villegas, M. (2025). IberoBench: A benchmark for LLM evaluation in Iberian lLanguages. In O. Rambow, L. Wanner, M. Apidianaki, H. Al-Khalifa, B. D. Eugenio, & S. Schockaert (Eds.), Proceedings of the 31st International Conference on Computational Linguistics (pp. 10491–10519). Association for Computational Linguistics. https://aclanthology.org/2025.coling-main.699/ Beauchemin, D., Tremblay, Y., Youssef, M. A., & Khoury, R. (2025). COLE: A comprehensive benchmark for French language understanding evaluation (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2510.05046 Cecchini, D., Nazir, A., Chakravarthy, K., & Kocaman, V. (2024). Holistic evaluation of large language models: Assessing robustness, accuracy, and toxicity for real-world applications. In A. Ovalle, K.-W. Chang, Y. T. Cao, N. Mehrabi, J. Zhao, A. Galstyan, J. Dhamala, A. Kumar, & R. Gupta (Eds.), Proceedings of the 4th Workshop on Trustworthy Natural Language Processing (TrustNLP 2024) (pp. 109–117). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.trustnlp-1.11 Dadas, S., Grębowiec, M., Perełkiewicz, M., & Poświata, R. (2025). Evaluating Polish linguistic and cultural competency in large language models (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2503.00995 Fajcik, M., Docekal, M., Dolezal, J., Ondrej, K., Beneš, K., Kapsa, J., Smrz, P., Polok, A., Hradis, M., Neverilova, Z., Horak, A., Sabol, R., Stefanik, M., Jirkovsky, A., Adamczyk, D., Hyner, P., Hula, J., & Kydlicek, H. (2025). BenCzechMark: A Czech-centric multitask and multimetric benchmark for large language models with duel scoring mechanism. Transactions of the Association for Computational Linguistics, 13, 1068–1095. https://doi.org/10.1162/TACL.a. A P P E N D I X | A I I N D E X R E P O R T 2026 394 Gao, F., Huang, C., Tashi, N., Wang, X., Tsering, T., Ma-bao, B., Duojie, R., Luosang, G., Dongrub, R., Tashi, D., Feng, H. W. X., & Yu, Y. (2025). TLUE: A Tibetan language understanding evaluation benchmark (Version 5). arXiv. https://doi.org/10.48550/ARXIV.2503.12051 González, J. Á., Obrador, I. B., Herrero, Á. R., Sarvazyan, A. M., Chinea-Ríos, M., Basile, A., & Franco-Salvador, M. (2025). IberBench: LLM evaluation on Iberian languages (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2504.16921 Isbarov, J., Akhundjanova, A., Hajili, M., Huseynova, K., Gaynullin, D., Rzayev, A., Tursun, O., Turdubaeva, A., Saetov, I., Kharisov, R., Belginova, S., Kenbayeva, A., Alisheva, A., Köksal, A., Rustamov, S., & Ataman, D. (2025). TUMLU: A unified and native language understanding benchmark for Turkic languages (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2502.11020 Jassem, K., Ciesiółka, M., Graliński, F., Jabłoński, P., Pokrywka, J., Kubis, M., Jabłońska, M., & Staruch, R. (2025). LLMzSzŁ: A comprehensive LLM benchmark for Polish (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2501.02266 Kaffee, L.-A., Pistilli, G., & Jernite, Y. (2025). INTIMA: A benchmark for human-AI companionship behavior (Version 1). arXiv. https://doi. org/10.48550/ARXIV.2508.09998 Kemmerzell, N., & Schreiner, A. (2024). Quantifying the trade-offs between dimensions of trustworthy AI - An empirical study on fairness, explainability, privacy, and robustness. KI 2024: Advances in Artificial Intelligence, 128–146. https://doi.org/10.1007/978-3031-70893-0_10 Kim, D., Jang, M., Kwon, D. S., & Davis, E. (2022). KOBEST: Korean balanced evaluation of significant tasks (Version 1). arXiv. https:// doi.org/10.48550/ARXIV.2204.04541 Ko, H., Yang, K., Ryu, M., Choi, T., Yang, S., Hyun, J., Park, S., & Park, K. (2023). A technical report for Polyglot-Ko: Open-source largescale Korean language models (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2306.02254 Koberidze, I., Elizbarashvili, A., & Tsintsadze, M. (2025). A benchmark for evaluating logical reasoning in Georgian for large language models. In E. L. Estevanell-Valladares, A. Picazo-Izquierdo, T. Ranasinghe, B. Mikaberidze, S. Ostermann, D. Gurgurov, P. Mueller, C. Borg, & M. Šimko (Eds.), Proceedings of the First Workshop on Advancing NLP for Low-Resource Languages (pp. 121–130). INCOMA Ltd., Shoumen, Bulgaria. https://aclanthology.org/2025.lowresnlp-1.13/ Li, N., Pan, A., Gopal, A., Yue, S., Berrios, D., Gatti, A., Li, J. D., Dombrowski, A.-K., Goel, S., Phan, L., Mukobi, G., Helm-Burger, N., Lababidi, R., Justen, L., Liu, A. B., Chen, M., Barrass, I., Zhang, O., Zhu, X., … Hendrycks, D. (2024). The WMDP benchmark: Measuring and reducing malicious use with unlearning (Version 7). arXiv. https://doi.org/10.48550/ARXIV.2403.03218 Libovický, J., Helcl, J., Manea, A., & Vico, G. (2025). CUS-QA: Local-knowledge-oriented open-ended question answering dataset (Version 3). arXiv. https://doi.org/10.48550/ARXIV.2507.22752 Lillepalu, H. G., & Alumäe, T. (2025). Estonian native large language model benchmark (Version 2). arXiv. https://doi.org/10.48550/ ARXIV.2510.21193 Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., Forsyth, D., & Hendrycks, D. (2024). HarmBench: A standardized evaluation framework for automated red teaming and robust refusal (Version 2). arXiv. https://doi. org/10.48550/ARXIV.2402.04249 Min, J., Ng, Y. H., Chan, S., Zhao, H. S., & Lee, E.-S. A. (2025). CantoNLU: A benchmark for Cantonese natural language understanding (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2510.20670 Montalan, J. R., Layacan, J. P., Africa, D. D., Flores, R. I., Lopez, M. T., Magsajo, T. D., Cayabyab, A., & Tjhi, W. C. (2025). Batayan: A Filipino NLP benchmark for evaluating large language models (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2502.14911 NAVER Cloud HyperCLOVA X Team. (2025). HyperCLOVA X THINK technical report (Version 2). arXiv. https://doi.org/10.48550/ ARXIV.2506.22403 Ng, R., Nguyen, T. N., Huang, Y., Tai, N. C., Leong, W. Y., Leong, W. Q., Yong, X., Ngui, J. G., Susanto, Y., Cheng, N., Rengarajan, H., Limkonchotiwat, P., Hulagadri, A. V., Teng, K. W., Tong, Y. Y., Siow, B., Teo, W. Y., Lau, W., Tan, C. M., … Teo, L. (2025). SEA-LION: Southeast Asian languages in one network (Version 4). arXiv. https://doi.org/10.48550/ARXIV.2504.05747 Ojo, J., Ogundepo, O., Oladipo, A., Ogueji, K., Lin, J., Stenetorp, P., & Adelani, D. I. (2023). AfroBench: How good are large language models on African languages? (Version 5). arXiv. https://doi.org/10.48550/ARXIV.2311.07978 Parrish, A., Chen, A., Nangia, N., Padmakumar, V., Phang, J., Thompson, J., Htut, P. M., & Bowman, S. R. (2021). BBQ: A hand-built bias benchmark for question answering (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2110.08193 Röttger, P., Kirk, H. R., Vidgen, B., Attanasio, G., Bianchi, F., & Hovy, D. (2023). XSTest: A test suite for identifying exaggerated safety behaviours in large language models (Version 3). arXiv. https://doi.org/10.48550/ARXIV.2308.01263 Seveso, A., Potertì, D., Federici, E., Mezzanzanica, M., & Mercorio, F. (2025). ITALIC: An Italian culture-aware natural language A P P E N D I X | A I I N D E X R E P O R T 2026 395 benchmark. In L. Chiruzzo, A. Ritter, & L. Wang (Eds.), Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (pp. 1469–1478). Association for Computational Linguistics. https://doi.org/10.18653/v1/2025.naacl-long. Son, G., Lee, H., Kim, S., Kim, H., Lee, J., Yeom, J. W., Jung, J., Kim, J. W., & Kim, S. (2023). HAE-RAE Bench: Evaluation of Korean knowledge in language models (Version 5). arXiv. https://doi.org/10.48550/ARXIV.2309.02706 Souly, A., Lu, Q., Bowen, D., Trinh, T., Hsieh, E., Pandey, S., Abbeel, P., Svegliato, J., Emmons, S., Watkins, O., & Toyer, S. (2024). A StrongREJECT for empty jailbreaks (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2402.10260 Susanto, Y., Hulagadri, A. V., Montalan, J. R., Ngui, J. G., Yong, X. B., Leong, W., Rengarajan, H., Limkonchotiwat, P., Mai, Y., & Tjhi, W. C. (2025). SEA-HELM: Southeast Asian holistic evaluation of language models (Version 2). arXiv. https://doi.org/10.48550/ ARXIV.2502.14301 Suzgun, M., Gur, T., Bianchi, F., Ho, D. E., Icard, T., Jurafsky, D., & Zou, J. (2025). Language models cannot reliably distinguish belief from knowledge and fact. Nature Machine Intelligence, 7(11), 1780–1790. https://doi.org/10.1038/s42256-025-01113-8 Turatali, T., Turdubaeva, A., Zhenishbekov, I., Suranbaev, Z., Alekseev, A., & Izmailov, R. (2025). Bridging the gap in less-resourced languages: Building a benchmark for Kyrgyz language models. 2025 10th International Conference on Computer Science and Engineering (UBMK), 1673–1677. https://doi.org/10.1109/UBMK67458.2025.11206960 Vidgen, B., Scherrer, N., Kirk, H. R., Qian, R., Kannappan, A., Hale, S. A., & Röttger, P. (2023). SimpleSafetyTests: A test suite for identifying critical safety risks in large language models (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2311.08370 Wasif, D., Chen, D., Madabushi, S., Alluru, N., Moore, T. J., & Cho, J.-H. (2025). Empirical analysis of privacy-fairness-accuracy tradeoffs in federated learning: A step towards responsible AI (arXiv:2503.16233). arXiv. https://doi.org/10.48550/arXiv.2503.16233 Wei, J., Karina, N., Chung, H. W., Jiao, Y. J., Papay, S., Glaese, A., Schulman, J., & Fedus, W. (2024). Measuring short-form factuality in large language models (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2411.04368 Zanga, A. I., Abdulrahman, S. M., Ado, A., Bichi, A. A., Jibril, L. A., Umar, A. B., Adamu, A., Muhammad, S. H., & Abubakar, B. S. (2025). HausaMovieReview: A Benchmark Dataset for Sentiment Analysis in Low-Resource African Language (Version 1). arXiv. https:// doi.org/10.48550/ARXIV.2509.16256 Zhang, A. K., Perry, N., Dulepet, R., Ji, J., Menders, C., Lin, J. W., Jones, E., Hussein, G., Liu, S., Jasper, D., Peetathawatchai, P., Glenn, A., Sivashankar, V., Zamoshchin, D., Glikbarg, L., Askaryar, D., Yang, M., Zhang, T., Alluri, R., … Liang, P. (2025). Cybench: A framework for evaluating cybersecurity capabilities and risks of language models (arXiv:2408.08926). arXiv. https://doi.org/10.48550/ arXiv.2408.08926 Zhang, R., Li, H., Meng, H., Zhan, J., Gan, H., & Lee, Y.-C. (2025). The dark side of AI companionship: A taxonomy of harmful algorithmic behaviors in human-AI relationships (arXiv:2410.20130). arXiv. https://doi.org/10.48550/arXiv.2410.20130 A P P E N D I X | A I I N D E X R E P O R T 2026 396 Chapter 4: Economy Quid Quid insights prepared by Heather English and Courtney Prabhakar Quid uses its own in-house LLM and other smart search features, as well as traditional Boolean query, to search for focus areas, topics, and keywords within many datasets: social media, news, forums and blogs, companies, patents, and other custom feeds of data (e.g., survey data). Quid has many visualization options and data delivery endpoints, including network graphs based on semantic similarity, in-platform dashboarding capabilities, and programmatic PostgreSQL database delivery. Quid applies best-inclass AI and NLP to reveal hidden patterns in large datasets, enabling users to make data-driven decisions accurately, quickly, and efficiently. Search, Data Sources, and Scope Over 8 million global public and private company profiles from multiple data sources are indexed to search across company descriptions, while filtering and including metadata ranging from investment information to firmographic information, such as founding year, headquarter location, and more. Company information is updated on a weekly basis. The Quid algorithm reads a large amount of text data from each document to make links between different documents based on their similar language. This process is repeated at an immense scale, which produces a network of different clusters identifying distinct topics or focus areas. Trends are identified based on keywords, phrases, people, companies, and institutions that Quid identifies and other metadata that is put into the software. Data Companies Organization data is embedded from Capital IQ and Crunchbase. These companies include every type of organization (private, public, operating, operating as a subsidiary, out of business) throughout the world. The investment data includes private investments, M&A, public offerings, minority stakes held by private equity and venture capital, corporate venture arms, governments, and institutions both within and outside the United States. Some data is unavailable—for instance, when investors’ names or funding amounts are not disclosed. Quid embeds Capital IQ data as a default and adds in data from Crunchbase for the data points that are not captured in Capital IQ. This not only yields comprehensive and accurate data on all global organizations, but it also captures early-stage startups and funding events data. Search Parameters Boolean query is used to search for focus areas, topics, and keywords within the archived company database and within their business descriptions and websites. Quid can filter out the search results by HQ regions, investment amount, operating status, organization type (private/ public), and founding year. Quid then visualizes these companies by semantic similarity. If there are more than 7,000 companies from the search result, Quid selects the 7,000 most relevant companies for visualization based on the language algorithm. Boolean search: “artificial intelligence” or “AI” or “machine learning” or “deep learning.” Companies

• Global AI and ML companies that have received investments (private, IPO, M&A) from Jan. 1, 2015, to Dec. 31, 2025.

• Global AI and ML companies that have received over $1.5 million for the past 10 years (Jan. 1, 2015, to Dec. 31, 2025).

• Global data was also pulled for a generative AI query (Boolean search: “generative AI” or “gen AI” OR “generative artificial intelligence”) for companies that have received over $1.5 million for the past 10 years (Jan. 1, 2015, to Dec. 31, 2025). Target Event Definitions

• Private placement: A private placement is a sale of newly issued securities (equity or debt) by a company to a select investor or group of investors. The stakes that buyers take in private placements are often minority stakes (under 50%), although it is possible to take control of a company through a private placement, in which case the private placement would be a majority stake investment.

• Minority investment: This refers to a minority stake acquisition, which takes place when a buyer acquires less than 50% of the existing ownership stake in entities, asset products, and business divisions.

• M&A: This refers to a buyer acquiring more than 50% of the existing ownership stake in entities, asset products, and business divisions. McKinsey & Company Data used in the “Corporate Activity” section was sourced from the McKinsey global survey “The State of AI in 2025: Agents, Innovation, and Transformation.” The online survey ran from June 25, 2025, to July 29, 2025, and garnered responses from 1,993 participants in 105 nations representing the full range of regions, industries, company sizes, functional specialties, and tenures. Thirty-eight percent of A P P E N D I X | A I I N D E X R E P O R T 2026 397 respondents say they work for organizations with more than $1 billion in annual revenues. To adjust for differences in response rates, the data is weighted by the contribution of each respondent’s nation to global GDP. The AI Index also considered data from previous iterations of the McKinsey survey. These include: The State of AI in Early 2024: Gen AI Adoption Spikes and Starts to Generate Value The State of AI: How Organizations Are Rewiring to Capture Value The State of AI in 2023: Generative AI’s Breakout Year The State of AI in 2022—and a Half Decade in Review The State of AI in 2021 The State of AI in 2020 AI Proves Its Worth, But Few Scale Impact (2019) AI Adoption Advances, But Foundational Barriers Remain (2018) Lightcast Prepared by Elena Magrini and Rebecca Milde Lightcast delivers job market analytics that empower employers, workers, and educators to make data-driven decisions. The company’s artificial intelligence technology analyzes hundreds of millions of job postings and real-life career transitions to provide insight into labor market patterns. This real-time strategic intelligence offers crucial insights, such as what jobs are most in demand, the specific skills employers need, and the career directions that offer the highest potential for workers. For more information, visit www.lightcast.io. Job Postings Data To support these analyses, Lightcast mined its dataset of billions of job postings collected since 2010. Lightcast collects postings from over 51,000 online job sites to develop a comprehensive, real-time portrait of labor market demand. It aggregates job postings, removes duplicates, and extracts data from job postings text. This includes information on job title, employer, industry, and region, as well as required experience, education, and skills. Job postings are useful for understanding trends in the labor market because they allow for a detailed, real-time look at the skills employers seek. To assess the representativeness of job postings data, Lightcast conducts a number of analyses to compare the distribution of job postings to the distribution of official government and other third-party sources in the United States. The primary source of government data on U.S. job postings is the Job Openings and Labor Turnover Survey (JOLTS) program, conducted by the Bureau of Labor Statistics. Based on comparisons between JOLTS and Lightcast, the labor market demand captured by Lightcast data represents over 99% of the total labor demand. Jobs not posted online are usually in small businesses (the classic example being the “Help Wanted” sign in a restaurant window) and union hiring halls. Measuring Demand for AI To measure the demand by employers of AI skills, Lightcast uses its skills taxonomy of over 33,000 skills. These skills are organized hierarchically in over 400 skills clusters and 32 skills categories. From this full list, Lightcast identified a number of skill clusters within the broader umbrella of AI, for a total of 10 clusters. Nine of these clusters were included in the 2025 AI Index report: AI Ethics, Governance, and Regulation; Artificial Intelligence; Autonomous Driving; Generative AI; Machine Learning; Natural Language Processing; Neural Networks; Robotics; and Visual Image Recognition. An additional cluster—Agentic AI—was added this year to reflect developments at the AI innovation frontier. Across the 10 clusters, Lightcast experts identified over 300 AI skills, and a job posting is considered an AI job if it includes one of those skills within the text of the job listing. The list of AI skills from Lightcast data are shown below, with associated skill clusters. A total of 24 skills were added since the previous edition, and these skills are highlighted in yellow below. While some skills are considered to be in the AI cluster specifically, for the purposes of this report, all skills below were considered AI skills. AI Agents [New cluster - it includes a combination of skills already within the AI definition and new skills added in 2025]: Chatbot, Agentic AI, ChatGPT, AI Agents, Conversational AI, Microsoft Copilot, Multi-Agent Systems, Langgraph, Microsoft Copilot Studio, Agentic Systems, Google Bard, Intelligent Agent, Intelligent Virtual Assistant, Embedded AI, Customer Engagement Suite with Google AI, Bot Framework, Amazon Alexa, Cortana, Interactive Kiosk, IPSoft Amelia, CrewAI, Game Ai, Generative AI Agents, Autonomic Computing, Botpress, Cursor AI, Swarm Intelligence, ChatGPT Integration, AgentGPT. AI Ethics, Governance, and Regulations: Ethical AI, AI Security, Data Sovereignty, AI Safety, Artificial Intelligence Risk, AI Literacy, Human AI Interaction, AI Alignment. Artificial Intelligence: Artificial Intelligence, Data Version Control (DVC), Edge Intelligence, Applications Of Artificial Intelligence, 5 https://lightcast.io/open-skills A P P E N D I X | A I I N D E X R E P O R T 2026 398 AI Research, Artificial Intelligence Development, AIOps (Artificial Intelligence For IT Operations), AI/ML Inference, Automated Data Cleaning, AI Innovation, Knowledge-Based Systems, Artificial Intelligence Markup Language (AIML), Intelligent Systems, PineCone, Synthetic Data Generation, Explainable AI (XAI), Open Neural Network Exchange (ONNX), Knowledge Engineering, AI Testing, Expert Systems, Azure Cognitive Services, Knowledge Distillation, Knowledge-Based Configuration, Artificial Intelligence Systems, Weka, Knowledge Representation, Cognitive Computing, Reasoning Systems, Intelligent Control, Cognitive Automation, Baidu, Weaviate, AI Personalization, Computational Intelligence, Watson Studio, Artificial General Intelligence, Watson Conversation, Neuro-Symbolic AI, OpenAI Gym, Operationalizing AI, Qdrant, PennyLane, Google Quantum AI. Autonomous driving: Advanced Driver Assistance Systems, Autonomous Vehicles, Light Detection And Ranging (LiDAR), Remote Sensing, Autonomous System, Guidance Navigation And Control Systems, OpenCV, Object Tracking, Dynamic Routing, Path Analysis, Scene Understanding, Autonomous Cruise Control Systems, Path Finding, Unmanned Aerial Systems (UAS). Generative AI: Generative Artificial Intelligence, Large Language Modeling, Prompt Engineering, Retrieval Augmented Generation, Variational Autoencoders, Generative Adversarial Networks, Text to Speech (TTS), Multimodal Models, Image Inpainting, Text Summarization, Multimodal Learning, Image Super-Resolution, Stable Diffusion, DALL-E Image Generator, Adobe Sensei, AIGenerated Code, Context Engineering, ChatGPT Prompt, Beautiful.AI, Variational Autoencoders (VAEs). Machine Learning: Machine Learning, Apache Spark, PyTorch (Machine Learning Library), Predictive Modeling, MLOps (Machine Learning Operations), Scikit-Learn (Python Package), Machine Learning Algorithms, AWS SageMaker, Reinforcement Learning, Transformer (Machine Learning Model), Feature Engineering, MLflow, Data Classification, Vertex AI, Decision Tree Learning, Azure Machine Learning, Machine Learning Model Training, Unsupervised Learning, Machine Learning Methods, Distributed Machine Learning, Kubeflow, Recommender Systems, Concept Drift Detection, Random Forest Algorithm, Xgboost, Boosting, Support Vector Machine, Supervised Learning, Torch (Machine Learning), Dask (Software), Association Rule Learning, AutoGen, Feature Extraction, Gradient Boosting, Cluster Analysis, Feature Selection, H2O.ai, ModelOps, Hyperparameter Optimization, Transfer Learning, Automated Machine Learning, Bagging Techniques, Collaborative Filtering, Machine Learning Model Monitoring And Evaluation, Federated Learning, K-Means Clustering, Dimensionality Reduction, Support Vector Machines (SVM), Feature Learning, CyberPhysical Systems, Neural Architecture Compression, Ensemble Methods, Confusion Matrix, Training Datasets, Decision Models, Bayesian Belief Networks, Naive Bayes Classifier, Programmatic Media Buying, Genetic Algorithm, Meta-Learning, Loss Functions, Gradient Boosting Machines (GBM), Semi-Supervised Learning, Test Datasets, Objective Function, Markov Chain, Hidden Markov Model, Gaussian Process, Theano (Software), Matrix Factorization, Apache Mahout, CHi-Squared Automatic Interaction Detection (CHAID), Microsoft Cognitive Toolkit (CNTK), Kernel Methods, Sorting Algorithm, Topological Data Analysis (TDA), Pydata, Adversarial Machine Learning, Oracle Autonomous Database, Perceptron, Inference Engine, Dbscan, Embedded Intelligence, Attention Mechanisms, Apache MADlib, Incremental Learning, Neural Architecture Search (NAS), PyTorch Lightning, Vowpal Wabbit, Dlib (C++ Library), AdaBoost (Adaptive Boosting), Boltzmann Machine, Zero Shot Learning, mlpack (C++ Library), Google AutoML, t-SNE (t-distributed Stochastic Neighbor Embedding), Apache SINGA, Soft Computing, Evolutionary Programming, Predictionio, Expectation Maximization Algorithm, Google Cloud ML Engine, Meta Learning, LIBSVM, Classification And Regression Tree (CART), Natural Language Processing: AI Translation, Natural Language Processing (NLP), Text Retrieval Systems, Screen Reader, Hugging Face (NLP Framework), Language Models, LLaMA (Language Model), Natural Language Understanding (NLU), Tokenization, Text Mining, Sentiment Analysis, Speech Recognition, BERT (NLP Model), Optical Character Recognition (OCR), Computational Linguistics, Hugging Face Transformers, Semantic Search, Semantic Kernel, Natural Language Understanding, Natural Language Generation (NLG), Machine Translation, Voice Assistant Technology, Amazon Textract, Speech Synthesis, Natural Language Generation, Word2Vec Models, Voice User Interface, Semantic Analysis, Speech Recognition Software, Dialog Systems, Statistical Language Acquisition, Word Embedding, Fuzzy Logic, Voice Interaction, Microsoft LUIS, ANTLR, Apache OpenNLP, Handwriting Recognition, Summarization Methods, Kaldi, Latent Dirichlet Allocation, Natural Language Programming, Semantic Parsing, Vespa, Nuance Mix, fastText, Seq2Seq, Natural Language Toolkits, Small Language Model, Natural Language User Interface, Sentence Transformers, AI Copywriting, DeepSpeech, Lexalytics, Nearest Neighbour Algorithm, Semantic Interpretation For Speech Recognition, Shogun, Text-To-Speech, Language Model. Neural Networks: TensorFlow, Deep Learning, Keras (Neural Network Library), Deep Learning Methods, Graph Neural Networks (GNNs), Apache MXNet, Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM), Residual Networks (ResNet), Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNNs), Deep Reinforcement Learning (DRL), Cudnn, Sequence-to-Sequence Models (Seq2Seq), Autoencoders, Caffe2, Reinforcement Learning (RL), OpenVINO, Evolutionary Acquisition Of Neural Topologies, Fast.ai, Caffe (Framework), Chainer (Deep Learning Framework), Deeplearning4j, Spiking Neural Networks, Neural Ordinary Differential Equations, PaddlePaddle, Pybrain, Convolutional Neural Networks, Artificial Neural Networks. A P P E N D I X | A I I N D E X R E P O R T 2026 399 Robotics: Robotic Systems, Robot Operating Systems, OpenAI Gym Environments, Motion Planning, SLAM Algorithms (Simultaneous Localization And Mapping), Robot Framework, Robotic Automation Software, Advanced Robotics, Robotic Programming, Servomotor, Nvidia Jetson, Robotic Liquid Handling Systems, Cognitive Robotics, Reinforcement Learning from Human Feedback (RLHF), Meta-Reinforcement Learning. Visual Image Recognition: Computer Vision, Image Analysis, Machine Vision, General-Purpose Computing On Graphics Processing Units, 3D Reconstruction, Image Sensor, Image Recognition, Eye Tracking, Facial Recognition, Motion Analysis, Image Segmentation, Object Recognition, Pose Estimation, Contextual Image Classification, Instance Segmentation, Digital Image Processing, Thermal Imaging Analysis, Gesture Recognition, Activity Recognition, OmniPage, Image Captioning, Imagenet, Face Detection, Digital Twin Technology, Deck.gl, Image Matching, Mnist, Realsense. LinkedIn Prepared by Rosie Hood, Akash Kaura, and Mar Carpanelli LinkedIn Data This body of work is drawn from the anonymized and aggregated profile information of LinkedIn’s more than 1.3 billion members worldwide. It therefore is influenced by how members choose to use the platform, which can vary based on professional, social, and regional culture, as well as overall site availability and accessibility. In publishing these insights from LinkedIn’s Economic Graph, we want to provide accurate statistics while ensuring our members’ privacy. As a result, all data shows aggregated information for the corresponding period following strict data quality thresholds that prevent disclosing information about specific individuals. Country Sample LinkedIn provides data on Argentina, Australia, Austria, Belgium, Brazil, Canada, Chile, Costa Rica, Croatia, Cyprus, Czechia, Denmark, Estonia, Finland, France, Germany, Greece, Hong Kong SAR, Hungary, Iceland, India, Indonesia, Ireland, Israel, Italy, Latvia, Lithuania, Luxembourg, Mexico, Netherlands, New Zealand, Norway, Poland, Portugal, Romania, Saudi Arabia, Singapore, Slovenia, South Africa, South Korea, Spain, Sweden, Switzerland, Türkiye, United Arab Emirates, United Kingdom, United States, and Uruguay. Skills LinkedIn members self-report their skills on their LinkedIn profiles. Currently, more than 42,000 distinct, standardized skills are identified by LinkedIn. LinkedIn categorizes AI skills into two mutually exclusive groups: “AI Engineering” and “AI Literacy.” AI Engineering skills broadly refer to the technical expertise and practical competencies required to design, develop, deploy, and maintain artificial intelligence systems, and AI Literacy skills refer to the knowledge, abilities, and critical thinking competencies needed to understand, evaluate, and effectively interact with artificial intelligence technologies. As skills are constantly evolving, LinkedIn maintains and refreshes these classifications on a periodic basis. Industry LinkedIn’s industry taxonomy is a collection of entities that share economic activities and contribute to a specific product or service. An industry represents the products or services that a company offers or sells. LinkedIn analyzes the following industries in the context of AI skills penetration: Education; Financial Services; Manufacturing; Professional Services; and Technology, Information, and Media. For all other metrics, LinkedIn analyzes the following industries: Accommodation and Food Services; Administrative and Support Services; Construction; Consumer Services; Education; Entertainment Providers; Farming, Ranching, and Forestry; Financial Services; Government Administration; Hospitals and Health Care; Manufacturing; Oil, Gas, and Mining; Professional Services; Real Estate and Equipment Rental Services; Retail; Technology, Information, and Media; Transportation, Logistics, Supply Chain, and Storage; Utilities; and Wholesale. Gender LinkedIn recognizes that some members identify beyond the traditional gender constructs of “man” and “woman.” If not explicitly self-identified, LinkedIn infers the gender of members included in this analysis either by the pronouns used on their LinkedIn profiles or on the basis of first name. Members whose gender could not be inferred were excluded from any gender analysis. Note: LinkedIn filtered out countries where its gender attribution algorithm lacked sufficient coverage. AI Jobs or Occupations LinkedIn member titles are standardized and grouped into over 16,000 occupations. These are not sector or country specific. An AI job is an occupation that requires AI skills to perform the job. Examples include (but are not limited to): machine learning engineer, artificial intelligence specialist, data A P P E N D I X | A I I N D E X R E P O R T 2026 400 scientist, and computer vision engineer. AI Talent A LinkedIn member is considered AI talent if they have explicitly added at least two AI engineering skills to their profile and/or they are or have been employed in an AI job. A LinkedIn member is considered AI literate if they have added at least one AI literacy skill to their profile. Methodologies

1. Top AI Skills These are the AI skills most frequently added by members by year. Sample interpretation: The most added AI engineering skills globally are machine learning, AI, and deep learning.

2. Fastest Growing AI Skills The year-over-year (YoY) growth rate for AI skills most frequently added by all members. Note: LinkedIn implements thresholds to skill add volumes in the most recent year; thresholds are set at the 50th percentile of the most recent year’s AI skill adds distribution by country. Sample interpretation: The fastest growing AI engineering skills globally are custom GPTs, AI productivity, and AI agents.

3. AI Talent Concentration The counts of AI talent are used to calculate talent concentration metrics. For instance,, to calculate the country-level AI talent concentration, LinkedIn uses the counts of AI talent in a country divided by the counts of LinkedIn members in that respective country. Note: Concentration metrics may be influenced by LinkedIn coverage in these countries and should be used with caution. Sample interpretation: AI talent with AI engineering skills represents 0.88% of LinkedIn members in the United States.

4. Relative AI Talent Hiring Rate YoY Ratio The LinkedIn hiring rate is a measure of hires normalized by LinkedIn membership. It is computed as the percentage of LinkedIn members who added a new employer in the same period the job began, divided by the total number of LinkedIn members in the corresponding location. The AI hiring rate is computed using the overall hiring rate methodology, but it only considers members classified as AI talent. The relative AI talent hiring rate YoY ratio is the year-over-year change in the AI hiring rate relative to the overall hiring rate in the same country. LinkedIn shares a three-month moving average. Sample interpretation: In the United States, the ratio of AI talent hiring relative to overall hiring has grown 24.7% year over year.

5. Skills Penetration Skills Genome For any entity (occupation, country, industry, etc.), the skills genome is an ordered list (a vector) of the 50 most characteristic skills of that entity. These most characteristic skills are determined using a TF-IDF algorithm, which down-ranks ubiquitous skills that add little information about that specific entity (e.g., Microsoft Word) and up-ranks skills unique to that specific entity (e.g., artificial intelligence). Further details are available at LinkedIn’s Skills Genome and the LinkedIn–World Bank Methodology note. As an example, Table 1 details the skills genome of the technology, information, and media industry in the United States in 2024, displaying the top 20 skills ranked by TF-IDF. A P P E N D I X | A I I N D E X R E P O R T 2026 401 Skills genome of the technology, information and media industry in the United States in 2024 Top 20 skills ranked by TF-IDF TF-IDF skill rank Skill name 1 Amazon Web Services (AWS) 2 Software as a Service (SaaS) 3 Artificial intelligence (AI) 4 Python (programming language) 5 Go-to-market strategy 6 Customer success 7 Large language models (LLM) 8 Salesforce.com 9 SQL 10 Generative AI AI Skills Penetration This metric measures the intensity of AI skills in a given entity through the following methodology:

• LinkedIn computes frequencies for all self-added skills by LinkedIn members in a given entity (occupation, industry, etc.) from 2015 on.

• LinkedIn reweights re-weight skill frequencies using a TF-IDF model to get the top 50 most representative skills in that entity. These 50 skills compose the skill genome of that entity.

• LinkedIn computes the share of skills that belong to the AI skill group out of the top skills in the selected entity. Interpretation: The AI skill penetration rate signals the prevalence of AI skills across occupations, or the intensity with which LinkedIn members use AI skills in their jobs. For example, the top 50 skills for the occupation of engineer are calculated based on the weighted frequency with which they appear in LinkedIn members’ profiles. If four of the skills that engineers possess belong to the AI skill group, this measure indicates that the penetration of AI skills is estimated at 8% ( 4 out of 50) among engineers. Relative AI Skills Penetration To allow for skills penetration comparisons across countries, the skills genomes are calculated and a relevant benchmark is selected (e.g., a global average). A ratio is then constructed between a country and the benchmark’s AI skills penetrations, controlling for occupations. Sample interpretation: A country’s relative AI skills penetration of 1.5 indicates that AI skills are 1.5 times as frequent as in the benchmark, for an overlapping set of occupations. Global Comparison For cross-country comparison, LinkedIn presents the relative penetration rate of AI skills, measured as the sum of the penetration of each AI skill across occupations in a given country, divided by the average global penetration of AI skills across the overlapping occupations in a sample of countries. Sample interpretation: A relative penetration rate of 2 means the average penetration of AI skills in a particular country is two times the global average across the same set of occupations. Global Comparison: By Industry The relative AI skills penetration by country for a given industry provides an in-depth sectoral decomposition of AI skills penetration across industries and countries. Sample interpretation: A country’s relative AI skills penetration rate of 2 in the education sector means the average penetration of AI skills in that country is two times the global average across the same set of occupations in that sector. Global Comparison: By Gender The relative AI skills penetration by gender provides a cross-country comparison of AI skills penetrations within A P P E N D I X | A I I N D E X R E P O R T 2026 402 a gender. Since the global averages are distinct for each gender, this metric should only be used to compare country rankings within each gender, not for cross-gender comparisons within countries. Sample interpretation: If a country’s AI skills penetration for women is 1.5, that means members who are women in that country are 1.5x more likely to list AI skills than the average female member in all countries pooled together across the same set of occupations that exist in the country-gender combination. Global Comparison: Across Genders The relative AI skills penetration across genders allows for cross-gender comparisons within and across countries globally, since LinkedIn compares a country’s AI skills penetration by gender to the same global average regardless of gender.

6. Female Representation in AI The share of AI talent that are women. Sample interpretation: Female representation within AI talent with AI engineering skills is 30.5% globally.

7. AI Talent Migration Data on migration comes from the World Bank Group–LinkedIn “Digital Data for Development” partnership (see https://linkedindata. worldbank.org/ and Zhu et al. (2018)). LinkedIn migration rates are derived from the self-identified locations of LinkedIn member profiles. For example, when a LinkedIn member updates their location from Paris to London, this is counted as a migration. Migration data is available from 2019 on. LinkedIn data provides insights to countries on AI talent gained or lost due to migration trends. AI talent migration is considered for all members with AI skills/holding AI jobs at time “t” for country A as the country of interest and country B as the source of inflows and destination for outflows. Net AI talent migration between country A and country B would therefore be calculated as: Net flows are defined as total arrivals minus departures within a given time period. LinkedIn membership varies between countries, making it a challenge to interpret absolute movements of members from one country to another. Migration flows are therefore normalized with respect to each country. For example, for country A, all absolute net flows into and out of country A, regardless of origin and destination countries, are normalized based on the LinkedIn membership of country A at the end of each year and multiplied by 10,000. Hence, this metric indicates relative talent migration from all countries to and from country A. Note: Minimum thresholds have been applied to ensure that transitions have a sufficient sample size. Sample interpretation: The United States had a positive net flow of AI talent relative to its membership size at 1.07 net flow per 10,000 members.

8. Career Transitions into AI Jobs LinkedIn considers the source occupations that feed AI occupations, analyzing the share of transitions into AI occupations pooled over a five-year period. Career transitions are computed by aggregating member-level job transitions from one occupation to another occupation that the member has previously not held. LinkedIn excludes first occupations added by new graduates and intra-occupation transitions. Sample interpretation: In the United States, 26.9% of transitions into AI engineer came from software engineer, followed by 13.3% from data scientist.

9. Fastest Growing AI Occupations This metric identifies the fastest growing occupations within each AI talent segment by country, based on year-over-year changes in occupation share. For each occupation, LinkedIn calculates the year-over-year growth rate as the percentage change in the share of members listing an AI occupation compared with the same month one year earlier. Sample interpretation: In the United States, AI product manager is among the fastest growing occupations within the AI engineering segment, showing the largest year-over-year increase.

10. AI Skills Diffusion Index The AI skills diffusion index measures the breadth and depth of adoption of AI skills across the LinkedIn member base. It captures both (i) how widespread a given group of AI skills is among members in a country and (ii) how diverse the set of AI skills represented on member profiles is within that group. The index is constructed by tracking the share of members who list at least one AI skill on A P P E N D I X | A I I N D E X R E P O R T 2026 403 their profile, adjusted for the diversity of distinct AI skills observed in a given country. Values are indexed relative to a baseline period where a member threshold is reached at the country level, which ensures stability and comparability across countries and over time. The resulting index value at any point in the time series reflects how much the adjusted share of members with that AI skill group has increased relative to the baseline. Because the index is normalized to an early reference point, it is best interpreted as a measure of growth in diffusion rather than as a level or penetration rate. Sample interpretation: In the United States, the AI skills diffusion index for AI engineering indicates that the adjusted share of members listing AI literacy skills is 1075x higher than it was in March 2023.

Acknowledgments

LinkedIn gratefully acknowledges the contributions of Murat Erer and Carl Shan in developing these methodologies and metrics, and the feedback from our collaborators at the OECD.AI, Stanford Institute for Human-Centered AI, World Bank, and Centro Nacional de Inteligencia Artificial (Cenia). International Federation of Robotics (IFR) Data presented in the Robot Installations section was sourced from the World Robotics 2025 report. Works Cited Aldasoro, I., Gambacorta, L., Rozalia, P., Revoltella, D., Weiss, C., & Wolski, M. (2026, February 17). How AI is affecting productivity and jobs in Europe. CEPR. https://cepr.org/voxeu/columns/how-ai-affecting-productivity-and-jobs-europe Arnold, Z., & Murphy, B. (2021). Understanding Chinese government guidance funds. https://cset.georgetown.edu/wp-content/ uploads/CSET-Understanding-Chinese-Government-Guidance-Funds.pdf Arnon, A. (2025, September 8). The projected impact of generative AI on future productivity growth. Penn Wharton Budget Model. https://budgetmodel.wharton.upenn.edu/p/2025-09-08-the-projected-impact-of-generative-ai-on-future-productivity-growth/ Becker, J., Rush, N., Barnes, E., & Rein, D. (2025). Measuring the impact of early-2025 AI on experienced open-source developer productivity. https://arxiv.org/abs/2507.09089v2 Beraja, M., Peng, W., Yang, D. Y., & Yuchtman, N. (2024). Government as venture capitalist in AI (Working Paper No. 32701). National Bureau of Economic Research. https://www.nber.org/system/files/working_papers/w32701/w32701.pdf Bick, A., Blandin, A., & Deming, D. J. (2025, November 13). The State of generative AI adoption in 2025. Federal Reserve Bank. https:// www.stlouisfed.org/on-the-economy/2025/nov/state-generative-ai-adoption-2025 Bick, A., Blandin, A., & Deming, D. J. (2026). The rapid adoption of generative AI. Management Science. https://doi.org/10.1287/ mnsc.2025.02523 Brynjolfsson, E. (2026, February 15). The AI productivity take-off is finally visible. https://www.ft.com/content/4b51d0b4-bbfe-4f05b50a-1d485d419dc5 Brynjolfsson, E., Chandar, B., & Chen, R. (2025). Canaries in the coal mine? Six facts about the recent employment effects of artificial intelligence. Stanford Digital Economy Lab. https://digitaleconomy.stanford.edu/app/uploads/2025/12/CanariesintheCoalMine_ Nov25.pdf Brynjolfsson, E., Li, D., & Raymond, L. (2025). Generative AI at work. The Quarterly Journal of Economics, 140(2), 889–942. https://doi. org/10.1093/qje/qjae044 Brynjolfsson, E., Collis, A., Eggers, F., Kazinnik, S., & Nguyen, D. (2026). What is Generative AI Worth?. Available at SSRN: https:// ssrn.com/abstract=6569938 Cui, K. Z., Demirer, M., Jaffe, S., Musolff, L., Peng, S., & Salz, T. (2025). The effects of generative AI on high-skilled work: Evidence from three field experiments with software developers. https://economics.mit.edu/sites/default/files/inline-files/draft_copilot_ experiments.pdf Eckhardt, S., & Goldschlag, N. (2025). AI and jobs: The final word (Until the next one). https://eig.org/wp-content/uploads/2025/08/ EIG-AI-and-Jobs.pdf Felten, E., Raj, M., & Seamans, R. (2021). Occupational, industry, and geographic exposure to artificial intelligence: A novel dataset and its potential uses. Strategic Management Journal, 42(12), 2195–2217. https://doi.org/10.1002/smj.3286 Filippucci, F., Gal, P., Laengle, K., & Schief, M. (2025). Macroeconomic productivity gains from artificial intelligence in G7 economies. OECD Publishing. https://doi.org/10.1787/a5319ab5-en Frank, M. R., Sabet, A. J., Simon, L., Bana, S. H., & Yu, R. (2026). AI-exposed jobs deteriorated before ChatGPT. https://arxiv.org/ A P P E N D I X | A I I N D E X R E P O R T 2026 404 abs/2601.02554v1 Gimbel, M., Kinder, M., Kendall, J., & Lee, M. (2025). Evaluating the impact of AI on the labor market: Current state of affairs. The Budget Lab at Yale. https://budgetlab.yale.edu/research/evaluating-impact-ai-labor-market-current-state-affairs Ho Choi, J., & Xie, C. (2025). Human + AI in accounting: Early evidence from the field. https://www.gsb.stanford.edu/faculty-research/ working-papers/human-ai-accounting-early-evidence-field Hosseini, S. M., & Lichtinger, G. (2025). Generative AI as seniority-biased technological change: evidence from U.S. résumé and job posting data. https://www.alejandrobarros.com/wp-content/uploads/2025/11/ssrn-5425555.pdf Ju, H., & Aral, S. (2025). Collaborating with AI agents: Field experiments on teamwork, productivity, and performance. https://arxiv.org/ abs/2503.18238v3 Massenkoff, M., Lyubich, E., McCrory, P., Appel, R., & Heller, R. (2026). The Anthropic economic index report: Learning curves. Anthropic. https://cdn.sanity.io/files/4zrzovbb/website/4053bf3440c0c85b8852052770c5b4cf882689c3.pdf Misra, A., Wang, J., McCullers, S., White, K., & Ferres, J. L. (2025). Measuring AI diffusion: A population-normalized metric for tracking global AI usage (arXiv:2511.02781). arXiv. https://doi.org/10.48550/arXiv.2511.02781 Nordhaus, W. (2004). Schumpeterian profits in the American economy: Theory and measurement (No. W10433; p. w10433). National Bureau of Economic Research. https://doi.org/10.3386/w10433 Reimers, I., & Waldfogel, J. (2026). AI and the quantity and quality of creative products: Have LLMs boosted creation of valuable books? (Working Paper No. 34777). National Bureau of Economic Research. https://doi.org/10.3386/w34777 Shao, Y., Zope, H., Jiang, Y., Pei, J., Nguyen, D., Brynjolfsson, E., & Yang, D. (2025). Future of work with AI agents: Auditing automation and augmentation potential across the U.S. workforce. https://arxiv.org/abs/2506.06576v3 Shen, J. H., & Tamkin, A. (2026). How AI impacts skill formation. https://arxiv.org/abs/2601.20245v2 Yotzov, I., Barrero, J. M., Bloom, N., Bunn, P., Davis, S. J., Foster, K. M., Jalca, A., Meyer, B. H., Mizen, P., Navarrete, M. A., Smietanka, P., Thwaites, G., & Wang, B. Z. (2026). Firm data on AI (Working Paper No. 34836). National Bureau of Economic Research. https://doi. org/10.3386/w34836 A P P E N D I X | A I I N D E X R E P O R T 2026 405 Chapter 5: Science AI Publication Analysis To quantify growth in AI-related research across the natural sciences, the AI Index team conducted a bibliometric analysis using the Web of Science Core Collection (Advanced Search). For each discipline group—chemistry, materials science, physics, astronomy, biology, neuroscience, medicine, geosciences, atmospheric and ocean sciences, and environmental sciences—the team constructed queries that combined AI-related topic search terms (TS = “machine learning,” “deep learning,” “artificial intelligence,” “neural network*,” “foundation model*,” “large language model*,” and “generative AI”) with the corresponding Web of Science subject categories (WC). Results were limited to articles (DT = “Article”). For each query, the team obtained annual publication counts by accessing the full publication year distribution and exporting the table, and ran an aggregate query combining all subject categories to assess overall trends. All searches were performed on March 20, 2026. Works Cited Allen, A., Markou, S., Tebbutt, W., Requeima, J., Bruinsma, W. P., Andersson, T. R., Herzog, M., Lane, N. D., Chantry, M., Hosking, J. S., & Turner, R. E. (2025). End-to-end data-driven weather prediction. Nature, 641(1172–1179). https://doi.org/10.1038/s41586-02508897-0 Angeloudi, E., Audenaert, J., Bowles, M., Boyd, B. M., Chemaly, D., Cherinka, B., Ciucă, I., Cranmer, M., Do, A., Grayling, M., Hayes, E. E., Hehir, T., Ho, S., Huertas-Company, M., Iyer, K. G., Jablonska, M., Lanusse, F., Leung, H. W., Mandel, K., … Wu, J. F. (n.d.). The multimodal universe: Enabling large-scale machine learning with 100 TB of astronomical scientific data. arXiv. https://doi.org/10.48550/ arXiv.2412.02527 Asanjan, A. A., Alexander, O., Berg, T., Peng, S., Makki, J., Zhang, C., Yang, M., Shidham, D., Chakraborty, S., Bender, W., Crawford, C., Ravindran, A., Raiman, O., Potere, D., & Bell, D. (2026). GAIA: A foundation model for operational atmospheric dynamics (arXiv:2505.18179). arXiv. https://doi.org/10.48550/arXiv.2505.18179 Avsec, Ž., Latysheva, N., Cheng, J., Novati, G., Taylor, K. R., Ward, T., Bycroft, C., Nicolaisen, L., Arvaniti, E., Pan, J., Thomas, R., Dutordoir, V., Perino, M., De, S., Karollus, A., Gayoso, A., Sargeant, T., Mottram, A., Wong, L. H., … Kohli, P. (2026). Advancing regulatory variant effect prediction with AlphaGenome. Nature, 649(8099), 1206–1218. https://doi.org/10.1038/s41586-025-10014-0 Batatia, I., Benner, P., Chiang, Y., Elena, A. M., Kovács, D. P., Riebesell, J., Advincula, X. R., Asta, M., Avaylon, M., Baldwin, W. J., Berger, F., Bernstein, N., Bhowmik, A., Bigi, F., Blau, S. M., Cărare, V., Ceriotti, M., Chong, S., Darby, J. P., … Csányi, G. (2025). A foundation model for atomistic materials chemistry. The Journal of Chemical Physics, 163(184110). https://doi. org/10.1063/5.029700610.48550/arXiv.2401.00096 Bonev, B., Kurth, T., Mahesh, A., Bisson, M., Kossaifi, J., Kashinath, K., Anandkumar, A., Collins, W. D., Pritchard, M. S., & Keller, A. (2025). FourCastNet 3: A geometric approach to probabilistic machine-learning weather forecasting at scale (Version 2). arXiv. https:// doi.org/10.48550/ARXIV.2507.12144 Bragg, J., D’Arcy, M., Balepur, N., Bareket, D., Dalvi, B., Feldman, S., Haddad, D., Hwang, J. D., Jansen, P., Kishore, V., Majumder, B. P., Naik, A., Rahamimov, S., Richardson, K., Singh, A., Surana, H., Tiktinsky, A., Vasu, R., Wiener, G., … Weld, D. S. (2025). AstaBench: Rigorous benchmarking of AI agents with a scientific research suite (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2510.21652 Bran, A. M., Cox, S., Schilter, O., Baldassari, C., White, A. D., & Schwaller, P. (2023). ChemCrow: Augmenting large-language models with chemistry tools (Version 5). arXiv. https://doi.org/10.48550/ARXIV.2304.05376 Brenowitz, N. D., Ge, T., Subramaniam, A., Manshausen, P., Gupta, A., Hall, D. M., Mardani, M., Vahdat, A., Kashinath, K., & Pritchard, M. S. (2025). Climate in a bottle: Towards a generative foundation model for the kilometer-scale global atmosphere (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2505.06474 Brixi, G., Durrant, M. G., Ku, J., Poli, M., Brockman, G., Chang, D., Gonzalez, G. A., King, S. H., Li, D. B., Merchant, A. T., Naghipourfar, M., Nguyen, E., Ricci-Tam, C., Romero, D. W., Sun, G., Taghibakshi, A., Vorontsov, A., Yang, B., Deng, M., … Hie, B. L. (2025). Genome modeling and design across all domains of life with Evo 2. https://doi.org/10.1101/2025.02.18.638918 Brown, C. F., Kazmierski, M. R., Pasquarella, V. J., Rucklidge, W. J., Samsikova, M., Zhang, C., Shelhamer, E., Lahera, E., Wiles, O., Ilyushchenko, S., Gorelick, N., Zhang, L. L., Alj, S., Schechter, E., Askay, S., Guinan, O., Moore, R., Boukouvalas, A., & Kohli, P. (2025). AlphaEarth Foundations: An embedding field model for accurate and efficient global mapping from sparse label data (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2507.22291 C3S. (2018). ERA5 hourly data on single levels from 1940 to present [Dataset]. Copernicus Climate Change Service (C3S) Climate Data Store (CDS). https://doi.org/10.24381/CDS.ADBB2D47 Callahan, T. J., Beckwith, Z., & Merth, T. (n.d.). Virtual cells as causal world models: A perspective on evaluation. OpenReview. Chen, Y., Piȩkos, P., Ostaszewski, M., Laakom, F., & Schmidhuber, J. (2025). PhysGym: Benchmarking LLMs in interactive physics discovery with controlled priors (arXiv:2507.15550). arXiv. https://doi.org/10.48550/arXiv.2507.15550 A P P E N D I X | A I I N D E X R E P O R T 2026 406 Choudhary, K. (2025). ChatGPT Material Explorer: Design and implementation of a custom GPT assistant for materials science applications. Integrating Materials and Manufacturing Innovation, 14(3), 276–283. https://doi.org/10.1007/s40192-025-00410-9 Chung, D. J. H., Gao, Z., Kvasiuk, Y., Li, T., Münchmeyer, M., Rudolph, M., Sala, F., & Tadepalli, S. C. (2025). Theoretical Physics Benchmark (TPBench)—A dataset and study of AI reasoning capabilities in theoretical physics (arXiv:2502.15815). arXiv. https://doi. org/10.48550/arXiv.2502.15815 Ding, K., Yu, J., Huang, J., Yang, Y., Zhang, Q., & Chen, H. (2025). SciToolAgent: A knowledge-graph-driven scientific agent for multitool integration. Nature Computational Science, 5(10), 962–972. https://doi.org/10.1038/s43588-025-00849-y Dorkenwald, S., Matsliah, A., Sterling, A. R., Schlegel, P., Yu, S., McKellar, C. E., Lin, A., Costa, M., Eichler, K., Yin, Y., Silversmith, W., Schneider-Mizell, C., Jordan, C. S., Brittain, D., Halageri, A., Kuehner, K., Ogedengbe, O., Morey, R., Gager, J., … Zandawala, M. (2024). Neuronal wiring diagram of an adult brain. Nature, 634(8032), 124–138. https://doi.org/10.1038/s41586-024-07558-y Elhamod, M., Khurana, M., Manogaran, H. B., Uyeda, J. C., Balk, M. A., Dahdul, W., Bakış, Y., Bart, H. L., Mabee, P. M., Lapp, H., Balhoff, J. P., Charpentier, C., Carlyn, D., Chao, W.-L., Stewart, C. V., Rubenstein, D. I., Berger-Wolf, T., & Karpatne, A. (2023). Discovering novel biological traits from images using phylogeny-guided neural networks (Version 1). arXiv. https://doi.org/10.48550/ ARXIV.2306.03228 Fahsbender, E., Andersson, A., Ash, J., Binder, P., Burkhardt, D., Chang, B., Gerber, G. K., Gitter, A., Godau, P., Gupta, A., Haliburton, G., He, S., Ideker, T., Jelic, I., Khan, A., Kim, Y.-J., Krishnapriyan, A., Laurent, J. M., Liu, T., … Kalantar, K. (2025). Benchmarking and evaluation of AI models in biology: Outcomes and recommendations from the CZI Virtual Cells Workshop (Version 2). arXiv. https://doi. org/10.48550/ARXIV.2507.10502 Ghafarollahi, A., & Buehler, M. J. (2024). ProtAgents: Protein discovery via large language model multi-agent collaborations combining physics and machine learning (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2402.04268 Gottweis, J., Weng, W.-H., Daryin, A., Tu, T., Palepu, A., Sirkovic, P., Myaskovsky, A., Weissenberger, F., Rong, K., Tanno, R., Saab, K., Popovici, D., Blum, J., Zhang, F., Chou, K., Hassidim, A., Gokturk, B., Vahdat, A., Kohli, P., … Natarajan, V. (2025). Towards an AI coscientist (arXiv:2502.18864). arXiv. https://doi.org/10.48550/arXiv.2502.18864 Gu, J., Stevens, S., Campolongo, E. G., Thompson, M. J., Zhang, N., Wu, J., Kopanev, A., Mai, Z., White, A. E., Balhoff, J., Dahdul, W., Rubenstein, D., Lapp, H., Berger-Wolf, T., Chao, W.-L., & Su, Y. (2025). BioCLIP 2: Emergent properties from scaling hierarchical contrastive learning (arXiv:2505.23883). arXiv. https://doi.org/10.48550/arXiv.2505.23883 Guo, Z., Wang, J., Ling, F., Wei, W., Yue, X., Jiang, Z., Xu, W., Luo, J.-J., Cheng, L., Ham, Y.-G., Song, F., Gentine, P., Yamagata, T., Fei, B., Zhang, W., Gu, X., Li, C., Wang, Y., Chen, T., … Bai, L. (2025). A self-evolving AI agent system for climate science (Version 3). arXiv. https://doi.org/10.48550/ARXIV.2507.17311 Heiskanen, J., Brümmer, C., Buchmann, N., Calfapietra, C., Chen, H., Gielen, B., Gkritzalis, T., Hammer, S., Hartman, S., Herbst, M., Janssens, I. A., Jordan, A., Juurola, E., Karstens, U., Kasurinen, V., Kruijt, B., Lankreijer, H., Levin, I., Linderson, M.-L., … Kutsch, W. (2022). The integrated carbon observation system in Europe. Bulletin of the American Meteorological Society, 103(3), E855–E872. https://doi.org/10.1175/BAMS-D-19-0364.1 Herzog, H., Bastani, F., Zhang, Y., Tseng, G., Redmon, J., Sablon, H., Park, R., Morrison, J., Buraczynski, A., Farley, K., Hansen, J., Howe, A., Johnson, P. A., Otterlee, M., Schmitt, T., Pitelka, H., Daspit, S., Ratner, R., Wilhelm, C., … Beukema, P. (2025). OlmoEarth: Stable latent image modeling for multimodal earth observation (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2511.13655 Holzschuh, B., Liu, Q., Kohl, G., & Thuerey, N. (2025). PDE-Transformer: Efficient and versatile transformers for physics simulations (arXiv:2505.24717). arXiv. https://doi.org/10.48550/arXiv.2505.24717 Huang, K., Zhang, S., Wang, H., Qu, Y., Lu, Y., Roohani, Y., Li, R., Qiu, L., Li, G., Zhang, J., Yin, D., Marwaha, S., Carter, J. N., Zhou, X., Wheeler, M., Bernstein, J. A., Wang, M., He, P., Zhou, J., … Leskovec, J. (2025). Biomni: A general-purpose biomedical AI agent. https://doi.org/10.1101/2025.05.30.656746 Inizan, T. J., Yang, S., Kaplan, A., Lin, Y., Yin, J., Mirzaei, S., Abdelgaid, M., Alawadhi, A. H., Cho, K., Zheng, Z., Cubuk, E. D., Borgs, C., Chayes, J. T., Persson, K. A., & Yaghi, O. M. (2025). System of agentic AI for the discovery of metal-organic frameworks (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2504.14110 Jakubik, J., Yang, F., Blumenstiel, B., Scheurer, E., Sedona, R., Maurogiovanni, S., Bosmans, J., Dionelis, N., Marsocci, V., Kopp, N., Ramachandran, R., Fraccaro, P., Brunschwiler, T., Cavallaro, G., Bernabe-Moreno, J., & Longépé, N. (2025). TerraMind: Large-scale generative multimodality for Earth observation (arXiv:2504.11171). arXiv. https://doi.org/10.48550/arXiv.2504.11171 Jin, R., Guo, Y., Qu, Y., Yang, M., Shang, C., Yang, Q., Chao, L., Zhou, Y., Xu, R., Xu, Z., Zhou, R., Zhang, Z., Wang, M., Zhang, X., & Cong, L. (2025). BioLab: End-to-end autonomous life sciences research with multi-agents system integrating biological foundation models. https://doi.org/10.1101/2025.09.03.674085 Jin, Z., Liu, L., Yang, Q., Jia, X., Tao, S., Guo, Y., Ghosh, R., Wang, S., Zhu, Q., Jung, M., Guan, K., Kumar, V., Reichstein, M., Fang, A P P E N D I X | A I I N D E X R E P O R T 2026 407 J., & Luo, Y. (2026). Knowledge‐guided machine learning for global change ecology research. Global Change Biology, 32(2), e70742. https://doi.org/10.1111/gcb.70742 Joseph, S. A., Husain, S. M., Offner, S. S. R., Juneau, S., Torrey, P., Bolton, A. S., Farias, J. P., Gaffney, N., Durrett, G., & Li, J. J. (2025). AstroVisBench: A code benchmark for scientific computing and visualization in astronomy (arXiv:2505.20538). arXiv. https://doi. org/10.48550/arXiv.2505.20538 Jung, H., Amini, M., Hunt, B. J., Murphy, E. I., Sadil, P., Halchenko, Y. O., Petre, B., Miao, Z., Kragel, P. A., Han, X., Heilicher, M. O., Sun, M., Collins, O. G., Lindquist, M. A., & Wager, T. D. (2025). Spacetop: A multimodal fMRI dataset unifying naturalistic processes with a rich array of experimental tasks. Scientific Data, 12(1), 1465. https://doi.org/10.1038/s41597-025-05154-x Kao, C. H., Zhao, W., Revankar, S., Speas, S., Bhagat, S., Datta, R., Phoo, C. P., Mall, U., Vondrick, C., Bala, K., & Hariharan, B. (2025). Towards LLM agents for Earth observation (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2504.12110 Kazemian, A., Elmoznino, E., & Bonner, M. F. (2025). Convolutional architectures are cortex-aligned de novo. Nature Machine Intelligence, 7(11), 1834–1844. https://doi.org/10.1038/s42256-025-01142-3 Khan, R., & Sharma, P. (2025). AI-enabled smart irrigation for climate-resilient agriculture. SHS Web of Conferences, 216, 01005. https://doi.org/10.1051/shsconf/202521601005 Kim, H., Li, C., Deng, W., Jin, M., Huang, W., Lu, M., & Yuan, B. (2025). CLIMATEAGENT: Multi-agent orchestration for complex climate data science workflows (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2511.20109 Koblischke, N., Jang, H., Menou, K., & Ali-Dib, M. (2025). Gravity-Bench-v1: A benchmark on gravitational physics discovery for agents (arXiv:2501.18411). arXiv. https://doi.org/10.48550/arXiv.2501.18411 Kratzert, F., Klotz, D., Shalev, G., Klambauer, G., Hochreiter, S., & Nearing, G. (2019). Towards learning universal, regional, and local hydrological behaviors via machine learning applied to large-sample datasets. Hydrology and Earth System Sciences, 23(12), 5089– 5110. https://doi.org/10.5194/hess-23-5089-2019 Levine, D. S., Shuaibi, M., Spotte-Smith, E. W. C., Taylor, M. G., Hasyim, M. R., Michel, K., Batatia, I., Csányi, G., Dzamba, M., Eastman, P., Frey, N. C., Fu, X., Gharakhanyan, V., Krishnapriyan, A. S., Rackers, J. A., Raja, S., Rizvi, A., Rosen, A. S., Ulissi, Z., … Wood, B. M. (2026). The Open Molecules 2025 (OMol25) dataset, evaluations, and models (arXiv:2505.08762). arXiv. https://doi. org/10.48550/arXiv.2505.08762 Lin, Y., Tang, S., Lyu, B., Wu, J., Lin, H., Yang, K., Li, J., Xia, M., Chen, D., Arora, S., & Jin, C. (2025). Goedel-Prover: A frontier model for open-source automated theorem proving (Version 3). arXiv. https://doi.org/10.48550/ARXIV.2502.07640 Linsley, D., Feng, P., & Serre, T. (2025). Better artificial intelligence does not mean better models of biology. Trends in Cognitive Sciences, S1364661325003493. https://doi.org/10.1016/j.tics.2025.11.016 Liu, L., Zhou, W., Guan, K., Peng, B., Xu, S., Tang, J., Zhu, Q., Till, J., Jia, X., Jiang, C., Wang, S., Qin, Z., Kong, H., Grant, R., Mezbahuddin, S., Kumar, V., & Jin, Z. (2024). Knowledge-guided machine learning can improve carbon cycle quantification in agroecosystems. Nature Communications, 15(1), 357. https://doi.org/10.1038/s41467-023-43860-5 Luo, E., Jia, J., Xiong, Y., Li, X., Guo, X., Yu, B., Hao, M., Wei, L., & Zhang, X. (2025). Benchmarking AI scientists for omics data driven biological discovery (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2505.08341 Marin-Llobet, A., Lin, Z., Baek, J., Aljovic, A., Zhang, X., Lee, A. J., Wang, W., Lee, J., Shen, H., He, Y., Li, N., & Liu, J. (2025). An AI agent for cell-type specific brain computer interfaces. https://doi.org/10.1101/2025.09.11.675660 McCabe, M., Mukhopadhyay, P., Marwah, T., Blancard, B. R.-S., Rozet, F., Diaconu, C., Meyer, L., Wong, K. W. K., Sotoudeh, H., Bietti, A., Espejo, I., Fear, R., Golkar, S., Hehir, T., Hirashima, K., Krawezik, G., Lanusse, F., Morel, R., Ohana, R., … Ho, S. (2025). Walrus: A cross-domain foundation model for continuum dynamics (arXiv:2511.15684). arXiv. https://doi.org/10.48550/arXiv.2511.15684 Mehandru, N., Hall, A. K., Melnichenko, O., Dubinina, Y., Tsirulnikov, D., Bamman, D., Alaa, A., Saponas, S., & Malladi, V. S. (2025). BioAgents: Bridging the gap in bioinformatics analysis with multi-agent systems. Scientific Reports, 15(1), 39036. https://doi. org/10.1038/s41598-025-25919-z Miller, H. E., Greenig, M., Tenmann, B., & Wang, B. (2025). BioML-bench: Evaluation of AI agents for end-to-end biomedical ML. https://doi.org/10.1101/2025.09.01.673319 Miller, T., Michoński, G., Durlik, I., Kozlovska, P., & Biczak, P. (2025). Artificial intelligence in aquatic biodiversity research: A PRISMAbased systematic review. Biology, 14(5), 520. https://doi.org/10.3390/biology14050520 Mirza, A., Alampara, N., Kunchapu, S., Ríos-García, M., Emoekabu, B., Krishnan, A., Gupta, T., Schilling-Wilhelmi, M., Okereke, M., Aneesh, A., Asgari, M., Eberhardt, J., Elahi, A. M., Elbeheiry, H. M., Gil, M. V., Glaubitz, C., Greiner, M., Holick, C. T., Hoffmann, T., … Jablonka, K. M. (2025). A framework for evaluating the chemical knowledge and reasoning abilities of large language models against A P P E N D I X | A I I N D E X R E P O R T 2026 408 the expertise of chemists. Nature Chemistry, 17(7), 1027–1034. https://doi.org/10.1038/s41557-025-01815-x Mirza, A., Alampara, N., Ríos-García, M., Abdelalim, M., Butler, J., Connolly, B., Dogan, T., Nezhurina, M., Şen, B., Tirunagari, S., Worrall, M., Young, A., Schwaller, P., Pieler, M., & Jablonka, K. M. (2025). ChemPile: A 250GB diverse and curated dataset for chemical foundation models (arXiv:2505.12534). arXiv. https://doi.org/10.48550/arXiv.2505.12534 Mitchener, L., Laurent, J. M., Andonian, A., Tenmann, B., Narayanan, S., Wellawatte, G. P., White, A., Sani, L., & Rodriques, S. G. (2025). BixBench: A comprehensive benchmark for LLM-based agents in computational biology (arXiv:2503.00096). arXiv. https://doi.org/10.48550/ arXiv.2503.00096 Mitchener, L., Yiu, A., Chang, B., Bourdenx, M., Nadolski, T., Sulovari, A., Landsness, E. C., Barabasi, D. L., Narayanan, S., Evans, N., Reddy, S., Foiani, M., Kamal, A., Shriver, L. P., Cao, F., Wassie, A. T., Laurent, J. M., Melville-Green, E., Caldas, M., … White, A. D. (2025). Kosmos: An AI scientist for autonomous discovery (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2511.02824 Nduma, R., Park, H., & Walsh, A. (2025). Crystalyse: A multi-tool agent for materials design. Materials Science (arXiv:2512.00977). arXiv. https:// doi.org/10.48550/arXiv.2512.00977 Nguyen, T., Koneru, A., Li, S., & Grover, A. (2025). PhysiX: A foundation model for physics simulations (arXiv:2506.17774). arXiv. https://doi. org/10.48550/arXiv.2506.17774 Novick, K. A., Biederman, J. A., Desai, A. R., Litvak, M. E., Moore, D. J. P., Scott, R. L., & Torn, M. S. (2018). The AmeriFlux network: A coalition of the willing. Agricultural and Forest Meteorology, 249, 444–456. https://doi.org/10.1016/j.agrformet.2017.10.009 Pantiukhin, D., Shapkin, B., Kuznetsov, I., Jost, A. A., & Koldunov, N. (2025). Accelerating earth science discovery via multi-agent LLM systems (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2503.05854 Parker, L., Lanusse, F., Shen, J., Liu, O., Hehir, T., Sarra, L., Meyer, L., Bowles, M., Wagner-Carena, S., Qu, H., Golkar, S., Bietti, A., Bourfoune, H., Casserau, N., Cornette, P., Hirashima, K., Krawezik, G., Ohana, R., Lourie, N., … Ho, S. (2025). AION-1: Omnimodal foundation model for astronomical sciences (arXiv:2510.17960). arXiv. https://doi.org/10.48550/arXiv.2510.17960 Pastorello, G., Trotta, C., Canfora, E., Chu, H., Christianson, D., Cheah, Y.-W., Poindexter, C., Chen, J., Elbashandy, A., Humphrey, M., Isaac, P., Polidori, D., Reichstein, M., Ribeca, A., Van Ingen, C., Vuichard, N., Zhang, L., Amiro, B., Ammann, C., … Papale, D. (2020). The FLUXNET2015 dataset and the ONEFlux processing pipeline for eddy covariance data. Scientific Data, 7(1), 225. https://doi.org/10.1038/s41597-020-0534-3 Qiu, J., Shi, J., Juan, X., Zhao, Z., Geng, J., Liu, S., Wang, H., Wu, S., & Wang, M. (2025). Physics supernova: AI agent matches elite gold medalists at IPhO 2025 (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2509.01659 Qiu, S., Guo, S., Song, Z.-Y., Sun, Y., Cai, Z., Wei, J., Luo, T., Yin, Y., Zhang, H., Hu, Y., Wang, C., Tang, C., Chang, H., Liu, Q., Zhou, Z., Zhang, T., Zhang, J., Liu, Z., Li, M., … Zhu, H. X. (2025). PHYBench: Holistic evaluation of physical perception and reasoning in large language models (arXiv:2504.16074). arXiv. https://doi.org/10.48550/arXiv.2504.16074 Queen, O., Zhang, H. G., & Zou, J. (2025). CGBench: Benchmarking language model scientific reasoning for clinical genetics research (arXiv:2510.11985). arXiv. https://doi.org/10.48550/arXiv.2510.11985 Riebesell, J., Goodall, R. E. A., Benner, P., Chiang, Y., Deng, B., Ceder, G., Asta, M., Lee, A. A., Jain, A., & Persson, K. A. (2025). A framework to evaluate machine learning crystal stability predictions. Nature Machine Intelligence, 7(6), 836–847. https://doi.org/10.1038/s42256-025-01055-1 Roy, S., Schmude, J., Lal, R., Gaur, V., Freitag, M., Kuehnert, J., Kessel, T. van, Hegde, D. V., Muñoz-Jaramillo, A., Jakubik, J., Vos, E., Mandal, K., Asanjan, A. A., Almeida, J. L. de S., Lin, A., Singh, T., Yang, K., Pandey, C., Hong, J., … Ramachandran, R. (2025). Surya: Foundation model for heliophysics (arXiv:2508.14112). arXiv. https://doi.org/10.48550/arXiv.2508.14112 Sahoo, S. J., Maraschin, M., Levine, D. S., Ulissi, Z., Zitnick, C. L., Varley, J. B., Gauthier, J. A., Govindarajan, N., & Shuaibi, M. (2025). The Open Catalyst 2025 (OC25) dataset and models for solid-liquid interfaces (arXiv:2509.17862). arXiv. https://doi.org/10.48550/arXiv.2509.17862 Schneider, M., Canzano, J., Peng, J., Hou, Y., Smith, S. L., & Beyeler, M. (2025). Mouse vs. AI: A neuroethological benchmark for visual robustness and neural alignment (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2509.14446 Shojaee, P., Nguyen, N.-H., Meidani, K., Farimani, A. B., Doan, K. D., & Reddy, C. K. (2025). LLM-SRBench: A new benchmark for scientific equation discovery with large language models (arXiv:2504.10415). arXiv. https://doi.org/10.48550/arXiv.2504.10415 Soares, E., Vital Brazil, E., Shirasuna, V., Zubarev, D., Cerqueira, R., & Schmidt, K. (2025). An open-source family of large encoder-decoder foundation models for chemistry. Communications Chemistry, 8(1), 193. https://doi.org/10.1038/s42004-025-01585-0 Song, T., Luo, M., Zhang, X., Chen, L., Huang, Y., Cao, J., Zhu, Q., Liu, D., Zhang, B., Zou, G., Zhang, G., Zhang, F., Shang, W., Fu, Y., Jiang, J., & Luo, Y. (2025). A multiagent-driven robotic AI chemist enabling autonomous chemical research on demand. Journal of the American Chemical Society, 147(15), 12534–12545. https://doi.org/10.1021/jacs.4c17738 Sun, R., Qian, L., Li, Y., Cheng, H., Xue, Z., Zhang, X., Tan, L., Zhan, Y., Hu, W., Xiao, Q., Liu, Z., Zhang, G., Weinan, E., Zhou, P., Wen, H., Zhu, A P P E N D I X | A I I N D E X R E P O R T 2026 409 Y., & Guo, T. (2025). A perturbation proteomics-based foundation model for virtual cell construction. https://doi.org/10.1101/2025.02.07.637070 The evolution of AI in researchExplanAItions: Key findings. (2025). Wiley. https://www.wiley.com/en-de/about-us/ai-resources/ai-study/keyfindings/ The MICrONS Consortium, Bae, J. A., Baptiste, M., Baptiste, M. R., Bishop, C. A., Bodor, A. L., Brittain, D., Brooks, V., Buchanan, J., Bumbarger, D. J., Castro, M. A., Celii, B., Cobos, E., Collman, F., Da Costa, N. M., Danskin, B., Dorkenwald, S., Elabbady, L., Fahey, P. G., … Zhang, C. (2025). Functional connectomics spanning multiple areas of mouse visual cortex. Nature, 640(8058), 435–447. https://doi.org/10.1038/s41586025-08790-w Ueyama, M., Takao, Y., Yazawa, H., Tanaka, M., Yabuki, H., Kumagai, T., Iwata, H., Awal, Md. A., Du, M., Harazono, Y., Hata, Y., Hirano, T., Hiura, T., Ide, R., Ishida, S., Ishikawa, M., Kitamura, K., Kominami, Y., Komiya, S., … Ichii, K. (2025). The JapanFlux2024 dataset for eddy covariance observations covering Japan and East Asia from 1990 to 2023. Earth System Science Data, 17(8), 3807–3833. https://doi.org/10.5194/ essd-17-3807-2025 Vepreva, A., Razlivina, J., Eremeeva, M., Gubina, N., Orlova, A., Dmitrenko, A., Kapranova, K., Jyakhwo, S., Vasilev, N., Sarkisyan, A., Chernyshov, I. Y., Vinogradov, V., & Dmitrenko, A. (2025). Benchmarking agentic systems in automated scientific information extraction with ChemX (arXiv:2510.00795). arXiv. https://doi.org/10.48550/arXiv.2510.00795 Wang, C., Zhang, Y., Li, Y., Hu, X., Mao, Y., Chen, X., Du, P., Wang, R., Wu, Y., Yang, H., Li, Y., Wang, B., Mu, H., Chen, X., He, S., Mo, H., Zhang, L., Du, L., Zhao, Y., … Liu, J. (2025). StarWhisper Telescope: An AI framework for automating end-to-end astronomical observations. Communications Engineering, 4(1), 184. https://doi.org/10.1038/s44172-025-00520-4 Wang, D., Cheng, M., Yu, S., Liu, Z., Guo, Z., Li, X., & Liu, Q. (2025). PaperArena: An evaluation benchmark for tool-augmented agentic reasoning on scientific literature (Version 4). arXiv. https://doi.org/10.48550/ARXIV.2510.10909 Wang, E. Y., Fahey, P. G., Ding, Z., Papadopoulos, S., Ponder, K., Weis, M. A., Chang, A., Muhammad, T., Patel, S., Ding, Z., Tran, D., Fu, J., Schneider-Mizell, C. M., MICrONS Consortium, Da Costa, N. M., Reid, R. C., Collman, F., Da Costa, N. M., Franke, K., … Tolias, A. S. (2025). Foundation model of neural activity predicts response to new stimulus types. Nature, 640(8058), 470–477. https://doi.org/10.1038/s41586-02508829-y Wiesner, F., Wessling, M., & Baek, S. (2026). Towards a physics foundation model (arXiv:2509.13805). arXiv. https://doi.org/10.48550/ arXiv.2509.13805 Wu, M., Wang, Y., Ming, Y., An, Y., Wan, Y., Chen, W., Lin, B., Li, Y., Xie, T., & Zhou, D. (2025). CheMatAgent: Enhancing LLMs for chemistry and materials science through tree-search based tool learning (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2506.07551 Xu, W., Zhao, X., Zhou, Y., Yue, X., Fei, B., Ling, F., Zhang, W., & Bai, L. (2025). EarthSE: A benchmark for evaluating Earth scientific exploration capability of LLMs (Version 3). arXiv. https://doi.org/10.48550/ARXIV.2505.17139 Yamada, Y., Lange, R. T., Lu, C., Hu, S., Lu, C., Foerster, J., Clune, J., & Ha, D. (2025). The AI Scientist-v2: Workshop-level automated scientific discovery via agentic tree search (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2504.08066 Ye, C., Yuan, S., Cooray, S., Dillmann, S., Roque, I. L. V., Baron, D., Frank, P., Martin-Alvarez, S., Koblischke, N., Qu, F. J., Yang, D., Wechsler, R., & Ciuca, I. (2025). ReplicationBench: Can AI agents replicate astrophysics research papers? (arXiv:2510.24591). arXiv. https://doi.org/10.48550/ arXiv.2510.24591 Zeng, Y., Xie, J., Shangguan, N., Wei, Z., Li, W., Su, Y., Yang, S., Zhang, C., Zhang, J., Fang, N., Zhang, H., Lu, Y., Zhao, H., Fan, J., Yu, W., & Yang, Y. (2025). CellFM: A large-scale foundation model pre-trained on transcriptomics of 100 million human cells. Nature Communications, 16(1), 4679. https://doi.org/10.1038/s41467-025-59926-5 Zeni, C., Pinsler, R., Zügner, D., Fowler, A., Horton, M., Fu, X., Wang, Z., Shysheya, A., Crabbé, J., Ueda, S., Sordillo, R., Sun, L., Smith, J., Nguyen, B., Schulz, H., Lewis, S., Huang, C.-W., Lu, Z., Zhou, Y., … Xie, T. (2025). A generative model for inorganic materials design. Nature, 639(8055), 624–632. https://doi.org/10.1038/s41586-025-08628-5 Zhang, J., Gan, J., Wang, X., Jia, Z., Gu, C., Chen, J., Zhu, Y., Ma, M. D., Zhou, D., Li, L., & Wang, W. (2025). MatSciBench: Benchmarking the reasoning ability of large language models in materials science (arXiv:2510.12171). arXiv. https://doi.org/10.48550/arXiv.2510.12171 Zhang, Z., Qiu, Z., Wu, Y., Li, S., Wang, D., Liu, Y., Zhou, Z., Hu, Y., Chen, Y., An, D., Wang, Y., Li, Y., Zhong, Z., Ou, C., Wang, Z., Tang, F., Chen, J. X., Ma, R., Li, J., … Zheng, S. (2025). OriGene: A self-evolving virtual disease biologist automating therapeutic target discovery. https://doi. org/10.1101/2025.06.03.657658 Zhao, S., Xiong, Z., Zhao, J., & Zhu, X. X. (2025). ExEBench: Benchmarking foundation models on extreme Earth events (Version 1). arXiv. https:// doi.org/10.48550/ARXIV.2505.08529 Zhao, Z., Ma, D., Chen, L., Sun, L., Li, Z., Xia, Y., Chen, B., Xu, H., Zhu, Z., Zhu, S., Fan, S., Shen, G., Yu, K., & Chen, X. (2025). Developing ChemDFM as a large language foundation model for chemistry. Cell Reports Physical Science, 6(4), 102523. https://doi.org/10.1016/j. xcrp.2025.102523 A P P E N D I X | A I I N D E X R E P O R T 2026 410 Chapter 6: Medicine The Central Dogma Number of AI driven protein research publications The following PubMed queries were used to retrieve total annual paper counts for 2024 and 2025. Queries below are shown for 2025. All Papers: (“Artificial Intelligence”[Mesh] OR “Machine Learning”[Mesh] OR “Deep Learning”[Mesh] OR “artificial intelligence”[tiab] OR “machine learning”[tiab] OR “deep learning”[tiab] OR “neural network*”[tiab] OR “generative AI”[tiab] OR “language model*”[tiab] OR “AI”[tiab]) AND 2025[dp]. Function Prediction: (“Artificial Intelligence”[Mesh] OR “Machine Learning”[Mesh] OR “Deep Learning”[Mesh] OR “artificial intelligence”[tiab] OR “machine learning”[tiab] OR “deep learning”[tiab] OR “neural network*”[tiab] OR “AI”[tiab]) AND (“protein function”[tiab] OR “function prediction”[tiab] OR “functional annotation”[tiab] OR “protein annotation”[tiab]) AND 2025[dp]. Protein Structure Prediction: (“Artificial Intelligence”[Mesh] OR “Machine Learning”[Mesh] OR “Deep Learning”[Mesh] OR “artificial intelligence”[tiab] OR “machine learning”[tiab] OR “deep learning”[tiab] OR “neural network*”[tiab] OR “AI”[tiab]) AND (“protein structure”[tiab] OR “structure prediction”[tiab] OR “protein folding”[tiab] OR “AlphaFold”[tiab] OR “ESMFold”[tiab] OR “RoseTTAFold”[tiab]) AND 2025[dp]. Protein Drug Interactions: (“Artificial Intelligence”[Mesh] OR “Machine Learning”[Mesh] OR “Deep Learning”[Mesh] OR “artificial intelligence”[tiab] OR “machine learning”[tiab] OR “deep learning”[tiab] OR “neural network*”[tiab] OR “AI”[tiab]) AND (“proteindrug”[tiab] OR “drug-target”[tiab] OR “binding affinity”[tiab] OR “molecular docking”[tiab] OR “protein-ligand”[tiab] OR “target interaction*”[tiab]) AND 2025[dp]. Synthetic Protein Design: (“Artificial Intelligence”[Mesh] OR “Machine Learning”[Mesh] OR “Deep Learning”[Mesh] OR “artificial intelligence”[tiab] OR “machine learning”[tiab] OR “deep learning”[tiab] OR “neural network*”[tiab] OR “AI”[tiab]) AND (“synthetic protein”[tiab] OR “protein design”[tiab] OR “de novo protein”[tiab] OR “de novo design”[tiab] OR “protein engineering”[tiab]) AND 2025[dp]. Number of publications on AI for drug discovery The AI Index team ran the search query below on PubMed, limiting results to articles published from 2018 on. Search query: (“Artificial Intelligence”[Mesh] OR “Machine Learning”[Mesh] OR “Deep Learning”[Mesh] OR “Neural Networks, Computer”[Mesh] OR “artificial intelligence”[tiab] OR “machine learning”[tiab] OR “deep learning”[tiab] OR “neural network*”[tiab] OR “generative AI”[tiab] OR “large language model*”[tiab] OR “LLM*”[tiab] OR “AI”[tiab]) AND (“Drug Discovery”[Mesh] OR “Drug Design”[Mesh] OR “drug discovery”[tiab] OR “drug design”[tiab] OR “drug development”[tiab] OR “computer-aided drug design”[tiab] OR “CADD”[tiab] OR “de novo drug design”[tiab] OR “target identification”[tiab] OR “lead optimization”[tiab] OR “structure-based drug design”[tiab]) Number of publications on virtual cell models The AI Index team ran the search query below on PubMed. Search query: “virtual cell”[tiab] OR “virtual cells”[tiab] FDA-Approved AI Medical Devices Data on FDA-approved AI medical devices was sourced from the FDA website, which tracks artificial intelligence and machinelearning (AI/ML)–enabled medical devices. Patient Perspectives on AI in Health Care A comprehensive literature search was conducted in PubMed/MEDLINE, Embase, and Scopus to identify empirical studies published between January 1, 2020, and December 31, 2025, that examined patient, family, or public perspectives on the use of artificial intelligence (AI) in healthcare. The search strategy was structured around five core concepts: (1) clinical artificial intelligence (including artificial intelligence, machine learning, deep learning, neural networks, and large language models such as ChatGPT/ GPT); (2) explicit patient, family, or public perspectives; (3) healthcare and clinical contexts; (4) AI use or implementation in care; and

(5) empirical data collection methods. Searches required explicit mention of AI technologies in the title or abstract and included only studies reporting primary data collected through qualitative, quantitative, or mixed-methods designs (e.g., surveys, interviews, focus groups, cross-sectional or observational studies). To maintain focus on AI as the object of perception rather than a background analytic tool, studies centered on telehealth were excluded. Systematic reviews, scoping reviews, narrative reviews, and other literature-only reviews were also excluded at the search level using title and abstract text exclusions. Search strategies were adapted to the indexing and field structures of each database, including the use of MeSH terms in PubMed and Emtree terms in Embase, with database-specific refinements applied to ensure consistent conceptual alignment. All retrieved records were subsequently screened at the title/abstract and full-text levels to confirm eligibility. Included papers were tagged by clinical specialty and geographic representation by four reviewers. To facilitate visualization, tags were res-structured as binary variables using Claude (Sonnet 4.6), and outputs were manually reviewed for accuracy. Mock figures were generated using Claude A P P E N D I X | A I I N D E X R E P O R T 2026 411 (Sonnet 4.6). Search strategy: Concept A — Clinical Artificial Intelligence AI technologies likely to be encountered/used in care

• Artificial intelligence

• Machine learning

• Deep learning

• Neural networks

• Generative AI

• Large language models (LLMs); ChatGPT; GPT

• Clinical AI

• Predictive models

• Clinical decision support Concept B — Patient, Family, and Public Perspectives Explicitly stated perspectives only

• Patient perspective*

• Patient perception*

• Patient attitude*

• Public perception*

• Public attitude*

• Family perspective*

• Caregiver perspective* Concept C — Healthcare Context Anchors retrieval to clinical/medical settings

• Delivery of healthcare / healthcare

• Healthcare

• Medicine

• Clinical care

• Medical decision-making Concept D — AI Use/Implementation in Care Ensures AI is the object of perception as used in practice

• use/used

• implementation

• deploy*

• application*

• integration

• interaction* A P P E N D I X | A I I N D E X R E P O R T 2026 412 Concept E — Empirical Study Design/Data Collection Privileges studies reporting primary data

• survey*

• questionnaire*

• interview*

• focus group*

• qualitative

• quantitative

• mixed methods

• cross-sectional

• observational Exclusions Prevents drift and removes review-only literature

• telehealth

• telemedicine

• mobile health/mhealth

• systematic review

• scoping review

• narrative review

• literature review

• review of the literature Core Limits (applied across databases)

• Publication years: 2020–2025

• Document types: empirical articles only (reviews excluded via search-string exclusions)

• Language: start with English (non-English may be noted descriptively if needed) Search strings implemented on Feb. 9, 2026 PubMed (174 articles): ((((“Artificial Intelligence”[MeSH Terms] OR “Machine Learning”[MeSH Terms] OR “Deep Learning”[MeSH Terms] OR “neural networks, computer”[MeSH Terms] OR “Artificial Intelligence”[Title/Abstract] OR “Machine Learning”[Title/Abstract] OR “Deep Learning”[Title/Abstract] OR “neural network*”[Title/Abstract] OR “generative ai”[Title/Abstract] OR “large language model*”[Title/ Abstract] OR “chatgpt*”[Title/Abstract] OR “gpt”[Title/Abstract] OR “clinical ai”[Title/Abstract] OR “predictive model*”[Title/Abstract] OR “clinical decision*”[Title/Abstract] OR “decision support”[Title/Abstract]) AND (“use”[Title/Abstract] OR “used”[Title/Abstract] OR “implementation”[Title/Abstract] OR “deploy*”[Title/Abstract] OR “application*”[Title/Abstract] OR “integration”[Title/Abstract] OR “interaction*”[Title/Abstract]) AND (“patient perspective*”[Title/Abstract] OR “patient perception*”[Title/Abstract] OR “patient attitude*”[Title/Abstract] OR “public perception*”[Title/Abstract] OR “public attitude*”[Title/Abstract] OR “family perspective*”[Title/ Abstract] OR “caregiver perspective*”[Title/Abstract]) AND (“Delivery of Health Care”[MeSH Terms] OR “healthcare”[Title/Abstract] OR “health care”[Title/Abstract] OR “medicine”[Title/Abstract] OR “clinical care”[Title/Abstract] OR “medical decision-making”[Title/ Abstract]) AND (“survey*”[Title/Abstract] OR “questionnaire*”[Title/Abstract] OR “interview*”[Title/Abstract] OR “focus group*”[Title/ Abstract] OR “qualitative”[Title/Abstract] OR “quantitative”[Title/Abstract] OR “mixed methods”[Title/Abstract] OR “crosssectional”[Title/Abstract] OR “observational”[Title/Abstract])) NOT (“telehealth”[Title/Abstract] OR “telemedicine”[Title/Abstract] OR “mobile health”[Title/Abstract] OR “mhealth”[Title/Abstract])) NOT (“systematic review”[Title/Abstract] OR “scoping review”[Title/ Abstract] OR “narrative review”[Title/Abstract] OR “literature review”[Title/Abstract] OR “review of the literature”[Title/Abstract])) AND 2020/01/01:2025/12/31[Date - Publication] A P P E N D I X | A I I N D E X R E P O R T 2026 413 Embase (239 articles retrieved): (‘artificial intelligence’/exp OR ‘machine learning’/exp OR ‘deep learning’/exp OR ‘neural networks’/exp OR ‘artificial intelligence’:ti,ab OR ‘machine learning’:ti,ab OR ‘deep learning’:ti,ab OR ‘neural network*’:ti,ab OR ‘generative ai’:ti,ab OR ‘large language model*’:ti,ab OR chatgpt*:ti,ab OR gpt:ti,ab) AND ((((((patient AND perspective*:ti,ab OR patient) AND perception*:ti,ab OR patient) AND attitude*:ti,ab OR public) AND perception*:ti,ab OR public) AND attitude*:ti,ab OR family) AND perspective*:ti,ab OR caregiver) AND perspective*:ti,ab AND (‘health care’/exp OR healthcare:ti,ab OR ‘health care’:ti,ab OR medicine:ti,ab OR ‘clinical care’:ti,ab OR ‘medical decision making’/exp) AND (survey*:ti,ab OR questionnaire*:ti,ab OR interview*:ti,ab OR ‘focus group*’:ti,ab OR qualitative:ti,ab OR quantitative:ti,ab OR ‘mixed methods’:ti,ab OR ‘cross-sectional’:ti,ab OR observational:ti,ab) AND [2020-2025]/ py NOT (telehealth:ti,ab OR telemedicine:ti,ab OR ‘mobile health’:ti,ab OR mhealth:ti,ab) NOT (‘systematic review’:ti,ab OR ‘scoping review’:ti,ab OR ‘narrative review’:ti,ab OR ‘literature review’:ti,ab OR ‘review of the literature’:ti,ab) Scopus (368 articles retrieved): ( TITLE-ABS-KEY ( “artificial intelligence” OR “machine learning” OR “deep learning” OR “neural network*” OR “generative ai” OR “large language model*” OR chatgpt OR gpt ) AND TITLE-ABS-KEY ( “patient perspective*” OR “patient perception*” OR “patient attitude*” OR “public perception*” OR “public attitude*” OR “family perspective*” OR “caregiver perspective*” ) AND TITLE-ABSKEY ( “health care” OR healthcare OR medicine OR “clinical care” OR “medical decision-making” ) AND TITLE-ABS-KEY ( survey* OR questionnaire* OR interview* OR “focus group*” OR qualitative OR quantitative OR “mixed methods” OR “cross-sectional” OR observational ) ) AND NOT TITLE-ABS-KEY ( telehealth OR telemedicine OR “mobile health” OR mhealth ) AND NOT TITLE-ABS-KEY ( “systematic review” OR “scoping review” OR “narrative review” OR “literature review” OR “review of the literature” ) AND PUBYEAR > 2019 AND PUBYEAR < 2026 Ethical Considerations A bibliometric analysis assessed the medical AI and ethics literature, with a focus on data sharing, algorithm sharing, biosecurity, and global health. The AI Index searched PubMed Central for publications from January 2021 through December 2025 using (1) a primary query to identify medical AI articles that discuss ethics and, within that set, additional filters to identify articles that also address (2) data sharing, (3) algorithm sharing, (4) biosecurity, and (5) global health. For the resulting articles, the AI Index mapped the ethics topics covered (see taxonomy below) and grouped them into algorithmic, governance, or societal concerns. The AI Index used the following taxonomy for the ethics publications: Algorithmic: Bias, Fairness, Explainability, Misuse, Dual Use Governance: Accountability, Transparency, Independent Review, Oversight, Privacy, Trustworthy Societal: Equity, Inequity, Racism, Accessibility, Autonomy, Justice, Community Engagement Works Cited Abramson, J., Adler, J., Dunger, J., Evans, R., Green, T., Pritzel, A., Ronneberger, O., Willmore, L., Ballard, A. J., Bambrick, J., Bodenstein, S. W., Evans, D. A., Hung, C.-C., O’Neill, M., Reiman, D., Tunyasuvunakool, K., Wu, Z., Žemgulytė, A., Arvaniti, E., … Jumper, J. M. (2024). Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature, 630(8016), 493–500. https://doi.org/10.1038/s41586-024-07487-w Adams, R., Henry, K. E., Sridharan, A., Soleimani, H., Zhan, A., Rawat, N., Johnson, L., Hager, D. N., Cosgrove, S. E., Markowski, A., Klein, E. Y., Chen, E. S., Saheed, M. O., Henley, M., Miranda, S., Houston, K., Linton, R. C., Ahluwalia, A. R., Wu, A. W., & Saria, S. (2022). Prospective, multi-site study of patient outcomes after implementation of the TREWS machine learning-based early warning system for sepsis. Nature Medicine, 28(7), 1455–1460. https://doi.org/10.1038/s41591-022-01894-0 Adduri, A. K., Gautam, D., Bevilacqua, B., Imran, A., Shah, R., Naghipourfar, M., Teyssier, N., Ilango, R., Nagaraj, S., Dong, M., Ricci-Tam, C., Carpenter, C., Subramanyam, V., Winters, A., Tirukkovular, S., Sullivan, J., Plosky, B. S., Eraslan, B., Youngblut, N. D., … Roohani, Y. H. (2025). Predicting cellular responses to perturbation across diverse contexts with State. https://doi. org/10.1101/2025.06.26.661135 Akiyama, Y., Zhang, Z., Mirdita, M., Steinegger, M., & Ovchinnikov, S. (2025). Scaling down protein language modeling with MSA Pairformer. https://doi.org/10.1101/2025.08.02.668173 Alderman, J. E., Palmer, J., Laws, E., McCradden, M. D., Ordish, J., Ghassemi, M., Pfohl, S. R., Rostamzadeh, N., Cole-Lewis, H., Glocker, B., Calvert, M., Pollard, T. J., Gill, J., Gath, J., Adebajo, A., Beng, J., Leung, C. H., Kuku, S., Farmer, L.-A., … Liu, X. (2025). Tackling algorithmic bias and promoting transparency in health datasets: The STANDING Together consensus recommendations. The Lancet Digital Health, 7(1), e64–e88. https://doi.org/10.1016/S2589-7500(24)00224-3 Allen, J. W., Hannikainen, I. R., Savulescu, J., Wilkinson, D., & Earp, B. D. (2025). Is Consent-GPT valid? Public attitudes to generative AI use in surgical consent. AI & SOCIETY. https://doi.org/10.1007/s00146-025-02644-9 Almagro-Pérez, C., Song, A. H., Weishaupt, L., Kim, A., Jaume, G., Williamson, D. F. K., Hemker, K., Lu, M. Y., Singh, K., Chen, B., Le, L. P., Baras, A. S., Jiang, S., Bashashati, A., Liu, J. T. C., & Mahmood, F. (2025). AI-driven 3D spatial transcriptomics (Version 1). arXiv. A P P E N D I X | A I I N D E X R E P O R T 2026 414 https://doi.org/10.48550/ARXIV.2502.17761 Avsec, Ž., Latysheva, N., Cheng, J., Novati, G., Taylor, K. R., Ward, T., Bycroft, C., Nicolaisen, L., Arvaniti, E., Pan, J., Thomas, R., Dutordoir, V., Perino, M., De, S., Karollus, A., Gayoso, A., Sargeant, T., Mottram, A., Wong, L. H., … Kohli, P. (2026). Advancing regulatory variant effect prediction with AlphaGenome. Nature, 649(8099), 1206–1218. https://doi.org/10.1038/s41586-025-10014-0 Bannur, S., Bouzid, K., Castro, D. C., Schwaighofer, A., Thieme, A., Bond-Taylor, S., Ilse, M., Pérez-García, F., Salvatelli, V., Sharma, H., Meissen, F., Ranjit, M., Srivastav, S., Gong, J., Codella, N. C. F., Falck, F., Oktay, O., Lungren, M. P., Wetscherek, M. T., … Hyland, S. L. (2024). MAIRA-2: Grounded radiology report generation (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2406.04449 Ben Hmido, S., Abder Rahim, H., Ploem, C., Haitjema, S., Damman, O., Kazemier, G., & Daams, F. (2025). Patient perspectives on AI-based decision support in surgery. BMJ Surgery, Interventions, & Health Technologies, 7(1), e000365. https://doi.org/10.1136/ bmjsit-2024-000365 Berger, S. A., Håland, E., & Solbjør, M. (2026). Patient perspectives on trust in artificial intelligence–powered tools in prostate cancer diagnostics. Qualitative Health Research, 36(2–3), 276–288. https://doi.org/10.1177/10497323251387545 Blankemeier, L., Kumar, A., Cohen, J. P., Liu, J., Liu, L., Van Veen, D., Gardezi, S. J. S., Yu, H., Paschali, M., Chen, Z., Delbrouck, J.-B., Reis, E., Holland, R., Truyts, C., Bluethgen, C., Wu, Y., Lian, L., Jensen, M. E. K., Ostmeier, S., … Chaudhari, A. S. (2026). Merlin: A computed tomography vision–language foundation model and dataset. Nature. https://doi.org/10.1038/s41586-026-10181-8 Bouladian, S. G., Ngo, S., Mastrodicasa, D., Eng, D., Khandwala, N., Sousa, D., Chaudhari, A. S., Maron, D. J., Rodriguez, F., & Sandhu, A. T. (2025). Statin therapy persistence following opportunistic screening for coronary artery calcium on nongated chest CTs. JACC: Cardiovascular Imaging, 18(7), 841–843. https://doi.org/10.1016/j.jcmg.2025.03.007 Boussina, A., Shashikumar, S. P., Malhotra, A., Owens, R. L., El-Kareh, R., Longhurst, C. A., Quintero, K., Donahue, A., Chan, T. C., Nemati, S., & Wardi, G. (2024). Impact of a deep learning sepsis prediction model on quality of care and survival. Npj Digital Medicine, 7(1), 14. https://doi.org/10.1038/s41746-023-00986-6 Brixi, G., Durrant, M. G., Ku, J., Naghipourfar, M., Poli, M., Sun, G., Brockman, G., Chang, D., Fanton, A., Gonzalez, G. A., King, S. H., Li, D. B., Merchant, A. T., Nguyen, E., Ricci-Tam, C., Romero, D. W., Schmok, J. C., Taghibakhshi, A., Vorontsov, A., … Hie, B. L. (2026). Genome modelling and design across all domains of life with Evo 2. Nature. https://doi.org/10.1038/s41586-026-10176-5 Brodeur, P. G., Buckley, T. A., Kanjee, Z., Goh, E., Ling, E. B., Jain, P., Cabral, S., Abdulnour, R.-E., Haimovich, A. D., Freed, J. A., Olson, A., Morgan, D. J., Hom, J., Gallo, R., McCoy, L. G., Mombini, H., Lucas, C., Fotoohi, M., Gwiazdon, M., … Rodman, A. (2024). Superhuman performance of a large language model on the reasoning tasks of a physician (Version 3). arXiv. https://doi.org/10.48550/ ARXIV.2412.10849 Callahan, A., McElfresh, D., Banda, J. M., Bunney, G., Char, D., Chen, J., Corbin, C. K., Dash, D., Downing, N. L., Jain, S. S., Kotecha, N., Masterson, J., Mello, M. M., Morse, K., Nallan, S., Pandya, A., Revri, A., Sharma, A., Sharp, C., … Shah, N. H. (2024). Standing on FURM ground: A framework for evaluating fair, useful, and reliable AI models in health care systems. NEJM Catalyst, 5(10). https://doi. org/10.1056/CAT.24.0131 Carl, N., Haggenmüller, S., Winterstein, J. T., Nguyen, L., Wies, C., Hetz, M. J., Mangold, M. H., Grüne, B., Michel, M. S., Brinker, T. J., & Wessels, F. (2025). Patient insights into empathy, compassion and self-disclosure in medical large language models: Results from the IPALLM III study. World Journal of Urology, 43(1), 492. https://doi.org/10.1007/s00345-025-05872-2 Cherti, M., Beaumont, R., Wightman, R., Wortsman, M., Ilharco, G., Gordon, C., Schuhmann, C., Schmidt, L., & Jitsev, J. (2023). Reproducible scaling laws for contrastive language-image learning. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2818–2829. https://doi.org/10.1109/CVPR52729.2023.00276 Davis, V. H., Qiang, J. R., Adekoya MacCarthy, I., Howse, D., Seshie, A. Z., Kosowan, L., Delahunty-Pike, A., Abaga, E., Cooney, J., Robinson, M., Senior, D., Zsager, A., Aubrey-Bassler, K., Irwin, M., Jackson, L. A., Katz, A., Marshall, E. G., Muhajarine, N., Neudorf, C., … Pinto, A. D. (2025). Perspectives on using artificial intelligence to derive social determinants of health data from medical records in Canada: Large multijurisdictional qualitative study. Journal of Medical Internet Research, 27, e52244. https://doi.org/10.2196/52244 Edgar, R., Scholte, N. T. B., Ebrahimkheil, K., Brouwer, M. A., Beukema, R. J., Mafi-Rad, M., Vernooy, K., Yap, S.-C., Ronner, E., Van Mieghem, N., Boersma, E., Stas, P. C., Van Royen, N., & Bonnes, J. L. (2024). Automated cardiac arrest detection using a photoplethysmography wristband: Algorithm development and validation in patients with induced circulatory arrest in the DETECT-1 study. The Lancet Digital Health, 6(3), e201–e210. https://doi.org/10.1016/S2589-7500(23)00249-2 Engelmann, J., & Bernabeu, M. O. (2025). Training a high-performance retinal foundation model with half-the-data and 400 times less compute. Nature Communications, 16(1), 6862. https://doi.org/10.1038/s41467-025-62123-z Fee, N., Glover, L. E., Bauman, R., & Crosby, D. A. (2025). Patient perceptions on the use of artificial intelligence (AI) in fertility treatment. Human Fertility, 28(1), 2591161. https://doi.org/10.1080/14647273.2025.2591161 Ghareeb, A. E., Chang, B., Mitchener, L., Yiu, A., Szostkiewicz, C. J., Laurent, J. M., Razzak, M. T., White, A. D., Hinks, M. M., & A P P E N D I X | A I I N D E X R E P O R T 2026 415 Rodriques, S. G. (2025). Robin: A multi-agent system for automating scientific discovery (Version 1). arXiv. https://doi.org/10.48550/ ARXIV.2505.13400 Hayes, T., Rao, R., Akin, H., Sofroniew, N. J., Oktay, D., Lin, Z., Verkuil, R., Tran, V. Q., Deaton, J., Wiggert, M., Badkundri, R., Shafkat, I., Gong, J., Derry, A., Molina, R. S., Thomas, N., Khan, Y., Mishra, C., Kim, C., … Rives, A. (2024). Simulating 500 million years of evolution with a language model. https://doi.org/10.1101/2024.07.01.600583 Hernström, V., Josefsson, V., Sartor, H., Schmidt, D., Larsson, A.-M., Hofvind, S., Andersson, I., Rosso, A., Hagberg, O., & Lång, K. (2025). Screening performance and characteristics of breast cancer detected in the Mammography Screening with Artificial Intelligence trial (MASAI): A randomised, controlled, parallel-group, non-inferiority, single-blinded, screening accuracy study. The Lancet Digital Health, 7(3), e175–e183. https://doi.org/10.1016/S2589-7500(24)00267-X Holste, G., Oikonomou, E. K., Tokodi, M., Kovács, A., Wang, Z., & Khera, R. (2025). Complete AI-enabled echocardiography interpretation with multitask deep learning. JAMA, 334(4), 306. https://doi.org/10.1001/jama.2025.8731 Huang, K., Zhang, S., Wang, H., Qu, Y., Lu, Y., Roohani, Y., Li, R., Qiu, L., Li, G., Zhang, J., Yin, D., Marwaha, S., Carter, J. N., Zhou, X., Wheeler, M., Bernstein, J. A., Wang, M., He, P., Zhou, J., … Leskovec, J. (2025). Biomni: A general-purpose biomedical AI agent. https://doi.org/10.1101/2025.05.30.656746 Huo, Z., Fries, J. A., Lozano, A., Valanarasu, J. M. J., Steinberg, E., Blankemeier, L., Chaudhari, A. S., Langlotz, C., & Shah, N. H. (2024). Time-to-event pretraining for 3D medical imaging (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2411.09361 Jain, S., Beazer, J., Ruffolo, J. A., Bhatnagar, A., & Madani, A. (2025). E1: Retrieval-augmented protein encoder models. https://doi. org/10.1101/2025.11.12.688125 Jiang, Y., Black, K. C., Geng, G., Park, D., Zou, J., Ng, A. Y., & Chen, J. H. (2025). MedAgentBench: A realistic virtual EHR environment to benchmark medical LLM agents (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2501.14654 Jin, R., Xu, M., Meng, F., Wan, G., Cai, Q., Jiang, Y., Han, J., Chen, Y., Lu, W., Wang, M., Lan, Z., Jiang, Y., Liu, J., Wang, D., Cong, L., & Zhang, Z. (2025). STELLA: Towards a biomedical world model with self-evolving multimodal agents. https://doi. org/10.1101/2025.07.01.662467 Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., Bridgland, A., Meyer, C., Kohl, S. A. A., Ballard, A. J., Cowie, A., Romera-Paredes, B., Nikolov, S., Jain, R., Adler, J., … Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873), 583–589. https://doi.org/10.1038/s41586-02103819-2 Kim, S., Jin, P., Song, S., Chen, C., Li, Y., Ren, H., Li, X., Liu, T., & Li, Q. (2025). EchoFM: Foundation model for generalizable echocardiogram analysis. IEEE Transactions on Medical Imaging, 44(10), 4049–4062. https://doi.org/10.1109/TMI.2025.3580713 Labinsky, H., Klemm, P., Graalmann, L., Thiele, T., Hornig, J., Fink, D., Morf, H., Mucke, J., Kiltz, U., Pecher, A.-C., Pfeil, A., EllingAudersch, C., Bendzuck, G., Krusche, M., Hueber, A. J., & Knitza, J. (2025). Patient experiences, attitudes, and profiles regarding artificial intelligence in rheumatology: A German national cross-sectional survey study. Rheumatology International, 45(12), 269. https://doi.org/10.1007/s00296-025-06023-x Levine, D. S., Shuaibi, M., Spotte-Smith, E. W. C., Taylor, M. G., Hasyim, M. R., Michel, K., Batatia, I., Csányi, G., Dzamba, M., Eastman, P., Frey, N. C., Fu, X., Gharakhanyan, V., Krishnapriyan, A. S., Rackers, J. A., Raja, S., Rizvi, A., Rosen, A. S., Ulissi, Z., … Wood, B. M. (2026). The Open Molecules 2025 (OMol25) dataset, evaluations, and models (arXiv:2505.08762). arXiv. https://doi. org/10.48550/arXiv.2505.08762 Li, H., Zhang, S., Tao, L., Li, X., & Liu, J. (2025). Acceptance of healthcare services based on the large language model in China: A national cross-sectional study. BMC Public Health, 25(1), 4007. https://doi.org/10.1186/s12889-025-25425-x Liu, H., Zhu, Y., Wang, Y., Long, Y., Lai, Z., Yu, L., & Zhao, C. (2025). MedMMV: A controllable multimodal multi-agent framework for reliable and verifiable clinical reasoning (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2509.24314 Machado, C. M., & Voyer, B. G. (2025). Applying artificial intelligence to cardiology services: A qualitative exploration of patient perspectives. British Journal of Healthcare Management, 31(7), 1–15. https://doi.org/10.12968/bjhc.2024.0086 Madani, A., Krause, B., Greene, E. R., Subramanian, S., Mohr, B. P., Holton, J. M., Olmos, J. L., Xiong, C., Sun, Z. Z., Socher, R., Fraser, J. S., & Naik, N. (2023). Large language models generate functional protein sequences across diverse families. Nature Biotechnology, 41(8), 1099–1106. https://doi.org/10.1038/s41587-022-01618-2 Mello, M. M., Char, D., & Xu, S. H. (2025). Ethical obligations to inform patients about use of AI tools. JAMA, 334(9), 767. https://doi. org/10.1001/jama.2025.11417 Munim, A., Fallahpour, A., Szasz, T., Attarpour, A., Jiang, R., Sooriyakanthan, B., Sooriyakanthan, M., Whitney, H., Slivnick, J., Rubin, B., Tsang, W., & Wang, B. (2026). EchoJEPA: A latent predictive foundation model for echocardiography (arXiv:2602.02603). arXiv. A P P E N D I X | A I I N D E X R E P O R T 2026 416 https://doi.org/10.48550/arXiv.2602.02603 Njei, B., Al-Ajlouni, Y. A., Sidney Kanmounye, U., Boateng, S., Loic Nguefang, G., Njei, N., Hamouri, S., & Al-Ajlouni, A. F. (2026). Artificial intelligence agents in healthcare research: A scoping review. PLOS One, 21(2), e0342182. https://doi.org/10.1371/journal. pone.0342182 Nong, P., & Ji, M. (2025). Expectations of healthcare AI and the role of trust: Understanding patient views on how AI will impact cost, access, and patient-provider relationships. Journal of the American Medical Informatics Association, 32(5), 795–799. https://doi. org/10.1093/jamia/ocaf031 Notin, P., Kollasch, A. W., Ritter, D., Van Niekerk, L., Paul, S., Spinner, H., Rollins, N., Shaw, A., Weitzman, R., Frazer, J., Dias, M., Franceschi, D., Orenbuch, R., Gal, Y., & Marks, D. S. (2023). ProteinGym: Large-scale benchmarks for protein design and fitness prediction. https://doi.org/10.1101/2023.12.07.570727 Ogu, J. C., Ozcan, B. B., Dogan, B. E., Xi, Y., & Knippa, E. E. (2026). Patient perspectives on artificial intelligence in mammography interpretation: A comparative survey study of safety-net and academic hospital settings. Breast Cancer Research and Treatment, 215(1), 25. https://doi.org/10.1007/s10549-025-07870-9 Omar, M., Sorin, V., Wieler, L. H., Charney, A. W., Kovatch, P., Horowitz, C. R., Korfiatis, P., Glicksberg, B. S., Freeman, R., Nadkarni, G. N., & Klang, E. (2026). Mapping the susceptibility of large language models to medical misinformation across clinical notes and social media: A cross-sectional benchmarking analysis. The Lancet Digital Health, 8(1), 100949. https://doi.org/10.1016/j. landig.2025.100949 Pai, S., Hadzic, I., Bontempi, D., Bressem, K., Kann, B. H., Fedorov, A., Mak, R. H., & Aerts, H. J. W. L. (2025). Vision foundation models for computed tomography (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2501.09001 Passaro, S., Corso, G., Wohlwend, J., Reveiz, M., Thaler, S., Somnath, V. R., Getz, N., Portnoi, T., Roy, J., Stark, H., Kwabi-Addo, D., Beaini, D., Jaakkola, T., & Barzilay, R. (2025). Boltz-2: Towards accurate and efficient binding affinity prediction. https://doi. org/10.1101/2025.06.14.659707 Rao, A., Jensen, M., Fisher, A. T., Blankemeier, L., Berens, P., Fereydooni, A., Lirette, S., Alkan, E., Kitamura, F. C., Chaves, J. M. Z., Reis, E., Desai, A., Willis, M. H., Hom, J., Johnston, A., Lenchik, L., Boutin, R. D., Farina, E. M. J. M., Serpa, A. S., … Chaudhari, A. S. (2026). Comp2Comp: Open-source software with FDA-cleared artificial intelligence algorithms for computed tomography image analysis (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2602.10364 Sandhu, A. T., Rodriguez, F., Ngo, S., Patel, B. N., Mastrodicasa, D., Eng, D., Khandwala, N., Balla, S., Sousa, D., & Maron, D. J. (2023). Incidental coronary artery calcium: Opportunistic screening of previous nongated chest computed tomography scans to improve statin rates (NOTIFY-1 project). Circulation, 147(9), 703–714. https://doi.org/10.1161/CIRCULATIONAHA.122.062746 Sau, A., Pastika, L., Sieliwonczyk, E., Patlatzoglou, K., Ribeiro, A. H., McGurk, K. A., Zeidaabadi, B., Zhang, H., Macierzanka, K., Mandic, D., Sabino, E., Giatti, L., Barreto, S. M., Camelo, L. D. V., Tzoulaki, I., O’Regan, D. P., Peters, N. S., Ware, J. S., Ribeiro, A. L. P., … Ng, F. S. (2024). Artificial intelligence-enabled electrocardiogram for mortality and cardiovascular risk estimation: A model development and validation study. The Lancet Digital Health, 6(11), e791–e802. https://doi.org/10.1016/S2589-7500(24)00172-9 Shaban, M., Chang, Y., Qiu, H., Yeo, Y. Y., Song, A. H., Jaume, G., Wang, Y., Weishaupt, L. L., Ding, T., Vaidya, A., Lamane, A., Shao, D., Zidane, M., Bai, Y., McCallum, P., Luo, S., Wu, W., Wang, Y., Cramer, P., … Mahmood, F. (2025). A foundation model for spatial proteomics (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2506.03373 Shah, S. J., Devon-Sand, A., Ma, S. P., Jeong, Y., Crowell, T., Smith, M., Liang, A. S., Delahaie, C., Hsia, C., Shanafelt, T., Pfeffer, M. A., Sharp, C., Lin, S., & Garcia, P. (2025). Ambient artificial intelligence scribes: Physician burnout and perspectives on usability and documentation burden. Journal of the American Medical Informatics Association, 32(2), 375–380. https://doi.org/10.1093/jamia/ ocae295 Shah, S. J., Nair, A., Murtagh, K., Ma, S. P., Vogt, K., Clutter, D., Sheikh, L., Schmidt, H., Smith, M., Lakhotia, A., Bullock, L., Bhasin, A., Pfeffer, M. A., Sharp, C., Lin, S., & Garcia, P. (2025). Clinician perspectives on AI-generated drafts of patient test result explanations. JAMA Network Open, 8(8), e2528794. https://doi.org/10.1001/jamanetworkopen.2025.28794 Shi, D., Zhang, W., Yang, J., Huang, S., Chen, X., Xu, P., Jin, K., Lin, S., Wei, J., Yusufu, M., Liu, S., Zhang, Q., Ge, Z., Xu, X., & He, M. (2025). A multimodal visual–language foundation model for computational ophthalmology. Npj Digital Medicine, 8(1), 381. https://doi. org/10.1038/s41746-025-01772-2 Siméoni, O., Vo, H. V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., Massa, F., Haziza, D., Wehrstedt, L., Wang, J., Darcet, T., Moutakanni, T., Sentana, L., Roberts, C., Vedaldi, A., … Bojanowski, P. (2025). DINOv3 (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2508.10104 Singh, R., Bapna, M., Diab, A. R., Ruiz, E. S., & Lotter, W. (2025). How AI is used in FDA-authorized medical devices: A taxonomy across 1,016 authorizations. Npj Digital Medicine, 8(1), 388. https://doi.org/10.1038/s41746-025-01800-1 A P P E N D I X | A I I N D E X R E P O R T 2026 417 Soh, Z. D., Bai, Y., Yu, K., Zhou, Y., Lei, X., Thakur, S., Lee, Z., Phang, L. C. L., Peng, Q., Xue, C. C., Chong, R. S., Hoang, Q. V., Raghavan, L., Tham, Y. C., Sabanayagam, C., Wu, W.-C., Ho, M.-C., He, J., Gupta, P., … Cheng, C.-Y. (2025). An integrated languagevision foundation model for conversational diagnostics and triaging in primary eye care. Cell Reports Medicine, 6(12), 102476. https:// doi.org/10.1016/j.xcrm.2025.102476 Swanson, K., Wu, W., Bulaong, N. L., Pak, J. E., & Zou, J. (2024). The virtual lab: AI agents design new SARS-CoV-2 nanobodies with experimental validation. https://doi.org/10.1101/2024.11.11.623004 The OpenFold3 Team. (2026). OpenFold3-preview-2 technical report. Tudor, B. H., Shargo, R., Gray, G. M., Fierstein, J. L., Kuo, F. H., Burton, R., Johnson, J. T., Scully, B. B., Asante-Korang, A., Rehman, M. A., & Ahumada, L. M. (2025). A scoping review of human digital twins in healthcare applications and usage patterns. Npj Digital Medicine, 8(1), 587. https://doi.org/10.1038/s41746-025-01910-w Vaidya, A., Zhang, A., Jaume, G., Song, A. H., Ding, T., Wagner, S. J., Lu, M. Y., Doucet, P., Robertson, H., Almagro-Perez, C., Chen, R. J., ElHarouni, D., Ayoub, G., Bossi, C., Ligon, K. L., Gerber, G., Le, L. P., & Mahmood, F. (2025). Molecular-driven foundation model for oncologic pathology (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2501.16652 Van Bon, A. C., Blauw, H., Jansen, T. J. P., Laverman, G. D., Urgert, T., Geessink-Mennink, J., Mulder, A. H., Out, M., Groote Veldman, R., Onvlee, A. J., Schouwenberg, B. J. J. W., Vermeulen, M. A. R., Diekman, M. J. M., Gerding, M. N., Van Wijk, J. P. H., Klaassen, M., Witkop, M., & DeVries, J. H. (2024). Bihormonal fully closed-loop system for the treatment of type 1 diabetes: A real-world multicentre, prospective, single-arm trial in the Netherlands. The Lancet Digital Health, 6(4), e272–e280. https://doi.org/10.1016/ S2589-7500(24)00002-5 Vince, O., Oldach, P., Pereno, V., Leung, M. H. Y., Greco, C., Minto-Cowcher, G., Ur-Rehman, S., Kam, K. Y. K., Chow, W., Bolton, E., Mwambingu, B. R., Greenhalgh, N. L., Knot, I. E., Christoffersen, L., Clark, M., Pecoraro, R., Kollasch, A. W., Bohnuud, T., Bakalar, M., … Gowers, G. (2025). Breaking through biology’s data wall: Expanding the known tree of life by over 10x using a global biodiscovery pipeline. https://doi.org/10.1101/2025.06.11.658620 Vorontsov, E., Shaikovski, G., Casson, A., Viret, J., Zimmermann, E., Tenenholtz, N., Wang, Y. K., Bernhard, J. H., Godrich, R. A., Retamero, J. A., Shia, J., Gonen, M., Weiser, M. R., Klimstra, D. S., Yousfi, R., Fusi, N., Fuchs, T. J., Severson, K., & Liu, S. (2025). PRISM2: Unlocking multi-modal Ggeneral pathology AI with clinical dialogue (Version 2). arXiv. https://doi.org/10.48550/ ARXIV.2506.13063 Vukadinovic, M., Chiu, I.-M., Tang, X., Yuan, N., Chen, T.-Y., Cheng, P., Li, D., Cheng, S., He, B., & Ouyang, D. (2026). Comprehensive echocardiogram evaluation with view primed vision language AI. Nature, 650(8103), 970–977. https://doi.org/10.1038/s41586-02509850-x Wahab, N., & Rajpoot, N. (2025). MPath: Multimodal pathology report generation from whole slide images (Version 1). arXiv. https:// doi.org/10.48550/ARXIV.2512.11906 Wald, T., Hamamci, I. E., Gao, Y., Bond-Taylor, S., Sharma, H., Ilse, M., Lo, C., Melnichenko, O., Schwaighofer, A., Codella, N. C. F., Wetscherek, M. T., Maier-Hein, K. H., Korfiatis, P., Salvatelli, V., Alvarez-Valle, J., & Pérez-García, F. (2025). Comprehensive languageimage pre-training for 3D medical image understanding (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2510.15042 Wu, C., Zhang, X., Zhang, Y., Hui, H., Wang, Y., & Xie, W. (2025). Towards generalist foundation model for radiology by leveraging web-scale 2D&3D medical data. Nature Communications, 16(1), 7866. https://doi.org/10.1038/s41467-025-62385-7 Wu, D., Haredasht, F. N., Maharaj, S. K., Jain, P., Tran, J., Gwiazdon, M., Rustagi, A., Jindal, J., Koshy, J. M., Kadiyala, V., Agarwal, A., Tappuni, B., French, B., Jesudasen, S., Cosgriff, C. V., Chakraborty, R., Caldwell, J., Ziolkowski, S., Iberri, D. J., … Goh, E. (2025). First, do NOHARM: Towards clinically safe large language models (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2512.01241 wwPDB consortium, Burley, S. K., Berman, H. M., Bhikadiya, C., Bi, C., Chen, L., Costanzo, L. D., Christie, C., Duarte, J. M., Dutta, S., Feng, Z., Ghosh, S., Goodsell, D. S., Green, R. K., Guranovic, V., Guzenko, D., Hudson, B. P., Liang, Y., Lowe, R., … Ioannidis, Y. E. (2019). Protein Data Bank: The single global archive for 3D macromolecular structure data. Nucleic Acids Research, 47(D1), D520– D528. https://doi.org/10.1093/nar/gky949 Xiang, J., Wang, X., Zhang, X., Xi, Y., Eweje, F., Chen, Y., Li, Y., Bergstrom, C., Gopaulchan, M., Kim, T., Yu, K.-H., Willens, S., Olguin, F. M., Nirschl, J. J., Neal, J., Diehn, M., Yang, S., & Li, R. (2025). A vision–language foundation model for precision oncology. Nature, 638(8051), 769–778. https://doi.org/10.1038/s41586-024-08378-w Xu, S., Feng, Q., Qiao, L., Wu, H., Shen, T., Cheng, Y., Zheng, S., & Sun, S. (2025). Benchmarking all-atom biomolecular structure prediction with FoldBench. Nature Communications, 17(1), 442. https://doi.org/10.1038/s41467-025-67127-3 Xu, T., Hosseini, S., Anderson, C., Rinaldi, A., Krishnan, R. G., Martel, A. L., & Goubran, M. (2025). A generalizable 3D framework and model for self-supervised learning in medical imaging. Npj Digital Medicine, 8(1), 639. https://doi.org/10.1038/s41746-025-02035-w Xu, Y., Wang, Y., Zhou, F., Ma, J., Jin, C., Yang, S., Li, J., Zhang, Z., Zhao, C., Zhou, H., Li, Z., Lin, H., Wang, X., Wang, J., Han, A., A P P E N D I X | A I I N D E X R E P O R T 2026 418 Chan, R. C. K., Liang, L., Zhang, X., & Chen, H. (2025). A multimodal knowledge-enhanced whole-slide pathology foundation model. Nature Communications, 16(1), 11406. https://doi.org/10.1038/s41467-025-66220-x Ye, C., Benegas, G., Albors, C., Li, J. C., Prillo, S., Fields, P. D., Clarke, B., & Song, Y. S. (2025). Predicting functional constraints across evolutionary timescales with phylogeny-informed genomic language models. https://doi.org/10.1101/2025.09.21.677619 Zhang, J., Cunningham, J., Brown, J., & Gatenby, R. (2022). Evolution-based mathematical models significantly prolong response to abiraterone in metastatic castrate-resistant prostate cancer and identify strategies to further improve outcomes. eLife, 11, e76284. https://doi.org/10.7554/eLife.76284 Zhang, J., Ubas, A. A., De Borja, R., Svensson, V., Thomas, N., Thakar, N., Lai, I., Winters, A., Khan, U., Jones, M. G., Tran, V., Pangallo, J., Papalexi, E., Sapre, A., Nguyen, H., Sanderson, O., Nigos, M., Kaplan, O., Schroeder, S., … Yu, J. (2025). Tahoe100M: A giga-scale single-cell perturbation atlas for context-dependent gene function and cellular modeling. https://doi. org/10.1101/2025.02.20.639398 Zheng, H., Shi, Z., & Yi, P. (2025). MedCoAct: Confidence-aware multi-agent collaboration for complete clinical decision (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2510.10461 Zimmermann, E., Vorontsov, E., Viret, J., Casson, A., Zelechowski, M., Shaikovski, G., Tenenholtz, N., Hall, J., Klimstra, D., Yousfi, R., Fuchs, T., Fusi, N., Liu, S., & Severson, K. (2024). Virchow2: Scaling self-supervised mixed magnification models in pathology (Version 3). arXiv. https://doi.org/10.48550/ARXIV.2408.00738 A P P E N D I X | A I I N D E X R E P O R T 2026 419 Chapter 7: Education Code.org Access to Computer Science Education Data on access to computer science education was drawn from Code.org’s State of AI + CS Education 2025 report and the accompanying interactive dashboard. AP Computer Science Data The AP Computer Science data, provided to Code.org through an agreement it maintains with the College Board, can be viewed here. The AP Computer Science data is drawn from the College Board’s national and state summary reports. Computing Research Association (CRA Taulbee Survey) Computing Research Association (CRA) distributes the Taulbee Survey annually to its member universities in the United States and Canada to collect information about computer science, computer engineering, and information departments. The data used in this report is from the 2024 survey, the results of which were published in 2025. For more information about CRA’s methodology, please refer to their report. Global K–12 AI Education The Raspberry Pi Computing Education Research Centre, based in the department of computer science and technology at the University of Cambridge, compiled this dataset, building off the research conducted by the Brookings Institution in its 2021 report Building Skills for Life. One change was made to the dataset to clarify that computer science in the U.S. is available in some schools and districts and not available everywhere as an elective course. For more information about the methodology used by the Raspberry Pi Computing Education Research Centre, please refer to its report. This year, the AI Index team updated the dataset compiled by the Centre with current information as of 2025. IPEDS The Integrated Postsecondary Education Data System (IPEDS) combines annual surveys conducted by the U.S. Department of Education’s National Center for Education Statistics. IPEDS gathers information from every college, university, and technical and vocational institution that participates in the federal student financial aid programs. Completion Data This chapter uses data from the Completions survey, which collects data on the number of students who complete a postsecondary education program. The complete list of majors considered to be AI software–related and AI hardware–related can be found in

Appendix C of the January 2025 White House AI Talent Report

OECD This chapter uses data from the OECD Data Explorer, specifically from the table entitled “Number of enrolled students, graduates and new entrants by field of education.” The methodology for this dataset can be found in Education at a Glance 2024 Sources, Methodologies and Technical Notes. Works Cited Ahead of the curve: what the nation’s largest public university system is learning about AI. (2026). https://www.calstate.edu/impact-ofthe-csu/technology/ai-empowered-csu/Pages/ai-survey.aspx AI talent report | CEA | The White House. (2025, January 15). https://bidenwhitehouse.archives.gov/cea/written-materials/2025/01/14/ ai-talent-report/ Burns, M., Winthrop, R., Luther, N., Venetis, E., & Karim, R. (2026). A new direction for students in an AI world: Prosper, prepare, protect. Brookings. https://www.brookings.edu/wp-content/uploads/2026/01/A-New-Direction-for-Students-in-an-AI-World-FULLREPORT.pdf Chegg.org (2025). Chegg Global Student Survey 2025. https://www.chegg.org/global-student-survey-2025 College students’ views on AI (2025). Inside Higher Ed. https://www.insidehighered.com/news/students/academics/2025/08/29/ survey-college-students-views-ai Freeman, J. (2025). Student Generative AI Survey 2025. HEPI. https://www.hepi.ac.uk/wp-content/uploads/2025/02/HEPI-KortextStudent-Generative-AI-Survey-2025.pdf Guide to using generative artificial intelligence in primary and secondary schools (2025 Edition) | Center for Security and Emerging Technology (2025). Chinese Ministry of Education Steering Committee on Basic Education and Instruction. https://cset.georgetown. edu/publication/china-school-gen-ai-use-guide-2025/ A P P E N D I X | A I I N D E X R E P O R T 2026 420 Handa, K., Bent, D., Tamkin, A., McCain, M., Durmus, E., Stern, M., Schiraldi, M., Huang, S., Ritchie, S., Syverud, S., Jagadish, K., Vo, M., Bell, M., & Ganguli, D. (2025). Anthropic Education Report: How university students use Claude. https://www.anthropic.com/news/ anthropic-education-report-how-university-students-use-claude Joseph Doss, C., Bozick, R., Schwartz, H. L., Chu, L., Rainey, L. R., Woo, A., Reich, J., & Dukes, J. (2025). AI use in schools is quickly increasing but guidance lags behind: Findings from the RAND survey panels | RAND. RAND Survey Panels. https://www.rand.org/ pubs/research_reports/RRA4180-1.html Kumar, A. N. (2012, July 3). A study of stereotype threat in computer science | Proceedings of the 17th ACM annual conference on innovation and technology in computer science education. Proceedings of the 17th ACM Annual Conference on Innovation and Technology in Computer Science Education. https://dl.acm.org/doi/10.1145/2325296.2325361 Laird, E., Dwyer, M., & Quay-de la Vallee, H. (2025). Hand in hand: Schools’ embrace of AI connected to increased risks to students. Center for Democracy & Technology. https://cdt.org/wp-content/uploads/2025/10/CDT-2025-Hand-in-Hand-Polling-111225accessible.pdf McClain, C., Anderson, M., Sidoti, O., & Bishop, W. (2026). How teens use and view AI. https://www.pewresearch.org/wp-content/ uploads/sites/20/2026/02/PI_2026.02.24_Teens-and-AI_REPORT.pdf Students Speak Up about AI in their Learning Lives. (2025). Project Tomorrow. https://www.tomorrow.org/resource/2025-nationalreport/ Voices of Gen Z: How American youth view and use artificial intelligence. (2025). Walton Family Foundation-Gallup. https://www. gallup.com/analytics/658901/walton-family-foundation-gallup-voices-gen-american-youth.aspx Warner, J. R., Fletcher, C. L., Torbey, R., & Garbrecht, L. S. (2019). Increasing capacity for computer science education in rural areas through a large-scale collective impact model. Proceedings of the 50th ACM Technical Symposium on Computer Science Education, 1157–1163. https://doi.org/10.1145/3287324.3287418 Watson, C. E., & Rainie, L. (2025). The AI challenge. How college faculty assess the present and future of higher education in the age of AI. Elon University’s Imagining the Digital Future Center. https://dgmg81phhvh63.cloudfront.net/content/user-photos/Research/ PDFs/AI_Challenge.pdf Yongpradit, P., Weissman, H., & Ellis, V. (2025). 2025 State of AI + CS Education. Code.org. https://drive.google.com/file/d/1p1xUFVec0fwXOBv2B17j65wiD8KYSvQ/view A P P E N D I X | A I I N D E X R E P O R T 2026 421 Chapter 8: Policy and Governance Global Legislation Records on AI This year, the AI Index partnered with the Digital Policy Alert to track AI-related bills passed into law. The Digital Policy Alert’s pipeline, described below, was used to systematically identify and classify AI-relevant legislation. A final manual review was conducted by the AI Index team to verify all inclusions.

Overview

The AI Legislation Tracker employs an automated four-stage pipeline to systematically identify, classify, and extract structured information from AI-relevant legislation across the G20 countries. The pipeline combines autonomous agentic web crawling with large language model (LLM)–based document analysis to process thousands of candidate legal texts with minimal manual intervention. Pipeline Architecture Stage 1: Automated Source Crawling An AI-driven browser agent navigates a predefined set of legislative source websites. For every source, the agent executes a predefined set of keyword queries—spanning terms such as artificial intelligence, machine learning, algorithm, and other domainrelevant expressions in the source’s language—and collects URLs of pages that appear to contain legislative content. The agent operates autonomously, handling site-specific navigation patterns including pagination, cookie consent dialogs, CAPTCHA challenges, and search filters without human intervention. Stage 2: Primary Classification Each collected URL is fetched and its full text extracted. The document is then submitted to a large language model (Google Gemini) for an initial relevance assessment. The model determines (a) whether the text constitutes AI-relevant legislation and (b) the document type (e.g., primary legislation, secondary regulation, strategy document, executive order). Documents falling below a defined relevance threshold are discarded at this stage. Stage 3: Secondary Classification Documents that pass the primary classification stage as relevant legislation undergo a second, more granular LLM analysis. This pass extracts a set of structured fields from each law, including but not limited to:

- AI system types covered by the legislation

- Sectors affected (e.g., healthcare, finance, public administration)

- Policy objectives (e.g., safety, transparency, accountability, innovation promotion)

- Responsible agencies or regulatory bodies designated by the law

- Bill or act number and formal title, serving to deduplicate legislation

- Legislative status (e.g., enacted, proposed, under review) Stage 4: Quote Extraction A final LLM pass identifies and extracts verbatim passages from each legislative text—the specific provisions, definitions, or clauses that substantiate the classification decisions made in the preceding stages. These quotes serve as auditable evidence linking each structured annotation to its source material. Data Storage The pipeline persists the following artefacts on the server:

- All visited URLs and their classification outcomes

- Full scraped text of all URLs that passed primary classification (sub-page texts used only for classification are not retained separately)

- Structured classification fields and extracted quotes, stored in JSON format

- Crawl job metadata, including source identifiers and processing time stamps US State-Level AI Legislation For AI-related bills passed into law, the AI Index performed searches for the keyword “artificial intelligence” in the full text of bills on the websites of all 50 U.S. states. Bills are only counted as passed into law if the keyword appears in the final version of the bill, not just the introduced version. Note that only laws passed from 2015 to 2025 are included. The AI Index team surveyed the following databases: Alabama, Alaska, Arizona, Arkansas, California, Colorado, Connecticut, Delaware, Florida, Georgia, Hawaii, Idaho, Illinois, Indiana, Iowa, Kansas, Kentucky, Louisiana, Maine, Maryland, Massachusetts, Michigan, Minnesota, Mississippi, Missouri, Montana, Nebraska, Nevada, New Hampshire, New Jersey, New Mexico, New York, North Carolina, North Dakota, Ohio, Oklahoma, Oregon, Pennsylvania, Rhode Island, South Carolina, South Dakota, Tennessee, Texas, Utah, Vermont, Virginia, Washington, West Virginia, Wisconsin, Wyoming A P P E N D I X | A I I N D E X R E P O R T 2026 422 For a more thorough review, the AI Index also included AI-related state laws listed on the Multistate AI Legislation Tracker, even if they did not specifically reference “artificial intelligence” as a keyword. US AI Regulation This section examines AI-related regulations enacted by U.S. regulatory agencies from 2016 to 2025, analyzing the total number of regulations and their originating agencies. To compile this data, the AI Index conducted a keyword search for “artificial intelligence” on the Federal Register, a comprehensive repository of government documents drawn from over 436 agencies and nearly every branch of the U.S. government. US Congressional Hearings This captures witness participation in U.S. congressional hearings on artificial intelligence, spanning the 115th Congress (Jan. 3, 2017) through the 119th Congress (as of Jan. 30, 2025). The data was collected primarily from Congress.gov and, when needed, supplemented with committee websites and transcripts from the U.S. Government Publishing Office. Hearings were identified through a manual, keyword-based search of titles and descriptions (e.g., “artificial intelligence,” “machine learning,” “algorithm,” “automation,” “generative AI”), then manually validated to ensure AI was the hearing’s central focus. The unit of analysis is the individual witness appearance, yielding 425 observations across 107 hearings. For each observation, the AI Index recorded hearinglevel information (congressional session, date, name, event number, serial number, subject) and witness-level attributes (e.g., name, organization, and categorized affiliation). Public Investment in AI The AI Index analyzed government AI spending across European countries and the United States, focusing on regions where data is more accessible. It is important to note that this analysis may not fully represent all countries or regions, as the availability and quality of data can vary significantly. Additionally, while this analysis includes data on government contracts from various countries, it only covers grant-level spending for the United States. This discrepancy is the result of challenges in collecting comparable grant data from other countries and regions, such as the European Union and China. Nevertheless, the U.S. case illustrates that a substantial portion of government spending on AI occurs through grants. Coverage will expand in future iterations of the AI Index as more data becomes available, but discrepancies and gaps in the existing data may affect the comprehensiveness and accuracy of the findings. Data Sources For European countries, the AI Index collected public tender data from Tenders Electronic Daily (TED) (Publications Office of the European Union, 2024)—the online supplement to the official journal of the EU dedicated to European public procurement. While contracts are available in various formats, the most detailed data comes from bulk XML downloads, which include comprehensive information on tendering procedures, issuing entities, awarded contractors, lot values, descriptions, award dates, and common procurement vocabulary (CPV) codes. TED publication is governed by EU law thresholds: Tenders above specific monetary values, deemed of cross-border interest, must be published on TED. However, some countries also report below-threshold procurements, leading to variations in coverage across countries. For the United Kingdom, data sources include TED, Find a Tender, Contracts Finder, and Contracts Finder Archive. Data from Scotland and Wales was accessed via the APIs of their procurement websites, while Northern Ireland does not offer this service, necessitating its exclusion from the analysis and potentially leading to an underestimation of public investments in AI for the U.K. Due to API limitations restricting historical data access, the AI Index utilized the Open Contracting Partnership’s data registry via Kingfisher Collect to obtain comprehensive data for Scotland and Wales. Contracts and OTAs data for the United States was sourced from Omari et al. (2025) and from FPDS official API. Grants data was sourced from the publicly accessible USAspending platform, an official repository that facilitates bulk downloads of information related to contract award notices and grant data. While this dataset encompasses a longer time frame than the TED dataset, it is important to note that data quality can vary. Additionally, a study by the U.S. Government Accountability Office (GAO, 2023) found that 49 agencies, including 25 in the executive branch, did not report data—accounting for over $5 billion in net outlays for fiscal year 2022—to USAspending. Data Processing Processing TED data posed significant challenges due to inconsistent storage of contract descriptions, which varied by XML tag names based on release time and procurement type. Some files contained aggregated descriptions while others detailed each awarded contract lot. To capture comprehensive information, the main descriptions of each competition call were combined with partial descriptions when available. The linguistic diversity in data from different countries required translating all texts into English using the deep-translator tool and the Google Translator engine. Post-translation, tender texts were processed using natural language processing (NLP) techniques. These included the removal of stop words and special characters, part-of-speech (POS) tagging to retain key grammatical categories, lowercase conversion, lemmatization, and replacement of numerical measures with a <NUM> tag. U.S. data provides transaction-level information for multi-year awards. In earlier versions, when matching on a single transaction, we reconstructed the full award life cycle in terms of obligations to mitigate the effects of past de-obligations (unrelated to current matches). In recent years, however, AI-related activities have increasingly been added to grants and tenders that were awarded A P P E N D I X | A I I N D E X R E P O R T 2026 423 earlier, resulting in a retroactive inflation of estimated investment levels. The AI Index team therefore adopted an alternative approach that aggregates only transactions occurring after each AI match for a given awarded procedure, while controlling for the extreme effects of prior de-obligations (i.e., dropping all de-obbligations reported within the first year6 of the first AI-transaction in time for each award). This approach preserves the time trends observed in earlier periods while reducing the total estimated stock of investments, particularly for the initial sample period. Classification Classifying AI-related contracts and grants was achieved using full-text search with regular expressions. An AI dictionary was compiled by generating AI-related expressions and incorporating “core” expressions from the Yamashita et al. (2021) vocabulary. Additionally, a Word2Vec model expanded the dictionary with cosine-similar terms for each baseline expression that were manually reviewed and included in the final vocabulary. This process provided keywords and co-occurrence patterns crucial for identifying AI content. The classification followed a multistep approach. Initially, regular expression (regex) matching identified AI terms within contract and grant awards. These documents were then categorized as either “non AI-related” or “AI-related.” To validate AI-related matches, BERTopic model and pretrained DeBERTA transformer were employed to assess probability scores for specific AI-related topics. Awards with relevance scores below 20% underwent manual review, while those with higher scores were confirmed as AI-related. To ensure additional accuracy, all high-value tenders were also manually reviewed. 6 Closeout action period defined by CFR regulation. A P P E N D I X | A I I N D E X R E P O R T 2026 424 Chapter 9: Public Opinion Ipsos For brevity, the 2025 AI Index does not republish the methodology used for the Ipsos survey featured in the report. More details about the Ipsos surveys’ methodology can be found in the survey reports themselves: 2022, 2023, 2024, 2025, 2026 (Ipsos/Google MutiCountry AI Survey). Works Cited Bengio, Y., Clare, S., Prunkl, C., Rismani, S., Andriushchenko, M., Bucknall, B., Fox, P., Hu, T., Jones, C., Manning, S., Maslej, N., Mavroudis, V., McGlynn, C., Murray, M., Stix, C., Velasco, L., Wheeler, N., Privitera, D., Mindermann, S., … Zhu, L. (2025). International AI safety report 2025: First key update: Capabilities and risk implications (arXiv:2510.13653). arXiv. https://doi. org/10.48550/arXiv.2510.13653 Chou, C.-Y., Chan, T.-W., Chen, Z.-H., Liao, C.-Y., Shih, J.-L., Wu, Y.-T., Chang, B., Yeh, C. Y. C., Hung, H.-C., & Cheng, H. (2025). Defining AI companions: A research agenda—from artificial companions for learning to general artificial companions for Global Harwell. Research and Practice in Technology Enhanced Learning, 20, 032–032. https://doi.org/10.58459/rptel.2025.20032 Freitas, J. D., Uguralp, A. K., Uguralp, Z. O., & Stefano, P. (2024). AI companions reduce loneliness (arXiv:2407.19096). arXiv. https:// doi.org/10.48550/arXiv.2407.19096 Gillespie, N., Lockey, S., Ward, T., Macdade, A., & Hassed, G. (2025). Trust, attitudes and use of artificial intelligence: A global study 2025 (p. 4974511 Bytes). The University of Melbourne. https://doi.org/10.26188/28822919 Kennedy, B., Yam, E., Kikuchi, E., Pula, I., & Fuentes, J. (2025). How Americans view AI and its impact on people and society. Pew Research Center. https://www.pewresearch.org/wp-content/uploads/sites/20/2025/09/PS_2025.9.15_AI-and-its-impact_report.pdf Kouros, T., & Papa, V. (2024). Digital mirrors: AI companions and the self. Societies, 14(10), 1–18. https://www.mdpi.com/20754698/14/10/200 McClain, C., Kennedy, B., Gottfried, J., Anderson, M., & Pasquini, G. (2025). How the U.S. public and AI experts view artificial intelligence. Pew Research Center. https://www.pewresearch.org/wp-content/uploads/sites/20/2025/04/pi_2025.04.03_us-publicand-ai-experts_report.pdf Murphy, C., Rosenberg, J., Canedy, J., Jacobs, Z., Flechner, N., Britt, R., Pan, A., Rogers-Smith, C., Mayland, D., Buffington, C., Coston, A., Kerner, H., Pierson, E., Rabbany, R., Seamans, R., Su, Y., Tramèr, F., Hashimoto, T., Narayanan, A., … Karger, E. (2025). The longitudinal expert AI panel: Understanding expert views on AI capabilities, adoption, and impact. Forecasting Research Institute. https://static1.squarespace.com/static/635693acf15a3e2a14a56a4a/t/6939ad8392bbfd43752309be/1765387651933/the-longitudinalexpert-ai-panel.pdf Pan, H., Xu, W., Shen, M., & Gao, Z. (2025). Human-centered artificial social intelligence (HC-ASI) (arXiv:2511.21044). arXiv. https://doi. org/10.48550/arXiv.2511.21044 Pataranutaporn, P., Karny, S., Archiwaranguprok, C., Constanze, A., Auren R., L., & Maes, P. (2025). “My boyfriend is AI”: A computational analysis of human-AI companionship in Reddit’s AI community. arXiv. https://arxiv.org/abs/2509.11391 Poushter, J., Fagan, M., & Corichi, M. (2025). How people around the world view AI. Pew Research Center. https://www.pewresearch. org/wp-content/uploads/sites/20/2025/10/pg_2025.10.15_ai_report.pdf Quinn, M. J., & Riley, J. (2024). Companion robots: A debate. Ubiquity, 2024(December), 1:1-1:24. https://doi.org/10.1145/3707639 Rainie, L., & Anderson, J. (2025). Public views on being human in 2035. Elon University. https://imaginingthedigitalfuture.org/reportsand-publications/public-views-on-being-human-in-2035/ Uslu, A., Wihbey, J., Lazer, D., Perlis, R. H., Ognyanova, K., Baum, M. A., Druckman, J. N., Santillana, M., Qu, H., & Sullivan, G. (2025). AI across America: Attitudes on AI usage, job impact, and federal regulation. The Civic Health Institutions Project, The COVID States Project. https://www.chip50.org/reports/ai-across-america-attitudes-on-ai-usage-job-impact-and-federal-regulation Yang, Y., Wang, C., Xiang, X., & An, R. (2025). AI applications to reduce loneliness among older adults: A systematic review of effectiveness and technologies. Healthcare, 13(5), 446. https://doi.org/10.3390/healthcare13050446 A P P E N D I X | A I I N D E X R E P O R T 2026 425 Zhang, E., & Lu, X. (2023). Social AI improves well-being among female young adults (arXiv:2311.14706). arXiv. https://doi.org/10.48550/ arXiv.2311.14706 Zhang, R., Li, H., Meng, H., Zhan, J., Gan, H., & Lee, Y.-C. (2025). The dark side of AI companionship: A taxonomy of harmful algorithmic behaviors in human-AI relationships. Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, CHI ’25, 1–17. https://doi.org/10.1145/3706598.3713429 Zhang, Y., Zhao, D., Hancock, J. T., Kraut, R., & Yang, D. (2025). The rise of AI companions: How human-chatbot relationships influence well-being (arXiv:2506.12605). arXiv. https://doi.org/10.48550/arXiv.2506.12605 A P P E N D I X | A I I N D E X R E P O R T 2026