Can LMS quizzes measure AI proficiency effectively?

Standard LMS quizzes use Classical Test Theory with basic percentage-correct scoring. They cannot account for question difficulty, cannot adapt to the test-taker's level, cannot produce comparable scores across different quiz versions, and cannot provide confidence intervals on results. For measuring a complex, multidimensional capability like AI proficiency — which spans prompting, evaluation, judgment, iteration, and responsible use — the measurement instrument needs to match the complexity of the construct.

Completion vs Proficiency

Your people completed the training. Do you know if it worked?

Q: Why doesn't training completion predict AI proficiency?

Training completion measures whether someone finished a course — not whether they can apply what they learned. The BCG-Harvard study of 758 consultants (Organization Science, March 2026) found that AI training alone showed no significant performance advantage over simple tool access. Separately, Section AI's hands-on testing found employees who completed AI training scored only 40 out of 100 on proficiency tests. Classic transfer-of-training research estimates only 10–15% of training effectively transfers to workplace application.

Q: What is the completion trap in enterprise AI training?

The completion trap describes the common enterprise practice of measuring AI readiness through training completion rates, course hours, and certificates — metrics that track activity rather than capability. Only 4% of learning leaders can communicate tangible business outcomes of their programmes, and 92% of business leaders fail to see the impact of learning initiatives. The trap is that high completion rates create a false sense of readiness while actual proficiency remains unmeasured.

Q: How do self-assessments compare to psychometric measurement for AI skills?

Self-assessments are unreliable for AI skills because of a documented reverse Dunning-Kruger effect: Aalto University research (February 2026) found that higher AI literacy correlates with greater overconfidence, not better self-calibration. Separately, 79% of tech workers admit to pretending they know more about AI than they do. Psychometric measurement using Item Response Theory provides objective, performance-based proficiency estimates with known precision — comparable to how the GMAT and GRE measure ability.

Most enterprises measure AI readiness through completion rates, course hours, and certificates. None of these reliably predict whether employees can actually use AI in their work. The distinction between tracking training activity and measuring professional capability is the most consequential gap in enterprise AI adoption today.

Data from BCG-Harvard, Stanford, Section AIPeer-reviewed evidence

See How Proficiency Is Measured See the Comparison →

The Completion Trap

Completion rates tell you who finished a course. They tell you nothing about who can use AI.

Enterprises have invested heavily in AI training over the past three years. The programmes are well-intentioned. The completion rates look healthy. The certificates are prominently displayed. And the underlying assumption — that completing training produces competence — goes largely unexamined.

The research does not support that assumption.

The BCG-Harvard study of 758 consultants, published in Organization Science in March 2026, found that AI training alone showed no statistically significant performance advantage over simple tool access. The consultants who received training and those who simply received the tool performed indistinguishably. The variance in outcomes was driven by individual proficiency — a characteristic that training completion cannot detect.

Section AI's 2026 Proficiency Report, which combined surveys with hands-on skill testing across 5,000 knowledge workers, confirmed the pattern at scale: 97% of the workforce are using AI poorly or not at all. Among them, employees who had completed AI training programmes still scored only 40 out of 100 on proficiency assessments. They remained firmly in the “experimenter” category — capable of basic prompting but unable to reliably evaluate AI output, identify when AI use is inappropriate, or manage the risks of AI-generated content in professional settings.

Measuring training completion is like measuring gym memberships instead of fitness levels. The data is easy to collect, impressive to report, and almost entirely uncorrelated with the outcome that matters.

What Completion Actually Measures

What completion tracks, what it measures — and what it misses

Training completion metrics record that an employee watched the videos, clicked through the slides, and passed a recall-based quiz. They can tell you who engaged with the content and who did not. This information is not worthless — it reveals who showed up.

What completion metrics cannot tell you is whether the employee can now formulate a prompt that produces usable output on the first attempt, identify a selectively accurate statement in an AI-generated analysis, determine when a task falls outside AI's capability boundary, or navigate the disclosure and privacy obligations of using AI with client data. These are the capabilities that determine whether AI use creates value or creates risk in professional work. And they require a different kind of measurement entirely.

The old way

Tracking whether someone attended the lecture

The new way

Testing whether they can diagnose the patient

The old way

Asking someone how healthy they feel

The new way

Running a blood panel

The distinction is not academic. Classic transfer-of-training research estimates that only 10–15% of training effectively transfers to workplace application (Georgenson 1982, Ford et al. 2018). The Association for Talent Development found only 12% of employees effectively apply new skills on the job. These transfer rates were established before AI — where the distance between watching a tutorial and applying judgment under uncertainty is particularly wide.

The Hidden Cost

The gap between completion and proficiency has a price — and someone is paying it

$9M

annual cost of AI-generated “workslop” rework for a 10,000-person company

BetterUp Labs / Stanford Social Media Lab, HBR, September 2025

of learning leaders can communicate tangible business outcomes of their programmes

CEB research

92%

of business leaders fail to see the impact of learning initiatives

CEB research

Consider the 4% figure. In 96% of organisations, the executive team cannot see what L&D delivers. When the board asks “how AI-ready is our workforce?” and the only available answer is a completion percentage, the L&D function is evaluated on the wrong metric — and the investment is protected by faith rather than evidence.

Meanwhile, the cost of unmeasured AI use accumulates. BetterUp Labs and Stanford's Social Media Lab found that 40% of employees receive AI-generated “workslop” each month — low-quality content that requires an average of 1 hour and 56 minutes to resolve per incident. For a 1,000-person organisation, that translates to over $2.2 million per year in invisible rework. The completion rate of the training programme that was supposed to prevent this looks fine. The spreadsheet doesn't show what completion failed to produce.

The Comparison

What changes when you measure proficiency instead of completion

Dimension	Completion Tracking	Proficiency Measurement
What it measures	Who finished the course	Who can apply AI effectively in their role
Underlying science	Classical Test Theory at most — percentage-correct scoring	Item Response Theory — the methodology behind major standardised assessments worldwide
Question difficulty	All items treated equally	Each item calibrated for difficulty and discrimination — harder questions tell you more
Score precision	Percentage correct — no error estimate	Confidence interval on every score — you know how precise the estimate is
Comparability	Scores depend on which quiz version was taken	Scores comparable across different forms — because ability is estimated independently of specific items
What it detects	Who engaged with content	Who can evaluate AI output, identify errors, exercise judgment, and manage risk
Gaming resistance	Low — fixed questions, no adaptation	High — unique forms, adaptive difficulty, response pattern analysis, timing monitoring
Growth measurement	Repeated completion measures re-engagement, not growth	Pre/post designed on the same psychometric scale — growth reported only when it exceeds measurement error
Board presentation	“87% completed the programme”	“Advisory is at Competent level. Tax is Developing. Here's where to invest next quarter.”
The analogy	Counting gym memberships	Measuring fitness levels

Why Self-Assessment Fails Too

The people with the largest gaps are the least accurate at identifying them

The instinctive response to the completion gap is often “we'll survey employees on their AI confidence.” The evidence shows why this produces worse data, not better.

Aalto University researchers published findings in Computers in Human Behavior (February 2026) that upend the assumption behind self-assessment. In two studies with approximately 500 participants, they found a reverse Dunning-Kruger effect: higher AI literacy correlated with greater overconfidence, not better self-calibration. Participants using ChatGPT overestimated their correct answers by 4 points out of 20 — a gap larger than the actual performance improvement from using AI. Financial incentives for accurate self-assessment did not correct the bias.

The industry data confirms it at scale. 79% of tech workers admit to pretending they know more about AI than they do (Pluralsight 2025). 81% profess confidence in their AI skills, but only 12% have significant hands-on experience. And 64% of workers pass off AI-generated content as their own (Salesforce 2024) — a behaviour that self-assessment by definition will not reveal.

Asking employees how well they use AI is the equivalent of asking patients to self-diagnose. The people who most need the diagnosis are the least equipped to provide it.

Kruger and Dunning's foundational 1999 research found that bottom-quartile performers rate themselves at the 58th–62nd percentile on average. The gap is structural, not motivational — people lack the very skills needed to recognise their own deficiency. In the context of AI, where outputs appear fluent and authoritative regardless of their accuracy, this metacognitive blind spot is particularly dangerous.

Completion tracking misses the problem. Self-assessment misrepresents it. Standard LMS quizzes — built on Classical Test Theory where all items count equally — lack the precision to detect it. The difference between an LMS quiz and psychometric proficiency measurement is the difference between a pop quiz and a medical board exam: one checks recall, the other measures whether you can practise. Performance-based psychometric assessment measures the capability that matters — with known precision, calibrated difficulty, and scores that mean the same thing regardless of which questions were asked.

The Evidence

Three findings that close the argument

The BCG-Harvard study (Dell'Acqua et al., Organization Science, March 2026) enrolled 758 consultants in a pre-registered randomised controlled trial. AI-proficient workers produced 40% higher-quality output on suitable tasks. Workers who misjudged AI's capability boundary performed 19 percentage points worse than colleagues using no AI at all. And approximately 10% — the “Sleeping Drivers” — passively delegated to AI without exercising judgment, producing the worst outcomes of any group. Training completion could not distinguish proficient users from Sleeping Drivers. Proficiency measurement can.

Gartner's 2025 Strategic Predictions forecast that 75% of hiring processes will include AI proficiency certifications and testing by 2027 — while simultaneously predicting that 50% of organisations will require “AI-free” skills assessments to counter critical-thinking atrophy from generative AI use. Both predictions point in the same direction: the era of treating AI readiness as a training checkbox is ending.

A 2024 Nature systematic review (npj Science of Learning) evaluated 16 AI literacy measurement scales across 22 studies and concluded that no psychometrically validated gold standard for measuring AI literacy exists. Most scales demonstrated adequate structural validity, but very few had been tested for cross-cultural validity, measurement error, or criterion validity. The gap between AI adoption and validated proficiency measurement is documented at the highest levels of academic research.

The Shift

From tracking activity to measuring capability

AI proficiency measurement applies the same psychometric science that has been trusted for 60 years in the highest-stakes assessments — from graduate admissions to medical licensing to military selection — to the specific question of how effectively professionals use AI in their work.

The approach differs from completion tracking in three fundamental ways. First, it accounts for question difficulty. Answering a hard question correctly reveals more about proficiency than answering an easy one — a principle that percentage-correct scoring ignores entirely. Second, it produces comparable scores across different test forms. Because ability is estimated independently of the specific items asked, two employees who take different versions of the assessment receive scores on the same scale. Third, every score includes a confidence interval — an explicit estimate of how precise the measurement is, preventing managers from over-interpreting small differences that may be noise.

The result is organisational visibility that completion metrics cannot provide: which teams can deploy AI effectively today, which need targeted development, where the risks of unsupervised AI use are highest, and whether training investments are producing measurable change over time. The board presentation shifts from “87% completed the programme” to “Advisory is at Competent level, Tax is Developing, and here is where the next quarter's investment should go.”

The question is not whether to invest in AI training. It is whether you can verify that the investment is producing the capability your organisation needs.

Item Response Theory (IRT) is a family of mathematical models that describes the relationship between a person's underlying ability and their probability of answering each test item correctly. Unlike Classical Test Theory — where all items are weighted equally — IRT models each item's difficulty (how hard it is), discrimination (how sharply it separates higher and lower ability), and guessing parameter (the probability of answering correctly by chance).

This produces ability estimates that are independent of the specific items asked. Two people can take different sets of questions and receive comparable scores — because the scoring accounts for which questions were harder and which were easier. The methodology has been refined over six decades and is the basis for adaptive testing in assessments worldwide. For a detailed explanation, see the methodology page.

Common Questions

Completion, proficiency, and measurement

Why doesn't training completion predict AI proficiency?

Completion measures whether someone finished a course — not whether they can apply what they learned. The BCG-Harvard study found no significant performance advantage from training alone. Section AI's hands-on testing found trained employees scored only 40 out of 100. Transfer-of-training research estimates only 10–15% of training transfers to workplace application.

What is the completion trap?

The completion trap describes the enterprise practice of measuring AI readiness through completion rates, course hours, and certificates — metrics that track activity rather than capability. Only 4% of learning leaders can communicate tangible business outcomes of their programmes, and 92% of business leaders cannot see the impact. High completion rates create a false sense of readiness.

What is the difference between Classical Test Theory and Item Response Theory?

Classical Test Theory (used in most LMS quizzes) treats all items equally. Item Response Theory accounts for item difficulty and discrimination, produces ability estimates independent of the specific questions asked, and provides confidence intervals on every score. IRT is the methodology behind major standardised assessments worldwide.

Can self-assessments measure AI proficiency?

Self-assessments are unreliable for AI skills. Aalto University research (February 2026) found higher AI literacy correlates with greater overconfidence, not better calibration. Separately, 79% of tech workers admit to pretending they know more about AI than they do. Performance-based psychometric measurement is the evidence-based alternative.

Can LMS quizzes measure AI proficiency?

Standard LMS quizzes use Classical Test Theory with basic percentage-correct scoring. They cannot account for question difficulty, adapt to the test-taker's level, produce comparable scores across versions, or provide confidence intervals. For a multidimensional capability like AI proficiency — spanning prompting, evaluation, judgment, and responsible use — the measurement instrument needs to match the complexity of the construct.

References

BetterUp Labs & Stanford Social Media Lab. (2025). The hidden toll of AI-generated work. Harvard Business Review, September 2025.

Dell'Acqua, F., McFowland, E., Mollick, E.R., et al. (2026). Navigating the jagged technological frontier. Organization Science. DOI: 10.1287/orsc.2025.21838.

Fernandes, D., Welsch, R., et al. (2026). The effects of AI on metacognitive accuracy. Computers in Human Behavior.

Ford, J.K., Baldwin, T.T., & Prasad, J. (2018). Transfer of training: The known and the unknown. Annual Review of Organizational Psychology and Organizational Behavior, 5, 201–225.

Gartner. (2025). Top Strategic Predictions for 2026 and Beyond. Gartner Research.

Kruger, J., & Dunning, D. (1999). Unskilled and unaware of it. Journal of Personality and Social Psychology, 77(6), 1121–1134.

Nature. (2024). Systematic review of AI literacy measurement instruments. npj Science of Learning.

Pluralsight. (2025). 2025 AI Skills Report: Mind the Confidence Gap.

Salesforce. (2024). Trends in AI for CRM. Salesforce Research.

Section AI. (2026). 2026 AI Proficiency Report. Section.

See what proficiency measurement looks like in practice

The methodology page explains how psychometric proficiency measurement works in practice — and how it differs from every other approach to assessing AI readiness. The research page presents the full evidence base.

Read the Methodology See the Research →