Our Science

Measuring what your workforce can actually do with AI

Your people's AI proficiency, measured with the same psychometric rigour as the world's most trusted standardised assessments — accounting for question difficulty, response precision, and measurement uncertainty. Five dimensions. Six formats. A confidence interval on every score.

Expert-reviewed methodologyConfidence intervals on every score
How It Works

Your score depends on three things

Traditional assessments count correct answers. Genplify accounts for three factors that make each response worth more or less.

Whether your answers are correct

Across six different formats — from workplace scenarios to prompt writing to flagging errors in AI-generated content. Each format measures a different type of professional judgment.

How difficult each item is

Getting a hard question right tells us more about your ability than getting an easy one right. The scoring framework accounts for each item's specific difficulty when estimating your proficiency.

How precisely each item discriminates

Some items sharply separate higher and lower proficiency. Others are less precise. The framework gives more weight to items that distinguish between ability levels more reliably.

These three factors are the basis of Item Response Theory (IRT) — the psychometric methodology underlying major standardised assessments worldwide. Because IRT separates person ability from item properties, your people's scores mean the same thing regardless of which specific items they receive.

What We Measure

Five dimensions of AI proficiency

Each dimension captures a distinct professional capability. Together, they represent what it means to use AI effectively, responsibly, and with calibrated judgment in professional work.

  1. Giving AI clear instructions

    Formulating specific, contextually appropriate prompts that produce useful output on the first attempt or with minimal iteration. In practice: the difference between a prompt that produces a usable client memo and a vague request that produces generic filler.

  2. Refining AI output

    Diagnosing deficiencies in what AI produces — distinguishing accuracy errors from structural problems — and generating specific corrective instructions.

  3. Evaluating what AI produces

    Identifying errors, hallucinations, logical inconsistencies, and selectively accurate content — particularly when the output appears superficially credible. In practice: catching the AI-generated report that cites a real statistic but omits the context that reverses its meaning.

  4. Knowing when to use AI

    Determining when AI use is appropriate, when it is not, and how to integrate AI output with human expertise in a given work context. In practice: recognising that the same tool that accelerates a market analysis could compromise a privileged legal review.

  5. Managing AI responsibly

    Navigating privacy, data protection, ethical considerations, bias awareness, and disclosure obligations when using AI in professional settings.

How We Measure

Six formats, each targeting a different type of professional judgment

Multiple-choice alone cannot measure the full range of your people's AI proficiency. Each format captures a different cognitive demand — from recognition to production to classification.

Workplace scenarios

Read a professional scenario, choose the best response. Measures recognition and applied judgment — the same cognitive operation used in situational judgment tests across professional certification.

Prompt writing

Read a scenario, write the prompt you would give an AI tool. Scored on goal clarity, context, constraints, and audience awareness — with AI scoring calibrated against human expert panels.

Error identification

Read an AI-generated passage, flag statements containing errors. Classify each as fabrication, logical inconsistency, or selective accuracy. Flagging a correct statement has a cost — just as in professional work.

Matching and classification

Drag items into the correct categories. Tests multi-element classification — the same cognitive operation used in in-basket exercises for managerial assessment.

Ranking and prioritisation

Put items in the correct order based on a workplace scenario. Tests prioritisation under constraints — sequencing decisions about when and how to apply AI to professional tasks.

Risk triage

For each scenario, make simultaneous yes/no decisions across multiple risk dimensions. Tests the kind of multi-dimensional governance judgment required in compliance and professional oversight.

Workplace ScenarioEvaluating what AI produces

An AI-generated summary states: “Survey results were overwhelmingly positive — 78% rated the service Good or Excellent.” You check the source data: 78% is correct. But the same survey also shows 45% of respondents would not use the service again. What best describes the problem?

This is one of six formats. Each captures a different type of professional judgment — from recognition to production to classification.

Honest Uncertainty

Every score tells you how precise it is

Most assessment tools report a single number. That number without context is worse than no number at all.

Every score your people receive is displayed inside a range band — a visual representation of the confidence interval. The score appears at the centre, and the band shows the range within which their true proficiency most likely falls. No “±” notation. No statistical jargon. Just a clear picture of what you know and how well you know it.

When two employees' range bands overlap, your dashboard says so: “These scores are not meaningfully different.” This prevents managers from over-interpreting small differences that could be measurement noise.

When measuring growth, the same discipline applies. If the improvement between pre and post assessment does not exceed the minimum detectable difference — a statistical threshold determined by the measurement precision of both administrations — the system does not claim improvement. It reports the scores, shows the overlap, and lets the data speak honestly.

Validation

A rigorous psychometric foundation

Genplify's assessment methodology is developed consistent with the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 2014), the SIOP Principles for the Validation and Use of Personnel Selection Procedures, and the ITC Guidelines on Computer-Based and Internet-Delivered Testing.

Content validity has been established through systematic expert review by subject matter experts in AI proficiency and psychometric assessment design, formal job analysis across professional services industries, and cognitive interviews with target-population respondents. Each of the five dimensions was evaluated against four selection criteria: it must be distinct from the other dimensions, observable in workplace behaviour, predictive of professional AI performance, and teachable through structured development.

Our validation programme follows the progressive approach recommended by the AERA/APA/NCME Standards — content validity established first, with construct and criterion validity studies underway as early adopter data accumulates. For your evaluation team: the detailed validation methodology and evidence are documented in our Technical Summary.

Regulatory Context

Built for the regulatory landscape your organisation operates in

Singapore

Personal Data Protection Act

Designed to support your organisation's compliance with Singapore's PDPA, including data minimisation, purpose limitation, and mandatory breach notification.

Singapore

Workplace Fairness Act

Designed to help organisations prepare for Singapore's Workplace Fairness Act. Assessment outputs are traceable, explainable, and grounded in psychometrically defensible evidence.

Hong Kong

Personal Data (Privacy) Ordinance

Designed to support your organisation's compliance with Hong Kong's PDPO, including the six Data Protection Principles and the PCPD's ethical AI guidance.

Global

EU AI Act Preparedness

Designed for compliance with the EU AI Act. Architecture includes AI scoring transparency, right to explanation, human oversight requirements, and audit trails.

Common Questions

Understanding the methodology

What is Item Response Theory (IRT)?
Item Response Theory is a psychometric framework that models the relationship between a person's underlying ability and their probability of answering each test item correctly. Unlike classical test theory, IRT provides ability estimates independent of the specific questions asked, enabling adaptive testing and precise cross-group comparisons. It is the methodology underlying major standardised assessments worldwide.
Why is adaptive assessment more accurate than traditional tests?
Adaptive assessment accounts for item difficulty and discrimination when estimating proficiency. Two people who answer the same number of items correctly may receive different scores — because the difficulty of the items they answered matters. This produces more precise proficiency estimates, especially at the extremes of the ability range, where traditional percentage-correct scoring is least reliable.
How does Genplify prevent cheating on the assessment?
Each employee receives a unique, dynamically assembled form. Combined with response pattern analysis, timing monitoring, and multiple detection methods, the system identifies anomalous behaviour without requiring invasive proctoring. Sessions classified as high-risk are queued for review before scores are released.
What AI skills does the assessment measure?
The assessment measures five dimensions: giving AI clear instructions, refining AI output through iteration, evaluating what AI produces for errors and selective accuracy, knowing when AI use is appropriate, and managing AI responsibly — including privacy, ethics, and disclosure obligations in professional settings.
Has the Genplify assessment been validated?
Content validity has been established through systematic expert review by subject matter experts in AI proficiency and psychometric assessment design. Construct and criterion validity studies are in progress, following the progressive validation approach recommended by the AERA/APA/NCME Standards for Educational and Psychological Testing (2014).

References

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for Educational and Psychological Testing.

Embretson, S.E., & Reise, S.P. (2000). Item Response Theory for Psychologists. Lawrence Erlbaum Associates.

International Test Commission. (2005). ITC Guidelines on Computer-Based and Internet-Delivered Testing.

Lord, F.M. (1980). Applications of Item Response Theory to Practical Testing Problems. Lawrence Erlbaum Associates.

Messick, S. (1989). Validity. In R.L. Linn (Ed.), Educational Measurement (3rd ed., pp. 13–103). Macmillan.

Society for Industrial and Organizational Psychology. (2018). Principles for the Validation and Use of Personnel Selection Procedures (5th ed.).

See the methodology in action

Schedule a consultation to see how Genplify measures AI proficiency in your organisation — or download the technical summary for your evaluation team.