We Don't Have an AI Ethics Problem. We Have an AI Measurement Problem.

· 8 min read
AI Governance ISO 42001 Auditing

This week I spent 2 days in Glasgow for the AI Standards Hub Summit. A gathering of 250 in-person delegates (plus 400 online) from 31 countries, spanning government agencies, industry, academia, and civil society. I walked in expecting another conference heavy on principle statements and light on practical answers. I walked out convinced of something different: the AI governance conversation has quietly turned a corner. The era of “should we regulate AI?” is over. The question now is engineering-grade: how do we measure it, test it, and prove it works?

As someone who works across conformity assessment and software development, that shift matters enormously. Here’s what I took away and why I think the biggest disruption to this space might not come from the standards bodies at all.

1. The Car Dealership Problem: You’re Buying AI Blindfolded

Sebastian Hallensleben’s keynote landed the analogy that stuck with me all week. Imagine walking into a car dealership where every vehicle is hidden under a sheet. The salesperson hands you a card that says “awesome.” That’s how most organisations buy AI today — on hype, promises, and vendor marketing decks. No standardised fuel consumption. No stopping distance. No boot space.

For anyone working in conformity assessment, this is the gap that certification bodies exist to fill. The whole point of third-party assurance is turning subjective trust (“this feels safe”) into objective evidence (“here are the test results”). The problem is that for AI, we’re still arguing about what the test results should look like. Hallensleben’s push for use-case-specific quality metrics — measurable numbers rather than documentation exercises — is exactly the kind of framework that makes certification meaningful rather than performative.

From a software engineering standpoint, I’d add: we already have analogues for this. Software has unit tests, integration tests, code coverage, mean time to recovery, deployment frequency. The AI industry’s refusal to adopt equivalent rigour isn’t a technical limitation — it’s a cultural one. And standards bodies are now forcing the conversation.

2. 38% of AI Governance Tools Could Cause Harm. Let That Sink In.

The UK’s National Physical Laboratory dropped a statistic during the national testing approaches panel that should be front-page news: 38% of AI governance tools currently in use rely on metrics that could actually result in harm. That’s not a fringe finding — it’s a damning indictment of the current state of play.

NPL’s response — a new AI Measurement Centre announced by the Prime Minister — signals that the UK is positioning metrology (the science of measurement) as the backbone of AI governance. This is a smart play. If you can’t measure it, you can’t certify it. And if you can’t certify it, the entire quality infrastructure — accreditation bodies, testing labs, certification bodies — is building on sand.

The four-nation comparison panel (UK, Canada, US, Singapore) was revealing. Each country is taking a slightly different approach: NIST has the Risk Management Framework, Singapore has its testing toolkits and Global Assurance Sandbox, Canada ran an AI management system pilot before ISO 42001 was even published. But every single one agreed on the same punchline: context-specific testing is non-negotiable, and the data quality problem underpinning it all is, to quote one panellist, “a complete mess.”

3. The Standards Speed Problem — And Why It Opens the Door to Big Tech

The risk management panel on Day 2 surfaced a tension that anyone in software development will recognise: standards development moves at the speed of consensus, while AI moves at the speed of venture capital. ISO/IEC’s GenAI risk management standard (25568) is in development. Their agentic AI roadmap started in October 2025. Benchmarking standards take over a year per paradigm shift. Meanwhile, the models those standards are trying to govern have already been superseded three times.

Google DeepMind’s representative put it bluntly: evaluation capabilities are lagging behind system development exponentially. Pre-deployment testing doesn’t predict real-world behaviour. Model-centric governance misses the system-level, context-dependent risks that actually matter.

Here’s what I keep thinking about: this speed gap is an open invitation for the big AI companies to step in and own the space. Anthropic already publishes detailed model cards, runs red-team evaluations, and has built constitutional AI frameworks. Google has invested heavily in responsible AI tooling through their Model Evaluation and Benchmark initiatives. OpenAI publishes system cards and safety evaluations for every major release.

What if one of these companies decided to productise AI assurance? Not as a side project, but as a core offering. Imagine Anthropic launching an AI audit-as-a-service platform, leveraging Claude to continuously evaluate other AI systems against ISO 42001, the EU AI Act, and NIST frameworks — all in real time. Or Google packaging its internal safety evaluation tools as an enterprise product. They have the models, the compute, the evaluation infrastructure, and the incentive (building trust in the ecosystem benefits everyone selling AI).

For certification bodies, this should be a wake-up call. The traditional model — manual audits conducted annually by human assessors working through paper checklists — cannot compete with an AI-powered assurance engine that runs continuously. The question isn’t whether this will happen. It’s whether CBs will be the ones building it, or whether they’ll be disrupted by it.

4. Human Rights Aren’t a Hurdle. They’re the API Contract.

The UN’s Peggy Hicks gave one of the summit’s strongest keynotes, arguing that human rights should be understood not as an obstacle to AI development but as a universally agreed specification for how AI should behave when it touches people’s lives. The Seoul Declaration (December 2025) and the joint ISO/IEC/ITU statement both reinforce this.

I find the software metaphor useful here. Human rights frameworks are essentially an API contract between technology and society. They define the expected inputs, outputs, and failure modes. Standards are the implementation guide. Conformity assessment is the test suite. When you frame it that way, the conversation stops being about philosophy and starts being about engineering.

ETSI has now adopted a human rights checklist for standards development. ISO is running capacity-building sessions for national delegations. These aren’t token gestures — they’re infrastructure changes that will shape how every AI standard is written from here on.

5. The AIQI Consortium and the Quality Infrastructure Supply Chain

The final session of Day 1 featured the AI Quality Infrastructure Consortium (AIQI), founded in September 2024 to bring together the entire quality ecosystem: standards bodies, metrology organisations, academia, industry, testing labs, certification bodies, and accreditation bodies. UKAS’s CEO noted that 6,000 professionals have already completed the ISO 42001 training course. A UK regulatory framework announcement is imminent after 18 months of development involving 100 regulators.

For the conformity assessment world, AIQI represents the formalisation of a supply chain for AI trust. It’s not just about writing standards — it’s about building the accredited ecosystem that can actually assess against them. The shift from process-based standards (like ISO 42001’s management system approach) to metrology-led, measurement-focused assessment is the key evolution.

The cybersecurity session on Day 2 illustrated this perfectly. A healthcare AI workshop — modelling an NHS deployment platform for medical image analysis — showed that even a single security principle, when applied to a real use case, reveals cascading complexity across multiple stakeholders. The lesson: you can’t paper-audit your way through AI assurance. You need interdisciplinary teams, technical depth, and domain-specific testing capability.

6. What If Standards Became Software?

The closing session’s most provocative insight: standards bodies need to fundamentally transform. The vision? Digital standards issued as machine-readable fragments, applied in real time, with iterative feedback loops for continuous improvement.

Let me take that further, because I think the summit undersold the possibility.

What if a standard wasn’t a 200-page PDF that costs £200 to download? What if it was a chatbot? You describe your AI system — its use case, data sources, deployment context, risk profile — and the standard talks back to you. It asks clarifying questions. It identifies which clauses apply. It generates a tailored compliance checklist. It tells you what evidence you need and where the gaps are. It can even run preliminary assessments against your technical documentation in real time.

This isn’t science fiction. The underlying technology exists today. Large language models can already parse regulatory text, map requirements to evidence, and generate structured compliance outputs. BSI, ISO, and ETSI collectively hold thousands of standards documents — a perfect training corpus for a domain-specific AI. The AI Standards Hub already catalogues 1,000+ tools and 100+ technical metrics. Combine that with an LLM interface and you have something genuinely transformative.

Now go one step further. What about governance-on-chip? The summit mentioned this almost in passing, but it deserves far more attention. Imagine an AI system that ships with an embedded compliance layer — a chip or firmware module that continuously monitors the system’s behaviour against the applicable standards, flags deviations in real time, generates audit-ready logs, and triggers alerts when the system drifts outside its approved operating envelope. The audit doesn’t happen once a year when the assessor shows up. It happens every millisecond.

For certification bodies, this completely reimagines the business model. Instead of periodic site visits and document reviews, the CB connects to the system’s governance layer and monitors compliance continuously. The role shifts from inspector to oversight architect — designing the rules the chip enforces, validating the monitoring logic, and intervening when the automated system flags something it can’t resolve. Think of it as the difference between a bank teller counting cash by hand and a fraud detection system running 24/7. Both serve the same purpose. One scales.

For software providers, the opportunity is enormous. The companies that build these governance layers — the compliance SDKs, the monitoring APIs, the standards-as-code libraries — will become the plumbing of AI trust. It’s the same pattern we’ve seen with cloud infrastructure: nobody builds their own servers anymore. Soon, nobody will build their own compliance engines either.

7. The Real Question: Who Builds the Trust Infrastructure?

Zooming out, the summit left me with a question that nobody quite asked directly but that hung over every session: who will ultimately own the AI trust infrastructure?

Option A: the traditional quality infrastructure ecosystem evolves fast enough. Standards bodies go digital, CBs build technical depth, accreditation bodies adapt their frameworks, and the whole system modernises from within. This is the path everyone at the summit is working towards, and there’s real momentum — AIQI, NPL’s Measurement Centre, Singapore’s Assurance Sandbox, NIST’s ARIA programme.

Option B: big tech builds it first. Anthropic, Google, or Microsoft launches an AI assurance platform that’s faster, cheaper, and more technically sophisticated than anything a traditional CB can offer. They use their own models to evaluate other models. They embed governance tooling directly into their cloud platforms (Azure, GCP, AWS). Compliance becomes a feature toggle, not an annual engagement. The traditional quality infrastructure becomes a niche service for organisations that can’t or won’t use the tech-native alternative.

Option C: it’s a hybrid. The standards and accreditation framework provides the legitimacy, the legal recognition, and the regulatory alignment. The technology companies provide the tooling, the automation, and the scale. CBs become integrators, combining accredited authority with AI-powered assessment capability. The auditor of 2030 doesn’t replace the clipboard with a laptop. They replace it with a real-time monitoring dashboard backed by an LLM that has internalised every relevant standard.

My bet is on Option C, but only if the traditional ecosystem moves fast. The closing session gave standards bodies five years. I’d give certification bodies three. The technology to automate 80% of a compliance assessment exists right now. The question is whether CBs will harness it, or whether they’ll be standing in the car dealership wondering why the customers stopped showing up.