The Baseline Problem: The Instrument You Can’t Audit or When the Infrastructure Becomes the Risk

Mar 31

The previous posts in this series have argued that GFMs are unsuitable as baseline-setters for climate and environmental policy, and that a narrower role as independent spatial auditors operating against a legally-fixed reference is both more defensible and more genuinely useful. That argument rests on four conditions: a fixed baseline, a spatially explicit question, independent governance of the model, and as post three in this series concluded, the model sitting beyond the reach of the reporting party.

There is a fifth condition, and it may be the hardest of all to satisfy: the instrument itself must be stable, versioned, and auditable over the compliance period it is used to assess. Right now, for GFMs deployed as a service by commercial platforms, none of those things are guaranteed. And without them, the independent spatial witness isn’t independent. It isn’t even consistent.

*The evidence is only as good as the chain. image courtesy of ChatGPT*

The Overhead Problem and What It Produces

In theory, any organisation could build its own GFM. In practice, the overhead is prohibitive for all but a handful of well-capitalised actors. Training at meaningful global scale requires petabytes of curated EO data, substantial compute, and rare combined expertise in deep learning and geospatial science. The reality is a small number of dominant platforms providing GFM embeddings as a service, with the rest of the market consuming outputs they did not produce and cannot fully inspect.

That concentration is not inherently disqualifying. Public infrastructure is often best delivered at scale by a small number of capable providers, subject to appropriate oversight. The problem is that the oversight architecture does not yet exist, and the commercial incentives of platform operators run in the opposite direction from the stability and transparency that accountability applications require.

A platform improving its GFM, with better training data, refined architecture, recalibrated spectral normalisation, will produce different embeddings for identical input imagery. That is, from the platform’s perspective, simply progress. From the perspective of a compliance monitoring workflow that consumed version 1.0 embeddings in 2026 and version 2.3 embeddings in 2029, it is an unannounced methodological change introduced silently at the infrastructure layer, below the level where most users are looking.

Silent Updates and the Audit Trail That Isn’t

The consequences of silent model updates in accountability contexts are not theoretical. Consider a member state submitting a LULUCF flexibility claim in 2031, supported by GFM-derived spatial characterisation of affected and comparison areas. The Commission, or a party with standing to challenge, wishes to verify that the characterisation was consistent with what the model would have produced at the time of the original monitoring, say 2027. If the platform has updated its model in the interim, that verification may be impossible. The original embeddings may not be reproducible. The platform may not be able to confirm which version was running at which date. The changelog, if one exists at all, is unlikely to characterise what changed and in which direction for the specific land cover types and spectral conditions relevant to the claim.

Unlike a software update where a changelog describes functional changes, the practical opacity of a GFM update means the differences are often uncharacterisable without systematic parallel testing i.e. running both versions on the same imagery across representative conditions and comparing outputs. Most users will not do that. Most platforms do not offer the tooling to do it. And the results, if obtained, would rarely be interpretable by the legal or regulatory professionals who need to act on them.

The frozen version response (lock the model version used for a given compliance period) is the obvious answer and creates its own problem. A GFM frozen at its 2026 state and applied to 2030 imagery is characterising the world through increasingly stale priors. Sensor calibration drifts. Atmospheric correction approaches improve. Land cover classes that were rare in training data become common as conditions change. A frozen model is not a stable reference. It is a degrading one, and the degradation is invisible unless you are specifically looking for it.

This Is Not a New Problem

Other high-stakes industries have encountered version instability in analytical tools and developed governance responses. The parallels are instructive precisely because they reveal how far the EO sector still has to travel.

In financial services, model risk management frameworks: the UK’s PRA/FCA or Europe’s EBA require that models used for regulatory capital calculation or risk reporting are subject to independent validation, version control, and documented change management. A bank cannot silently update a capital model and apply it retrospectively to positions. Model changes require approval, parallel running, and documented evidence of equivalence or impact assessment. The model’s lineage is part of the audit record. Again, none of this is currently required of GFM platforms serving environmental accountability functions, despite the structural similarity of the problem.

The gap is not technical. Cryptographic versioning of model weights, reproducible inference environments, and structured changelogs are all achievable. The gap is regulatory will and commercial incentive, and at present both point in the wrong direction.

The Commoditisation Trap

The concentration of GFM capability in a small number of commercial platforms creates a dependency that is difficult to reverse once established. Regulators, monitoring bodies, and compliance frameworks that build workflows around platform-provided embeddings are, whether they recognise it or not, delegating a public governance function to private infrastructure operating under commercial terms.

Those terms typically include the right to update, modify, or discontinue services; no commitment to version stability beyond a short operational window; no obligation to support retrospective reproducibility; and intellectual property provisions that may prevent independent audit of the model itself. The platforms are not acting in bad faith, these are standard commercial software terms. The problem is that standard commercial software terms are structurally incompatible with the requirements of legal accountability and regulatory compliance over multi-year periods.

The insurance sector is already encountering this. Parametric products written on the basis of one model’s characterisation of flood risk or wildfire exposure face disputes when large claims arrive and the model has since been updated. The methodology that priced the risk is not the methodology that characterises the event. Litigation will eventually clarify some of this, but litigation is a slow, expensive, and retrospective governance mechanism. It does not prevent the problem; it prices it after the fact.

What Needs to Happen, and Why It Probably Won’t

The requirements are not complex to state. GFMs used in regulatory, compliance, or legal accountability contexts should be subject to:

• Cryptographic versioning of model weights, so that a specific version can be unambiguously identified and independently verified

• A public or regulator-accessible registry of versions, with structured documentation of material changes between them

• Commitments to maintain deprecated versions, or validated equivalence testing, for the duration of relevant compliance periods

• Contractual obligations on platform operators equivalent to those applied to financial model risk management environments and in other regulated industries

• Independent technical audit rights, exercisable by regulatory bodies or parties with legal standing, without requiring platform cooperation beyond what is contractually mandated

The EO industry should be advocating for these requirements, not waiting for regulators to impose them. The credibility of EO-derived evidence in legal and regulatory contexts depends on the integrity of the tool chain, and right now that integrity is largely assumed rather than demonstrated. When the first significant legal challenge turns on model version instability, and it will, the reputational damage will extend well beyond the platform involved.

The honest assessment is that none of this is likely to happen quickly. Platform operators have no current commercial incentive to take on the costs of validation infrastructure. Regulators in the environmental space are still developing the technical literacy to understand what they need to specify. And the procurement frameworks through which public bodies commission EO services are not yet equipped to evaluate version stability commitments, let alone enforce them.

If the digital revolution has taught us anything, it is that new powerful technologies get deployed faster than the governance infrastructure can follow, with the credibility costs landing later and unevenly. The difference this time is that the stakes are existential. GFM outputs embedded in climate compliance architecture will be used to determine whether sovereign obligations have been met, whether carbon credits are legitimate, and whether the polluter pays or goes unpunished.

The instrument you can’t audit is not a neutral tool. In the wrong hands, or simply in careless ones, it is a liability that the governance architecture we have built so carefully around the baseline will not be able to contain.

The independent spatial witness only witnesses if it can be cross-examined. We need to build the conditions for that before we need them in court.

the baseline problemGFMsearth observationembeddingsLULUCF

James Cutler