The Baseline Problem: Why Climate Policy Can’t Run on Short Memory

or When Foundation Models Meet Carbon Accounting

In my previous piece, I argued that Earth observation needs honest conversations about AI capabilities and constraints. Let me now illustrate why this matters with a specific, high-stakes example: carbon accounting under climate policy frameworks such as REDD+ and LULUCF.

Sunken brick building and survey marker

a measurement system that appears authoritative but is anchored to a compromised starting point (thx ChatGPT)

AlphaEarth Foundations provides annual embeddings from 2017 onwards. There are of course other models and embeddings. All technically impressive. Maybe globally comprehensive though the tropics and high arctic fare less well than those temperate areas in between. Freely available, for now in their current form. And fundamentally incompatible with some of the most economically significant applications in Earth observation.

Here’s why.

The Temporal Baseline Imperative

REDD+ (Reducing Emissions from Deforestation and forest Degradation) and LULUCF (Land Use, Land-Use Change and Forestry) accounting aren’t academic exercises, they underpin billions in carbon finance, sovereign climate commitments, and corporate net-zero strategies. The entire framework depends on establishing credible baselines against which to measure change.

Those baselines typically reference 1990 (under Kyoto Protocol), the 2000–2005 period, or jurisdiction-specific historical reference levels. You measure deforestation rates, forest degradation trends, and land use transitions over decades, not years. The temporal depth isn’t a nice-to-have, it’s the foundation of the accounting system.

Now consider what happens when someone pitches “AI-powered REDD+ monitoring using state-of-the-art foundation models.” The embeddings start in 2017. Your baseline requirements stretch back 17–27 years before that. The maths doesn’t work.

The Temptation to Fudge

The predictable response goes something like: “We’ll use embeddings for ongoing monitoring and traditional remote sensing for historical baselines.” Sounds reasonable. Creates a verification nightmare. And what is the legal remedy if the baseline is wrong?

You’re now running two parallel systems with different methodologies, different data sources, different error characteristics. When the embedding-based 2024 forest loss estimate differs from what the historical trend would predict, which do you trust? How do you audit the join? What happens when the embedding model gets updated (as AlphaEarth inevitably will) and changes slightly how it characterises forest degradation?

The alternative is worse: “We’ll just reset baselines to 2017 and work forward from there.” This isn’t technical adaptation, it’s policy vandalism. You’re proposing to ignore decades of deforestation, conveniently starting the clock at a point that might make current land use look acceptable by comparison. Try explaining that to indigenous communities whose forests were cleared in 2010, or negotiators who spent years agreeing historical reference levels.

Why This Isn’t Just REDD+

The temporal baseline problem extends across any application requiring historical context:

• Biodiversity net gain assessments need pre-development baselines

• Agricultural subsidy compliance requires multi-year land use verification

• Climate attribution studies need to separate anthropogenic from natural trends

• Insurance risk models require understanding decadal patterns, not just recent snapshots

• Planning enforcement needs to demonstrate when unauthorised change occurred

In each case, a 7–8 year time series (however sophisticated) cannot substitute for the 20–30+ year perspectives these applications require. The foundation model’s technical excellence doesn’t overcome the fundamental constraint: it didn’t exist in 2005, let alone 1990.

The Audit Trail Problem

Even where you can construct hybrid approaches, you face the audit and verification challenge. Carbon credits trade as financial instruments. REDD+ projects require independent verification to internationally agreed standards. When your monitoring system comprises “traditional remote sensing for historical baseline, foundation model embeddings for recent change, and expert interpretation of the bit in the middle,” you’ve created something that’s technically defensible but practically unauditable at scale.

Verification bodies need consistent methodologies. Buyers of carbon credits need confidence that the tonne of CO₂ they’re purchasing has the same evidentiary basis whether it was sequestered in 2008 or 2024. Introducing a methodological discontinuity right in the middle of your time series doesn’t inspire confidence, it raises questions.

What This Means for the Industry

This isn’t an argument against foundation models. It’s an argument for clarity about their applicability envelope. Embeddings are already and will continue to be transformative for:

• Current state assessment and near-term monitoring

• Rapid prototyping of classification approaches

• Similarity-based searches for analogous conditions globally

• Applications where 2017-onwards is sufficient temporal depth

They are very much less or not suitable for:

• Applications requiring pre-2017 baselines (without complementary data)

• Contexts where methodology consistency across decades is critical

• Frameworks with legally mandated historical reference periods

• Any use case where “reset the baseline” isn’t acceptable

The problem isn’t that embeddings can’t solve every problem. The problem is when we pitch them as if they can, land contracts based on that premise, and then discover the temporal constraint mid-implementation.

What This Means Under the Law and for Regulators

The baseline is a legal construct, not a natural fact. Forest reference levels under LULUCF took years to negotiate and remain politically contested. The baseline has always been political. The question is whether wrapping it in AI makes it appear less so. If the baseline is compromised, the accounting is compromised, and any compliance claims built on it are legally vulnerable.

The problem isn’t that GFMs are wrong, but that they answer a different question than LULUCF requires. They characterise current land surface state with impressive spatial resolution and scalability. But LULUCF requires a trajectory - what was happening, what changed, against what counterfactual. A model trained on 2016-18 or 2017–2024 data cannot establish that trajectory. It can only confirm where you are, not where you came from or what you committed to.

Technology can launder political choices into apparently objective ones. A GFM-derived baseline would arrive with the authority of machine learning, satellite data, and scientific process behind it, extremely difficult to challenge through normal legal mechanisms because of what lawyers in other contexts call the “black box problem.” When baseline-setting methodology becomes technically complex and opaque, it becomes harder, not easier, to contest.

This is explicitly a governance risk. A GFM-derived baseline adopted into policy would inadvertently amnesty pre-baseline degradation, reward historically poor performers, and create a reference point that is both technically authoritative-looking and legally close to unassailable once embedded.

Baseline Manipulation Has Form: The REDD+ Lesson

Under REDD+, countries faced a perverse incentive to exaggerate baseline emissions from deforestation by selecting high-deforestation periods as their reference, making subsequent reductions look impressive against an inflated baseline to increase revenues from ecosystem services payments. The manipulation happened within the technical framework, not in spite of it. Controversy over the reasonableness of such baseline scenarios threw the nascent market mechanism into some disarray.

A GFM-derived LULUCF baseline would replicate that dynamic at a larger scale and with even greater apparent scientific authority, because the model’s internal workings are far less transparent than a negotiated reference period that at least has documented assumptions.

The 2023 LULUCF amendment (Regulation 2023/839 - https://eur-lex.europa.eu/eli/reg/2023/839/oj) tightened some things but left significant discretionary space, with forest reference levels remaining member state-constructed, natural disturbance flexibility genuinely exploitable, and cross-regulation flexibility (with Effort Sharing Regulation - https://eur-lex.europa.eu/eli/reg/2023/857/oj ) creating accounting arbitrage opportunities. The recent implementing regulation 2025/2043 (https://eur-lex.europa.eu/eli/reg/2025/2043/oj) provides for documenting evidentiary reporting on climate impacts and organic soil legacy effects but as with REDD+, the more technical the evidentiary framework, the more vectors it creates for sophisticated actors to construct convenient comparisons.

The regulation creates binding targets but relies on member state inventory submissions as the primary evidence base, with Commission review being technical and iterative rather than adversarial. If a member state adopted an AI-derived baseline methodology that embedded historical degradation as the new normal, the Commission would need to challenge the science to challenge the compliance claim, a formidable obstacle.

The amnesty point links directly to the polluter pays principle. A baseline anchored to a recently-degraded landscape effectively inverts it: the historical polluter is absolved, and future obligations are calibrated to an already-compromised starting point.

So: if a member state adopted a GFM-derived land use baseline that systematically understated historical degradation, what course of action exists, before which forum, and who has standing to bring it? That question remains largely unanswered.

The Path Forward

Honest scoping of foundation model capabilities requires asking temporal baseline questions upfront:

• What historical reference period does this application require?

• Can we construct defensible hybrid approaches if needed?

• Will clients accept methodology changes mid-time-series?

• What are the audit and verification requirements?

• Are there legal or regulatory constraints on baseline periods?

For REDD+, LULUCF and similar frameworks, the answer might be: “Foundation models provide powerful tools for current monitoring and future projection, but historical baseline establishment still requires traditional remote sensing approaches. We can build hybrid systems, but they add complexity rather than reducing it.”

That’s not as compelling a pitch as “AI solves carbon monitoring.” But it’s honest. And in an industry where overselling capabilities has repeatedly damaged credibility, honesty is the competitive advantage we most need to cultivate.

The foundation models are here. They’re powerful. They’re transformative for specific applications. But they’re not time machines. And they are not policy machines or arbitrators. We need to stop pitching them as if they are.

The sophistication of the tool should not be mistaken for the validity of the question it’s being asked to answer, and with climate policy that answer is existential.

But the story doesn’t end there. The same temporal constraint that makes GFMs unsuitable as baseline-setters may make them valuable as something quite different: independent auditors of compliance claims, operating against a legally-fixed reference. That argument, and a specific regulatory architecture that makes it possible, is the subject of the next post.

Next
Next

Beyond the Hype: Why Earth Observation Needs Honest Conversations About AI (or: How to Avoid Another Trip Through the Trough of Despair)