A brand new model
With the introduction of our new Foundation Model (FM) Albatross, it is time to look at how this model compares to the current leading Geospatial Foundation Models. But how do we do this? And how do we even test how good an FM is?
In this article, we compare our latest foundation model with foundation models from TESSERA (Cambridge University) and AlphaEarth (Google DeepMind), as well as with our previous FM. We start with a brief explanation of the different techniques behind the models and then dive deeper into a systematic comparison of their performance.
Spheer FM Albatross
Our new FM, like the previous one, uses imagery from the Sentinel-2 satellites of the Copernicus programme: two ESA satellites that orbit the Earth every 3 to 5 days and capture the Earth's surface. The pixels, which determine the resolution of the images, measure 10 by 10 metres on the ground. These images are multispectral: meaning they capture more colours than just red, green, and blue. For Spheer FM, we use ten of the spectral bands from Sentinel-2, including several infrared bands.
Our model processes these satellite images pixel by pixel, using a full year of measurements at a time. This means that no complete images are fed into our model; instead, each 10 by 10 metre patch of ground is processed individually. For each patch, what is essentially a 1 by 1 pixel video containing all images from the entire year is converted by the model into an embedding.
TESSERA
Like us, TESSERA uses Sentinel-2 data, but they also add data from the Sentinel-1 satellites. These satellites use synthetic aperture radar to collect data that provides information about the texture or structure of the measured surface. From this, properties such as surface roughness, soil moisture, and vegetation cover can be derived.
Whereas we always use the full year of data, TESSERA always takes 40 data points, even when more are available. Like us, TESSERA processes data at the pixel level.
AlphaEarth
While Spheer FM and TESSERA work at the pixel level, Google's AlphaEarth processes entire satellite images. In addition to Sentinel-1 and Sentinel-2, they also use Landsat-8 and -9 imagery. For all of these satellites, a full year of data is processed in patches of 1.28 by 1.28 kilometres. They also incorporate a large amount of additional data, such as elevation, precipitation, air pressure, and textual descriptions, to support the training process. AlphaEarth converts this information into embeddings at a 10 by 10 metre resolution.
Because Google processes entire satellite images, more spatial context is captured in the embeddings, but this also results in a less detailed representation of the natural environment. This is clearly visible in the image below.

Validation
To determine how well an FM performs, we evaluate results both qualitatively and quantitatively. In addition to the external models, we also include the previous Spheer FM as an internal baseline. This allows us to assess how Spheer FM Albatross compares to external models while also measuring how much improvement it offers over our previous FM.
For the qualitative evaluation, we assess how embeddings and prediction maps look, without knowing in advance which FM produced them. For the quantitative validation, we test the different FMs on several benchmarks consisting of two parts:
- The first part concerns specific properties of the embeddings. For example, we specifically test for the stability of embeddings across years by training on embeddings from one year and evaluating performance on embeddings from other years. We also validate the model on challenging cases, such as at the edges of vegetation, forests, or lakes.
- The second part consists of use cases that are representative of how users actually work with our foundation model in Spheer. This data is partly compiled by ourselves and partly consists of labelled data collected by ecologists.
Together, these two parts provide a reasonably complete quantitative picture of model performance.
The table below shows the F1 scores for the different benchmarks per FM, with the highest score per benchmark shown in bold.
| Benchmark | Spheer FM Albatross | Spheer FM 2025 | Tessera | AlphaEarth |
|---|---|---|---|---|
| Stability across years | 0.933 | 0.826 | 0.849 | 0.748 |
| Over long spatial distance | 0.945 | 0.872 | 0.907 | 0.876 |
| Vegetation edges | 0.934 | 0.926 | 0.878 | 0.905 |
| Use case dunes – 1 year | 0.966 | 0.967 | 0.933 | 0.874 |
| Use case dunes – all years | 0.959 | 0.963 | 0.918 | 0.887 |
| Use case salt marshes | 0.929 | 0.916 | 0.911 | 0.899 |
From the table above, we can see that for nature monitoring in the Netherlands, Spheer FM Albatross is superior. This holds true both for the more theoretical embedding properties and for the concrete use cases employed by our customers.
Spheer FM Albatross has better stability both across years and across distance, meaning it performs better when limited labelled data is available. It also produces maps with better and sharper boundaries between vegetation types.
Our model also performs better in the use cases that rely on data collected (in part) through field work. We see that it scores on par with the previous Spheer FM and clearly outperforms both Tessera and AlphaEarth. This clearly indicates that Spheer FM is also the better choice for real-world use cases in the Spheer App.
In Closing
Back to the question we started with: is our new model truly better than those from Cambridge and Google? For nature monitoring in the Netherlands, the answer is an emphatic yes. And that is not just a number on a benchmark — it is a gain you can see in every map in the Spheer App: sharper boundaries and more reliable predictions, even where labelled data is scarce.
Do you have questions about the benchmarks, the foundation models, or any other topic in Spheer? Please reach out to our support team at support@spheer.ai.

