shekelto, sshetty@google.com\uselogo
Earth AI: Unlocking Geospatial Insights with Foundation Models and Cross-Modal Reasoning
Abstract
Geospatial data offers immense potential for understanding our planet. However, the sheer volume and diversity of this data along with its varied resolutions, timescales, and sparsity pose significant challenges for thorough analysis and interpretation. This paper introduces Earth AI, a family of geospatial AI models and agentic reasoning that enables significant advances in our ability to unlock novel and profound insights into our planet. This approach is built upon foundation models across three key domains—Planet-scale Imagery, Population, and Environment—and an intelligent Gemini-powered reasoning engine. We present rigorous benchmarks showcasing the power and novel capabilities of our foundation models and validate that when used together, they provide complementary value for geospatial inference and their synergies unlock superior predictive capabilities. To handle complex, multi-step queries, we developed a Gemini-powered agent that jointly reasons over our multiple foundation models along with large geospatial data sources and tools. On a new benchmark of real-world crisis scenarios, our agent demonstrates the ability to deliver critical and timely insights, effectively bridging the gap between raw geospatial data and actionable understanding.
1 Introduction
Understanding signals from our planet and reasoning about their effect on our livelihoods has inspired human curiosity and innovation for millennia, from the earliest origins of folklore as a guide to natural wisdom, to the first applications of computer science for weather forecasting [lynch2008origins].
Decades of siloed geospatial data from satellites [donlon2012global, drusch2012sentinel], sensors [gorelick2017google, hersbach2020era5, munoz2021era5], and demographic records [DataCommons2025, DataCommons2025b] have posed a significant challenge to cross-domain analysis. To address this, the field of Geospatial AI (GeoAI) [iyer2025harnessing] has evolved from specialized models to general-purpose Foundation Models for Earth Observation [zhu2024foundationsearthclimatefoundation]. This shift, driven by large-scale datasets [jakubik2023foundation, wang2023skyscript] and benchmarks [mai2024opportunities], has culminated in agentic systems where Large Language Models (LLMs) reason over geospatial data, tested by new frameworks [kao2025towards] and reasoning benchmarks [yerramilli2025geochain, dihan2025mapeval].
Our work builds on these foundations to propose a new paradigm for planetary analysis. We introduce “Earth AI”, an interoperable family of geospatial AI models orchestrated by a customizable geospatial reasoning agent to create a holistic, multi-modal view of the Earth. Using Foundation Models (FMs) and LLM based reasoning, we build generalizable systems that surpass the limitations of single-purpose models and are capable of generating novel and actionable insights across a wide spectrum of planetary questions (see Figure 1 for an overview).
Our approach leverages three categories of Earth data: Imagery, Population and Environment. For each category, we developed novel, foundation Earth AI Models that demonstrate state of the art performance in benchmark tasks.
Imagery incorporates satellite, aerial and ground level imagery, sensor observations and related models [gorelick2017google] of the planet, including mapping of urban and agricultural landscapes [sirko2021continentalscalebuildingdetectionhigh, sirko2024highresolutionbuildingroaddetection, goroshin2023estimatingresidentialsolarpotential, dua2024agriculturallandscapeunderstandingcountryscale], classifying global land cover [brown2022dynamic], discovering temporal urban patterns [deng2025visualchroniclesusingmultimodal], producing multi-source compact embeddings [brown2025alphaearth] and the Remote Sensing Foundations vision backbones and multimodal models.
Population encompasses observation and analysis of humans and their impact on the earth, including maps and data about the built environment [weiss2020global], simulation and optimization of mobility patterns and transportation systems (e.g., traffic, migration) [aktay2020googlecovid19communitymobility, cook2025short, zhang2024traffic, choudhury2024towards, haddad2024quantitative], demographic and socioeconomic patterns [DataCommons2025] public health and vectors of disease transmission [wellenius2021impacts] and our Population Dynamics Foundations integrating human behavior and location [agarwal2024general].
Environment relates to spatiotemporal signals capturing dynamics of the Earth, including observations and models of weather, air quality, and climate [agrawal2025operationaldeeplearningsatellitebased, lam2023learning, kochkov2024neural, price2025probabilistic, google2025weather, google2025airquality], forecasts and tracking of natural disasters, such as cyclones [alet2025skillful], floods [nearing2024global], and wildfires [matias2021realtime], and maps of habitat loss and its underlying drivers [sims2023global].
To address real-world, multimodal challenges, we orchestrate these individual Earth AI Models with a Gemini-powered Geospatial Reasoning agent. By uniting natural language interactions and model connectivity across multiple domains that are usually analyzed separately, we expand the ability for non-expert users to analyze important questions without needing to download intermediate elements of answers and join or cross-reference results manually. This paradigm greatly expands the range of users to include anyone who can formulate a well-structured question and assess the resulting response. In this way, Earth AI allows for holistic, multi-faceted analysis and insight generation at a scale that was previously intractable. We demonstrate the power of this approach using a new benchmark of complex, real-world crisis response scenarios, showing how the synthesis of diverse data sources can unlock critical and time-sensitive insights.
The key findings from our evaluation of the Earth AI approach are summarized below:
-
•
State-of-the-Art Earth AI Models: We demonstrate that our Remote Sensing Foundations achieve state-of-the-art (SOTA) performance on tasks such as open-vocabulary object detection and zero-shot cross-modal retrieval. Concurrently, our Population Dynamics Foundations has been independently validated to improve real-world retail and public health applications, and has been extended to provide temporal embeddings at a monthly granularity.
-
•
Leading-edge Predictive Power Through Model Synergy: We provide strong evidence that the integration of models from different modalities yields superior predictive capabilities. By combining signals from our Imagery, Population and Environment models and datasets, we achieve higher predictive accuracy on real-world classification and forecasting tasks versus a single modality analysis.
-
•
Complex Problem-Solving via Agentic Reasoning: We show that the Gemini-powered reasoning agent can effectively deconstruct complex geospatial queries, select the appropriate models and tools in sequence, present transparent reasoning, and synthesize results into a coherent answer, demonstrating a robust capability to automate and scale complex analysis and insight generation across a number of domains, for example, identifying geographic regions with elevated short-term flood risk and high social vulnerability.
2 Earth AI Capabilities
We summarize the capabilities of Earth AI in this section. First, we introduce the core foundation Earth AI Models trained on specialized geospatial datasets. Next, we demonstrate how these models can be combined to create powerful predictive applications by leveraging their synergistic strengths. Finally, we describe an approach to orchestrating all of these components to solve complex, multi-step queries using agentic Geospatial Reasoning.
2.1 Earth AI Models
Our core Earth AI models are trained on specialized geospatial datasets across three categories of Earth data to analyze distinct aspects of our planet. First, we introduce our Imagery models, trained on remote sensing datasets. Next, we describe our Population model, which captures the dynamics of human behavior in relation to geography. Finally, we detail our suite of Environment models for weather, climate, and natural crisis applications.
2.1.1 Imagery: Remote Sensing Foundations
Existing generalist geospatial capabilities were built through large-scale pre-training on multispectral satellite imagery [jakubik2023foundation] while specialized architectures like SatMAE utilize masked autoencoding for downstream tasks [cong2022satmae]. Our Remote Sensing (RS) Foundation models (Figure 2) address key challenges in Earth observation—such as limited labeled datasets and unique image distributions—to unlock new capabilities in visual understanding. These models provide a roadmap toward scalable, general-purpose RS analysis, bridging the gap between advancements in general computer vision and the specific demands of geospatial data.
This family of models features three core capabilities:
-
•
Vision-Language Understanding: We showcase vision-language models (VLMs) that connect remote sensing imagery with natural language. These models learn to map both visual and textual inputs into a joint embedding space, a process that enables the quantification of semantic similarity between an image and a corresponding text description. This core capability enables the model to perform dynamic, zero-shot image classification and retrieval using natural language prompts, effectively handling labels or descriptions not seen during its training.
-
•
Open-Vocabulary Object Detection: We further present an open-vocabulary object detection (OVD) model that leverages VLM-derived embeddings. This allows the model to detect previously unseen object categories in a zero-shot setting, thus supporting detailed queries over satellite and aerial imagery. A Few-shot algorithm can further improve performance using just tens of annotated examples.
-
•
General-Purpose Vision-Transformer (ViT) Backbone: We also feature a comprehensive method for pre-training a vision-transformer encoder on a combination of large-scale, unlabeled remote sensing imagery and smaller-scale labeled datasets. The resulting model was rigorously evaluated across scene classification, object detection, and semantic and instance segmentation tasks, demonstrating strong generalization capacity.
Our RS foundation models work in concert with AlphaEarth Foundations [brown2025alphaearth] to provide a multi-layered view of the planet. AlphaEarth Foundations summarizes optical satellite images, radar, climate simulations and more, for macro-level analysis, reducing the need for large training datasets or directly handling satellite imagery. This is made available for analysis as a 10-meter resolution annual embedding. The Remote Sensing Foundations complement the offering by providing direct access to models that operate on RGB imagery from diverse sources at fine-grained resolutions (0.1m-10m), and can be fine-tuned for specific tasks. Additionally Remote Sensing Foundations models feature native support for natural language queries, enabling non-experts to conduct rapid analysis of imagery that captures specific objects or events.
2.1.2 Population: Population Dynamics Foundations
Our Population Dynamics Foundations model fuses diverse datasets to represent the dynamics of human behavior in a geographic context [agarwal2024general], as presented in Figure 3. It captures the built environment through maps data, the natural environment through weather and air quality, and human behavior through Search Trends and anonymized busyness data. The model relates and encodes these datasets in a graph neural network to produce a unified digital embedding for each region (e.g., administrative region, postal code) while preserving privacy.
Our original Population Dynamics Foundations focused on US data over a single year [agarwal2024general]. This work builds on that with two key expansions for enhanced analytical capabilities:
-
•
Global Spatial Coverage: We have expanded embeddings to 17 countries. The resulting Global Population Dynamics Foundations embeddings are comparable across countries, meaning a downstream model trained on US data can be applied in the United Kingdom or Brazil.
-
•
Dynamic Temporal Analysis: We have created embeddings that evolve over time, represented as monthly embeddings over the last two years. This extension directly addresses the challenge of unpredictable human behavior by integrating temporal behavioral shifts into the embeddings, enhancing nowcasting and forecasting applications.
2.1.3 Environment: Weather & Climate Models
Our Environment models and APIs provide state-of-the-art insights into weather, climate, air quality and natural crises, making complex geospatial information widely available. In this evaluation, we have integrated three distinct, representative environmental signals:
-
•
Weather Forecasting: Google Maps Platform Weather API, incorporates machine learning models such as MetNet [agrawal2025operationaldeeplearningsatellitebased], and provides hourly forecasts up to 240 hours and daily forecasts up to 10 days, covering conditions such as temperature, precipitation, wind, and UV index.
-
•
Flood Forecasting: The Google Flood Forecasting API delivers real-time riverine flood predictions using data from measurement gauges. Forecasts detail the anticipated area of inundation, severity level, and probability. Users can retrieve current predictions or access historical forecast data back to August 1st, 2025.
-
•
Experimental Cyclone Forecasting: Google’s experimental AI-based cyclone model, based on stochastic neural networks [alet2025skillful], predicts a cyclone’s formation, track, intensity, size and shape by generating 50 possible scenarios up to 15 days in advance. Historical data is available back to January 1st, 2022.
2.2 Combining Earth AI Models: Predictive Applications
While each Earth AI model offers a powerful lens into a specific domain, a holistic understanding of our planet requires us to leverage multiple domains simultaneously. Any single model, whether focused on imagery, human behavior, or climate, is inherently limited by its perspective. Earth AI is designed to overcome this limitation by allowing for the integration of these diverse viewpoints, enabling a more comprehensive analysis than would otherwise be possible. For instance, embeddings like AlphaEarth Foundations and Population Dynamics Foundations provide complementary, location-specific representations: Population Dynamics indexes human-centric signals such as Search Trends, mobility and maps while AlphaEarth Foundations encodes imagery, topography and climate information to give structural and environmental context.
We describe a number of methods for combining Earth Models by first mapping them to the same administrative regions and then integrating representations into geospatial modeling tasks such as extrapolation and forecasting.
2.3 Orchestration: Solving Complex Queries with Geospatial Reasoning Agents
The ultimate goal of Earth AI is to help users answer complex, real-world questions that require multifaceted reasoning across diverse models and data sources. Such queries can be categorized into a hierarchy of increasing complexity:
-
1.
Descriptive and retrieval queries involving fact-finding (e.g., “What was the highest recorded temperature in New York in August 2020?”).
-
2.
Analytical and relational queries seeking to uncover patterns between different data sources (e.g., “How many hospitals were located in areas experiencing severe storm conditions in the state of Louisiana when Hurricane Katrina came ashore?”).
-
3.
Predictive or inferential queries involving forecasting new information (e.g., “Which Indian cities have the most vulnerable populations at high risk of being impacted by flooding by November 25, 2027?”).
While direct data retrieval is sufficient for simple queries, addressing analytical and predictive questions demands a more sophisticated methodology. To meet this challenge, we developed the Geospatial Reasoning Agent (Figure 4). Its main role is to serve as an intelligent intermediary, both planning tasks and connecting to a nuanced understanding of the world derived from our core Earth Models and datasets. The agent is designed to decompose complex problems, leveraging its specialized capabilities to interpret Earth observation data with relevant Imagery models, infer socio-demographic insights using Population Dynamics Foundations, and generate forecasts with our Environment models.
The Geospatial Reasoning Agent, developed with Google’s Agent Development Kit (ADK) and powered by Gemini, integrates general-purpose capabilities like orchestration, planning, and recovery with specialized geospatial functionalities. These specialized capabilities are domain-specific, implemented as either simple tools or complex ‘expert’ sub-agents, based on task complexity. This modularity is aimed to facilitate future extensibility and customization. We categorize these capabilities into four primary domains, alongside more general functions:
- Imagery:
-
Utilizes the RemoteSensing Foundations family of models to perform on-demand analysis of satellite imagery, including tasks like classification, object detection and retrieval.
- Population:
-
Leverages our Population Dynamics Foundations models, alongside Google’s Places API and Data Commons, to resolve geographic boundaries and provide dynamic demographic statistics.
- Environment:
-
Accesses and reasons over dynamic Earth processes, using historical atmospheric data and our predictive models for cyclones, floods, and other phenomena.
- Spatiotemporal Model Training:
-
Provides a natural language interface for on-the-fly model training, using pre-trained embeddings from our core models to perform predictive tasks for user-specified variables.
- General Capabilities:
-
The system is augmented with essential tools for geospatial data analysis, code-generation for custom analyses, Google Earth Engine for access to public and private geospatial datasets, Google Search as an additional knowledge source, and key Google Cloud services.
Accessible through a natural language, map-based user interface, the agent interprets a user’s query, breaks it into manageable sub-tasks, delegates each to the appropriate expert agent or tool, and synthesizes the final response. This synergetic relationship—where capable models distill reality and a powerful agent reasons over that distillation—enables both retrospective investigation and proactive planning for complex scenarios.
3 Evaluation and Results
In this section, we present a comprehensive evaluation of the Earth AI capabilities. We first validate the performance of our new foundation models for Imagery and Population individually, demonstrating state-of-the-art results on established benchmarks. We then show the synergistic power of combining these models for complex predictive tasks. Finally, we evaluate the capabilities of our Geospatial Reasoning Agent on a new benchmark of real-world analytical and crisis-response queries.
3.1 Earth AI Models
3.1.1 Imagery: Remote Sensing Foundations
The Remote Sensing Foundations are composed of three core capabilities: a contrastively-trained VLM for tasks like zero-shot classification and cross-modal retrieval, an open-vocabulary object detection model for identifying objects without predefined labels, and a versatile pre-trained vision backbone model that can be fine-tuned for various downstream tasks. In this section, we present the evaluation results for each of these components.
Remote Sensing Vision Language Models
We evaluated our VLMs on two key tasks: zero-shot classification and text-based retrieval, using several public remote sensing benchmarks. Evaluations (Table 1 and Table 2) show that our training datasets applied to the SigLIP2 and MaMMUT architectures result in state-of-the-art model performance on the vast majority of benchmarks and are comparable with much larger and more expensive chat-based models like GeoChat (7B) [kuckreja2024geochat] and LHRS-Bot (7B) [muhtar2024lhrs]. Additional details can be found in [barzilai2025recipe].
|
|
|
|
|
|||||||||||
| SkyScript [wang2023skyscript] | 28.04 | 70.89* | 70.94 | – | – | ||||||||||
| RS-CLIP [li2023rs] | – | 68.84 | 71.35 | 74.28 | 70.51 | ||||||||||
| GeoRSCLIP (VitL) [zhang2024rs5m] | – | – | 71.89 | – | 76.33 | ||||||||||
| RemoteCLIP [liu2024remoteclip] | – | – | 79.84 | – | 91.30 | ||||||||||
| MaMMUT 400M [kuo2023mammut] | 37.58 | 58.66 | 66.93 | 76.52 | 71.46 | ||||||||||
| SigLIP2 400M [tschannen2025siglip] | 41.25 | 65.94 | 72.40 | 81.43 | 75.56 | ||||||||||
| RS-MaMMUT 400M (ours) | 47.24 | 69.46 | 72.31 | 80.29 | 71.96 | ||||||||||
| RS-SigLIP2 400M (ours) | 48.13 | 68.13 | 80.13 | 84.86 | 78.26 |
| RSICD | UCM-Captions | RSITMD | NWPU | |||||
| [lu2017exploring] | [qu2016deep] | [yuan2022exploring] | [Cheng2022NWPU] | |||||
| I2T | T2I | I2T | T2I | I2T | T2I | I2T | T2I | |
| PIR-ITR [pan2024pir] | 24.43 | 25.77 | – | – | 38.64 | 39.85 | – | – |
| SkyScript SkyCLIP-30 [wang2023skyscript] | 23.70 | 19.97 | 72.22 | 59.33 | 30.75 | 30.58 | – | – |
| Geo-RSClip+RS5M [zhang2024rs5m] | 26.41 | 25.96 | – | – | 33.33 | 38.02 | – | – |
| MaMMUT 400M [kuo2023mammut] | 23.88 | 24.17 | 69.21 | 66.50 | 28.83 | 32.70 | 20.33 | 23.18 |
| SigLIP2 400M [tschannen2025siglip] | 27.36 | 27.93 | 70.48 | 68.52 | 24.62 | 25.14 | 31.86 | 36.70 |
| RS-MaMMut 400M (ours) | 33.33 | 33.59 | 74.76 | 71.79 | 42.63 | 42.58 | 41.44 | 32.28 |
| RS-SigLIP2 400M (ours) | 38.37 | 37.64 | 76.67 | 75.33 | 43.14 | 47.26 | 45.12 | 37.74 |
Remote Sensing Open Vocabulary Detection
We assessed our Remote Sensing (RS) Open-Vocabulary Detection (OVD) model’s zero-shot performance on two commonly-used remote sensing object detection datasets: DOTA [xia2018dota] and DIOR [dior2023rs]. As shown in Table 3 (top), our RS-OWL-ViT-v2 OVD model outperforms the baseline OWL-ViT-v2 on both benchmarks, achieving mean Average Precision (mAP) scores of 31.83% on DOTA and 29.39% on DIOR.
We also evaluated our OVD model when augmented with a few-shot learning technique, following the FLAME approach [refael2025ontheflyovdadaptationflame] (Table 3, bottom). Using just 30 labeled classification examples per category, this approach improved the mAP to 53.96% on DOTA and 53.21% on DIOR, thereby demonstrating its effectiveness in adapting the model with minimal additional data. Notably, this variant also outperformed the recently-proposed Scale-adaptive Intersection over Union (SIoU) [jeune2023SIoU] approach for few-shot object detection, further confirming its utility.
| Model | DOTA | DIOR |
|---|---|---|
| Zero-Shot | ||
| OWL-ViT-v2 [minderer2023scaling] | 13.77% | 14.98% |
| RS-OWL-ViT-v2 (ours) | 31.83% | 29.39% |
| Few-Shot | ||
| SIoU [jeune2023SIoU] | 45.88% | 52.85% |
| FLAME with RS-OWL-ViT-v2 (ours) | 53.96% | 53.21% |
Remote Sensing Pre-trained Backbone Foundation Model
We evaluated the performance of our remote sensing global multi-task pretraining (RS-Global MTP) backbone model on 13 different downstream fine-tuning benchmarks from four categories:
-
•
Image Classification: FMoW [christie2018fmow], Resisc45 [Cheng2017RESISC45], UCM, AID [xia2017aid], SKAI [lee2020assessing]