Unsupervised Classification with Satellite Embedding Dataset

Author: Zhanchao Yang
Weitzman School of Design, University of Pennsylvania

1 Acknowledgements

This tutorial is adapted from the official Google Earth Engine embedding tutorial: https://developers.google.com/earth-engine/tutorials/community/satellite-embedding-02-unsupervised-classification

2 Overview

In this tutorial, we will take an unsupervised classification approach to ise the Satellite Embedding Dataset to classify land cover types in a study area. We will use the K-Means clustering algorithm to group similar land cover types based on their spectral characteristics.

2.1 A glance of the Google Satellite Embedding Dataset

We used the leafmap package to quickly visualize random bands from the Satellite Embedding Dataset across Pennsylvania on an interactive map. See the note book on the GitHub repo for details thanks to amazing tutorial by Dr. Qiusheng Wu.

Combination one(random 3 draw from 64 bands)

Combination two(random 3 draw from 64 bands)

3 Load Libraries and Google Earth Engine Authentication

Show Code

import ee
import geemap

Show Code

ee.Authenticate()
ee.Initialize(project="ee-zhanchaoyang")

4 Defined study area

Lancaster County in Pennsylvania is one of the most productive agricultural counties in the United States. It is known for its fertile soil and favorable climate, which support a wide variety of crops. The county is particularly famous for its corn and soybean production, which are the two main crops grown in the area. In addition to these staple crops, Lancaster County also produces wheat, barley, oats, and various fruits and vegetables. The county’s agricultural landscape is characterized by a mix of small family farms and larger commercial operations, contributing to its reputation as a leading agricultural region.

Show Code

counties = ee.FeatureCollection("TIGER/2018/Counties")

Show Code

lancaster = counties.filter(ee.Filter.eq("GEOID", "42071")).geometry()

Show Code

m = geemap.Map(center=[40.04, -76.30], zoom=9)
m.addLayer(lancaster, {}, "Lancaster County")
m

5 Loading satellite embedding and training dataset

Show Code

embedding = ee.ImageCollection("GOOGLE/SATELLITE_EMBEDDING/V1/ANNUAL")

Show Code

year = 2022
startdate = ee.Date.fromYMD(year, 1, 1)
enddate = ee.Date.fromYMD(year + 1, 1, 1)

Show Code

study_embeddings = embedding.filter(ee.Filter.date(startdate, enddate)).filter(
    ee.Filter.bounds(lancaster)
);

Show Code

embeddingsImage = study_embeddings.mosaic()

5.1 loading training data: USDA-NASS Cropland Data Layer (CDL)

For our modeling, we need to exclude non-cropland areas. There are many global and regional datasets that can be used to create a crop mask. ESA WorldCover or GFSAD Global Cropland Extent Product are good choices for global cropland datasets. A more recent addition is the ESA WorldCereal Active Cropland product which has seasonal mapping of active croplands. Since our region is in the US, we can use a more accurate regional dataset USDA NASS Cropland Data Layers (CDL) to obtain a crop mask.

Show Code

cdl = (
    ee.ImageCollection("USDA/NASS/CDL")
    .filter(ee.Filter.date("2022-01-01", "2023-01-01"))
    .first()
)
cropland = cdl.select("cropland")
cropland_mask = cdl.select("cultivated").eq(2).rename("cropmask")

Show Code

map = geemap.Map(center=[40.04, -76.30], zoom=9)
m.addLayer(
    cropland_mask.clip(lancaster),
    {"min": 0, "max": 1, "palette": ["white", "green"]},
    "Cropland Mask",
)
m

6 Extract training samples

We apply the cropland mask to the embedding mosaic. We are now left with all the pixels representing cultivated cropland in the county.

Show Code

cluster_image = embeddingsImage.updateMask(cropland_mask).addBands(cropland_mask)

We need to take the Satellite Embedding image and obtain random samples to train a clustering model. Since our region of interest contains many masked pixels, a simple random sampling may result in samples with null values. To ensure we can extract the desired number of non-null samples, we use stratified sampling to obtain the desired number of samples in unmasked areas.

Show Code

training = cluster_image.stratifiedSample(
    numPoints=1000,
    classBand="cropmask",
    region=lancaster,
    scale=10,
    tileScale=16,
    seed=100,
    dropNulls=True,
    geometries=True,
)

Show Code

m.addLayer(training.style(**{"color": "red", "pointSize": 3}), {}, "Training Points")
m

7 Perform K-Means Clustering

We can now train a clusterer and group the 64D embedding vectors into a chosen number of distinct clusters. We can perform unsupervised clustering on the Satellite Embedding to obtain clusters of pixels that have similar temporal trajectories and patterns. Pixels with similar spectral and spatial characteristics along with similar phenology will be grouped in the same cluster.

The ee.Clusterer.wekaCascadeKMeans() allows us to specify a minimum and maximum number of clusters and find the optimal number of clusters based on the training data.

Show Code

mincluster = 4
maxcluster = 5

Show Code

clusterer = ee.Clusterer.wekaCascadeKMeans(
    minClusters=mincluster,
    maxClusters=maxcluster,
).train(features=training, inputProperties=cluster_image.bandNames())

clustered = cluster_image.cluster(clusterer)

Show Code

vis = clustered.randomVisualizer().clip(lancaster)
m.addLayer(vis, {}, "Clustered Image")
m

Show Code

area_image = ee.Image.pixelArea().divide(4046.86).addBands(clustered)

Show Code

areas = area_image.reduceRegion(
    reducer=ee.Reducer.sum().group(
        groupField=1,
        groupName="cluster",
    ),
    geometry=lancaster,
    scale=10,
    maxPixels=1e10,
)

Show Code

print(areas.getInfo())

Show Code

cluster_areas = ee.List(areas.get("groups"))

Show Code

clusterAreas = ee.List(cluster_areas)

Show Code

def to_feature(item):
    d = ee.Dictionary(item)
    return ee.Feature(
        None, {"cluster": d.getNumber("cluster").format(), "area": d.getNumber("sum")}
    )


cluster_area_fc = ee.FeatureCollection(cluster_areas.map(to_feature))

Show Code

print(cluster_area_fc.limit(10).getInfo())

Prediction results (in acres): - Cluster 1: 25515.6423 - Cluster 2: 119071.5298 - Cluster 3: 62110.8848 - Cluster 4: 56379.1821

7.1 Limitations of K-Means Clustering:

We need local knowledge to understand the optimal number of clusters to use for our analysis.
We also need local knowledge to figure out the land cover types represented by each cluster.

8 Validating classification results

Based on the USDA report, the main crops in Lancaster County are: - Corn for grain 95,549 + 35,988 = 131537; Prediction=119071 - Forage (hay/haylage), all 65,142 (others) - Soybeans for beans 51,695 - Wheat for grain, all 24,101

We try to group the crops into 3 clusters: Corn, Soybean, and Other crops for simplicity.

Show Code

cdl = (
    ee.ImageCollection("USDA/NASS/CDL")
    .filter(ee.Filter.date("2022-01-01", "2023-01-01"))
    .first()
)
cropland = cdl.select("cropland")
cropmap = cropland.updateMask(cropland_mask).rename("crops")

Show Code

cropclasses = ee.List.sequence(0, 254)

Show Code

targetclasses = ee.List.repeat(0, 255).set(1, 1).set(5, 2)

Show Code

cropmapreclass = cropmap.remap(cropclasses, targetclasses).rename("crops")

Show Code

crop_vis = {"min": 0, "max": 2, "palette": ["#bdbdbd", "#ffd400", "#267300"]}
m.addLayer(cropmapreclass.clip(lancaster), crop_vis, "Reclassified Crop Map")
m

Validation Results from USDA-NASS CDL 2022:

Unsupervised classification results: from the Kmeans clustering on Satellite Embedding Dataset

Highly suggest to run the code cells in the notebook to explore the interactive map to compare the classification results with the USDA-NASS CDL 2022 data.

9 References and acknowledgements

Google Earth Engine Satellite Embedding Tutorial by Ujaval Gandhi via official Google Earth Engine embedding tutorial: https://developers.google.com/earth-engine/tutorials/community/satellite-embedding-02-unsupervised-classification
Leafmap package by Dr. Qiusheng Wu: https://leafmap.org/
geemap package by Dr. Qiusheng Wu: https://geemap.org/
Dr. Qiusheng Wu Google Satellite Embedding tutorial: https://www.youtube.com/watch?v=EGL7fXyA7-U

Thank you so much to Ujaval Gandhi and Dr. Qiusheng Wu for their amazing tutorials and open-source codes!