API enhancement proposal

Issue #412 closed
Andreas Janz created an issue

Dear @Benjamin Jakimow , @Fabian Thiel and @Dirk Pflugmacher , for some time I wanted to overhaul the current HUBDC and HUBFLOW API. After implementing the FORCE4QGIS Plugin I now have started to adopt some of the lessons learned from that project into the EnMAP-Box.

  1. The data model must be more flexible than GDAL Datasets. I want to specify a raster that include bands from different GDAL Datasets.
  2. Raster bands can be subsetted, stacked, reordered and renamed, without creating a VRT first.
  3. Each band can have an additional mask band (beside its internal mask given by the no data value).

Here are some examples that are using the new concept:

Rasterize a vector layer:

# rasterize reference polygons
landcoverRaster = Raster.create(grid=grid, bands=1, gdt=gdal.GDT_Byte)
landcoverRaster.fill(value=0)
landcoverRaster.rasterize(layer=landcover_polygons, burnAttribute='level_2_id')
landcoverRaster.setNoDataValue(value=0)

Stack bands from multiple raster together and give more meaningful names:

enmapRaster = Raster.open(enmap)
hiresRaster = Raster.open(hires)

# stack all bands and give more meaningful names
stack = Raster.stack(
    rasters=[
        landcoverRaster.rename(bandNames=['classId']),
        enmapRaster.rename(bandNames=[f'EnMAP{number}' for number, _ in enumerate(enmapRaster.bands, 1)]),
        hiresRaster.rename(bandNames=[f'HiRes{number}' for number, _ in enumerate(hiresRaster.bands, 1)])
    ]
)

Sample from raster stack and print table

# draw samples from stack and print
sample = stack.readAsSample(graRaster=gdal.GRA_Average, graMask=gdal.GRA_Mode, xMap='x', yMap='y')
print('Size', len(sample))
for name in sample.dtype.names:
    print(name, sample[name][:])

Size 2028
classId [1 3 2 ... 2 2 4]
EnMAP1 [ 485 288 408 ... 577 724 1011]
EnMAP2 [ 497 303 407 ... 579 744 1033]
EnMAP3 [ 491 300 414 ... 590 760 1057]

EnMAP175 [ 717 374 905 ... 1659 1978 3129]
EnMAP176 [ 684 414 873 ... 1647 2039 3191]
EnMAP177 [ 672 487 937 ... 1472 1888 3075]
HiRes1 [115 49 112 ... 106 166 205]
HiRes2 [136 70 139 ... 134 177 208]
HiRes3 [131 58 95 ... 92 150 190]
x [383997.37 384027.37 384057.37 ... 384267.37 384237.37 384267.37]
y [5820117.35 5820117.35 5820117.35 ... 5808657.35 5808627.35 5808627.35]

Fit RFC and predict Classification

# fit and apply RFC
y = sample.classId
X = np.transpose([sample[name] for name in sample.dtype.names[1:-2]])
estimator = RandomForestClassifier(n_estimators=10, n_jobs=8)
estimator.fit(X=X, y=y)
classification = estimatorPredict(raster=stack[1:], estimator=estimator, filename='rfc.bsq')

# set class colors and names
color = lambda hex: Color(*(int(hex[i:i + 2], 16) for i in (0, 2, 4)))
classification.setCategories(categories=[
    Category(id=1, name='impervious', color=color('e60000')),
    Category(id=2, name='low vegetation', color=color('98e600')),
    Category(id=3, name='tree', color=color('267300')),
    Category(id=4, name='soil', color=color('a87000')),
    Category(id=5, name='water', color=color('0064ff')),
])

Comments (13)

  1. Andreas Janz reporter

    Here is the full script.

    from osgeo import gdal
    import numpy as np
    from sklearn.ensemble import RandomForestClassifier
    
    from enmapboxtestdata import enmap, hires, landcover_polygons
    from hubdsm.algorithm.estimatorpredict import estimatorPredict
    from hubdsm.core.category import Category
    from hubdsm.core.color import Color
    from hubdsm.core.raster import Raster
    
    # prepare features (e.g. raster with different resolutions)
    enmapRaster = Raster.open(enmap)
    hiresRaster = Raster.open(hires)
    
    # select grid for analysis (all date is resampled on-the-fly to this grid)
    grid = enmapRaster.grid
    
    # rasterize reference polygons
    landcoverRaster = Raster.create(grid=grid, bands=1, gdt=gdal.GDT_Byte)
    landcoverRaster.fill(value=0)
    landcoverRaster.rasterize(layer=landcover_polygons, burnAttribute='level_2_id')
    landcoverRaster.setNoDataValue(value=0)
    
    # stack all bands and give more meaningful names
    stack = Raster.stack(
        rasters=[
            landcoverRaster.rename(bandNames=['classId']),
            enmapRaster.rename(bandNames=[f'EnMAP{number}' for number, _ in enumerate(enmapRaster.bands, 1)]),
            hiresRaster.rename(bandNames=[f'HiRes{number}' for number, _ in enumerate(hiresRaster.bands, 1)])
        ]
    )
    
    # draw samples from stack and print
    sample = stack.readAsSample(graRaster=gdal.GRA_Average, graMask=gdal.GRA_Mode, xMap='x', yMap='y')
    print('Size', len(sample))
    for name in sample.dtype.names:
        print(name, sample[name][:])
    
    # fit and apply RFC
    y = sample.classId
    X = np.transpose([sample[name] for name in sample.dtype.names[1:-2]])
    estimator = RandomForestClassifier(n_estimators=10, n_jobs=8)
    estimator.fit(X=X, y=y)
    classification = estimatorPredict(raster=stack[1:], estimator=estimator, filename='rfc.bsq')
    
    # set class colors and names
    color = lambda hex: Color(*(int(hex[i:i + 2], 16) for i in (0, 2, 4)))
    classification.setCategories(categories=[
        Category(id=1, name='impervious', color=color('e60000')),
        Category(id=2, name='low vegetation', color=color('98e600')),
        Category(id=3, name='tree', color=color('267300')),
        Category(id=4, name='soil', color=color('a87000')),
        Category(id=5, name='water', color=color('0064ff')),
    ])
    

  2. Benjamin Jakimow

    Sounds nice and surely it can make some programming parts easier. Otherwise I don’t know if this really adds more functionality compared to what GDAL allows:

    1. The data model must be more flexible than GDAL Datasets. I want to specify a raster that include bands from different GDAL Datasets.

    Including bands from different GDAL Datasets is supported by GDAL VRTs very well (https://gdal.org/drivers/raster/vrt.html)

    1. Raster bands can be subsetted, stacked, reordered and renamed, without creating a VRT first.

    Otherwise VRTs can be used by any other software that uses GDAL I/O, e.g. FORCE, QGIS, ArcGIS, R, …

    1. Each band can have an additional mask band (beside its internal mask given by the no data value).

    GDAL supports mask bands as well (https://gdal.org/development/rfc/rfc15_nodatabitmask.html, https://gdal.org/drivers/raster/vrt.html)

    If would be very nice if theses “virtual” and API only combinations of stacks, subsets, mask etc. can be manifested in a GDAL VRT, which can be read by other software as well.

  3. Andreas Janz reporter

    Regarding the sampling from the above example, I think the unpacking into y and X data for fitting the estimator is a bit clumsy:

    y = sample.classId
    X = np.transpose([sample[name] for name in sample.dtype.names[1:-2]])
    

    Syntax is much prettier when also exposing the readAsSample method to RasterCollection. This way the grouping of features can be descriped more elegantly. Here is an example:

    # build raster collection and give more meaningful names
    stack = RasterCollection(
        rasters=(
            landcoverRaster.withName(name='landcover').rename(bandNames=['classId']),
            enmapRaster,
            hiresRaster
        )
    )
    
    # draw samples from raster collection
    samples = stack.readAsSample(graRaster=gdal.GRA_Average, graMask=gdal.GRA_Mode, xMap='x', yMap='y')
    
    # pretty print sample
    for rasterName, sample in samples.items():
        print(rasterName)
        for fieldName in sample.dtype.names:
            print(f'  {fieldName}: {sample[fieldName]}')
    

    Prints:

    landcover
      classId: [1 3 2 ... 2 2 4]
    enmap_berlin.bsq
      band 8 (0.460000 Micrometers): [ 485  288  408 ...  577  724 1011]
      band 9 (0.465000 Micrometers): [ 497  303  407 ...  579  744 1033]
      band 10 (0.470000 Micrometers): [ 491  300  414 ...  590  760 1057]
      ...
      band 237 (2.393000 Micrometers): [ 717  374  905 ... 1659 1978 3129]
      band 238 (2.401000 Micrometers): [ 684  414  873 ... 1647 2039 3191]
      band 239 (2.409000 Micrometers): [ 672  487  937 ... 1472 1888 3075]
    hires_berlin.bsq
      Red: [115  49 112 ... 106 166 205]
      Green: [136  70 139 ... 134 177 208]
      Blue: [131  58  95 ...  92 150 190]
    _location
      x: [383997.37 384027.37 384057.37 ... 384267.37 384237.37 384267.37]
      y: [5820117.35 5820117.35 5820117.35 ... 5808657.35 5808627.35 5808627.35]
    

  4. Andreas Janz reporter

    @Benjamin Jakimow If would be very nice if theses “virtual” and API only combinations of stacks, subsets, mask etc. can be manifested in a GDAL VRT, which can be read by other software as well.

    This should be possible. So, we use the API for subsetting, stacking, renaming and reordering raster band in a convinient way, and finally, we call something like Raster.saveAsVrt on the resulting raster.

    I am not that familiar with the GDAL mask band feature. Have you ever used it?

  5. Jakub Charyton

    I don’t know if it’s the right place to comment (without committing anything to the source).

    I love the new API implemented in hubdsm. It’s much more understandable to enmapbox fresh programmers. However, I have one huge problem with sampling huge (in terms of single band size and a number of bands) rasters. It runs order of magnitudes slower than hubflow.ClassificationSample for example on the same raster.

    My raster size is 5000x5000x30 and sampling takes more than 15 minutes while hubflow sampling takes about 30 seconds.

    Best regards

  6. Log in to comment