API enhancement proposal
Dear @Benjamin Jakimow , @Fabian Thiel and @Dirk Pflugmacher , for some time I wanted to overhaul the current HUBDC and HUBFLOW API. After implementing the FORCE4QGIS Plugin I now have started to adopt some of the lessons learned from that project into the EnMAP-Box.
- The data model must be more flexible than GDAL Datasets. I want to specify a raster that include bands from different GDAL Datasets.
- Raster bands can be subsetted, stacked, reordered and renamed, without creating a VRT first.
- Each band can have an additional mask band (beside its internal mask given by the no data value).
Here are some examples that are using the new concept:
Rasterize a vector layer:
# rasterize reference polygons
landcoverRaster = Raster.create(grid=grid, bands=1, gdt=gdal.GDT_Byte)
landcoverRaster.fill(value=0)
landcoverRaster.rasterize(layer=landcover_polygons, burnAttribute='level_2_id')
landcoverRaster.setNoDataValue(value=0)
Stack bands from multiple raster together and give more meaningful names:
enmapRaster = Raster.open(enmap)
hiresRaster = Raster.open(hires)
# stack all bands and give more meaningful names
stack = Raster.stack(
rasters=[
landcoverRaster.rename(bandNames=['classId']),
enmapRaster.rename(bandNames=[f'EnMAP{number}' for number, _ in enumerate(enmapRaster.bands, 1)]),
hiresRaster.rename(bandNames=[f'HiRes{number}' for number, _ in enumerate(hiresRaster.bands, 1)])
]
)
Sample from raster stack and print table
# draw samples from stack and print
sample = stack.readAsSample(graRaster=gdal.GRA_Average, graMask=gdal.GRA_Mode, xMap='x', yMap='y')
print('Size', len(sample))
for name in sample.dtype.names:
print(name, sample[name][:])
Size 2028
classId [1 3 2 ... 2 2 4]
EnMAP1 [ 485 288 408 ... 577 724 1011]
EnMAP2 [ 497 303 407 ... 579 744 1033]
EnMAP3 [ 491 300 414 ... 590 760 1057]
…
EnMAP175 [ 717 374 905 ... 1659 1978 3129]
EnMAP176 [ 684 414 873 ... 1647 2039 3191]
EnMAP177 [ 672 487 937 ... 1472 1888 3075]
HiRes1 [115 49 112 ... 106 166 205]
HiRes2 [136 70 139 ... 134 177 208]
HiRes3 [131 58 95 ... 92 150 190]
x [383997.37 384027.37 384057.37 ... 384267.37 384237.37 384267.37]
y [5820117.35 5820117.35 5820117.35 ... 5808657.35 5808627.35 5808627.35]
Fit RFC and predict Classification
# fit and apply RFC
y = sample.classId
X = np.transpose([sample[name] for name in sample.dtype.names[1:-2]])
estimator = RandomForestClassifier(n_estimators=10, n_jobs=8)
estimator.fit(X=X, y=y)
classification = estimatorPredict(raster=stack[1:], estimator=estimator, filename='rfc.bsq')
# set class colors and names
color = lambda hex: Color(*(int(hex[i:i + 2], 16) for i in (0, 2, 4)))
classification.setCategories(categories=[
Category(id=1, name='impervious', color=color('e60000')),
Category(id=2, name='low vegetation', color=color('98e600')),
Category(id=3, name='tree', color=color('267300')),
Category(id=4, name='soil', color=color('a87000')),
Category(id=5, name='water', color=color('0064ff')),
])
Comments (13)
-
reporter -
reporter - edited description
-
Sounds nice and surely it can make some programming parts easier. Otherwise I don’t know if this really adds more functionality compared to what GDAL allows:
- The data model must be more flexible than GDAL Datasets. I want to specify a raster that include bands from different GDAL Datasets.
Including bands from different GDAL Datasets is supported by GDAL VRTs very well (https://gdal.org/drivers/raster/vrt.html)
- Raster bands can be subsetted, stacked, reordered and renamed, without creating a VRT first.
Otherwise VRTs can be used by any other software that uses GDAL I/O, e.g. FORCE, QGIS, ArcGIS, R, …
- Each band can have an additional mask band (beside its internal mask given by the no data value).
GDAL supports mask bands as well (https://gdal.org/development/rfc/rfc15_nodatabitmask.html, https://gdal.org/drivers/raster/vrt.html)
If would be very nice if theses “virtual” and API only combinations of stacks, subsets, mask etc. can be manifested in a GDAL VRT, which can be read by other software as well.
-
reporter Regarding the sampling from the above example, I think the unpacking into y and X data for fitting the estimator is a bit clumsy:
y = sample.classId X = np.transpose([sample[name] for name in sample.dtype.names[1:-2]])
Syntax is much prettier when also exposing the readAsSample method to RasterCollection. This way the grouping of features can be descriped more elegantly. Here is an example:
# build raster collection and give more meaningful names stack = RasterCollection( rasters=( landcoverRaster.withName(name='landcover').rename(bandNames=['classId']), enmapRaster, hiresRaster ) ) # draw samples from raster collection samples = stack.readAsSample(graRaster=gdal.GRA_Average, graMask=gdal.GRA_Mode, xMap='x', yMap='y') # pretty print sample for rasterName, sample in samples.items(): print(rasterName) for fieldName in sample.dtype.names: print(f' {fieldName}: {sample[fieldName]}')
Prints:
landcover classId: [1 3 2 ... 2 2 4] enmap_berlin.bsq band 8 (0.460000 Micrometers): [ 485 288 408 ... 577 724 1011] band 9 (0.465000 Micrometers): [ 497 303 407 ... 579 744 1033] band 10 (0.470000 Micrometers): [ 491 300 414 ... 590 760 1057] ... band 237 (2.393000 Micrometers): [ 717 374 905 ... 1659 1978 3129] band 238 (2.401000 Micrometers): [ 684 414 873 ... 1647 2039 3191] band 239 (2.409000 Micrometers): [ 672 487 937 ... 1472 1888 3075] hires_berlin.bsq Red: [115 49 112 ... 106 166 205] Green: [136 70 139 ... 134 177 208] Blue: [131 58 95 ... 92 150 190] _location x: [383997.37 384027.37 384057.37 ... 384267.37 384237.37 384267.37] y: [5820117.35 5820117.35 5820117.35 ... 5808657.35 5808627.35 5808627.35]
-
reporter @Benjamin Jakimow If would be very nice if theses “virtual” and API only combinations of stacks, subsets, mask etc. can be manifested in a GDAL VRT, which can be read by other software as well.
This should be possible. So, we use the API for subsetting, stacking, renaming and reordering raster band in a convinient way, and finally, we call something like Raster.saveAsVrt on the resulting raster.
I am not that familiar with the GDAL mask band feature. Have you ever used it?
-
Yes I already did.
-
addresses
#412refactoring, EnMAP-Box loads *.pkl files of unspecified type (simple dump als object tree)→ <<cset 38730aea65fd>>
-
addresses
#412(selected updates from overhaul_processing)→ <<cset daea2c38a82e>>
-
addresses
#412(selected updates from overhaul_processing)→ <<cset 30bfa844ce51>>
-
reporter addresses
#412refactoring, EnMAP-Box loads *.pkl files of unspecified type (simple dump als object tree)→ <<cset 38730aea65fd>>
-
I don’t know if it’s the right place to comment (without committing anything to the source).
I love the new API implemented in hubdsm. It’s much more understandable to enmapbox fresh programmers. However, I have one huge problem with sampling huge (in terms of single band size and a number of bands) rasters. It runs order of magnitudes slower than hubflow.ClassificationSample for example on the same raster.
My raster size is 5000x5000x30 and sampling takes more than 15 minutes while hubflow sampling takes about 30 seconds.
Best regards
-
reporter Hi @Jakub Charyton ,
it’s best to share your data and code snippet, so I can check. Send everything to andreas.rabe@geo.hu-berlin.de.
Regards
Andreas -
reporter - changed status to closed
that baby died :-)
- Log in to comment
Here is the full script.