[GH-2824] Add NetCDF metadata extraction to sedonainfo#2829
Draft
jiayuasu wants to merge 10 commits intoapache:masterfrom
Draft
[GH-2824] Add NetCDF metadata extraction to sedonainfo#2829jiayuasu wants to merge 10 commits intoapache:masterfrom
jiayuasu wants to merge 10 commits intoapache:masterfrom
Conversation
Add a new Spark DataSourceV2 that returns GeoTIFF file metadata
without decoding pixel data, similar to gdalinfo.
Usage: spark.read.format("sedonainfo").load("/path/to/*.tif")
Returns one row per file with: path, driver, fileSize, width,
height, numBands, srid, crs, geoTransform, cornerCoordinates,
bands (array with dataType, noData, blockSize, colorInterpretation),
overviews, metadata, isTiled, and compression.
Supports glob patterns, directory recursion, LIMIT pushdown,
and column pruning.
… logic, add docs - Rename package from io.geotiffmetadata to io.sedonainfo - Extract RasterFileMetadataExtractor trait for format-agnostic design - Move GeoTIFF-specific logic into GeoTiffMetadataExtractor - SedonaInfoPartitionReader delegates to format extractors via canHandle() dispatch, making it easy to add new formats - Add documentation page for the sedonainfo data source - Register in mkdocs.yml navigation
Generate COG files on-the-fly using RS_AsCOG and verify that sedonainfo correctly reports isTiled=true, non-empty overviews with proper level/width/height, and blockSize matching the requested tile size.
…ection - Replace all inexact assertions (>0, !=0) with exact value matches for test1.tiff: width=512, height=517, srid=3857, fileSize=174803, band type=UNSIGNED_8BITS, blockSize=256x256, etc. - Fix overview detection to use DatasetLayout.getNumInternalOverviews() instead of getResolutionLevels() which returns synthetic tile-based levels even for non-COG files - Add COG test that generates a COG on-the-fly via RS_AsCOG and verifies isTiled=true, 2 overviews, blockSize=256x256
…portsWrite, docs - Fix isTiled: read TIFF TileWidth tag (322) from IIO metadata instead of RenderedImage tile size which reports strips as tiles - Fix colorInterpretation: derive from TIFF Photometric Interpretation tag (262) instead of copying band description. Maps to gdalinfo values: Gray, Red, Green, Blue, Alpha, Palette, Undefined - Fix SupportsWrite: remove mixin, throw UnsupportedOperationException in newWriteBuilder since sedonainfo is read-only - Fix docs: remove false claim about column pruning skipping extraction - Fix compression: read from TIFF tag 259 description attribute for human-readable names (e.g., "LZW", "Deflate") - Extract TIFF IIO metadata before reader.read() to avoid stream state issues
…adata case classes Make RasterFileMetadata consistent: all nested structures (bands, overviews, geoTransform, cornerCoordinates) use dedicated case classes.
- Add NetCdfMetadataExtractor implementing RasterFileMetadataExtractor - Opens NetCDF files via UCAR cdm-core, extracts metadata without reading data arrays (only lat/lon coordinate arrays for spatial info) - Maps data variables to bands (numBands = number of record variables) - Reports dimensions and variables in metadata map - Supports .nc/.nc4/.netcdf extensions - Update glob patterns in SedonaInfoDataSource to include NetCDF files - Add 7 exact-match tests using test.nc (O3/NO2 variables, 80x48 grid)
Pass requiredFields from Spark's readDataSchema to extractors so they can skip expensive work (bands, overviews, metadata, compression, CRS WKT) when those columns are not selected in the query.
Move NetCDF metadata extraction to a follow-up PR. This PR focuses on GeoTIFF metadata extraction via the sedonainfo data source.
- Add NetCdfMetadataExtractor implementing RasterFileMetadataExtractor - Opens NetCDF files via UCAR cdm-core, extracts metadata without reading data arrays (only lat/lon coordinate arrays for spatial info) - Maps data variables to bands (numBands = number of record variables) - Reports dimensions and variables in metadata map - Supports .nc/.nc4/.netcdf extensions - Update glob patterns in SedonaInfoDataSource to include NetCDF files - Add 7 exact-match tests using test.nc (O3/NO2 variables, 80x48 grid)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
NetCdfMetadataExtractorimplementingRasterFileMetadataExtractor.nc/.nc4/.netcdfextensionstest.ncDepends on
Test plan
RS_FromNetCDF