Analyzing Earth Observation (EO) data is complex and solutions often require custom tailored algorithms.

Related tags

Data Analysiseo-grow
Overview

eo-grow

Earth observation framework for scaled-up processing in Python.

Analyzing Earth Observation (EO) data is complex and solutions often require custom tailored algorithms. In the EO domain most problems come with an additional challenge: How do we apply the solution on a larger scale?

Working with EO data is made easy by the eo-learn package, while the eo-grow package takes care of running the solutions at a large scale. In eo-grow an EOWorkflow based solution is wrapped in a pipeline object, which takes care of parametrization, logging, storage, multi-processing, EOPatch management and more. However pipelines are not necessarily bound to EOWorkflow execution and can be used for other tasks such as training ML models.

Features of eo-grow include:

  • Direct use of EOWorkflow procedures
  • Parametrizing workflows by using validated configuration files, making executions easy to reproduce and adjust
  • Easy use of both local and S3 storage with no required code adaptation
  • Workflows can be run either single-process, multi-process, or even on multiple machines (by using ray clusters)
  • A collection of basic pipelines, with methods that can be overridden to tailor to a large amount of use-cases
  • Execution reports and customizable logging
  • Options for skipping already processed data when re-running a pipeline
  • Offers a CLI interface for running pipelines, validating configuration files, and generating templates

General Structure Overview

The core object of eo-grow is the Pipeline. Each pipeline has a run_procedure method, which is executed after the pipeline is set up. By default the run_procedure executes an EOWorkflow which is built by the (user-defined) build_workflow method.

Each pipeline is linked to so called managers:

  • StorageManager handles loading and saving of files
  • AreaManager defines the area of interest and how it should be split into EOPatches
  • EOPatchManager takes care of listing eopatches and handling their storage details
  • LoggingManager provides control over logging

eo-grow-structure

Managers and pipelines usually require a large amount of parameters (setting storage paths, configuring log parameters, etc.), which are provided in .json configuration files. Each eo-grow object contains a special Schema class, which is a pydantic model describing the parameters of the object. Config files are then validated before execution to catch issues early. Templates for config files can be generated with the eogrow-template CLI command.

To make config files easier to write eo-grow uses a simple config language that supports importing other configs, variables, and more.

Installation

PyPi distribution

Unavailable until eo-learn 1.0.0 release.

The eo-grow package requires Python version >= 3.8 and can be installed with

pip install eo-grow

Command Line Interface

Running pipelines is easiest by using the CLI provided by eo-grow. For all options use the --help flag with each command.

  • eogrow executes the pipeline defined in the file
  • eogrow-validate only performs validation of the file
  • eogrow-test initializes the pipeline/object but does not run it. Useful for testing if managers are set correctly or for generating area-split grids
  • eogrow-ray executes the pipeline defined in on the active Ray cluster defined by the file
  • eogrow-template generates a template config for the object specified by the and saves it to the file (or outputs it directly if is not provided)

Documentation

Explanatory examples can be found here.

More details on the config language used by eo-grow can be found here.

Questions and Issues

Feel free to ask questions about the package and its use cases at Sentinel Hub forum or raise an issue on GitHub.

License

See LICENSE.

Comments
  • Make export pipeline logs more readable

    Make export pipeline logs more readable

    Silences output of gdal calls in favor of tqdm, making logs much more readable.

    In the logs there was a constant warning:

    Warning 1: General options of gdal_translate make the COPY_SRC_OVERVIEWS creation option ineffective as they hide the overviews
    

    I have removed this option in this MR, but it should be investigated if that is really the way to go. Link to cogification docs

    opened by zigaLuksic 7
  • [BUG] Issues running the batch_to_eopatch pipeline

    [BUG] Issues running the batch_to_eopatch pipeline

    Question

    I have successfully run the batch download pipeline and would like to convert the batch tiles to eopatches. After locally fixing #12 I've managed to run the batch_to_eopatch pipeline, but I get the following exception in the logs:

    Summary of exceptions
    
        LoadUserDataTask (LoadUserDataTask-29825b248e7b11ecbc3b-f57730fc0853):
            14 times:
    
            TypeError: execute() missing 1 required positional argument: 'eopatch'
    

    Which is weird, because the LoadUserDataTask is the first Task and no eopatch arguments should be expected.

    Here is my config:

    {
      "pipeline": "eogrow.pipelines.batch_to_eopatch.BatchToEOPatchPipeline",
      "folder_key": "data",
      "mapping": [
        {"batch_files": ["B01.tif"], "feature_type": "data", "feature_name": "B01", "multiply_factor": 1e-4},
        {"batch_files": ["B02.tif"], "feature_type": "data", "feature_name": "B02", "multiply_factor": 1e-4},
        {"batch_files": ["B03.tif"], "feature_type": "data", "feature_name": "B03", "multiply_factor": 1e-4},
        {"batch_files": ["B04.tif"], "feature_type": "data", "feature_name": "B04", "multiply_factor": 1e-4},
        {"batch_files": ["B05.tif"], "feature_type": "data", "feature_name": "B05", "multiply_factor": 1e-4},
        {"batch_files": ["B06.tif"], "feature_type": "data", "feature_name": "B06", "multiply_factor": 1e-4},
        {"batch_files": ["B07.tif"], "feature_type": "data", "feature_name": "B07", "multiply_factor": 1e-4},
        {"batch_files": ["B08.tif"], "feature_type": "data", "feature_name": "B08", "multiply_factor": 1e-4},
        {"batch_files": ["B8A.tif"], "feature_type": "data", "feature_name": "B8A", "multiply_factor": 1e-4},
        {"batch_files": ["B09.tif"], "feature_type": "data", "feature_name": "B09", "multiply_factor": 1e-4},
        {"batch_files": ["B10.tif"], "feature_type": "data", "feature_name": "B10", "multiply_factor": 1e-4},
        {"batch_files": ["B11.tif"], "feature_type": "data", "feature_name": "B11", "multiply_factor": 1e-4},
        {"batch_files": ["B12.tif"], "feature_type": "data", "feature_name": "B12", "multiply_factor": 1e-4},
        {"batch_files": ["CLP.tif"], "feature_type": "data", "feature_name": "CLP", "multiply_factor": 0.00392156862745098},
        {"batch_files": ["CLM.tif"], "feature_type": "mask", "feature_name": "CLM"},
        {"batch_files": ["dataMask.tif"], "feature_type": "mask", "feature_name": "dataMask"}
      ],
      "userdata_feature_name": "BATCH_INFO",
      "userdata_timestamp_reader": "eogrow.utils.batch.read_timestamps_from_orbits",
      "**global_settings": "${config_path}/sentinel2_l1c_batch_config.json"
    }
    

    Let me know if you need to see what sentinel2_l1c_batch_config.json looks like.

    bug 
    opened by mlubej 5
  • Add raster_shape param to rasterize pipeline

    Add raster_shape param to rasterize pipeline

    Exposes another parameter of the rasterization task.

    I also noticed a common pattern of validators, which i managed to extract into ensure_exactly_one_defined function. It was tested locally.

    opened by zigaLuksic 4
  • Hardcode the compression when saving

    Hardcode the compression when saving

    The parameter never set to anything other than 1 except by mistake. With this we get rid of some code complexity and inconsistency.

    But i'm not entirely sure if this is a step in the right direction :/

    opened by zigaLuksic 3
  • Add warp resampling when merging tiffs.

    Add warp resampling when merging tiffs.

    By switching from gdal_merge to gdalwarp we can now specify how to resample tiffs that are warped. This is a possible improvement for pixel misalignment.

    Another benefit is that gdal_merge loads all files into memory while gdalwarp runs much more memory-conservative.

    opened by zigaLuksic 3
  • Batch area manager rework

    Batch area manager rework

    1. Extracts dealing with an AOI into a BaseSplitterAreaManager (since it's common to both splitter based managers)
    2. Implements new Batch area manager
    3. adds tests for it that (with some mocking)
    4. I learned some new tricks so I fixed some old tests as well before i forgot about it
    opened by zigaLuksic 2
  • [FEAT] Make EONode construction more user friendly

    [FEAT] Make EONode construction more user friendly

    What is the problem? Please describe.

    Imagine a scenario where you are researching a workflow of nodes which are acyclic in nature. You write a task and add it to the node. You mess around, change things, explore, like researchers do. In the end you use the nodes to construct the workflow and run the workflow.

    What can happen (speaking from experience):

    • you create a task, but forget to use the same task in the node related to that task (old one is used)
    • you link the tasks wrong, potentially missing out on a branch in a workflow
    • hard to keep track of a list of all the nodes, first you have to defined the node objects and then add them to a list

    Alternatives

    It would be helpful if this was somehow better managed, to offer the user an easier way to construct a list of nodes with less potential mistakes.

    First idea I had was perhaps an additional method of an EOTask, where you call

    nodes_list = []
    my_created_task = MyCreatedTask(*args, **kwargs)
    my_created_node = my_created_task.get_node(input_nodes = [], nodes_list = nodes_list)
    
    my_next_created_task = MyNextCreatedTask(*args, **kwargs)
    my_next_created_node = my_next_created_task.get_node(input_nodes = my_created_node, nodes_list = nodes_list)
    
    ...
    

    the node my_*_created_node get created and filled automatically into the nodes_list object

    For simple linear graphs the input_nodes could default to [nodes_list[-1]], which points to the last node added to the list.

    Again, this is just the first thing that came to mind. Not sure if it's the best. I also thought about using some decorators, but didn't manage to find a way where this would could be used.

    enhancement 
    opened by mlubej 2
  • ZipMapPipeline

    ZipMapPipeline

    Adds a new ZipMapPipeline and deprecates the MappingPipeline since it is subsumed by it's successor.

    The MappingPipeline is not yet removed, it just emits a warning on use, but is no longer part of the test suite.

    opened by zigaLuksic 2
  • Byoc friendly export of temporal features

    Byoc friendly export of temporal features

    Adds an option to temporally split maps when exporting. This will make it much simpler to do byoc ingestion.

    I also resolved some path-juggling by enforcing the rule that all paths are relative to the filesystem used (storage or tempfs) and system paths are used only when calling the gdal functions

    Each map is suffixed with the timestamp in a near-iso format (the : character is problematic on windows so I tried to avoid it).

    The output result of the test is rather huge (3k+ lines) due to around 30 timestamps... I would want it to be run as part of the test-chain that is done on github, and for that I cannot give it less data. But perhaps we could just run a smaller pipeline in the chain-test and have this only as part of the large test suite? Then i can switch it over to some smaller data as well.

    opened by zigaLuksic 2
  • Fix setting of nodata in export2tiff pipeline, round 2

    Fix setting of nodata in export2tiff pipeline, round 2

    ExportToTiff pipeline didn't behave as expected. When merging tiffs, I wanted to set the empty space to the no-data value, turns out there were issues because of:

    • not setting the -init param
    • not using the values as strings (- was understood as a parameter???)

    Thanks @batic for the help.

    bug 
    opened by mlubej 2
  • Handling of nonexistent aws_profile

    Handling of nonexistent aws_profile

    This PR changes that in case given AWS profile doesn't exist a warning is shown instead of an error. It also adds a test that checks this.

    The reason for this is purely practical - if you run a pipeline on AWS instance with a role that already allows accessing S3 buckets you don't need AWS profiles. But if you run a pipeline locally then you need to specify aws_profile in order to access S3 buckets. So to avoid constantly changing aws_profile parameter it seems easier to just give a warning instead of an error. Although I'm not 100% sure if this really justifies this change. :thinking:

    opened by AleksMat 2
  • The great switch

    The great switch

    Ported all the pipelines to the new managers. Removed only what needed to be removed from the old ones (utm_area and eopatch were causing tests to fail, so they were removed already) and the rest will be removed later.

    Went surprisingly smoothly. We really managed to follow:

    1. Make the change easy (this is hard)
    2. Make the easy change
    opened by zigaLuksic 1
Releases(v1.3.3)
  • v1.3.3(Nov 17, 2022)

    Changelog:

    • Added ImportTiffPipeline for importing a tiff file into EOPatches.
    • ExportMapsPipeline now runs in parallel (single-machine only).
    • Fixed issue where ExportMapsPipeline consumed increasing amounts of storage space.
    • Area and eopatch managers for batch grids now warn the user if not linked correctly.
    • Added pyogrio as a possible geopandas backend for IO (experimental).
    • Add support for geopandas version 0.12.
    • Improve types after mypy version 0.990.
    • Removed utils.enum and old style of templating due to non-use.
    • Other various improvements and clean-ups.
    Source code(tar.gz)
    Source code(zip)
  • v1.3.2(Oct 24, 2022)

    Changelog:

    • Greatly improved ExportMapsPipeline and IngestByocTilesPipeline, which are now also able to export and ingest temporal BYOC collections
    • Improved test suite for exporting maps and ingesting BYOC collections
    • Fixed code according to newly exposed eolearn.core types
    • Fixed broken github links in documentation
    • Improvements to CI, added pre-commit hooks to the repository
    Source code(tar.gz)
    Source code(zip)
  • v1.3.1(Aug 31, 2022)

    Changelog:

    • BYOC ingestion pipeline is better at handling CRS objects
    • Becaue pydantic now type-checks default factories two custom factories list_factory and dict_factory have been added, because using just list currently clashes with fields of kind List[int].
    Source code(tar.gz)
    Source code(zip)
  • v1.3.0(Aug 30, 2022)

    Changelog:

    • Added IngestByocTiles pipeline, which creates or updates a BYOC collection from maps exported via ExportMapsPipeline.
    • Greatly improved DataCollection parser, which can now parse DataCollectionSchema objects instead of just names.
    • Added tests for validator utility functions.
    • New general validators ensure_defined_together and ensure_exactly_one_defined for verifying optional parameters.
    • Documentation of Schema objects is now much more verbose.
    • ExportMapsPipeline now saves maps into subfolders (per UTM zone).
    • Fixed issue where ExportMapPipeline ignored dtype and nodata when merging.
    • Improved handling of aws_profile parameter in storage managers.
    • RasterizePipeline now has an additional raster_shape parameter.
    Source code(tar.gz)
    Source code(zip)
  • v1.2.0(Jul 27, 2022)

    Changelog:

    • Fixed a bug in BatchToEOPatchPipeline where temporal dimension of some imported features could be reversed. Memory-optimization functionalities have been reverted.
    • Improved the way filesystem object is passed to EOTasks in EOWorkflows. These changes are a consequence of changes in eo-learn==1.2.0.
    • Added support for aws_acl parameter into Storage schema.
    • Download pipelines now support an optional size parameter.
    • Official support for Python 3.10.
    • Large changes in testing utilities. Statistics produced by ContentTester have been changed and are now more descriptive.
    • Improvements in code-style checkers and CI.
    Source code(tar.gz)
    Source code(zip)
  • v1.1.1(Jun 14, 2022)

    Changelog:

    • Support session sharing in download pipelines.
    • Improved BatchAreaManager bounding boxes.
    • Improve memory footprint of various pipelines.
    • Disabled skip_existing and eopatch_list at validation time for pipelines that do not support filtration.
    • Support for rasterization of temporal vector features from files.
    • Docs are now built automatically and the type annotations are included in parameter descriptions, resulting in better readability.
    • Many minor improvements and fixes in code, tests, and documentation.
    Source code(tar.gz)
    Source code(zip)
  • v1.1.0(May 3, 2022)

    Changelog:

    • Large changes in config objects and schemas:

      • replaced Config object with config utility functions collect_configs_from_path, interpret_config_from_dict, and interpret_config_from_path,
      • pipeline and manager config objects are now pydantic schema classes, which are fully typed objects,
      • removed ${env:variable} from the config language.
    • Changes in area managers:

      • added AreaManager.cache_grid method,
      • improved functionalities of BatchAreaManger, instead of tile_buffer it now uses tile_buffer_x and tile_buffer_y config parameters (code-breaking),
      • improved UtmZoneAreaManager, replaced patch_buffer config parameter with patch_buffer_x and patch_buffer_y which now work with absolute instead of relative buffers (code-breaking),
      • implemented grid transformation methods for UtmZoneAreaManager and BatchAreaManager.
    • Other core improvements:

      • added EOGrowObject.from_raw_config and EOGrowObject.from_path methods,
      • fixed an issue in EOPatchManager,
      • improvements of pipeline logging, logging handlers, and filters.
    • Pipeline improvements:

      • Implemented SwitchGridPipeline for converting data between tiling grids.
      • Large updates of BatchDownloadPipeline with restructured config schema and additional functionalities.
      • BatchToEOPatchPipeline now works with input_folder_key and output_folder_key instead of folder_key and has an option not to delete input data. A few issues in the pipeline were fixed and unit tests were added.
      • Minor improvements of config parameters in MergeSamplesPipeline and prediction pipelines.
      • Implemented DummyDataPipeline for generating data for unit tests.
    • New tasks:

      • SpatialJoinTask and SpatialSliceTask for spatial operations on EOPatches,
      • DummyRasterFeatureTask and DummyTimestampFeatureTask for creating EOPatches with dummy data.
    • Updates in utilities:

      • added utilities for spatial operations and grid transformations,
      • implemented eogrow.utils.fs.LocalFolder abstraction,
      • renamed get_patches_without_all_features into get_patches_with_missing_features from eogrow.utils.filter (code-breaking),
      • updated eogrow.utils.testing.run_and_test_pipeline to work with a list of pipeline configs.
    • Created the eo-grow package documentation page.

    • eo-grow is now a fully typed package. Added mypy and isort code checking to CI.

    • Updated tutorial notebooks to work with the latest code.

    • Many minor improvements and fixes in code, tests, and documentation.

    Source code(tar.gz)
    Source code(zip)
  • v1.0.0(Feb 10, 2022)

Owner
Sentinel Hub
Sentinel Hub services by Sinergise
Sentinel Hub
Pandas and Spark DataFrame comparison for humans

DataComPy DataComPy is a package to compare two Pandas DataFrames. Originally started to be something of a replacement for SAS's PROC COMPARE for Pand

Capital One 259 Dec 24, 2022
Pandas-based utility to calculate weighted means, medians, distributions, standard deviations, and more.

weightedcalcs weightedcalcs is a pandas-based Python library for calculating weighted means, medians, standard deviations, and more. Features Plays we

Jeremy Singer-Vine 98 Dec 31, 2022
The OHSDI OMOP Common Data Model allows for the systematic analysis of healthcare observational databases.

The OHSDI OMOP Common Data Model allows for the systematic analysis of healthcare observational databases.

Bell Eapen 14 Jan 02, 2023
An interactive grid for sorting, filtering, and editing DataFrames in Jupyter notebooks

qgrid Qgrid is a Jupyter notebook widget which uses SlickGrid to render pandas DataFrames within a Jupyter notebook. This allows you to explore your D

Quantopian, Inc. 2.9k Jan 08, 2023
A neural-based binary analysis tool

A neural-based binary analysis tool Introduction This directory contains the demo of a neural-based binary analysis tool. We test the framework using

Facebook Research 208 Dec 22, 2022
Parses data out of your Google Takeout (History, Activity, Youtube, Locations, etc...)

google_takeout_parser parses both the Historical HTML and new JSON format for Google Takeouts caches individual takeout results behind cachew merge mu

Sean Breckenridge 27 Dec 28, 2022
Larch: Applications and Python Library for Data Analysis of X-ray Absorption Spectroscopy (XAS, XANES, XAFS, EXAFS), X-ray Fluorescence (XRF) Spectroscopy and Imaging

Larch: Data Analysis Tools for X-ray Spectroscopy and More Documentation: http://xraypy.github.io/xraylarch Code: http://github.com/xraypy/xraylarch L

xraypy 95 Dec 13, 2022
PyPDC is a Python package for calculating asymptotic Partial Directed Coherence estimations for brain connectivity analysis.

Python asymptotic Partial Directed Coherence and Directed Coherence estimation package for brain connectivity analysis. Free software: MIT license Doc

Heitor Baldo 3 Nov 26, 2022
Big Data & Cloud Computing for Oceanography

DS2 Class 2022, Big Data & Cloud Computing for Oceanography Home of the 2022 ISblue Big Data & Cloud Computing for Oceanography class (IMT-A, ENSTA, I

Ocean's Big Data Mining 5 Mar 19, 2022
The Master's in Data Science Program run by the Faculty of Mathematics and Information Science

The Master's in Data Science Program run by the Faculty of Mathematics and Information Science is among the first European programs in Data Science and is fully focused on data engineering and data a

Amir Ali 2 Jun 17, 2022
talkbox is a scikit for signal/speech processing, to extend scipy capabilities in that domain.

talkbox is a scikit for signal/speech processing, to extend scipy capabilities in that domain.

David Cournapeau 76 Nov 30, 2022
Utilize data analytics skills to solve real-world business problems using Humana’s big data

Humana-Mays-2021-HealthCare-Analytics-Case-Competition- The goal of the project is to utilize data analytics skills to solve real-world business probl

Yongxian (Caroline) Lun 1 Dec 27, 2021
ETL pipeline on movie data using Python and postgreSQL

Movies-ETL ETL pipeline on movie data using Python and postgreSQL Overview This project consisted on a automated Extraction, Transformation and Load p

Juan Nicolas Serrano 0 Jul 07, 2021
A DSL for data-driven computational pipelines

"Dataflow variables are spectacularly expressive in concurrent programming" Henri E. Bal , Jennifer G. Steiner , Andrew S. Tanenbaum Quick overview Ne

1.9k Jan 03, 2023
Important dataframe statistics with a single command

quick_eda Receiving dataframe statistics with one command Project description A python package for Data Scientists, Students, ML Engineers and anyone

Sven Eschlbeck 2 Dec 19, 2021
Open-Domain Question-Answering for COVID-19 and Other Emergent Domains

Open-Domain Question-Answering for COVID-19 and Other Emergent Domains This repository contains the source code for an end-to-end open-domain question

7 Sep 27, 2022
Snakemake workflow for converting FASTQ files to self-contained CRAM files with maximum lossless compression.

Snakemake workflow: name A Snakemake workflow for description Usage The usage of this workflow is described in the Snakemake Workflow Catalog. If

Algorithms for reproducible bioinformatics (Koesterlab) 1 Dec 16, 2021
Data pipelines built with polars

valves Warning: the project is very much work in progress. Valves is a collection of functions for your data .pipe()-lines. This project aimes to host

14 Jan 03, 2023
Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python

Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python 📊

Thomas 2 May 26, 2022
Python Implementation of Scalable In-Memory Updatable Bitmap Indexing

PyUpBit CS490 Large Scale Data Analytics — Implementation of Updatable Compressed Bitmap Indexing Paper Table of Contents About The Project Usage Cont

Hyeong Kyun (Daniel) Park 1 Jun 28, 2022