PySpark bindings for H3, a hierarchical hexagonal geospatial indexing system

Overview

H3 Logo

h3-pyspark: Uber's H3 Hexagonal Hierarchical Geospatial Indexing System in PySpark

PyPI version PyPI downloads conda version

Tests

PySpark bindings for the H3 core library.

For available functions, please see the vanilla Python binding documentation at:

Installation

From PyPI:

pip install h3-pyspark

From conda

conda config --add channels conda-forge
conda install h3-pyspark

Usage

>> >>> df = df.withColumn('h3_9', h3_pyspark.geo_to_h3('lat', 'lng', 'resolution')) >>> df.show() +---------+-----------+----------+---------------+ | lat| lng|resolution| h3_9| +---------+-----------+----------+---------------+ |37.769377|-122.388903| 9|89283082e73ffff| +---------+-----------+----------+---------------+ ">
>>> from pyspark.sql import SparkSession, functions as F
>>> import h3_pyspark
>>>
>>> spark = SparkSession.builder.getOrCreate()
>>> df = spark.createDataFrame([{"lat": 37.769377, "lng": -122.388903, 'resolution': 9}])
>>>
>>> df = df.withColumn('h3_9', h3_pyspark.geo_to_h3('lat', 'lng', 'resolution'))
>>> df.show()

+---------+-----------+----------+---------------+
|      lat|        lng|resolution|           h3_9|
+---------+-----------+----------+---------------+
|37.769377|-122.388903|         9|89283082e73ffff|
+---------+-----------+----------+---------------+

Publishing

  1. Bump version in setup.cfg
  2. Publish:
python3 -m build
python3 -m twine upload --repository pypi dist/*
Comments
  • 'TypeError: must be real number, not NoneType' when using h3_pyspark

    'TypeError: must be real number, not NoneType' when using h3_pyspark

    Hi, I have the following spark dataframe and the column of h3 indices is created by applying the lat, lng pairs and the resolution to h3_pypark.geo_to_h3(lat, lng, resolution) function. However I encountered the following error when I tried to check if there's any null in the index column. And it's not only isNull() not working but also any other subsetting operations which all throw me the same error, could anyone provide some insights on what might be the issue and how to fix it? Thanks in advance!

    dataframe: image

    errors: image

    opened by Tingmi 5
  • Fix indexing for polygons and lines

    Fix indexing for polygons and lines

    Catches some edge cases where h3_line and polyfill would miss. Could be overbroad, which is why the docstrings are changed to say superset, but at least it should be complete

    opened by rwaldman 1
  • Better error handling when null values are passed in

    Better error handling when null values are passed in

    Currently the behavior for all UDFs is that if any row in your dataframe has a null value, the entire build will fail.

    This type behavior would be better/more resilient:

    @F.udf(T.ArrayType(T.StringType()))
    def index_shape(geometry, resolution):
        if geometry is None:
            return None
        return _index_shape(geometry, resolution)
    
    opened by kevinschaich 1
  • Fix bug in index_shape function which missed hexes for long line segments

    Fix bug in index_shape function which missed hexes for long line segments

    Fixes #8

    Previous behavior for problematic line:

    Screen Shot 2022-02-24 at 3 40 36 PM

    New behavior for same line:

    Screen Shot 2022-02-24 at 4 02 47 PM

    Previous behavior for problematic polygon:

    Screen Shot 2022-02-24 at 4 34 59 PM

    New behavior for same polygon:

    Screen Shot 2022-02-24 at 4 35 46 PM

    cc: @deankieserman @rwaldman

    opened by kevinschaich 0
  • Bug in index_shape function which misses several hexes

    Bug in index_shape function which misses several hexes

    Reported by @rwaldman – we can miss several hexes in the worst case if a line's start and endpoints are east-to-west and towards the north or south edge:

    image

    Proposed solution is for long line segments (≥ s where s = hex side length) to interpolate several points along the line based on the selected resolution, so that we catch the ones in between:

    image
    opened by kevinschaich 0
  • polyfill fails with valid multipolygon geojson

    polyfill fails with valid multipolygon geojson

    h3_pyspark.polyfill fails when a valid multipolygon geojson is provided this is expected behavior when utilizing the h3 native library.

    however, i thought it would be helpful if this library is able to accept multipolygons. could I get permission to push a PR?

    implementation in src/h3_pyspark/__init__.py

    @F.udf(returnType=T.ArrayType(T.StringType()))
    @handle_nulls
    def polyfill(polygons, res, geo_json_conformant):
        # NOTE: this behavior differs from default
        # h3-pyspark expect `polygons` argument to be a valid GeoJSON string
        polygons = json.loads(polygons)
        type_ = polygons["type"].lower()
        if type_ == "multipolygon":
            output = []
            for i in polygons["coordinates"]:
                _polygon = {"type": "Polygon", "coordinates": i}
                output.extend(list(h3.polyfill(_polygon, res, geo_json_conformant)))
            return sanitize_types(output)
        return sanitize_types(h3.polyfill(polygons, res, geo_json_conformant))
    

    test in tests/test_core.py

    multipolygon = '{"type": "MultiPolygon","coordinates": [[[[108.98309290409088,13.240363245242063],[108.98343622684479,13.240363245242063],[108.98343622684479,13.240634779729014],[108.98309290409088,13.240634779729014],[108.98309290409088,13.240363245242063]]],[[[108.98349523544312,13.240002939397714],[108.98389220237732,13.240002939397714],[108.98389220237732,13.240269252464502],[108.98349523544312,13.240269252464502],[108.98349523544312,13.240002939397714]]]]}'
    
    def test_polyfill_multipolygon(self):
            h3_test_args, h3_pyspark_test_args = get_test_args(h3.polyfill)
            print(h3_pyspark_test_args)
            integer = 12
            data = {
                "res": integer,
                "geo_json_conformant": True,
                "geojson": multipolygon,
            }
            df = spark.createDataFrame([data])
            actual = df.withColumn("actual", h3_pyspark.polyfill(*h3_pyspark_test_args))
            actual = actual.collect()[0]["actual"]
            print(actual)
            expected = []
            for i in json.loads(multipolygon)["coordinates"]:
                _polygon = {"type": "Polygon", "coordinates": i}
                expected.extend(list(h3.polyfill(_polygon, integer, True)))
            expected = sanitize_types(expected)
            assert sort(actual) == sort(expected)
    
    opened by kangeugine 0
Releases(1.2.6)
  • 1.2.6(Mar 10, 2022)

  • 1.2.4(Mar 4, 2022)

    What's Changed

    • Handle null values in inputs to UDFs by @kevinschaich in https://github.com/kevinschaich/h3-pyspark/pull/10

    Full Changelog: https://github.com/kevinschaich/h3-pyspark/compare/1.2.3...1.2.4

    Source code(tar.gz)
    Source code(zip)
  • 1.2.3(Feb 24, 2022)

    What's Changed

    • Add error handling for bad geometries by @deankieserman in https://github.com/kevinschaich/h3-pyspark/pull/3
    • Fix bug in index_shape function which missed hexes for long line segments by @kevinschaich in https://github.com/kevinschaich/h3-pyspark/pull/9

    New Contributors

    • @deankieserman made their first contribution in https://github.com/kevinschaich/h3-pyspark/pull/3

    Full Changelog: https://github.com/kevinschaich/h3-pyspark/compare/1.2.2...1.2.3

    Source code(tar.gz)
    Source code(zip)
  • 1.1.0(Dec 8, 2021)

    What's Changed

    • Create LICENSE by @kevinschaich in https://github.com/kevinschaich/h3-pyspark/pull/1
    • Add extension functions (index_shape, k_ring_distinct) for spatial indexing & buffers by @kevinschaich in https://github.com/kevinschaich/h3-pyspark/pull/2

    New Contributors

    • @kevinschaich made their first contribution in https://github.com/kevinschaich/h3-pyspark/pull/1

    Full Changelog: https://github.com/kevinschaich/h3-pyspark/commits/1.1.0

    Source code(tar.gz)
    Source code(zip)
Owner
Kevin Schaich
Solving awesome problems @palantir. Part-time open source junkie. Purveyor of hot coffee and thoughtful photographs.
Kevin Schaich
CPSPEC is an astrophysical data reduction software for timing

CPSPEC manual Introduction CPSPEC is an astrophysical data reduction software for timing. Various timing properties, such as power spectra and cross s

Tenyo Kawamura 1 Oct 20, 2021
Data pipelines built with polars

valves Warning: the project is very much work in progress. Valves is a collection of functions for your data .pipe()-lines. This project aimes to host

14 Jan 03, 2023
Investigating EV charging data

Investigating EV charging data Introduction: Got an opportunity to work with a home monitoring technology company over the last 6 months whose goal wa

Yash 2 Apr 07, 2022
A Python and R autograding solution

Otter-Grader Otter Grader is a light-weight, modular open-source autograder developed by the Data Science Education Program at UC Berkeley. It is desi

Infrastructure Team 93 Jan 03, 2023
A Python package for the mathematical modeling of infectious diseases via compartmental models

A Python package for the mathematical modeling of infectious diseases via compartmental models. Originally designed for epidemiologists, epispot can be adapted for almost any type of modeling scenari

epispot 12 Dec 28, 2022
Detailed analysis on fraud claims in insurance companies, gives you information as to why huge loss take place in insurance companies

Insurance-Fraud-Claims Detailed analysis on fraud claims in insurance companies, gives you information as to why huge loss take place in insurance com

1 Jan 27, 2022
PyIOmica (pyiomica) is a Python package for omics analyses.

PyIOmica (pyiomica) This repository contains PyIOmica, a Python package that provides bioinformatics utilities for analyzing (dynamic) omics datasets.

G. Mias Lab 13 Jun 29, 2022
Stream-Kafka-ELK-Stack - Weather data streaming using Apache Kafka and Elastic Stack.

Streaming Data Pipeline - Kafka + ELK Stack Streaming weather data using Apache Kafka and Elastic Stack. Data source: https://openweathermap.org/api O

Felipe Demenech Vasconcelos 2 Jan 20, 2022
Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

Data lineage made simple, reliable, and automated. Effortlessly track the flow of data, understand dependencies and analyze impact. Features Visualiza

898 Jan 09, 2023
Titanic data analysis for python

Titanic-data-analysis This Repo is an analysis on Titanic_mod.csv This csv file contains some assumed data of the Titanic ship after sinking This full

Hardik Bhanot 1 Dec 26, 2021
songplays datamart provide details about the musical taste of our customers and can help us to improve our recomendation system

Songplays User activity datamart The following document describes the model used to build the songplays datamart table and the respective ETL process.

Leandro Kellermann de Oliveira 1 Jul 13, 2021
An experimental project I'm undertaking for the sole purpose of increasing my Python knowledge

5ePy is an experimental project I'm undertaking for the sole purpose of increasing my Python knowledge. #Goals Goal: Create a working, albeit lightwei

Hayden Covington 1 Nov 24, 2021
Sentiment analysis on streaming twitter data using Spark Structured Streaming & Python

Sentiment analysis on streaming twitter data using Spark Structured Streaming & Python This project is a good starting point for those who have little

Himanshu Kumar singh 2 Dec 04, 2021
A data analysis using python and pandas to showcase trends in school performance.

A data analysis using python and pandas to showcase trends in school performance. A data analysis to showcase trends in school performance using Panda

Jimmy Faccioli 0 Sep 07, 2021
Working Time Statistics of working hours and working conditions by industry and company

Working Time Statistics of working hours and working conditions by industry and company

Feng Ruohang 88 Nov 04, 2022
Full ELT process on GCP environment.

Rent Houses Germany - GCP Pipeline Project: The goal of the project is to extract data about house rentals in Germany, store, process and analyze it u

Felipe Demenech Vasconcelos 2 Jan 20, 2022
Analyzing Earth Observation (EO) data is complex and solutions often require custom tailored algorithms.

eo-grow Earth observation framework for scaled-up processing in Python. Analyzing Earth Observation (EO) data is complex and solutions often require c

Sentinel Hub 18 Dec 23, 2022
Numerical Analysis toolkit centred around PDEs, for demonstration and understanding purposes not production

Numerics Numerical Analysis toolkit centred around PDEs, for demonstration and understanding purposes not production Use procedure: Initialise a new i

George Whittle 1 Nov 13, 2021