PySpark bindings for H3, a hierarchical hexagonal geospatial indexing system

Overview

H3 Logo

h3-pyspark: Uber's H3 Hexagonal Hierarchical Geospatial Indexing System in PySpark

PyPI version PyPI downloads conda version

Tests

PySpark bindings for the H3 core library.

For available functions, please see the vanilla Python binding documentation at:

Installation

From PyPI:

pip install h3-pyspark

From conda

conda config --add channels conda-forge
conda install h3-pyspark

Usage

>> >>> df = df.withColumn('h3_9', h3_pyspark.geo_to_h3('lat', 'lng', 'resolution')) >>> df.show() +---------+-----------+----------+---------------+ | lat| lng|resolution| h3_9| +---------+-----------+----------+---------------+ |37.769377|-122.388903| 9|89283082e73ffff| +---------+-----------+----------+---------------+ ">
>>> from pyspark.sql import SparkSession, functions as F
>>> import h3_pyspark
>>>
>>> spark = SparkSession.builder.getOrCreate()
>>> df = spark.createDataFrame([{"lat": 37.769377, "lng": -122.388903, 'resolution': 9}])
>>>
>>> df = df.withColumn('h3_9', h3_pyspark.geo_to_h3('lat', 'lng', 'resolution'))
>>> df.show()

+---------+-----------+----------+---------------+
|      lat|        lng|resolution|           h3_9|
+---------+-----------+----------+---------------+
|37.769377|-122.388903|         9|89283082e73ffff|
+---------+-----------+----------+---------------+

Publishing

  1. Bump version in setup.cfg
  2. Publish:
python3 -m build
python3 -m twine upload --repository pypi dist/*
Comments
  • 'TypeError: must be real number, not NoneType' when using h3_pyspark

    'TypeError: must be real number, not NoneType' when using h3_pyspark

    Hi, I have the following spark dataframe and the column of h3 indices is created by applying the lat, lng pairs and the resolution to h3_pypark.geo_to_h3(lat, lng, resolution) function. However I encountered the following error when I tried to check if there's any null in the index column. And it's not only isNull() not working but also any other subsetting operations which all throw me the same error, could anyone provide some insights on what might be the issue and how to fix it? Thanks in advance!

    dataframe: image

    errors: image

    opened by Tingmi 5
  • Fix indexing for polygons and lines

    Fix indexing for polygons and lines

    Catches some edge cases where h3_line and polyfill would miss. Could be overbroad, which is why the docstrings are changed to say superset, but at least it should be complete

    opened by rwaldman 1
  • Better error handling when null values are passed in

    Better error handling when null values are passed in

    Currently the behavior for all UDFs is that if any row in your dataframe has a null value, the entire build will fail.

    This type behavior would be better/more resilient:

    @F.udf(T.ArrayType(T.StringType()))
    def index_shape(geometry, resolution):
        if geometry is None:
            return None
        return _index_shape(geometry, resolution)
    
    opened by kevinschaich 1
  • Fix bug in index_shape function which missed hexes for long line segments

    Fix bug in index_shape function which missed hexes for long line segments

    Fixes #8

    Previous behavior for problematic line:

    Screen Shot 2022-02-24 at 3 40 36 PM

    New behavior for same line:

    Screen Shot 2022-02-24 at 4 02 47 PM

    Previous behavior for problematic polygon:

    Screen Shot 2022-02-24 at 4 34 59 PM

    New behavior for same polygon:

    Screen Shot 2022-02-24 at 4 35 46 PM

    cc: @deankieserman @rwaldman

    opened by kevinschaich 0
  • Bug in index_shape function which misses several hexes

    Bug in index_shape function which misses several hexes

    Reported by @rwaldman – we can miss several hexes in the worst case if a line's start and endpoints are east-to-west and towards the north or south edge:

    image

    Proposed solution is for long line segments (≥ s where s = hex side length) to interpolate several points along the line based on the selected resolution, so that we catch the ones in between:

    image
    opened by kevinschaich 0
  • polyfill fails with valid multipolygon geojson

    polyfill fails with valid multipolygon geojson

    h3_pyspark.polyfill fails when a valid multipolygon geojson is provided this is expected behavior when utilizing the h3 native library.

    however, i thought it would be helpful if this library is able to accept multipolygons. could I get permission to push a PR?

    implementation in src/h3_pyspark/__init__.py

    @F.udf(returnType=T.ArrayType(T.StringType()))
    @handle_nulls
    def polyfill(polygons, res, geo_json_conformant):
        # NOTE: this behavior differs from default
        # h3-pyspark expect `polygons` argument to be a valid GeoJSON string
        polygons = json.loads(polygons)
        type_ = polygons["type"].lower()
        if type_ == "multipolygon":
            output = []
            for i in polygons["coordinates"]:
                _polygon = {"type": "Polygon", "coordinates": i}
                output.extend(list(h3.polyfill(_polygon, res, geo_json_conformant)))
            return sanitize_types(output)
        return sanitize_types(h3.polyfill(polygons, res, geo_json_conformant))
    

    test in tests/test_core.py

    multipolygon = '{"type": "MultiPolygon","coordinates": [[[[108.98309290409088,13.240363245242063],[108.98343622684479,13.240363245242063],[108.98343622684479,13.240634779729014],[108.98309290409088,13.240634779729014],[108.98309290409088,13.240363245242063]]],[[[108.98349523544312,13.240002939397714],[108.98389220237732,13.240002939397714],[108.98389220237732,13.240269252464502],[108.98349523544312,13.240269252464502],[108.98349523544312,13.240002939397714]]]]}'
    
    def test_polyfill_multipolygon(self):
            h3_test_args, h3_pyspark_test_args = get_test_args(h3.polyfill)
            print(h3_pyspark_test_args)
            integer = 12
            data = {
                "res": integer,
                "geo_json_conformant": True,
                "geojson": multipolygon,
            }
            df = spark.createDataFrame([data])
            actual = df.withColumn("actual", h3_pyspark.polyfill(*h3_pyspark_test_args))
            actual = actual.collect()[0]["actual"]
            print(actual)
            expected = []
            for i in json.loads(multipolygon)["coordinates"]:
                _polygon = {"type": "Polygon", "coordinates": i}
                expected.extend(list(h3.polyfill(_polygon, integer, True)))
            expected = sanitize_types(expected)
            assert sort(actual) == sort(expected)
    
    opened by kangeugine 0
Releases(1.2.6)
  • 1.2.6(Mar 10, 2022)

  • 1.2.4(Mar 4, 2022)

    What's Changed

    • Handle null values in inputs to UDFs by @kevinschaich in https://github.com/kevinschaich/h3-pyspark/pull/10

    Full Changelog: https://github.com/kevinschaich/h3-pyspark/compare/1.2.3...1.2.4

    Source code(tar.gz)
    Source code(zip)
  • 1.2.3(Feb 24, 2022)

    What's Changed

    • Add error handling for bad geometries by @deankieserman in https://github.com/kevinschaich/h3-pyspark/pull/3
    • Fix bug in index_shape function which missed hexes for long line segments by @kevinschaich in https://github.com/kevinschaich/h3-pyspark/pull/9

    New Contributors

    • @deankieserman made their first contribution in https://github.com/kevinschaich/h3-pyspark/pull/3

    Full Changelog: https://github.com/kevinschaich/h3-pyspark/compare/1.2.2...1.2.3

    Source code(tar.gz)
    Source code(zip)
  • 1.1.0(Dec 8, 2021)

    What's Changed

    • Create LICENSE by @kevinschaich in https://github.com/kevinschaich/h3-pyspark/pull/1
    • Add extension functions (index_shape, k_ring_distinct) for spatial indexing & buffers by @kevinschaich in https://github.com/kevinschaich/h3-pyspark/pull/2

    New Contributors

    • @kevinschaich made their first contribution in https://github.com/kevinschaich/h3-pyspark/pull/1

    Full Changelog: https://github.com/kevinschaich/h3-pyspark/commits/1.1.0

    Source code(tar.gz)
    Source code(zip)
Owner
Kevin Schaich
Solving awesome problems @palantir. Part-time open source junkie. Purveyor of hot coffee and thoughtful photographs.
Kevin Schaich
Intake is a lightweight package for finding, investigating, loading and disseminating data.

Intake: A general interface for loading data Intake is a lightweight set of tools for loading and sharing data in data science projects. Intake helps

Intake 851 Jan 01, 2023
This mini project showcase how to build and debug Apache Spark application using Python

Spark app can't be debugged using normal procedure. This mini project showcase how to build and debug Apache Spark application using Python programming language. There are also options to run Spark a

Denny Imanuel 1 Dec 29, 2021
A python package which can be pip installed to perform statistics and visualize binomial and gaussian distributions of the dataset

GBiStat package A python package to assist programmers with data analysis. This package could be used to plot : Binomial Distribution of the dataset p

Rishikesh S 4 Oct 17, 2022
A fast, flexible, and performant feature selection package for python.

linselect A fast, flexible, and performant feature selection package for python. Package in a nutshell It's built on stepwise linear regression When p

88 Dec 06, 2022
Datashader is a data rasterization pipeline for automating the process of creating meaningful representations of large amounts of data.

Datashader is a data rasterization pipeline for automating the process of creating meaningful representations of large amounts of data.

HoloViz 2.9k Jan 06, 2023
Numerical Analysis toolkit centred around PDEs, for demonstration and understanding purposes not production

Numerics Numerical Analysis toolkit centred around PDEs, for demonstration and understanding purposes not production Use procedure: Initialise a new i

George Whittle 1 Nov 13, 2021
Catalogue data - A Python Scripts to prepare catalogue data

catalogue_data Scripts to prepare catalogue data. Setup Clone this repo. Install

BigScience Workshop 3 Mar 03, 2022
Statsmodels: statistical modeling and econometrics in Python

About statsmodels statsmodels is a Python package that provides a complement to scipy for statistical computations including descriptive statistics an

statsmodels 8k Dec 29, 2022
A forecasting system dedicated to smart city data

smart-city-predictions System prognostyczny dedykowany dla danych inteligentnych miast Praca inżynierska realizowana przez Michała Stawikowskiego and

Kevin Lai 1 Nov 08, 2021
Tkinter Izhikevich Neuron Model With Python

TKINTER IZHIKEVICH NEURON MODEL WITH PYTHON Hodgkin-Huxley Model It is a mathematical model for the generation and transmission of action potentials i

Rabia KOÇ 8 Jul 16, 2022
Mining the Stack Overflow Developer Survey

Mining the Stack Overflow Developer Survey A prototype data mining application to compare the accuracy of decision tree and random forest regression m

1 Nov 16, 2021
Stock Analysis dashboard Using Streamlit and Python

StDashApp Stock Analysis Dashboard Using Streamlit and Python If you found the content useful and want to support my work, you can buy me a coffee! Th

StreamAlpha 27 Dec 09, 2022
A collection of robust and fast processing tools for parsing and analyzing web archive data.

ChatNoir Resiliparse A collection of robust and fast processing tools for parsing and analyzing web archive data. Resiliparse is part of the ChatNoir

ChatNoir 24 Nov 29, 2022
🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

🧪📈 🐍. The purpose of the panel-chemistry project is to make it really easy for you to do DATA ANALYSIS and build powerful DATA AND VIZ APPLICATIONS within the domain of Chemistry using using Python a

Marc Skov Madsen 97 Dec 08, 2022
Hydrogen (or other pure gas phase species) depressurization calculations

HydDown Hydrogen (or other pure gas phase species) depressurization calculations This code is published under an MIT license. Install as simple as: pi

Anders Andreasen 13 Nov 26, 2022
The OHSDI OMOP Common Data Model allows for the systematic analysis of healthcare observational databases.

The OHSDI OMOP Common Data Model allows for the systematic analysis of healthcare observational databases.

Bell Eapen 14 Jan 02, 2023
A set of procedures that can realize covid19 virus detection based on blood.

A set of procedures that can realize covid19 virus detection based on blood.

Nuyoah-xlh 3 Mar 07, 2022
BioMASS - A Python Framework for Modeling and Analysis of Signaling Systems

Mathematical modeling is a powerful method for the analysis of complex biological systems. Although there are many researches devoted on produ

BioMASS 22 Dec 27, 2022
Jupyter notebooks for the book "The Elements of Statistical Learning".

This repository contains Jupyter notebooks implementing the algorithms found in the book and summary of the textbook.

Madiyar 369 Dec 30, 2022
Implementation in Python of the reliability measures such as Omega.

OmegaPy Summary Simple implementation in Python of the reliability measures: Omega Total, Omega Hierarchical and Omega Hierarchical Total. Name Link O

Rafael Valero Fernández 2 Apr 27, 2022