Pandas and Dask test helper methods with beautiful error messages.

Last update: Nov 28, 2022

Related tags

Overview

beavis

Pandas and Dask test helper methods with beautiful error messages.

test helpers

These test helper methods are meant to be used in test suites. They provide descriptive error messages to allow for a seamless development workflow.

The test helpers are inspired by chispa and spark-fast-tests, popular test helper libraries for the Spark ecosystem.

There are built-in Pandas testing methods that can also be used, but they don't provide error messages that are as easy to parse. The following sections compare the built-in Pandas output and what's output by Beavis, so you can choose for yourself.

Column comparisons

The built-in assert_series_equal method does not make it easy to decipher the rows that are equal and the rows that are different, so quickly fixing your tests and maintaining flow is hard.

Here's the built-in error message when comparing series that are not equal.

df = pd.DataFrame({"col1": [1042, 2, 9, 6], "col2": [5, 2, 7, 6]})
pd.testing.assert_series_equal(df["col1"], df["col2"])

>   ???
E   AssertionError: Series are different
E
E   Series values are different (50.0 %)
E   [index]: [0, 1, 2, 3]
E   [left]:  [1042, 2, 9, 6]
E   [right]: [5, 2, 7, 6]

Here's the beavis error message that aligns rows and highlights the mismatches in red.

import beavis

beavis.assert_pd_column_equality(df, "col1", "col2")

You can also compare columns in a Dask DataFrame.

ddf = dd.from_pandas(df, npartitions=2)
beavis.assert_dd_column_equality(ddf, "col1", "col2")

The assert_dd_column_equality error message is similarly descriptive.

DataFrame comparisons

The built-in pandas.testing.assert_frame_equal method doesn't output an error message that's easy to understand, see this example.

df1 = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
df2 = pd.DataFrame({'col1': [5, 2], 'col2': [3, 4]})
pd.testing.assert_frame_equal(df1, df2)

E   AssertionError: DataFrame.iloc[:, 0] (column name="col1") are different
E
E   DataFrame.iloc[:, 0] (column name="col1") values are different (50.0 %)
E   [index]: [0, 1]
E   [left]:  [1, 2]
E   [right]: [5, 2]

beavis provides a nicer error message.

beavis.assert_pd_equality(df1, df2)

DataFrame comparison options:

check_index (default True)
check_dtype (default True)

Let's convert the Pandas DataFrames to Dask DataFrames and use the assert_dd_equality function to check they're equal.

ddf1 = dd.from_pandas(df1, npartitions=2)
ddf2 = dd.from_pandas(df2, npartitions=2)
beavis.assert_dd_equality(ddf1, ddf2)

These DataFrames aren't equal, so we'll get a good error message that's easy to debug.

Development

Install Poetry and run poetry install to create a virtual environment with all the Beavis dependencies on your machine.

Other useful commands:

poetry run pytest tests runs the test suite
poetry run black . to format the code
poetry build packages the library in a wheel file
poetry publish releases the library in PyPi (need correct credentials)

Pandas and Dask test helper methods with beautiful error messages.

Related tags

Overview

beavis

test helpers

Column comparisons

DataFrame comparisons

Development

Owner

Matthew Powers

Making the DAEN information accessible.

Python Library for learning (Structure and Parameter) and inference (Statistical and Causal) in Bayesian Networks.

Port of dplyr and other related R packages in python, using pipda.

PostQF is a user-friendly Postfix queue data filter which operates on data produced by postqueue -j.

Data imputations library to preprocess datasets with missing data

collect training and calibration data for gaze tracking

Python package to transfer data in a fast, reliable, and packetized form.

Generate lookml for views from dbt models

Find exposed data in Azure with this public blob scanner

Convert tables stored as images to an usable .csv file

Statsmodels: statistical modeling and econometrics in Python

Python ELT Studio, an application for building ELT (and ETL) data flows.

sportsdataverse python package

Employee Turnover Analysis

A tax calculator for stocks and dividends activities.

A DSL for data-driven computational pipelines

Probabilistic Programming in Python: Bayesian Modeling and Probabilistic Machine Learning with Theano

Toolchest provides APIs for scientific and bioinformatic data analysis.

Intercepting proxy + analysis toolkit for Second Life compatible virtual worlds

Methylation/modified base calling separated from basecalling.