Combines Bayesian analyses from many datasets.

Overview

PosteriorStacker

Combines Bayesian analyses from many datasets.

Introduction

Fitting a model to a data set gives posterior probability distributions for a parameter of interest. But how do you combine such probability distributions if you have many datasets?

This question arises frequently in astronomy when analysing samples, and trying to infer sample distributions of some quantity.

PosteriorStacker allows deriving sample distributions from posterior distributions from a number of objects.

Method

The method is described in Appendix A of Baronchelli, Nandra & Buchner (2020).

hbm.png

The inputs are posterior samples of a single parameter, for a number of objects. These need to come from pre-existing analyses, under a flat parameter prior.

The hierarchical Bayesian model (illustrated above) models the sample distribution as a Gaussian with unknown mean and standard deviation. The per-object parameters are also unknown, but integrated out numerically using the posterior samples.

Additional to the Gaussian model (as in the paper), a histogram model (using a flat Dirichlet prior distribution) is computed, which is non-parametric and more flexible. Both models are inferred using UltraNest.

The output is visualised in a publication-ready plot.

Synopsis of the program:

$ python3 posteriorstacker.py --help
usage: posteriorstacker.py [-h] [--verbose VERBOSE] [--name NAME]
                           filename low high nbins

Posterior stacking tool.

Johannes Buchner (C) 2020-2021

Given posterior distributions of some parameter from many objects,
computes the sample distribution, using a simple hierarchical model.

The method is described in Baronchelli, Nandra & Buchner (2020)
https://ui.adsabs.harvard.edu/abs/2020MNRAS.498.5284B/abstract
Two computations are performed with this tool:

- Gaussian model (as in the paper)
- Histogram model (using a Dirichlet prior distribution)

The histogram model is non-parametric and more flexible.
Both models are computed using UltraNest.
The output is plotted.

positional arguments:
  filename           Filename containing posterior samples, one object per line
  low                Lower end of the distribution
  high               Upper end of the distribution
  nbins              Number of histogram bins

optional arguments:
  -h, --help         show this help message and exit
  --verbose VERBOSE  Show progress
  --name NAME        Parameter name (for plot)

Johannes Buchner (C) 2020-2021 

Licence

AGPLv3 (see COPYING file). Contact me if you need a different licence.

Install

Clone or download this repository. You need to install the ultranest python package (e.g., with pip).

Tutorial

In this tutorial you will learn:

  • How to find a intrinsic distribution from data with asymmetric error bars and upper limits
  • How to use PosteriorStacker

Lets say we want to find the intrinsic velocity dispersion given some noisy data points.

Our data are velocity measurements of a few globular cluster velocities in a dwarf galaxy, fitted with some model.

Preparing the inputs

For generating the demo input files and plots, run:

$ python3 tutorial/gendata.py

Visualise the data

Lets plot the data first to see what is going on:

example.png

Caveat on language: These are not actually "the data" (which are counts on a CCD). Instead, this is a intermediate representation of a posterior/likelihood, assuming flat priors on velocity.

Data properties

This scatter plot shows:

  • large, sometimes asymmetric error bars
  • intrinsic scatter

Resampling the data

We could also represent each data point by a cloud of samples. Each point represents a possible true solution of that galaxy.

example-samples.png

Running PosteriorStacker

We run the script with a range limit of +-100 km/s:

$ python3 posteriorstacker.py posteriorsamples.txt -80 +80 11 --name="Velocity [km/s]"
fitting histogram model...
[ultranest] Sampling 400 live points from prior ...
[ultranest] Explored until L=-1e+01
[ultranest] Likelihood function evaluations: 114176
[ultranest] Writing samples and results to disk ...
[ultranest] Writing samples and results to disk ... done
[ultranest]   logZ = -20.68 +- 0.06865
[ultranest] Effective samples strategy satisfied (ESS = 684.4, need >400)
[ultranest] Posterior uncertainty strategy is satisfied (KL: 0.46+-0.08 nat, need <0.50 nat)
[ultranest] Evidency uncertainty strategy is satisfied (dlogz=0.14, need <0.5)
[ultranest]   logZ error budget: single: 0.07 bs:0.07 tail:0.41 total:0.41 required:<0.50
[ultranest] done iterating.

logZ = -20.677 +- 0.424
  single instance: logZ = -20.677 +- 0.074
  bootstrapped   : logZ = -20.676 +- 0.123
  tail           : logZ = +- 0.405
insert order U test : converged: False correlation: 377.0 iterations

    bin1                0.051 +- 0.046
    bin2                0.052 +- 0.051
    bin3                0.065 +- 0.058
    bin4                0.062 +- 0.057
    bin5                0.108 +- 0.085
    bin6                0.31 +- 0.14
    bin7                0.16 +- 0.10
    bin8                0.051 +- 0.050
    bin9                0.047 +- 0.044
    bin10               0.048 +- 0.047
    bin11               0.047 +- 0.045
fitting gaussian model...
[ultranest] Sampling 400 live points from prior ...
[ultranest] Explored until L=-4e+01
[ultranest] Likelihood function evaluations: 4544
[ultranest] Writing samples and results to disk ...
[ultranest] Writing samples and results to disk ... done
[ultranest]   logZ = -47.33 +- 0.07996
[ultranest] Effective samples strategy satisfied (ESS = 1011.4, need >400)
[ultranest] Posterior uncertainty strategy is satisfied (KL: 0.46+-0.07 nat, need <0.50 nat)
[ultranest] Evidency uncertainty strategy is satisfied (dlogz=0.17, need <0.5)
[ultranest]   logZ error budget: single: 0.13 bs:0.08 tail:0.41 total:0.41 required:<0.50
[ultranest] done iterating.

logZ = -47.341 +- 0.440
  single instance: logZ = -47.341 +- 0.126
  bootstrapped   : logZ = -47.331 +- 0.173
  tail           : logZ = +- 0.405
insert order U test : converged: False correlation: 13.0 iterations

    mean                -0.3 +- 4.7
    std                 11.6 +- 5.2

Vary the number of samples to check numerical stability!
plotting results ...

Notice the parameters of the fitted gaussian distribution above. The standard deviation is quite small (which was the point of the original paper). A corner plot is at posteriorsamples.txt_out_gauss/plots/corner.pdf

Visualising the results

Here is the output plot, converted to png for this tutorial with:

$ convert -density 100 posteriorsamples.txt_out.pdf out.png

out.png

In black, we see the non-parametric fit. The red curve shows the gaussian model.

The histogram model indicates that a more heavy-tailed distribution may be better.

The error bars in gray is the result of naively averaging the posteriors. This is not a statistically meaningful procedure, but it can give you ideas what models you may want to try for the sample distribution.

Output files

  • posteriorsamples.txt_out.pdf contains a plot,
  • posteriorsamples.txt_out_gauss contain the ultranest analyses output assuming a Gaussian distribution.
  • posteriorsamples.txt_out_flexN contain the ultranest analyses output assuming a histogram model.
  • The directories include diagnostic plots, corner plots and posterior samples of the distribution parameters.

With these output files, you can:

  • plot the sample parameter distribution
  • report the mean and spread, and their uncertainties
  • split the sample by some parameter, and plot the sample mean as a function of that parameter.

If you want to adjust the plot, just edit the script.

If you want to try a different distribution, adapt the script. It uses UltraNest for the inference.

Take-aways

  • PosteriorStacker computed a intrinsic distribution from a set of uncertain measurements
  • This tool can combine arbitrarily pre-existing analyses.
  • No assumptions about the posterior shapes were necessary -- multi-modal and asymmetric works fine.
Owner
Johannes Buchner
Johannes Buchner
Traingenerator 🧙 A web app to generate template code for machine learning ✨

Traingenerator 🧙 A web app to generate template code for machine learning ✨ 🎉 Traingenerator is now live! 🎉

Johannes Rieke 1.2k Jan 07, 2023
MIT-Machine Learning with Python–From Linear Models to Deep Learning

MIT-Machine Learning with Python–From Linear Models to Deep Learning | One of the 5 courses in MIT MicroMasters in Statistics & Data Science Welcome t

2 Aug 23, 2022
Apache (Py)Spark type annotations (stub files).

PySpark Stubs A collection of the Apache Spark stub files. These files were generated by stubgen and manually edited to include accurate type hints. T

Maciej 114 Nov 22, 2022
Reproducibility and Replicability of Web Measurement Studies

Reproducibility and Replicability of Web Measurement Studies This repository holds additional material to the paper "Reproducibility and Replicability

6 Dec 31, 2022
CobraML: Completely Customizable A python ML library designed to give the end user full control

CobraML: Completely Customizable What is it? CobraML is a python library built on both numpy and numba. Unlike other ML libraries CobraML gives the us

Sriram Govindan 14 Dec 19, 2021
Estudos e projetos feitos com PySpark.

PySpark (Spark com Python) PySpark é uma biblioteca Spark escrita em Python, e seu objetivo é permitir a análise interativa dos dados em um ambiente d

Karinne Cristina 54 Nov 06, 2022
MBTR is a python package for multivariate boosted tree regressors trained in parameter space.

MBTR is a python package for multivariate boosted tree regressors trained in parameter space.

SUPSI-DACD-ISAAC 61 Dec 19, 2022
Flightfare-Prediction - It is a Flightfare Prediction Web Application Using Machine learning,Python and flask

Flight_fare-Prediction It is a Flight_fare Prediction Web Application Using Machine learning,Python and flask Using Machine leaning i have created a F

1 Dec 06, 2022
STUMPY is a powerful and scalable Python library for computing a Matrix Profile, which can be used for a variety of time series data mining tasks

STUMPY STUMPY is a powerful and scalable library that efficiently computes something called the matrix profile, which can be used for a variety of tim

TD Ameritrade 2.5k Jan 06, 2023
Databricks Certified Associate Spark Developer preparation toolkit to setup single node Standalone Spark Cluster along with material in the form of Jupyter Notebooks.

Databricks Certification Spark Databricks Certified Associate Spark Developer preparation toolkit to setup single node Standalone Spark Cluster along

19 Dec 13, 2022
Machine Learning University: Accelerated Natural Language Processing Class

Machine Learning University: Accelerated Natural Language Processing Class This repository contains slides, notebooks and datasets for the Machine Lea

AWS Samples 2k Jan 01, 2023
This is a curated list of medical data for machine learning

Medical Data for Machine Learning This is a curated list of medical data for machine learning. This list is provided for informational purposes only,

Andrew L. Beam 5.4k Dec 26, 2022
Python implementation of the rulefit algorithm

RuleFit Implementation of a rule based prediction algorithm based on the rulefit algorithm from Friedman and Popescu (PDF) The algorithm can be used f

Christoph Molnar 326 Jan 02, 2023
Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray

A unified Data Analytics and AI platform for distributed TensorFlow, Keras and PyTorch on Apache Spark/Flink & Ray What is Analytics Zoo? Analytics Zo

2.5k Dec 28, 2022
Deep Survival Machines - Fully Parametric Survival Regression

Package: dsm Python package dsm provides an API to train the Deep Survival Machines and associated models for problems in survival analysis. The under

Carnegie Mellon University Auton Lab 10 Dec 30, 2022
A collection of neat and practical data science and machine learning projects

Data Science A collection of neat and practical data science and machine learning projects Explore the docs » Report Bug · Request Feature Table of Co

Will Fong 2 Dec 10, 2021
ml4ir: Machine Learning for Information Retrieval

ml4ir: Machine Learning for Information Retrieval | changelog Quickstart → ml4ir Read the Docs | ml4ir pypi | python ReadMe ml4ir is an open source li

Salesforce 77 Jan 06, 2023
A Python implementation of GRAIL, a generic framework to learn compact time series representations.

GRAIL A Python implementation of GRAIL, a generic framework to learn compact time series representations. Requirements Python 3.6+ numpy scipy tslearn

3 Nov 24, 2021
Forecasting prices using Facebook/Meta's Prophet model

CryptoForecasting using Machine and Deep learning (Part 1) CryptoForecasting using Machine Learning The main aspect of predicting the stock-related da

1 Nov 27, 2021
QML: A Python Toolkit for Quantum Machine Learning

QML is a Python2/3-compatible toolkit for representation learning of properties of molecules and solids.

176 Dec 09, 2022