Select, weight and analyze complex sample data

Overview

Sample Analytics

docs

In large-scale surveys, often complex random mechanisms are used to select samples. Estimates derived from such samples must reflect the random mechanism. Samplics is a python package that implements a set of sampling techniques for complex survey designs. These survey sampling techniques are organized into the following four sub-packages.

Sampling provides a set of random selection techniques used to draw a sample from a population. It also provides procedures for calculating sample sizes. The sampling subpackage contains:

  • Sample size calculation and allocation: Wald and Fleiss methods for proportions.
  • Equal probability of selection: simple random sampling (SRS) and systematic selection (SYS)
  • Probability proportional to size (PPS): Systematic, Brewer's method, Hanurav-Vijayan method, Murphy's method, and Rao-Sampford's method.

Weighting provides the procedures for adjusting sample weights. More specifically, the weighting subpackage allows the following:

  • Weight adjustment due to nonresponse
  • Weight poststratification, calibration and normalization
  • Weight replication i.e. Bootstrap, BRR, and Jackknife

Estimation provides methods for estimating the parameters of interest with uncertainty measures that are consistent with the sampling design. The estimation subpackage implements the following types of estimation methods:

  • Taylor-based, also called linearization methods
  • Replication-based estimation i.e. Boostrap, BRR, and Jackknife
  • Regression-based e.g. generalized regression (GREG)

Small Area Estimation (SAE). When the sample size is not large enough to produce reliable / stable domain level estimates, SAE techniques can be used to model the output variable of interest to produce domain level estimates. This subpackage provides Area-level and Unit-level SAE methods.

For more details, visit https://samplics.readthedocs.io/en/latest/

Usage

Let's assume that we have a population and we would like to select a sample from it. The goal is to calculate the sample size for an expected proportion of 0.80 with a precision (half confidence interval) of 0.10.

from samplics.sampling import SampleSize

sample_size = SampleSize(parameter = "proportion")
sample_size.calculate(target=0.80, half_ci=0.10)

Furthermore, the population is located in four natural regions i.e. North, South, East, and West. We could be interested in calculating sample sizes based on region specific requirements e.g. expected proportions, desired precisions and associated design effects.

from samplics.sampling import SampleSize

sample_size = SampleSize(parameter="proportion", method="wald", stratification=True)

expected_proportions = {"North": 0.95, "South": 0.70, "East": 0.30, "West": 0.50}
half_ci = {"North": 0.30, "South": 0.10, "East": 0.15, "West": 0.10}
deff = {"North": 1, "South": 1.5, "East": 2.5, "West": 2.0}

sample_size = SampleSize(parameter = "proportion", method="Fleiss", stratification=True)
sample_size.calculate(target=expected_proportions, half_ci=half_ci, deff=deff)

To select a sample of primary sampling units using PPS method, we can use code similar to the snippets below. Note that we first use the datasets module to import the example dataset.

# First we import the example dataset
from samplics.datasets import load_psu_frame
psu_frame_dict = load_psu_frame()
psu_frame = psu_frame_dict["data"]

# Code for the sample selection
from samplics.sampling import SampleSelection

psu_sample_size = {"East":3, "West": 2, "North": 2, "South": 3}
pps_design = SampleSelection(
   method="pps-sys",
   stratification=True,
   with_replacement=False
   )

psu_frame["psu_prob"] = pps_design.inclusion_probs(
   psu_frame["cluster"],
   psu_sample_size,
   psu_frame["region"],
   psu_frame["number_households_census"]
   )

The initial weighting step is to obtain the design sample weights. In this example, we show a simple example of two-stage sampling design.

import pandas as pd

from samplics.datasets import load_psu_sample, load_ssu_sample
from samplics.weighting import SampleWeight

# Load PSU sample data
psu_sample_dict = load_psu_sample()
psu_sample = psu_sample_dict["data"]

# Load PSU sample data
ssu_sample_dict = load_ssu_sample()
ssu_sample = ssu_sample_dict["data"]

full_sample = pd.merge(
    psu_sample[["cluster", "region", "psu_prob"]],
    ssu_sample[["cluster", "household", "ssu_prob"]],
    on="cluster"
)

full_sample["inclusion_prob"] = full_sample["psu_prob"] * full_sample["ssu_prob"]
full_sample["design_weight"] = 1 / full_sample["inclusion_prob"]

To adjust the design sample weight for nonresponse, we can use code similar to:

import numpy as np

from samplics.weighting import SampleWeight

# Simulate response
np.random.seed(7)
full_sample["response_status"] = np.random.choice(
    ["ineligible", "respondent", "non-respondent", "unknown"],
    size=full_sample.shape[0],
    p=(0.10, 0.70, 0.15, 0.05),
)
# Map custom response statuses to teh generic samplics statuses
status_mapping = {
   "in": "ineligible",
   "rr": "respondent",
   "nr": "non-respondent",
   "uk":"unknown"
   }
# adjust sample weights
full_sample["nr_weight"] = SampleWeight().adjust(
   samp_weight=full_sample["design_weight"],
   adjust_class=full_sample["region"],
   resp_status=full_sample["response_status"],
   resp_dict=status_mapping
   )

To estimate population parameters using Taylor-based and replication-based methods, we can use code similar to:

# Taylor-based
from samplics.datasets import load_nhanes2

nhanes2_dict = load_nhanes2()
nhanes2 = nhanes2_dict["data"]

from samplics.estimation import TaylorEstimator

zinc_mean_str = TaylorEstimator("mean")
zinc_mean_str.estimate(
    y=nhanes2["zinc"],
    samp_weight=nhanes2["finalwgt"],
    stratum=nhanes2["stratid"],
    psu=nhanes2["psuid"],
    remove_nan=True,
)

# Replicate-based
from samplics.datasets import load_nhanes2brr

nhanes2brr_dict = load_nhanes2brr()
nhanes2brr = nhanes2brr_dict["data"]

from samplics.estimation import ReplicateEstimator

ratio_wgt_hgt = ReplicateEstimator("brr", "ratio").estimate(
    y=nhanes2brr["weight"],
    samp_weight=nhanes2brr["finalwgt"],
    x=nhanes2brr["height"],
    rep_weights=nhanes2brr.loc[:, "brr_1":"brr_32"],
    remove_nan=True,
)

To predict small area parameters, we can use code similar to:

import numpy as np
import pandas as pd

# Area-level basic method
from samplics.datasets import load_expenditure_milk

milk_exp_dict = load_expenditure_milk()
milk_exp = milk_exp_dict["data"]

from samplics.sae import EblupAreaModel

fh_model_reml = EblupAreaModel(method="REML")
fh_model_reml.fit(
    yhat=milk_exp["direct_est"],
    X=pd.get_dummies(milk_exp["major_area"], drop_first=True),
    area=milk_exp["small_area"],
    error_std=milk_exp["std_error"],
    intercept=True,
    tol=1e-8,
)
fh_model_reml.predict(
    X=pd.get_dummies(milk_exp["major_area"], drop_first=True),
    area=milk_exp["small_area"],
    intercept=True,
)

# Unit-level basic method
from samplics.datasets import load_county_crop, load_county_crop_means

# Load County Crop sample data
countycrop_dict = load_county_crop()
countycrop = countycrop_dict["data"]
# Load County Crop Area Means sample data
countycropmeans_dict = load_county_crop_means()
countycrop_means = countycropmeans_dict["data"]

from samplics.sae import EblupUnitModel

eblup_bhf_reml = EblupUnitModel()
eblup_bhf_reml.fit(
    countycrop["corn_area"],
    countycrop[["corn_pixel", "soybeans_pixel"]],
    countycrop["county_id"],
)
eblup_bhf_reml.predict(
    Xmean=countycrop_means[["ave_corn_pixel", "ave_corn_pixel"]],
    area=np.linspace(1, 12, 12),
)

Installation

pip install samplics

Python 3.7 or newer is required and the main dependencies are numpy, pandas, scpy, and statsmodel.

Contribution

If you would like to contribute to the project, please read contributing to samplics

License

MIT

Contact

created by Mamadou S. Diallo - feel free to contact me!

Owner
samplics
samplics
上海交通大学全自动抢课脚本,支持准点开抢与抢课后持续捡漏两种模式。2021/06/08更新。

Welcome to Course-Bullying-in-SJTU-v3.1! 2021/6/8 紧急更新v3.1 更新说明 为了更好地保护用户隐私,将原来用户名+密码的登录方式改为微信扫二维码+cookie登录方式,不再需要配置使用pytesseract。在使用扫码登录模式时,请稍等,二维码将马

87 Sep 13, 2022
Official code repository for the EMNLP 2021 paper

Integrating Visuospatial, Linguistic and Commonsense Structure into Story Visualization PyTorch code for the EMNLP 2021 paper "Integrating Visuospatia

Adyasha Maharana 23 Dec 19, 2022
TensorFlow implementation of "Learning from Simulated and Unsupervised Images through Adversarial Training"

Simulated+Unsupervised (S+U) Learning in TensorFlow TensorFlow implementation of Learning from Simulated and Unsupervised Images through Adversarial T

Taehoon Kim 569 Dec 29, 2022
Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics

Dataset Cartography Code for the paper Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics at EMNLP 2020. This repository cont

AI2 125 Dec 22, 2022
Motion Planner Augmented Reinforcement Learning for Robot Manipulation in Obstructed Environments (CoRL 2020)

Motion Planner Augmented Reinforcement Learning for Robot Manipulation in Obstructed Environments [Project website] [Paper] This project is a PyTorch

Cognitive Learning for Vision and Robotics (CLVR) lab @ USC 49 Nov 28, 2022
Simple tutorials using Google's TensorFlow Framework

TensorFlow-Tutorials Introduction to deep learning based on Google's TensorFlow framework. These tutorials are direct ports of Newmu's Theano Tutorial

Nathan Lintz 6k Jan 06, 2023
[ICCV 2021 Oral] NerfingMVS: Guided Optimization of Neural Radiance Fields for Indoor Multi-view Stereo

NerfingMVS Project Page | Paper | Video | Data NerfingMVS: Guided Optimization of Neural Radiance Fields for Indoor Multi-view Stereo Yi Wei, Shaohui

Yi Wei 369 Dec 24, 2022
Label-Free Model Evaluation with Semi-Structured Dataset Representations

Label-Free Model Evaluation with Semi-Structured Dataset Representations Prerequisites This code uses the following libraries Python 3.7 NumPy PyTorch

8 Oct 06, 2022
Keras documentation, hosted live at keras.io

Keras.io documentation generator This repository hosts the code used to generate the keras.io website. Generating a local copy of the website pip inst

Keras 2k Jan 08, 2023
ML-based medical imaging using Azure

Disclaimer This code is provided for research and development use only. This code is not intended for use in clinical decision-making or for any other

Microsoft Azure 68 Dec 23, 2022
This project contains an implemented version of Face Detection using OpenCV and Mediapipe. This is a code snippet and can be used in projects.

Live-Face-Detection Project Description: In this project, we will be using the live video feed from the camera to detect Faces. It will also detect so

Hassan Shahzad 3 Oct 02, 2021
Intelligent Video Analytics toolkit based on different inference backends.

English | 中文 OpenIVA OpenIVA is an end-to-end intelligent video analytics development toolkit based on different inference backends, designed to help

Quantum Liu 15 Oct 27, 2022
Lbl2Vec learns jointly embedded label, document and word vectors to retrieve documents with predefined topics from an unlabeled document corpus.

Lbl2Vec Lbl2Vec is an algorithm for unsupervised document classification and unsupervised document retrieval. It automatically generates jointly embed

sebis - TUM - Germany 61 Dec 20, 2022
MediaPipe is a an open-source framework from Google for building multimodal

MediaPipe is a an open-source framework from Google for building multimodal (eg. video, audio, any time series data), cross platform (i.e Android, iOS, web, edge devices) applied ML pipelines. It is

Bhavishya Pandit 3 Sep 30, 2022
Web service for facial landmark detection, head pose estimation, facial action unit recognition, and eye-gaze estimation based on OpenFace 2.0

OpenGaze: Web Service for OpenFace Facial Behaviour Analysis Toolkit Overview OpenFace is a fantastic tool intended for computer vision and machine le

Sayom Shakib 4 Nov 03, 2022
Codes for CIKM'21 paper 'Self-Supervised Graph Co-Training for Session-based Recommendation'.

COTREC Codes for CIKM'21 paper 'Self-Supervised Graph Co-Training for Session-based Recommendation'. Requirements: Python 3.7, Pytorch 1.6.0 Best Hype

Xin Xia 42 Dec 09, 2022
Label Mask for Multi-label Classification

LM-MLC 一种基于完型填空的多标签分类算法 1 前言 本文主要介绍本人在全球人工智能技术创新大赛【赛道一】设计的一种基于完型填空(模板)的多标签分类算法:LM-MLC,该算法拟合能力很强能感知标签关联性,在多个数据集上测试表明该算法与主流算法无显著性差异,在该比赛数据集上的dev效果很好,但是由

52 Nov 20, 2022
A simple program for training and testing vit

Vit This is a simple program for training and testing vit. Key requirements: torch, torchvision and timm. Dataset I put 5 categories of the cub classi

xiezhenyu 2 Oct 11, 2022
Image Completion with Deep Learning in TensorFlow

Image Completion with Deep Learning in TensorFlow See my blog post for more details and usage instructions. This repository implements Raymond Yeh and

Brandon Amos 1.3k Dec 23, 2022
This reposityory contains the PyTorch implementation of our paper "Generative Dynamic Patch Attack".

Generative Dynamic Patch Attack This reposityory contains the PyTorch implementation of our paper "Generative Dynamic Patch Attack". Requirements PyTo

Xiang Li 8 Nov 17, 2022