A Lightweight Experiment & Resource Monitoring Tool 📺

Last update: Dec 28, 2022

Related tags

Overview

Lightweight Experiment & Resource Monitoring 📺

"Did I already run this experiment before? How many resources are currently available on my cluster?" If these are common questions you encounter during your daily life as a researcher, then mle-monitor is made for you. It provides a lightweight API for tracking your experiments using a pickle protocol database (e.g. for hyperparameter searches and/or multi-configuration/multi-seed runs). Furthermore, it comes with built-in resource monitoring on Slurm/Grid Engine clusters and local machines/servers.

mle-monitor provides three core functionalities:

MLEProtocol: A composable protocol database API for ML experiments.
MLEResource: A tool for obtaining server/cluster usage statistics.
MLEDashboard: A dashboard visualizing resource usage & experiment protocol.

To get started I recommend checking out the colab notebook and an example workflow.

`MLEProtocol`: Keeping Track of Your Experiments 📝

from mle_monitor import MLEProtocol

# Load protocol database or create new one -> print summary
protocol_db = MLEProtocol("mle_protocol.db", verbose=False)
protocol_db.summary(tail=10, verbose=True)

# Draft data to store in protocol & add it to the protocol
meta_data = {
    "purpose": "Grid search",  # Purpose of experiment
    "project_name": "MNIST",  # Project name of experiment
    "experiment_type": "hyperparameter-search",  # Type of experiment
    "experiment_dir": "experiments/logs",  # Experiment directory
    "num_total_jobs": 10,  # Number of total jobs to run
    ...
}
new_experiment_id = protocol_db.add(meta_data)

# ... train your 10 (pseudo) networks/complete respective jobs
for i in range(10):
    protocol_db.update_progress_bar(new_experiment_id)

# Wrap up an experiment (store completion time, etc.)
protocol_db.complete(new_experiment_id)

The meta data can contain the following keys:

Search Type	Description	Default
`purpose`	Purpose of experiment	`'None provided'`
`project_name`	Project name of experiment	`'default'`
`exec_resource`	Resource jobs are run on	`'local'`
`experiment_dir`	Experiment log storage directory	`'experiments'`
`experiment_type`	Type of experiment to run	`'single'`
`base_fname`	Main code script to execute	`'main.py'`
`config_fname`	Config file path of experiment	`'base_config.yaml'`
`num_seeds`	Number of evaluations seeds	1
`num_total_jobs`	Number of total jobs to run	1
`num_job_batches`	Number of jobs in single batch	1
`num_jobs_per_batch`	Number of sequential job batches	1
`time_per_job`	Expected duration: days-hours-minutes	`'00:01:00'`
`num_cpus`	Number of CPUs used in job	1
`num_gpus`	Number of GPUs used in job	0

Additionally you can synchronize the protocol with a Google Cloud Storage (GCS) bucket by providing cloud_settings. In this case also the results stored in experiment_dir will be uploaded to the GCS bucket, when you call protocol.complete().

# Define GCS settings - requires 'GOOGLE_APPLICATION_CREDENTIALS' env var.
cloud_settings = {
    "project_name": "mle-toolbox",  # GCP project name
    "bucket_name": "mle-protocol",  # GCS bucket name
    "use_protocol_sync": True,  # Whether to sync the protocol to GCS
    "use_results_storage": True,  # Whether to sync experiment_dir to GCS
}
protocol_db = MLEProtocol("mle_protocol.db", cloud_settings, verbose=True)

The `MLEResource`: Keeping Track of Your Resources 📉

On Your Local Machine

from mle_monitor import MLEResource

# Instantiate local resource and get usage data
resource = MLEResource(resource_name="local")
resource_data = resource.monitor()

On a Slurm Cluster

resource = MLEResource(
    resource_name="slurm-cluster",
    monitor_config={"partitions": ["<partition-1>", "<partition-2>"]},
)

On a Grid Engine Cluster

resource = MLEResource(
    resource_name="sge-cluster",
    monitor_config={"queues": ["<queue-1>", "<queue-2>"]}
)

The `MLEDashboard`: Dashboard Visualization 🎞️

from mle_monitor import MLEDashboard

# Instantiate dashboard with protocol and resource
dashboard = MLEDashboard(protocol, resource)

# Get a static snapshot of the protocol & resource utilisation printed in console
dashboard.snapshot()

# Run monitoring in while loop - dashboard
dashboard.live()

Installation ⏳

A PyPI installation is available via:

pip install mle-monitor

Alternatively, you can clone this repository and afterwards 'manually' install it:

git clone https://github.com/mle-infrastructure/mle-monitor.git
cd mle-monitor
pip install -e .

Development & Milestones for Next Release

You can run the test suite via python -m pytest -vv tests/. If you find a bug or are missing your favourite feature, feel free to contact me @RobertTLange or create an issue 🤗 .

Complete system for facial identity system. Include one-shot model, database operation, features visualization, monitoring

2 Dec 28, 2021

Comments

Is the dashboard pooling squeue?

Hey, Thanks for publishing the library, the dashboard looks great!

However, I was a bit concerned to see you are using squeue since the official documentation says

"Executing squeue sends a remote procedure call to slurmctld. If enough calls from squeue or other Slurm client commands that send remote procedure calls to the slurmctld daemon come in at once, it can result in a degradation of performance of the slurmctld daemon, possibly resulting in a denial of service.

Do not run squeue or other Slurm client commands that send remote procedure calls to slurmctld from loops in shell scripts or other programs. Ensure that programs limit calls to squeue to the minimum necessary for the information you are trying to gather."

Do you poll squeue or is there some other, smarter management of it that I missed?

Thanks, Eliahu

opened by eliahuhorwitz 0

Releases(v0.0.1)

v0.0.1(Dec 9, 2021)

Basic API for MLEProtocol, MLEResource & MLEDashboard:

from mle_monitor import MLEProtocol

# Load protocol database or create new one -> print summary
protocol_db = MLEProtocol("mle_protocol.db", verbose=False)
protocol_db.summary(tail=10, verbose=True)

# Draft data to store in protocol & add it to the protocol
meta_data = {
    "purpose": "Grid search",  # Purpose of experiment
    "project_name": "MNIST",  # Project name of experiment
    "experiment_type": "hyperparameter-search",  # Type of experiment
    "experiment_dir": "experiments/logs",  # Experiment directory
    "num_total_jobs": 10,  # Number of total jobs to run
    ...
}
new_experiment_id = protocol_db.add(meta_data)

# ... train your 10 (pseudo) networks/complete respective jobs
for i in range(10):
    protocol_db.update_progress_bar(new_experiment_id)

# Wrap up an experiment (store completion time, etc.)
protocol_db.complete(new_experiment_id)

Source code(tar.gz)
Source code(zip)

A Lightweight Experiment & Resource Monitoring Tool 📺

Related tags

Overview

Lightweight Experiment & Resource Monitoring 📺

MLEProtocol: Keeping Track of Your Experiments 📝

The MLEResource: Keeping Track of Your Resources 📉

On Your Local Machine

On a Slurm Cluster

On a Grid Engine Cluster

The MLEDashboard: Dashboard Visualization 🎞️

Installation ⏳

Development & Milestones for Next Release

You might also like...

Meta Representation Transformation for Low-resource Cross-lingual Learning

OpenDILab RL Kubernetes Custom Resource and Operator Lib

Punctuation Restoration using Transformer Models for High-and Low-Resource Languages

Byte-based multilingual transformer TTS for low-resource/few-shot language adaptation.

PyTorch Implementation of [1611.06440] Pruning Convolutional Neural Networks for Resource Efficient Inference

Real-Time Social Distance Monitoring tool using Computer Vision

An air quality monitoring service with a Raspberry Pi and a SDS011 sensor.

Attendance Monitoring with Face Recognition using Python

Complete system for facial identity system. Include one-shot model, database operation, features visualization, monitoring

Comments

Is the dashboard pooling squeue?

Releases(v0.0.1)

v0.0.1(Dec 9, 2021)

Owner

Consensus Learning from Heterogeneous Objectives for One-Class Collaborative Filtering

Collective Multi-type Entity Alignment Between Knowledge Graphs (WWW'20)

Code for our CVPR 2021 paper "MetaCam+DSCE"

[BMVC2021] The official implementation of "DomainMix: Learning Generalizable Person Re-Identification Without Human Annotations"

HIVE: Evaluating the Human Interpretability of Visual Explanations

Datasets for new state-of-the-art challenge in disentanglement learning

CS50x-AI - Artificial Intelligence with Python from Harvard University

Time Delayed NN implemented in pytorch

Code for paper "Context-self contrastive pretraining for crop type semantic segmentation"

68 keypoint annotations for COFW test data

Attendance Monitoring with Face Recognition using Python

Code release for NeurIPS 2020 paper "Co-Tuning for Transfer Learning"

Siamese TabNet

Multistream CNN for Robust Acoustic Modeling

PyTorch implementation of our ICCV 2021 paper Intrinsic-Extrinsic Preserved GANs for Unsupervised 3D Pose Transfer.

Neural-fractal - Create Fractals Using Complex-Valued Neural Networks!

Article Reranking by Memory-enhanced Key Sentence Matching for Detecting Previously Fact-checked Claims.

This repository contains an implementation of the Permutohedral Attention Module in Pytorch

Hooks for VCOCO

This repo generates the training data and the model for Morpheus-Deblend

`MLEProtocol`: Keeping Track of Your Experiments 📝

The `MLEResource`: Keeping Track of Your Resources 📉

The `MLEDashboard`: Dashboard Visualization 🎞️