Repository for the paper titled: "When is BERT Multilingual? Isolating Crucial Ingredients for Cross-lingual Transfer"

Overview

When is BERT Multilingual? Isolating Crucial Ingredients for Cross-lingual Transfer

This repository contains code for our paper titled "When is BERT Multilingual? Isolating Crucial Ingredients for Cross-lingual Transfer". [arXiv]

Table of contents

  1. Paper in a nutshell
  2. Installation
  3. Data and models
  4. Repository usage
  5. Links to experiments and results
  6. Citation

Paper in a nutshell

While recent work on multilingual language models has demonstrated their capacity for cross-lingual zero-shot transfer on downstream tasks, there is a lack of consensus in the community as to what shared properties between languages enable such transfer. Analyses involving pairs of natural languages are often inconclusive and contradictory since languages simultaneously differ in many linguistic aspects. In this paper, we perform a large-scale empirical study to isolate the effects of various linguistic properties by measuring zero-shot transfer between four diverse natural languages and their counterparts constructed by modifying aspects such as the script, word order, and syntax. Among other things, our experiments show that the absence of sub-word overlap significantly affects zero-shot transfer when languages differ in their word order, and there is a strong correlation between transfer performance and word embedding alignment between languages (e.g., Spearman's R=0.94 on the task of NLI). Our results call for focus in multilingual models on explicitly improving word embedding alignment between languages rather than relying on its implicit emergence.

Installation instructions

  1. Step 1: Install from the conda .yml file.
conda env create -f installation/multilingual.yml
  1. Step 2: Install transformers in an editable way.
pip install -e transformers/
pip install -r transformers/examples/language-modeling/requirements.txt
pip install -r transformers/examples/token-classification/requirements.txt

Repository usage

For the commands we used to get the reported numbers in the paper, click here. This file contains common instructions used. This file can automatically generate commands for your use case.

Bilingual pre-training

  1. For bilingual pre-training on original and derived language pairs, use the flag --invert_word_order for the Inversion transformation, --permute_words for Permutation and --one_to_one_mapping for Transliteration. Example command for bilingual pre-training for English with Inversion transformation to create the derived language pair.
nohup  python transformers/examples/xla_spawn.py --num_cores 8 transformers/examples/language-modeling/run_mlm_synthetic.py --warmup_steps 10000 --learning_rate 1e-4 --save_steps -1 --max_seq_length 512 --logging_steps 50 --overwrite_output_dir --model_type roberta --config_name config/en/roberta_8/config.json --tokenizer_name config/en/roberta_8/ --do_train --do_eval --max_steps 500000 --per_device_train_batch_size 16 --per_device_eval_batch_size 16 --train_file ../../bucket/pretrain_data/en/train.txt --validation_file ../../bucket/pretrain_data/en/valid.txt --output_dir ../../bucket/model_outputs/en/inverted_order_500K/mlm --run_name inverted_en_500K_mlm --invert_word_order --word_modification add &
  1. For Syntax transformations, the train file used in the following command ([email protected][email protected]) means that it is the concatenation of French corpus with French modified to English verb and noun order ([email protected][email protected]).
nohup python transformers/examples/xla_spawn.py --num_cores 8 transformers/examples/language-modeling/run_mlm_synthetic.py --warmup_steps 10000 --learning_rate 1e-4 --save_steps -1 --max_seq_length 512 --logging_steps 50 --overwrite_output_dir --model_type roberta --config_name config/fr/roberta_8/config.json --tokenizer_name config/fr/roberta_8/ --do_train --do_eval --max_steps 500000 --per_device_train_batch_size 16 --per_device_eval_batch_size 16 --train_file ../../bucket/pretrain_data/fr/synthetic/[email protected][email protected] --validation_file ../../bucket/pretrain_data/fr/synthetic/[email protected][email protected] --output_dir ../../bucket/model_outputs/fr/syntax_modif_en/mlm --run_name fr_syntax_modif_en_500K_mlm &
  1. For composed transformations, apply multiple transformations by using multiple flags, e.g., --one_to_one_mapping --invert_word_order.
nohup python transformers/examples/xla_spawn.py --num_cores 8 transformers/examples/language-modeling/run_mlm_synthetic.py --warmup_steps 10000 --learning_rate 1e-4 --save_steps -1 --max_seq_length 512 --logging_steps 50 --overwrite_output_dir --model_type roberta --config_name config/en/roberta_8/config.json --tokenizer_name config/en/roberta_8/ --do_train --do_eval --max_steps 500000 --per_device_train_batch_size 16 --per_device_eval_batch_size 16 --train_file ../../bucket/pretrain_data/en/train.txt --validation_file ../../bucket/pretrain_data/en/valid.txt --output_dir ../../bucket/model_outputs/en/one_to_one_inverted/mlm --run_name en_one_to_one_inverted --one_to_one_mapping --invert_word_order --word_modification add &
  1. Using different domains for the original and derived language.
nohup python transformers/examples/xla_spawn.py --num_cores 8 transformers/examples/language-modeling/run_mlm_synthetic_transitive.py --warmup_steps 10000 --learning_rate 1e-4 --save_steps -1 --max_seq_length 512 --logging_steps 50 --overwrite_output_dir --model_type roberta --config_name config/en/roberta_8/config.json --tokenizer_name config/en/roberta_8/ --do_train --do_eval --max_steps 500000 --per_device_train_batch_size 16 --per_device_eval_batch_size 16 --train_file ../../bucket/pretrain_data/en/train_split_1.txt --transitive_file ../../bucket/pretrain_data/en/train_split_2.txt --validation_file ../../bucket/pretrain_data/en/valid.txt --output_dir ../../bucket/model_outputs/en/one_to_one_diff_source_100_more_steps/mlm --run_name en_one_to_one_diff_source_100_more_steps --one_to_one_mapping --word_modification add &

Fine-tuning and evaluation

This directory contains scripts used for downstream fine-tuning and evaluation.

  1. Transliteration, Inversion, and Permutation
  2. Syntax
  3. Composed transformations
  4. Using different domains for original and derived languages

Embedding alignment

Use this script to calculate embedding alignment for any model which uses Transliteration as one of the transformations.

Data and models

All the data used for our experiments, hosted on Google Cloud Bucket.

  1. Pre-training data - pretrain_data
  2. Downstream data - supervised_data
  3. Model files - model_outputs

Links to experiments and results

  1. Spreadsheets with run descriptions, commands, and weights and biases link
  2. Spreadsheet with all results
  3. Links to pre-training runs
  4. Link to fine-tuning and analysis

Citation

Please consider citing if you used our paper in your work!

To be updated soon!
Owner
Princeton Natural Language Processing
Princeton Natural Language Processing
The MATH Dataset

Measuring Mathematical Problem Solving With the MATH Dataset This is the repository for Measuring Mathematical Problem Solving With the MATH Dataset b

Dan Hendrycks 267 Dec 26, 2022
OMAMO: orthology-based model organism selection

OMAMO: orthology-based model organism selection OMAMO is a tool that suggests the best model organism to study a biological process based on orthologo

Dessimoz Lab 5 Apr 22, 2022
A Conditional Point Diffusion-Refinement Paradigm for 3D Point Cloud Completion

A Conditional Point Diffusion-Refinement Paradigm for 3D Point Cloud Completion This repo intends to release code for our work: Zhaoyang Lyu*, Zhifeng

Zhaoyang Lyu 68 Jan 03, 2023
Code for "Diffusion is All You Need for Learning on Surfaces"

Source code for "Diffusion is All You Need for Learning on Surfaces", by Nicholas Sharp Souhaib Attaiki Keenan Crane Maks Ovsjanikov NOTE: the linked

Nick Sharp 247 Dec 28, 2022
BARF: Bundle-Adjusting Neural Radiance Fields 🤮 (ICCV 2021 oral)

BARF 🤮 : Bundle-Adjusting Neural Radiance Fields Chen-Hsuan Lin, Wei-Chiu Ma, Antonio Torralba, and Simon Lucey IEEE International Conference on Comp

Chen-Hsuan Lin 539 Dec 28, 2022
Determined: Deep Learning Training Platform

Determined: Deep Learning Training Platform Determined is an open-source deep learning training platform that makes building models fast and easy. Det

Determined AI 2k Dec 31, 2022
Fiddle is a Python-first configuration library particularly well suited to ML applications.

Fiddle Fiddle is a Python-first configuration library particularly well suited to ML applications. Fiddle enables deep configurability of parameters i

Google 227 Dec 26, 2022
Corruption Invariant Learning for Re-identification

Corruption Invariant Learning for Re-identification The official repository for Benchmarks for Corruption Invariant Person Re-identification (NeurIPS

Minghui Chen 73 Dec 08, 2022
HSC4D: Human-centered 4D Scene Capture in Large-scale Indoor-outdoor Space Using Wearable IMUs and LiDAR. CVPR 2022

HSC4D: Human-centered 4D Scene Capture in Large-scale Indoor-outdoor Space Using Wearable IMUs and LiDAR. CVPR 2022 [Project page | Video] Getting sta

51 Nov 29, 2022
an implementation of Video Frame Interpolation via Adaptive Separable Convolution using PyTorch

This work has now been superseded by: https://github.com/sniklaus/revisiting-sepconv sepconv-slomo This is a reference implementation of Video Frame I

Simon Niklaus 985 Jan 08, 2023
[CVPR 2022 Oral] Versatile Multi-Modal Pre-Training for Human-Centric Perception

Versatile Multi-Modal Pre-Training for Human-Centric Perception Fangzhou Hong1  Liang Pan1  Zhongang Cai1,2,3  Ziwei Liu1* 1S-Lab, Nanyang Technologic

Fangzhou Hong 96 Jan 03, 2023
HyperDict - Self linked dictionary in Python

Hyper Dictionary Advanced python dictionary(hash-table), which can link it-self

8 Feb 06, 2022
Pseudo lidar - (CVPR 2019) Pseudo-LiDAR from Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving

Pseudo-LiDAR from Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving This paper has been accpeted by Conference o

Yan Wang 881 Dec 27, 2022
PyTorch implementation of "Efficient Neural Architecture Search via Parameters Sharing"

Efficient Neural Architecture Search (ENAS) in PyTorch PyTorch implementation of Efficient Neural Architecture Search via Parameters Sharing. ENAS red

Taehoon Kim 2.6k Dec 31, 2022
Rax is a Learning-to-Rank library written in JAX

🦖 Rax: Composable Learning to Rank using JAX Rax is a Learning-to-Rank library written in JAX. Rax provides off-the-shelf implementations of ranking

Google 247 Dec 27, 2022
Voila - Voilà turns Jupyter notebooks into standalone web applications

Rendering of live Jupyter notebooks with interactive widgets. Introduction Voilà turns Jupyter notebooks into standalone web applications. Unlike the

Voilà Dashboards 4.5k Jan 03, 2023
Revisiting, benchmarking, and refining Heterogeneous Graph Neural Networks.

Heterogeneous Graph Benchmark Revisiting, benchmarking, and refining Heterogeneous Graph Neural Networks. Roadmap We organize our repo by task, and on

THUDM 176 Dec 17, 2022
💡 Type hints for Numpy

Type hints with dynamic checks for Numpy! (❒) Installation pip install nptyping (❒) Usage (❒) NDArray nptyping.NDArray lets you define the shape and

Ramon Hagenaars 377 Dec 28, 2022
Show Me the Whole World: Towards Entire Item Space Exploration for Interactive Personalized Recommendations

HierarchicyBandit Introduction This is the implementation of WSDM 2022 paper : Show Me the Whole World: Towards Entire Item Space Exploration for Inte

yu song 5 Sep 09, 2022
Code associated with the paper "Deep Optics for Single-shot High-dynamic-range Imaging"

Deep Optics for Single-shot High-dynamic-range Imaging Code associated with the paper "Deep Optics for Single-shot High-dynamic-range Imaging" CVPR, 2

Stanford Computational Imaging Lab 40 Dec 12, 2022