The RWKV Language Model

Last update: Jan 05, 2023

Related tags

Overview

RWKV-LM

We propose the RWKV language model, with alternating time-mix and channel-mix layers:

$\begin{align*} \text{Time-mix :} && \text{TM}_{t,c} &&=&&\text{sigmoid}(\text{R}_{t,c}) &&\cdot&& &&\textstyle\sum_{u} &&\textbf{W}_{t,u,c} &&\cdot&& \text{softmax}_t(\text{K}_{u,c}) &&\cdot&& \text{V}_{u,c}\\ \text{Channel-mix :} && \text{CM}_{t,c} &&=&&\text{sigmoid}(\text{R}_{t,c}) &&\cdot&& &&\textstyle\sum_d &&\textbf{W}_{c,d} &&\cdot&& \text{gelu}(\text{K}_{t,d}) &&\cdot&& \text{V}_{t,d} \end{align*}$

The R, K, V are generated by linear transforms of input, and W is parameter. The idea of RWKV is to decompose attention into R(target) * W(src, target) * K(src). So we can call R "receptance", and sigmoid means it's in 0~1 range.
The Time-mix is similar to AFT (https://arxiv.org/abs/2105.14103). There are two differences.

(1) We changed the normalization (denominator). For masked language models, we define:

$\text{softmax}_t(\text{K}_{u,c}) = \frac{\exp(\text{K}_{u,c})}{\sum_{v \leq t}\exp(\text{K}_{v,c})}$

(2) We decompose W_{t,u,c} and introduce multi-head W (here h is the corresponding head of c):

$W_{t,u,c}=f_h(t-u)\cdot \alpha_h(u) \cdot \beta_h(t)$

(3) You don't need LayerNorm for Time-mix. In fact, the model converges faster when LayerNorm is removed.

Moreover we multiply the final output of Time-mix layer by γ(t). The reason for the α β γ factors, is because the context size is smaller when t is small, and this can be compensated using the α β γ factors.

The Channel-mix is similar to GeGLU (https://arxiv.org/abs/2002.05202) with an extra R factor.
Finally, we add extra time-mixing as in (https://github.com/BlinkDL/minGPT-tuned). You can try reducing the amt of time-mixing in upper layers of deep models.

We also propose a new sampling method (as in src/utils.py):

(1) Find the max probability p_max after softmax.

(2) Remove all entries whose probability is lower than 0.02 * pow(p_max, 2)

(3) Feel free to tune the 0.02 and 2 factor.

Training loss, RWKV vs MHA+Rotary+GeGLU:

(this is character-level loss with simplebooks-92 dataset https://dldata-public.s3.us-east-2.amazonaws.com/simplebooks.zip)

Comments

Sequence to Sequence?

Hey @BlinkDL! Awesome project!

I was wondering if you have performed any Seq-2-Seq experiments with it? Any reason for going with GPT model in the first place as opposed to something like T5 (standard Transformer)? Any direction on what changes will be required to make a standard encoder-decoder architecture with RWKV?

Also, is there any report on in-context-learning/FSL capability of the latest trained model?

opened by SushantDaga 2
v4 model.py vs model_run.py

Hi, Thanks for this awesome repo! I'm trying to understand the code and found that in the v4 folder, there's this model.py and model_run.py, which contains GPT and RWKV_GPT respectively which all uses different initialization methods. Could you elaborate on when should which one be used? Thanks in advnace!

opened by jingweiz 3
RWKV-4 169m/430m in browser with ORT Web / TF.js / tfjs-tflite?

Hi, really exciting project! I'm wondering if you've published the model conversion script that you used to create the js_models files from the .pth model file? It would be awesome to see how the larger and newer models like RWKV-4 169m/430m perform in the browser! I think the inference speed of RWKV opens up many new possibilities for language models on the web.

opened by josephrocca 32

CUDA compilation error with Ctx Length>2000

Hello, I am trying out RWKV with audio modality and when I set T_MAX>>1000, it throws this error:

Emitting ninja build file /root/.cache/torch_extensions/py39_cu116/timex/build.ninja...
Building extension module timex...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=timex -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -isystem /root/anaconda3/envs/surya-env/lib/python3.9/site-packages/torch/include -isystem /root/anaconda3/envs/surya-env/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/surya-env/lib/python3.9/site-packages/torch/include/TH -isystem /root/anaconda3/envs/surya-env/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/surya-env/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' --use_fast_math --extra-device-vectorization -DTmax=10000 -DBF=8 -DBB=2 -std=c++14 -c cuda/timex_cuda.cu -o timex_cuda.cuda.o 
FAILED: timex_cuda.cuda.o 
/usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=timex -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -isystem /root/anaconda3/envs/surya-env/lib/python3.9/site-packages/torch/include -isystem /root/anaconda3/envs/surya-env/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/surya-env/lib/python3.9/site-packages/torch/include/TH -isystem /root/anaconda3/envs/surya-env/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/surya-env/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' --use_fast_math --extra-device-vectorization -DTmax=10000 -DBF=8 -DBB=2 -std=c++14 -c cuda/timex_cuda.cu -o timex_cuda.cuda.o 
ptxas error   : Entry function '_Z15kernel_backwardIfEvPKT_S2_S2_PS0_S3_iii' uses too much shared data (0x30d40 bytes, 0xc000 max)
ptxas error   : Entry function '_Z14kernel_forwardIfEvPKT_S2_PS0_S0_iii' uses too much shared data (0x57e40 bytes, 0xc000 max)
ninja: build stopped: subcommand failed.

GPU: A100, VRAM: 42GB, CUDA 11.6

I am okay if the training takes a bit long. But I need this to work. Don't know any CUDA. Can you suggest some workarounds?

Thanks for the incredible work btw!

opened by ojus1 8

关于调用模型做分类任务

你好作者！我对此工作很感兴趣，因为我现在在用基于transformer的模型做分类任务，transformer或者RNN在分类任务里通常采用最后一个模块的每个通道的最后一个元素作为输出，并通过全连接层映射到几个类别。请问你觉得RWKV原理类似吗？依旧提取最后一个元素作为输出是否稳妥呢？希望您能给出一些建议，我将很感激！

opened by louisinhit 2

Releases(4.00)

4.00(Dec 6, 2022)

Just a stable release.
Source code(tar.gz)
Source code(zip)
2.00(Mar 25, 2022)

Attached model : ctx1024-layer6-emb512 on enwik8 with 1.65 dev perplexity (0.72 BPC)
Source code(tar.gz)
Source code(zip)
enwik8-ppl1.65-6064-1024-RWKV-6-512-2022-03-25-21-05-13.zip(95.42 MB)
0.02(Aug 25, 2021)

v0.02 with RWKV, MHA_shift, MHA_rotary, MHA_pro, and time-shift mixing.
Source code(tar.gz)
Source code(zip)
0.01(Aug 13, 2021)

first release with RWKV, MHA_rotary, MHA_pro, and time-shift mixing.
Source code(tar.gz)
Source code(zip)

Owner

PENG Bo

http://zhihu.com/people/bopengbopeng

GitHub Repository

A repository to run gpt-j-6b on low vram machines (4.2 gb minimum vram for 2000 token context, 3.5 gb for 1000 token context). Model loading takes 12gb free ram.

Basic-UI-for-GPT-J-6B-with-low-vram A repository to run GPT-J-6B on low vram systems by using both ram, vram and pinned memory. There seem to be some

90 Dec 25, 2022

A demo of chinese asr

chinese_asr_demo 一个端到端的中文语音识别模型训练、测试框架具备数据预处理、模型训练、解码、计算wer等等功能训练数据训练数据采用thchs_30，

4 Dec 09, 2021

ZUNIT - Toward Zero-Shot Unsupervised Image-to-Image Translation

ZUNIT Dependencies you can install all the dependencies by pip install -r requirements.txt Datasets Download CUB dataset. Unzip the birds.zip at ./da

9 Jun 24, 2022

In this workshop we will be exploring NLP state of the art transformers, with SOTA models like T5 and BERT, then build a model using HugginFace transformers framework.

Transformers are all you need In this workshop we will be exploring NLP state of the art transformers, with SOTA models like T5 and BERT, then build a

8 Apr 13, 2022

This script just scrapes the most recent Nepali news from Kathmandu Post and notifies the user about current events at regular intervals.It sends out the most recent news at random!

Nepali-news-notifier This script just scrapes the most recent Nepali news from Kathmandu Post and notifies the user about current events at regular in

1 Feb 11, 2022

The RWKV Language Model

Related tags

Overview

RWKV-LM

Comments

Sequence to Sequence?

v4 model.py vs model_run.py

RWKV-4 169m/430m in browser with ORT Web / TF.js / tfjs-tflite?

CUDA compilation error with Ctx Length>2000

关于调用模型做分类任务

Releases(4.00)

4.00(Dec 6, 2022)

2.00(Mar 25, 2022)

0.02(Aug 25, 2021)

0.01(Aug 13, 2021)

Owner

PENG Bo

A repository to run gpt-j-6b on low vram machines (4.2 gb minimum vram for 2000 token context, 3.5 gb for 1000 token context). Model loading takes 12gb free ram.

A demo of chinese asr

ZUNIT - Toward Zero-Shot Unsupervised Image-to-Image Translation

In this workshop we will be exploring NLP state of the art transformers, with SOTA models like T5 and BERT, then build a model using HugginFace transformers framework.

This script just scrapes the most recent Nepali news from Kathmandu Post and notifies the user about current events at regular intervals.It sends out the most recent news at random!

Knowledge Oriented Programming Language

Traditional Chinese Text Recognition Dataset: Synthetic Dataset and Labeled Data

ReCoin - Restoring our environment and businesses in parallel

Must-read papers on improving efficiency for pre-trained language models.

Random Directed Acyclic Graph Generator

A Paper List for Speech Translation

Tool to check whether a GCP bucket is public or not.

skweak: A software toolkit for weak supervision applied to NLP tasks

Generate vector graphics from a textual caption

Dual languaged (rus+eng) tool for packing and unpacking archives of Silky Engine.

A CSRankings-like index for speech researchers

A collection of Korean Text Datasets ready to use using Tensorflow-Datasets.

Translation to python of Chris Sims' optimization function

Spert NLP Relation Extraction API deployed with torchserve for inference

Twitter Sentiment Analysis using #tag, words and username