📖 Technical report | 🤗 Model | 🐰 Demo
VisualRWKV is the visual-enhanced version of the RWKV language model, enabling RWKV to handle various visual tasks.By utilizing a loosely coupled adapter design, visual capabilities can be effortlessly enhanced while preserving the performance of the RWKV language model. This approach allows for easy integration and interchangeability without compromising the core functionality of RWKV.
- 2024.05.11 🔥 VisualRWKV-6.0 is released!.
- 2024.03.25 🔥 VisualRWKV-5.0 is released!.
Lastest stable verion is VisualRWKV-v6/v6.0, please cd to the dir VisualRWKV-v6/v6.0 for running the code.
VisualRWKV training consists of two stages:
- (1) Pre-training stage: use pretrain dataset to train a projection layer from frozen pretrained vision encoder to the frozen RWKV;
- (2) Fine-tuning stage: use visual instruction data, to teach the model to follow visual instructions.
You can download the LLaVA-Pretrain.
If you want to pretrain by yourself. You can download the RWKV checkpoints from the following links in the table.
VisualRWKV Version | RWKV 1B6 | RWKV 3B | RWKV 7B |
---|---|---|---|
VisualRWKV-v6 | RWKV-x060-World-1B6 | RWKV-x060-World-3B | RWKV-x060-World-7B |
You can refer to the following command to pretrain the VisualRWKV-v6.0 model. Also see scripts in the scripts/train
directory.
# here is an example to use 4 GPUs to pretrain a 1B5 RWKV model
export CUDA_VISIBLE_DEVICES=0,1,2,3
python train.py --load_model /path/to/rwkv/checkpoint \
--wandb "" --proj_dir path/to/output/ \
--data_file /path/to/LLaVA-Pretrain/blip_laion_cc_sbu_558k.json \
--data_type "json" --vocab_size 65536 \
--ctx_len 1024 --epoch_steps 1000 --epoch_count 9 --epoch_begin 0 --epoch_save 0 \
--micro_bsz 16 --accumulate_grad_batches 2 --n_layer 24 --n_embd 2048 --pre_ffn 0 \
--lr_init 1e-3 --lr_final 1e-5 --warmup_steps 0 --beta1 0.9 --beta2 0.99 --adam_eps 1e-8 \
--accelerator gpu --devices 4 --precision bf16 --strategy deepspeed_stage_1 --grad_cp 0 \
--image_folder /path/to/LLaVA-Pretrain/images/ \
--vision_tower_name /path/to/openai/clip-vit-large-patch14-336 \
--freeze_rwkv 24 --detail low --grid_size -1 --image_position first \
--enable_progress_bar True
Please refer to the LLaVA project for visual instruction data.
You can refer to the following command to fine-tune the VisualRWKV-v6.0 model. Also see scripts in the scripts/train
directory.
# here is an example to use 8 GPUs to fine-tune a 1B5 RWKV model
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python train.py --model_path path/to/pretrained-visualrwkv \
--wandb "" --proj_dir out/rwkv1b5-v060_mix665k \
--data_file /path/to/LLaVA-Instruct-150K/shuffled_llava_v1_5_mix665k.json \
--data_type "json" --vocab_size 65536 \
--ctx_len 2048 --epoch_steps 1000 --epoch_count 20 --epoch_begin 0 --epoch_save 5 \
--micro_bsz 8 --accumulate_grad_batches 2 --n_layer 24 --n_embd 2048 --pre_ffn 0 \
--lr_init 2e-5 --lr_final 2e-5 --warmup_steps 0 --beta1 0.9 --beta2 0.99 --adam_eps 1e-8 \
--accelerator gpu --devices 8 --precision bf16 --strategy deepspeed_stage_1 --grad_cp 0 \
--image_folder /path/to/LLaVA-Instruct-150K/images/ \
--vision_tower_name /path/to/openai/clip-vit-large-patch14-336 \
--freeze_rwkv 0 --freeze_proj 0 --detail low --grid_size -1 --image_position middle \
--enable_progress_bar True