Yi-34B微调训练

原创已于 2024-01-12 18:06:28 修改 · 1.5k 阅读

22 ·

本内容遵循CC 4.0 BY-SA版权协议

标签

#llama #nlp

于 2024-01-02 11:33:51 首次发布

本文详细介绍了如何在特定环境下安装和配置LLaMA-Factory，涉及CUDA版本选择、HuggingFace库、DeepSpeed的使用，以及如何进行模型微调、量化训练和处理内存优化，包括ZeRO-1、ZeRO-2和ZeRO-3的内存管理策略。

环境安装

基础环境

# 创建环境
conda create -n llama_factory python=3.10
conda activate llama_factory
# 按照个人情况选择CUDA版本
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# 安装微调工具
git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -r requirements.txt
# 安装分布式加速训练库
pip install deepspeed

下载模型

# 镜像地址：https://hf-mirror.com/
conda activate base
pip install -U huggingface_hub
export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download --resume-download --local-dir-use-symlinks False 01-ai/Yi-34B-Chat --local-dir Yi-34B-Chat
# 若下载中断导致文件校验失败，可通过--include或--exclude指定多个文件，以空格分隔

训练过程

配置文件

# 参考：https://github.com/hiyouga/LLaMA-Factory/issues/256
vi ds_config_lora.json
{
    "bfloat16": {
        "enabled": false
    },
    "fp16": {
        "enabled": "auto"
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_fp16_weights_on_model_save": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 1e5,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

训练脚本

deepspeed --include localhost:0,1,2,3,4,5,6,7 src/train_bash.py \
    --stage sft \
    --do_train \
    --model_name_or_path /path/to/model/Yi-34B-Chat/ \
    --dataset alpaca_gpt4_zh \
    --template yi \
    --finetuning_type lora \
    --lora_target k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj \
    --output_dir ./yi_sft_checkpoint \
    --overwrite_cache \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --plot_loss \
    --fp16 \
    --deepspeed "./ds_config_lora.json"

ZeRO-1：分割Optimizer States；ZeRO-2：分割Optimizer States与Gradients；ZeRO-3：分割Optimizer States、Gradients与Parameters

量化训练

--quantization_bit 4
# ValueError: DeepSpeed ZeRO-3 is incompatible with quantization.
# 注意：参数配置行之间不要保留注释，会导致其后的配置不生效

常见问题

内存占用超高

since you are offloading both parameters and optimizer state to CPU you would need roughly 18 bytes per model parameter. That means for 7B model you would need ~126GB of CPU memory. Please see page 3 of https://arxiv.org/pdf/1910.02054.pdf for a discussion of the memory breakdown.

参考：How to calculate the cpu memory required for DeepSpeedZeRoOffload initialization? · Issue #3606 · microsoft/DeepSpeed · GitHub

通过启用DeepSpeed的ZeRO-3优化，可直接将模型拆分加载到显存中，初始化和训练过程中一直保持较低的内存占用。