基于MindSpore的llama微调在OpenI平台上运行

克隆预训练模型

克隆chatglm-6b代码仓，下载分布式的模型文件

git lfs install git clone https://huggingface.co/openlm-research/open_llama_7b

准备环境

安装Transformer

pip install transformers

执行转换脚本

python mindformers/models/glm/convert_weight.py --pt_ckpt_path /home/ma-user/work/models/mindspore/pt_glm_6b.pth --ms_ckpt_path ../models/mindspore/ms_glm_6b.ckpt

注意可能会遇到以下错误:

执行转换脚本，得到转换后的输出文件ms_glm_6b.ckpt

解决方法：

export LD_PRELOAD=$LD_PRELOAD:/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/torch/lib/libgomp-d22c30c5.so.1

原理：找到torch中的libgomp-d22c30c5.so.1 然后赋值给LD_PRELOAD环境变量，这个报错好像只有ARM平台会有

微调训练集准备

微调方式：lora

目前提供alpaca数据集的预处理脚本用于全参微调/lora微调任务。

数据集地址：https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json

alpaca数据集原始格式样例：

# alpaca examples: { "instruction": "Describe a time when you had to make a difficult decision.", "input": "", "output": "I had to make a difficult decision when I was working as a project manager at a construction company. I was in charge of a project that needed to be completed by a certain date in order to meet the client\u2019s expectations. However, due to unexpected delays, we were not able to meet the deadline and so I had to make a difficult decision. I decided to extend the deadline, but I had to stretch the team\u2019s resources even further and increase the budget. Although it was a risky decision, I ultimately decided to go ahead with it to ensure that the project was completed on time and that the client\u2019s expectations were met. The project was eventually successfully completed and this was seen as a testament to my leadership and decision-making abilities." }, { "instruction": "Identify the odd one out.", "input": "Twitter, Instagram, Telegram", "output": "Telegram" },

执行alpaca_converter.py，使用fastchat工具添加prompts模板，将原始数据集转换为多轮对话格式

# 脚本路径：tools/dataset_preprocess/llama/alpaca_converter.py # 执行转换脚本 python alpaca_converter.py \ --data_path /home/ma-user/work/data/alpaca_data.json \ --output_path /home/ma-user/work/data/alpaca-data-conversation.json

参数说明

# 参数说明 data_path: 存放alpaca数据的路径 output_path: 输出转换后对话格式的数据路径

转换后的样例:

{ "id": "1", "conversations": [ { "from": "human", "value": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:" }, { "from": "gpt", "value": "1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and h2. \n3. Get enough sleep and maintain a consistent sleep schedule." } ] },

执行llama_preprocess.py，进行数据预处理、Mindrecord数据生成，将带有prompt模板的数据转换为mindrecord格式。

安装依赖:

pip install "fschat[model_worker,webui]"

执行脚本

# 脚本路径：tools/dataset_preprocess/llama/llama_preprocess.py # 由于此工具依赖fschat工具包解析prompt模板，请提前安装fschat >= 0.2.13 python = 3.9 python llama_preprocess.py \ --dataset_type qa \ --input_glob /home/ma-user/work/data/alpaca-data-conversation.json \ --model_file /home/ma-user/work/models/open_llama_7b/tokenizer.model \ --seq_length 2048 \ --output_file /home/ma-user/work/models/alpaca-fastchat2048.mindrecord

lora微调

目前lora微调适配了llama_7b模型，并给出了默认配置文件config/llama/run_llama_7b_lora.yaml

step 1. 修改配置文件，参考全参微调修改训练数据集路径与预训练权重路径。 step 2. 启动lora微调任务。
注：llama_7b_lora模型支持单卡启动，需将配置文件中的use_parallel参数置为False。

脚本启动

python run_mindformer.py --config=./configs/llama/run_llama_7b_lora.yaml --use_parallel=False --run_mode=finetune

run_llma_7b_lora.yaml

seed: 0 output_dir: './output' # 当前不支持自定义修改，请勿修改该默认值 load_checkpoint: '/home/ma-user/work/models/mindspore/open_llama_7b_ms.ckpt' src_strategy_path_or_dir: '' auto_trans_ckpt: False # If true, auto transform load_checkpoint to load in distributed model only_save_strategy: False resume_training: False run_mode: 'finetune' # trainer config trainer: type: CausalLanguageModelingTrainer model_name: 'llama_7b_lora' # runner config runner_config: epochs: 1 batch_size: 2 sink_mode: True sink_size: 2 # optimizer optimizer: type: FP32StateAdamWeightDecay beta1: 0.9 beta2: 0.95 eps: 1.e-8 learning_rate: 1.e-4 # lr sechdule lr_schedule: type: CosineWithWarmUpLR learning_rate: 1.e-4 warmup_ratio: 0.03 total_steps: -1 # -1 means it will load the total steps of the dataset # dataset train_dataset: &train_dataset data_loader: type: MindDataset dataset_dir: "/home/ma-user/work/models/alpaca-fastchat2048.mindrecord" shuffle: True input_columns: ["input_ids", "labels"] # "input_ids", "labels" , labels are used in instruction finetune. num_parallel_workers: 8 python_multiprocessing: False drop_remainder: True batch_size: 2 repeat: 1 numa_enable: False prefetch_size: 1 train_dataset_task: type: CausalLanguageModelDataset dataset_config: *train_dataset # if True, do evaluate during the training process. if false, do nothing. # note that the task trainer should support _evaluate_in_training function. do_eval: False # eval dataset eval_dataset: &eval_dataset data_loader: type: MindDataset dataset_dir: "/home/ma-user/work/models/alpaca-fastchat2048.mindrecord" shuffle: False input_columns: ["input_ids", "labels"] num_parallel_workers: 8 python_multiprocessing: False drop_remainder: False repeat: 1 numa_enable: False prefetch_size: 1 eval_dataset_task: type: CausalLanguageModelDataset dataset_config: *eval_dataset use_parallel: False # parallel context config parallel: parallel_mode: 1 # 0-data parallel, 1-semi-auto parallel, 2-auto parallel, 3-hybrid parallel gradients_mean: False enable_alltoall: False full_batch: True search_mode: "sharding_propagation" enable_parallel_optimizer: False strategy_ckpt_save_file: "./ckpt_strategy.ckpt" parallel_optimizer_config: gradient_accumulation_shard: False parallel_optimizer_threshold: 64 # default parallel of device num = 8 910A parallel_config: data_parallel: 8 model_parallel: 1 pipeline_stage: 1 use_seq_parallel: False optimizer_shard: False micro_batch_num: 1 vocab_emb_dp: True gradient_aggregation_group: 4 # when model parallel is greater than 1, we can set micro_batch_interleave_num=2, that may accelerate the train process. micro_batch_interleave_num: 1 # recompute config recompute_config: recompute: True select_recompute: False parallel_optimizer_comm_recompute: False mp_comm_recompute: True recompute_slice_activation: True # callbacks callbacks: - type: MFLossMonitor - type: CheckpointMointor prefix: "llama_7b_lora" save_checkpoint_steps: 20000 integrated_save: False async_save: False - type: ObsMonitor # mindspore context init config context: mode: 0 #0--Graph Mode; 1--Pynative Mode device_target: "Ascend" enable_graph_kernel: False graph_kernel_flags: "--disable_expand_ops=Softmax,Dropout --enable_parallel_fusion=true --reduce_fuse_depth=8 --enable_auto_tensor_inplace=true" max_call_depth: 10000 max_device_memory: "31GB" save_graphs: False save_graphs_path: "./graph" device_id: 0 # model config model: model_config: type: LlamaConfig batch_size: 1 # add for increase predict seq_length: 2048 hidden_size: 4096 num_layers: 32 num_heads: 32 vocab_size: 32000 multiple_of: 256 rms_norm_eps: 1.0e-6 bos_token_id: 1 eos_token_id: 2 pad_token_id: 0 ignore_token_id: -100 compute_dtype: "float16" layernorm_compute_dtype: "float32" softmax_compute_dtype: "float16" rotary_dtype: "float16" param_init_type: "float16" use_past: False pretrain_seqlen: 2048 # seqlen of the pretrain checkpoint: 2048 for llama and 4096 for llama2 extend_method: "None" # support "None", "PI", "NTK" compute_in_2d: False use_flash_attention: False offset: 0 use_past_shard: False checkpoint_name_or_path: "llama_7b_lora" repetition_penalty: 1 max_decode_length: 512 top_k: 3 top_p: 1 do_sample: False pet_config: pet_type: lora # configuration of lora in_channels: 4096 out_channels: 4096 lora_rank: 16 lora_alpha: 16 lora_dropout: 0.05 arch: type: LlamaForCausalLMWithLora processor: return_tensors: ms tokenizer: unk_token: '<unk>' bos_token: '<s>' eos_token: '</s>' pad_token: '<pad>' type: LlamaTokenizer # metric metric: type: PerplexityMetric # wrapper cell config runner_wrapper: type: MFTrainOneStepCell scale_sense: type: DynamicLossScaleUpdateCell loss_scale_value: 4294967296 scale_factor: 2 scale_window: 1000 use_clip_grad: True eval_callbacks: - type: ObsMonitor auto_tune: False filepath_prefix: './autotune' autotune_per_step: 10 profile: False profile_start_step: 1 profile_stop_step: 10 init_start_profile: False profile_communication: False profile_memory: True layer_scale: False layer_decay: 0.65 lr_scale_factor: 256 # cfts init config remote_save_url: "Please input obs url on AICC platform."

codellamaalpacatokenchatpythonraptpu数据集jsoneloamlsatideappgiturlpromptconversationclifixllmcto预训练transformertelegramleadership数据预处理ats训练数据集llama2mun工具包sem微调训练clonegptiva数据转换twitterinstagram