2. Config Explaination
2.1. ppo_trainer.yaml for FSDP Backend
2.1.1. Data
data:
tokenizer: null
train_files: ~/data/rlhf/gsm8k/train.parquet
val_files: ~/data/rlhf/gsm8k/test.parquet
prompt_key: prompt
max_prompt_length: 512
max_response_length: 512
train_batch_size: 1024
val_batch_size: 1312
return_raw_input_ids: False # This should be set to true when the tokenizer between policy and rm differs
return_raw_chat: False
data.train_files: Training set parquet. Can be a list or a single file. The program will read all files into memory, so it can’t be too large (< 100GB). The path can be either local path or HDFS path. For HDFS path, we provide utils to download it to DRAM and convert the HDFS path to local path.data.val_files: Validation parquet. Can be a list or a single file.data.prompt_key: The field in the dataset where the prompt is located. Default is ‘prompt’.data.max_prompt_length: Maximum prompt length. All prompts will be left-padded to this length. An error will be reported if the length is too longdata.max_response_length: Maximum response length. Rollout in RL algorithms (e.g. PPO) generates up to this lengthdata.train_batch_size: Batch size sampled for one training iteration of different RL algorithms.data.val_batch_size: Batch size sampled for one validation iteration.data.return_raw_input_ids: Whether to return the original input_ids without adding chat template. This is mainly used to accommodate situations where the reward model’s chat template differs from the policy. It needs to be decoded first, then apply the RM’s chat template. If using a model-based RM, and the policy and RM chat_templates are different, this flag needs to be setdata.return_raw_chat:data.truncation: Truncate the input_ids or prompt length if they exceed max_prompt_length. Default is ‘error’, not allow exceed the max_prompt_length. The users should increase the max_prompt_length if throwing the error.
V1:
data.prompt_id_key: The field in the dataset where the prompt_id is located
data.max_prompt_id_length: In data processing, the prompt_id will be tokenized using the tokenizer and packaged with the prompt. This specifies the maximum length of the tokenized prompt_id. An error will be reported if it’s not long enough
2.1.2. Actor/Rollout/Reference Policy
actor_rollout_ref:
hybrid_engine: True
model:
path: ~/models/deepseek-llm-7b-chat
external_lib: null
override_config: {}
enable_gradient_checkpointing: False
actor:
strategy: fsdp # This is for backward-compatibility
ppo_mini_batch_size: 256
ppo_micro_batch_size: 64
grad_clip: 1.0
clip_ratio: 0.2
entropy_coeff: 0.001
ppo_epochs: 1
shuffle: True
optim:
lr: 1e-6
lr_warmup_steps_ratio: 0. # the total steps will be injected during runtime
min_lr_ratio: null # only useful for warmup with cosine
warmup_style: constant # select from constant/cosine
total_training_steps: -1 # must be override by program
fsdp_config:
wrap_policy:
# transformer_layer_cls_to_wrap: None
min_num_params: 0
param_offload: False
grad_offload: False
optimizer_offload: False
ref:
fsdp_config:
param_offload: False
wrap_policy:
# transformer_layer_cls_to_wrap: None
min_num_params: 0
log_prob_micro_batch_size: 128
rollout:
name: vllm
temperature: 1.0
top_k: -1 # 0 for hf rollout, -1 for vllm rollout
top_p: 1
response_length: ${data.max_response_length}
# for vllm rollout
dtype: bfloat16 # should align with FSDP
gpu_memory_utilization: 0.5
ignore_eos: False
enforce_eager: True
free_cache_engine: True
load_format: dummy_dtensor # or dummy_hf or dummy_megatron
tensor_model_parallel_size: 2
max_num_batched_tokens: 8192
max_num_seqs: 1024
log_prob_micro_batch_size: 128
# for vllm and hf rollout
do_sample: True
Common config for actor, rollout and reference model
actor_rollout_ref.hybrid_engine: Whether it’s a hybrid engine, currently only supports hybrid engineactor_rollout_ref.model.path: Huggingface model path. This can be either local path or HDFS path. For HDFS path, we provide utils to download it to DRAM and convert the HDFS path to local path.actor_rollout_ref.model.external_libs: Additional Python packages that need to be imported. Used to register models or tokenizers into the Huggingface system.actor_rollout_ref.model.override_config: Used to override some of the model’s original configurations, mainly dropoutactor_rollout_ref.model.enable_gradient_checkpointing: Whether to enable gradient checkpointing for the actor
Actor model
actor_rollout_ref.actor.strategy: fsdp or megatron. In this example, we use fsdp backend.actor_rollout_ref.actor.ppo_mini_batch_size: One sample is split into multiple sub-batches with batch_size=ppo_mini_batch_size for PPO updatesactor_rollout_ref.actor.ppo_micro_batch_size: Similar to gradient accumulation, the micro_batch_size for one forward pass, trading speed for GPU memoryactor_rollout_ref.actor.grad_clip: Gradient clipping for actor updatesactor_rollout_ref.actor.clip_ratio: PPO clip ratioactor_rollout_ref.actor.entropy_coeff: The weight of entropy when calculating PPO lossactor_rollout_ref.actor.ppo_epochs: Number of epochs for PPO updates on one set of sampled dataactor_rollout_ref.actor.shuffle: Whether to shuffle data when there are multiple epochsactor_rollout_ref.actor.optim: Actor’s optimizer parametersactor_rollout_ref.actor.fsdp_config: FSDP config for actor trainingwrap_policy: FSDP wrap policy. By default, it uses Huggingface’s wrap policy, i.e., wrapping by DecoderLayerNo need to set transformer_layer_cls_to_wrap, so we comment it.
*_offload: Whether to enable parameter, gradient and optimizer offloadTrading speed for GPU memory.
Reference Model
actor_rollout_ref.ref: FSDP config same as actor. For models larger than 7B, it’s recommended to turn on offload for ref by defaultactor_rollout_ref.ref.log_prob_micro_batch_size: The batch size for one forward pass in the computation ofref_log_prob.
Rollout Model
actor_rollout_ref.rollout.name: hf/vllm. We use vLLM by default because it’s much efficient and our hybrid engine is implemented with vLLM.Rollout (Auto-regressive) parameters. The key should be equal to the property name in vLLM’s
SamplingParams.temperature,top_k,top_pand others: Sampling parameters inSamplingParams.
dtype: Rollout model parameters type. This should be align with the actor model parameter type in FSDP/Megatron backend.gpu_memory_utilization: The proportion of the remaining GPU memory allocated for kv cache after other models have initialized when using vllm.tensor_model_parallel_size: TP size for rollout. Only effective for vllm.log_prob_micro_batch_size: Micro_batch_size (The batch size for one forward pass) for recalculating log_prob.do_sample: Whether to sample. If set to False, the rollout model will perform greedy sampling. We disabledo_sampleduring validation.actor_rollout_ref.rollout.ignore_eos: Whether to ignore the EOS token and continue generating tokens after the EOS token is generated.actor_rollout_ref.rollout.free_cache_engine: Offload the KVCache after rollout generation stage. Default is True. When set to True, we need to disable the usage of CUDAGraph (setenforce_eagerto True.)actor_rollout_ref.rollout.enforce_eager: Whether to use CUDAGraph in vLLM generation. Default set to True to disable CUDAGraph.actor_rollout_ref.rollout.load_format: Which weight loader to use to load the actor model weights to the rollout model.auto: Use Megatron weight loader.megatron: Use Megatron weight loader. Deployed with Megatron backend. The input modelstate_dict()is already partitioned along TP dimension and already gathered along PP dimension. This weight loader requires that the Rollout model and Actor model’s parameters shape and name should be identical.dtensor: Default solution when using Huggingface weight loader. Deployed with FSDP backend and the state_dict_type isStateDictType.SHARDED_STATE_DICT. Recommend to use this weight loaderhf: Use Huggingface weight loader. Deployed with FSDP backend and the state_dict_type isStateDictType.FULL_STATE_DICT. This solution doesn’t need to rewrite the weight loader for each model implemented in vLLM but it results in larger peak memory usage.dummy_hf,dummy_megatron,dummy_dtensor: Random initialization.
Note
NOTED: In this config field, users only need to select from dummy_megatron, dummy_dtensor, dummy_hf for rollout initialization and our hybrid engine will select the corresponding weight loader (i.e., megatron, dtensor, hf) during actor/rollout weight synchronization.
2.1.3. Critic Model
Most parameters for Critic are similar to Actor Model.
2.1.4. Reward Model
reward_model:
enable: False
model:
input_tokenizer: ${actor_rollout_ref.model.path} # set this to null if the chat template is identical
path: ~/models/Anomy-RM-v0.1
external_lib: ${actor_rollout_ref.model.external_lib}
fsdp_config:
min_num_params: 0
param_offload: False
micro_batch_size: 64
max_length: null
reward_model.enable: Whether to enable reward model. If False, we compute the reward only with the user-defined reward functions. In GSM8K and Math examples, we disable reward model. For RLHF alignment example using full_hh_rlhf, we utilize reward model to assess the responses. If False, the following parameters are not effective.reward_model.modelinput_tokenizer: Input tokenizer. If the reward model’s chat template is inconsistent with the policy, we need to first decode to plaintext, then apply the rm’s chat_template. Then score with RM. If chat_templates are consistent, it can be set to null.path: RM’s HDFS path or local path. Note that RM only supports AutoModelForSequenceClassification. Other model types need to define their own RewardModelWorker and pass it from the code.
2.1.5. Algorithm
algorithm:
gamma: 1.0
lam: 1.0
adv_estimator: gae
kl_penalty: kl # how to estimate kl divergence
kl_ctrl:
type: fixed
kl_coef: 0.005
gemma: discount factorlam: Trade-off between bias and variance in the GAE estimatoradv_estimator: gae. Currently only supports gae, will support GRPO in the futurekl_penalty:Supportkl,abs,mseandfull.How to calculate the kl divergence between actor and reference policy. For specific options, refer to core_algos.py .
2.1.6. Trainer
trainer:
total_epochs: 30
project_name: verl_examples
experiment_name: gsm8k
logger: ['console', 'tracking']
nnodes: 1
n_gpus_per_node: 8
save_freq: -1
test_freq: 2
critic_warmup: 0
default_hdfs_dir: ~/experiments/gsm8k/ppo/${trainer.experiment_name} # hdfs checkpoint path
default_local_dir: checkpoints/${trainer.project_name}/${trainer.experiment_name} # local checkpoint path
trainer.total_epochs: Number of epochs in training.trainer.project_name: For wandbtrainer.experiment_name: For wandbtrainer.logger: Support console and tracking. For tracking, we will initialize a wandbtrainer.nnodes: Number of nodes used in the training.trainer.n_gpus_per_node: Number of GPUs per node.trainer.save_freq: The frequency (by iteration) to save checkpoint of the actor and critic model.trainer.test_freq: The validation frequency (by iteration).trainer.critic_warmup: The number of iteration to train the critic model before actual policy learning.