Align your LLM with a lot much less memory and tempo setting pleasant technique than DPO.
Aligning LLMs for optimum effectivity generally begins with Supervised Large-Tuning (SFT). Usually, the model is loaded in 4-Bit, and config for LoRA teaching is utilized. The same old observe contains loading the model in 4-bit mode and making use of configurations for LoRA (Low-Rank Adaptation) teaching. Direct Selection Optimization (DPO) is one different distinguished technique for optimizing fashions with lower costs. The same old observe contains coupling SFT+DPO to further improve model effectivity nevertheless might be costly. Odds Ratio Selection Optimization (ORPO) replaces the SFT+DPO proper right into a single step with additional enhanced effectivity by together with an odds ratio-based penalty to the normal unfavorable log-likelihood (NLL) loss for differentiating the period varieties between favored and disfavored responses.
One different technique for additional safe teaching and improved effectivity is CPO-SimPO. It objectives to counter SFT’s dependency on teaching information prime quality for model effectivity, DPO’s memory + tempo inefficiency (if dealing with every parametrized and reference protection) and to forestall the period of prolonged nevertheless low-quality sequences. On this weblog, I will introduce this method intimately and extra put together Phi3-Mini-4K-Instruct on CPO-SimPO.
It is a joint of two want optimization methods: CPO and SimPO.
Launched by Haoran Xu et. al, 2024, the CPO aim is an approximation of the DPO aim by discarding the greatest protection throughout the genuine DPO loss. Moreover, a conduct cloning (BC) regularizer is built-in to verify the model doesn’t deviate from the favored information distribution.
CPO requires a high-quality nevertheless flawless want dataset ( format: instant, chosen, rejected ) to understand perfection in model output and mitigate even minor flaws.
Launched by Yu Meng et. al, 2024, SimPO eliminates the need for a reference model in distinction to widespread DPO, by a length-normalized reward which is the standard log likelihood of all tokens generated by the first protection model itself, instead of an specific reward model in DPO. Secondly, it introduces a aim reward margin γ to verify the reward distinction between chosen and rejected responses exceeds this margin.
SimPO is additional memory and compute-efficient than DPO for not using an specific reward model however prevents producing longer nevertheless lower-quality sequences as a result of it outperforms DPO all through AlpacaEval2 and ArenaHard.
Combining every targets leads us to CPO-SimPO loss to take advantage of some great benefits of every want optimization methods collectively.
We are going to perform the CPO-SimPO teaching of any HuggingFace model using the official GitHub repository.
We’d wish to create a Python ambiance using conda, so if you don’t have conda put in, proper right here’s how one can arrange conda:
mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh
~/miniconda3/bin/conda init bash
~/miniconda3/bin/conda init zsh
You must to open a model new terminal for the results to occur. Now Create a python digital ambiance.
conda create -n handbook python=3.10 && conda activate handbook
Then you definitely would possibly wish to arrange pytorch v2.2.2
, installation method is decided by your system.
conda arrange pytorch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 pytorch-cuda=11.8 -c pytorch -c nvidia
Since this codebase is constructed upon the alignment-handbook repo. We are going to arrange its bundle dependencies.
git clone https://github.com/huggingface/alignment-handbook.git
cd ./alignment-handbook/
python -m pip arrange .
cd ..
Moreover, you will have Flash Consideration 2 put in:
python -m pip arrange flash-attn --no-build-isolation
Now let’s clone the CPO_SimPO repository
git clone https://github.com/fe1ixxu/CPO_SIMPO.git
cd CPO_SIMPO
It is worthwhile to create a .yaml
config file to specify the teaching arguments. Modify per_device_train_batch_size
and max_length
primarily based in your GPU specs. Observe that we set the loss_type: simpo
and cpo_alpha
to a non zero value
# Model arguments
model_name_or_path: microsoft/Phi-3-mini-4k-instruct
torch_dtype: null
use_flash_attention_2: false
# Data teaching arguments
dataset_mixer:
princeton-nlp/llama3-ultrafeedback: 1.0
dataset_splits:
- put together
- check out
preprocessing_num_workers: 12
# CPOTrainer arguments
bf16: true
beta: 10
simpo_gamma: 5.4
cpo_alpha: 0.05
loss_type: simpo
do_eval: true
evaluation_strategy: steps
eval_steps: 400
gradient_accumulation_steps: 4
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: False
hub_model_id: cpo-simpo-exps
learning_rate: 1.0e-6
log_level: information
logging_steps: 5
lr_scheduler_type: cosine
max_length: 2048
max_prompt_length: 1800
num_train_epochs: 1
optim: adamw_torch
output_dir: outputs/phi3mini4k-cpo-simpo
run_name: phi3mini4k-cpo-simpo
per_device_train_batch_size: 2
per_device_eval_batch_size: 2
push_to_hub: false
save_strategy: "steps"
save_steps: 1000000
report_to:
- none
save_total_limit: 20
seed: 42
warmup_ratio: 0.1
Subsequent we have now to specify {{hardware}} configuration. We’ll take advantage of the deepspeed_zero3.yaml
config provided throughout the repository beneath accelerate_configs
itemizing. Choose num_processes
as a result of the number of GPUs you may need obtainable. It is doable you will need A100 GPUs to avoid CUDA errors.
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
deepspeed_multinode_launcher: regular
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: true
zero3_save_16bit_model: true
zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: principal
mixed_precision: bf16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
Current the path to your teaching and velocity up config recordsdata and start teaching. Your final model will probably be obtainable in output_dir
as specified by teaching arguments.
ACCELERATE_LOG_LEVEL=information velocity up launch --config_file accelerate_configs/deepspeed_zero3.yaml scripts/run_cpo.py training_configs/phi3-mini4k-instruct-cpo-simpo.yaml
After importing the model to your hugging face account, you probably can perform inference throughout the following means:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
torch.random.manual_seed(0)
model = AutoModelForCausalLM.from_pretrained(
"abideen/Phi-3-mini-4K-instruct-cpo-simpo",
device_map="cuda",
torch_dtype="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("abideen/Phi-3-mini-4K-instruct-cpo-simpo")
messages = [
{"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},
{"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."},
{"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"},
]
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
)
generation_args = {
"max_new_tokens": 500,
"return_full_text": False,
"temperature": 0.0,
"do_sample": False,
}
output = pipe(messages, **generation_args)
print(output[0]['generated_text'])