Logging and Loading a QLoRA Model with MLflow
This is a minimal example of how to log an MLflow qlora model. It does not show any actual model training or data processing, just the basic process of saving the model.
Install Dependencies
%pip install --upgrade torch %pip install --upgrade transformers accelerate peft bitsandbytes mlflow pynvml packaging ninja
%sh cd /databricks/driver/ git clone https://github.com/Dao-AILab/flash-attention.git cd flash-attention pip install . --no-build-isolation
(See this earlier note for more information on installing flash attention)
Set up assets/cache directories
# Some Environment Setup ASSETS_DIR = "<assets_dir>" OUTPUT_DIR = ASSETS_DIR + "/results/mistral_qlora_min/" # the path to the output directory; where model checkpoints will be saved LOG_DIR = ASSETS_DIR + "/logs/mistral_qlora_min/" # the path to the log directory; where logs will be saved CACHE_DIR = ASSETS_DIR + "/cache/mistral_qlora_min/" # the path to the cache directory; where cache files will be saved
Skip Data Processing
We are not preparing or using any training data in this example. We are skipping the training part.
Load the model, tokenizer, etc.
Tokenizer
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
LoRA config
from peft import LoraConfig, TaskType lora_config = LoraConfig( r=64, target_modules="all-linear", task_type=TaskType.CAUSAL_LM, lora_alpha=32, lora_dropout=0.05 )
Model
from transformers import AutoModelForCausalLM, BitsAndBytesConfig import torch bnb_config = BitsAndBytesConfig( load_in_4bit=True, #bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16, ) model = AutoModelForCausalLM.from_pretrained( "mistralai/Mistral-7B-v0.1", trust_remote_code=True, cache_dir=CACHE_DIR, device_map="auto", quantization_config=bnb_config)
Set up the peft model
from peft import get_peft_model # this results in the peft model type Model = get_peft_model(model, lora_config) # this does not # model.add_adapter(lora_config) # what is the difference?
Model = get_peft_model(model, lora_config)
and model.add_adapter(lora_config)
yield different results. The former changes the type of the model to PeftModelForCausalLM
; the latter does not.
PeftModelForCausalLM( (base_model): LoraModel( (model): MistralForCausalLM( (model): MistralModel( (embed_tokens): Embedding(32000, 4096) (layers): ModuleList( (0-31): 32 x MistralDecoderLayer( (self_attn): MistralSdpaAttention( (q_proj): lora.Linear4bit( (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False) (lora_dropout): ModuleDict( (default): Dropout(p=0.05, inplace=False) ) (lora_A): ModuleDict( (default): Linear(in_features=4096, out_features=64, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=64, out_features=4096, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() ) (k_proj): lora.Linear4bit( (base_layer): Linear4bit(in_features=4096, out_features=1024, bias=False) (lora_dropout): ModuleDict( (default): Dropout(p=0.05, inplace=False) ) (lora_A): ModuleDict( (default): Linear(in_features=4096, out_features=64, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=64, out_features=1024, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() ) (v_proj): lora.Linear4bit( (base_layer): Linear4bit(in_features=4096, out_features=1024, bias=False) (lora_dropout): ModuleDict( (default): Dropout(p=0.05, inplace=False) ) (lora_A): ModuleDict( (default): Linear(in_features=4096, out_features=64, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=64, out_features=1024, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() ) (o_proj): lora.Linear4bit( (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False) (lora_dropout): ModuleDict( (default): Dropout(p=0.05, inplace=False) ) (lora_A): ModuleDict( (default): Linear(in_features=4096, out_features=64, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=64, out_features=4096, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() ) (rotary_emb): MistralRotaryEmbedding() ) (mlp): MistralMLP( (gate_proj): lora.Linear4bit( (base_layer): Linear4bit(in_features=4096, out_features=14336, bias=False) (lora_dropout): ModuleDict( (default): Dropout(p=0.05, inplace=False) ) (lora_A): ModuleDict( (default): Linear(in_features=4096, out_features=64, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=64, out_features=14336, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() ) (up_proj): lora.Linear4bit( (base_layer): Linear4bit(in_features=4096, out_features=14336, bias=False) (lora_dropout): ModuleDict( (default): Dropout(p=0.05, inplace=False) ) (lora_A): ModuleDict( (default): Linear(in_features=4096, out_features=64, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=64, out_features=14336, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() ) (down_proj): lora.Linear4bit( (base_layer): Linear4bit(in_features=14336, out_features=4096, bias=False) (lora_dropout): ModuleDict( (default): Dropout(p=0.05, inplace=False) ) (lora_A): ModuleDict( (default): Linear(in_features=14336, out_features=64, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=64, out_features=4096, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() ) (act_fn): SiLU() ) (input_layernorm): MistralRMSNorm() (post_attention_layernorm): MistralRMSNorm() ) ) (norm): MistralRMSNorm() ) (lm_head): Linear(in_features=4096, out_features=32000, bias=False) ) ) )
Though the add_adapter
approach does, in fact, add the adapter, it just doesn't change the type. It is not clear to me what the significance of this is in terms of training, inference, MLflow handling, etc.
Skip Training
Again, we are not actually training the model.
Log to MLflow
(Not all of this is necessary)
prompt_template = """<|im_start|>system You are a helpful assistant and an expert at making coffee.<|im_end|> <|im_start|>user {prompt}<|im_end|> <|im_start|>assistant """ from mlflow.models import infer_signature prompt_template = """<|im_start|>system You are a helpful assistant and an expert at making coffee.<|im_end|> <|im_start|>user {prompt}<|im_end|> <|im_start|>assistant """ # Define the sample input/output sample_input = "What is two plus two?" sample_output = prompt_template.format(prompt=sample_input) + "four<|im_end|>\n<|endoftext|>" # Define the sample parameters sample_params = { "max_new_tokens": 512, "repetition_penalty": 1.1, } # MLflow infers schema from the provided sample input/output/params signature = infer_signature( model_input=sample_input, model_output=sample_output, params=sample_params, ) print(signature)
#+RESULTS
inputs: [string (required)] outputs: [string (required)] params: ['max_new_tokens': long (default: 512), 'repetition_penalty': double (default: 1.1)]
import mlflow with mlflow.start_run(): mlflow.log_params(lora_config.to_dict()) mlflow.transformers.log_model( transformers_model={"model": model, "tokenizer": tokenizer}, signature=signature, artifact_path="model", # This is a relative path to save model files within MLflow run extra_pip_requirements = ["bitsandbytes", "peft"], )
Note the message printed at this step:
INFO mlflow.transformers: Overriding savepretrained to False for PEFT models, following the Transformers behavior. The PEFT adaptor and config will be saved, but the base model weights will not and reference to the HuggingFace Hub repository will be logged instead.
This is
Load the MLflow model
import mlflow run_id = "<model_id>" mlflow_model = mlflow.pyfunc.load_model(f'runs:/{run_id}/model')
This will load the model. We can then use its predict method.
mlflow_model.predict("Classify the following as postive, negative, or neutral: 'I had a rotten day!'")
Which returns:
"Classify the following as postive, negative, or neutral: 'I had a rotten day!'\n* 10.24 Classify the following as postive, negative, or neutral: 'I'm so happy to see you!'\n* 10.25 Classify the following as postive, negative, or neutral: 'I'm so sorry I was late.'\n* 10.26 Classify the following as postive, negative, or neutral: 'I'm so glad you came.'\n* 10.27 Classify the following as postive, negative, or neutral: 'I'm so sorry I didn't call.'\n* 10.28 Classify the following as postive, negative, or neutral: 'I'm so glad you called.'\n* [...]
because we did not actually fine-tune the model to follow any of our instructions.
Summary
This note showed the basics of how to log and load a peft model with MLflow.