LLM Fine-Tuning Techniques: A Technical Overview

Fine-tuning adapts a pretrained large language model to specific tasks or domains by continuing training on a smaller, targeted dataset. This write-up covers the major approaches, their mechanics, and practical considerations.


Full Fine-Tuning

The most straightforward approach updates all model parameters during training. You initialize with pretrained weights and perform gradient descent on your task-specific dataset.

Mechanics: Given a pretrained model with parameters $\theta$, you minimize a task-specific loss $\mathcal{L}$ over your dataset $\mathcal{D}$:

$$\theta^* = \arg\min_\theta \sum_{(x,y) \in \mathcal{D}} \mathcal{L}(f(x; \theta), y)$$

For language models, this is typically cross-entropy loss over next-token predictions.

Considerations: Full fine-tuning offers maximum expressiveness but requires substantial compute and memory—you need to store optimizer states for every parameter. For a 7B parameter model with AdamW, that's roughly 56GB just for optimizer states (two fp32 values per parameter). There's also significant risk of catastrophic forgetting, where the model loses general capabilities while specializing.


Parameter-Efficient Fine-Tuning (PEFT)

PEFT methods freeze most pretrained weights and train only a small subset of parameters, dramatically reducing memory requirements while often matching full fine-tuning performance.

LoRA (Low-Rank Adaptation)

LoRA decomposes weight updates into low-rank matrices. Instead of updating a weight matrix $W$ directly, you learn two smaller matrices $A$ and $B$ such that the effective update is $BA$.

For a weight matrix $W \in \mathbb{R}^{d \times k}$, LoRA adds:

$$W' = W + BA$$

where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times k}$.

The rank $r$ is typically 4-64, far smaller than $d$ or $k$ (which might be thousands). This reduces trainable parameters from $d \times k$ to $r \times (d + k)$.

Implementation details: LoRA is typically applied to attention projection matrices (Q, K, V, and output projections). The original weights remain frozen; only $A$ and $B$ are trained. At inference, you can merge $BA$ into $W$ for zero additional latency.

QLoRA extends this by quantizing the base model to 4-bit precision, enabling fine-tuning of 65B+ models on a single consumer GPU. It uses a novel NF4 (Normal Float 4-bit) data type and double quantization to minimize accuracy loss.

Adapters

Adapters insert small bottleneck modules between transformer layers. A typical adapter consists of:

  1. Down-projection: $\mathbb{R}^d \rightarrow \mathbb{R}^r$ where $r \ll d$
  2. Nonlinearity (ReLU or GELU)
  3. Up-projection: $\mathbb{R}^r \rightarrow \mathbb{R}^d$
  4. Residual connection

Only adapter parameters train while the original model stays frozen. This adds slight inference latency since the adapter forward pass can't be folded into existing weights.

Prefix Tuning and Prompt Tuning

These methods prepend learnable continuous vectors to the input or hidden states.

Prompt tuning learns a sequence of soft tokens $[P_1, P_2, \ldots, P_n]$ prepended to the input embedding. Only these vectors ($n \times d_{\text{embed}}$ parameters) are trained.

Prefix tuning is more expressive—it prepends learnable key-value pairs to every attention layer, allowing the model to attend to task-specific "virtual tokens" throughout its depth.

Both methods are extremely parameter-efficient but can underperform other PEFT methods on complex tasks.


Instruction Fine-Tuning

This approach trains models to follow natural language instructions by creating datasets of (instruction, input, output) tuples.

Dataset format example:

Instruction: Summarize the following text in one sentence.
Input: [paragraph about climate change]
Output: Climate change poses significant risks to global ecosystems...

The key insight is that diverse instruction-following data generalizes surprisingly well—models trained on a few thousand high-quality examples can follow novel instructions they've never seen.

Self-Instruct and Evol-Instruct (used in WizardLM) are techniques for automatically generating and evolving instruction datasets using the model itself, reducing manual annotation costs.


Alignment Fine-Tuning

These techniques align model behavior with human preferences, going beyond simple instruction following.

RLHF (Reinforcement Learning from Human Feedback)

The classic three-stage pipeline:

  1. Supervised fine-tuning (SFT): Train on demonstration data of desired behavior
  2. Reward model training: Train a model to predict human preferences between response pairs
  3. RL optimization: Use PPO (Proximal Policy Optimization) to maximize the reward model's score while staying close to the SFT policy via KL penalty

The objective balances reward maximization with distribution shift prevention:

$$
J(\pi) = \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi(\cdot|x)}
\left[R(x, y)\right]\beta \cdot D_{\text{KL}}\left(\pi \mid \pi_{\text{ref}}\right)
$$

RLHF is powerful but complex—it requires maintaining multiple models simultaneously and is sensitive to reward hacking.

DPO (Direct Preference Optimization)

DPO sidesteps reward modeling entirely by deriving a closed-form solution to the RLHF objective. Given preference pairs $(y_w, y_l)$ where $y_w$ is preferred over $y_l$:

$$
\mathcal{L}{\text{DPO}} = -\log \sigma\left(\beta \left[\log \frac{\pi\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right]\right)
$$

This directly increases the likelihood of preferred responses relative to dispreferred ones, scaled by how much each deviates from the reference policy. DPO is simpler to implement, more stable, and often matches RLHF performance.

ORPO, KTO, and Variants

ORPO (Odds Ratio Preference Optimization) combines SFT and preference alignment into a single stage by adding an odds ratio penalty term to the standard cross-entropy loss:

$$
\mathcal{L}{\text{ORPO}} = \mathcal{L}{\text{SFT}} + \lambda \cdot \mathcal{L}_{\text{OR}}
$$

KTO (Kahneman-Tversky Optimization) works with unpaired preference data—you only need to know whether individual responses are good or bad, not which of two responses is better. This dramatically simplifies data collection. The loss function is inspired by prospect theory:

$$
\mathcal{L}{\text{KTO}} = \mathbb{E}\left[\lambda_y \cdot \sigma\left(\beta \cdot \text{sign}(y) \cdot \left[\log \frac{\pi\theta(y|x)}{\pi_{\text{ref}}(y|x)} - z_{\text{ref}}\right]\right)\right]
$$

where $\text{sign}(y) = 1$ for desirable outputs and $-1$ for undesirable ones.


GRPO (Group Relative Policy Optimization)

GRPO, introduced by DeepSeek, eliminates the need for a separate critic/value model by using group-based advantage estimation. Instead of training a value function to estimate baselines, GRPO samples multiple outputs for each prompt and computes advantages relative to the group.

How it works: For each prompt x, sample a group of G outputs {y_1, y_2, \ldots, y_G} from the current policy. Compute rewards {r_1, r_2, \ldots, r_G} for each output, then normalize within the group:

\[\hat{A}_i = \frac{r_i - \text{mean}(\{r_1, \ldots, r_G\})}{\text{std}(\{r_1, \ldots, r_G\})}\]

The policy gradient objective becomes:

\[\mathcal{L}_{\text{GRPO}} = -\frac{1}{G}\sum_{i=1}^{G} \min\left(\rho_i \hat{A}_i, \; \text{clip}(\rho_i, 1-\epsilon, 1+\epsilon)\hat{A}_i\right) + \beta \cdot D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})\]

where \(\rho_i = \frac{\pi_\theta(y_i|x)}{\pi_{\text{old}}(y_i|x)}\) is the importance sampling ratio.

Key advantages:

  • No critic model: Reduces memory footprint significantly—you don't need to load and train a separate value network
  • Simpler implementation: Removes the complexity of GAE (Generalized Advantage Estimation) and value function fitting
  • Effective for reasoning: Particularly well-suited for math and code tasks where reward signals (correctness) are clear and group comparisons are meaningful

Practical considerations: GRPO requires generating multiple completions per prompt during training, which increases compute per step but often converges faster in terms of wall-clock time due to better gradient signal. Group sizes of 4-16 are typical. The method works best when you have a reliable reward signal (verifiable tasks like math, code, or factual QA).


Continued Pretraining vs. Fine-Tuning

Worth distinguishing: continued pretraining extends the base model's knowledge by training on domain-specific corpora (medical literature, legal documents, code) with the standard language modeling objective:

$$
\mathcal{L}{\text{LM}} = -\sum{t=1}^{T} \log P(x_t | x_{<t}; \theta)
$$

This happens before instruction fine-tuning and uses much larger datasets (billions of tokens vs. thousands of examples).

A typical modern pipeline:

  1. Base pretraining (web-scale data)
  2. Continued pretraining (domain data)
  3. Instruction fine-tuning (task demonstrations)
  4. Alignment (RLHF/DPO on preferences)

Practical Considerations

Learning rate: Fine-tuning uses much smaller learning rates than pretraining—typically $1 \times 10^{-5}$ to $5 \times 10^{-5}$ for full fine-tuning, potentially higher ($1 \times 10^{-4}$ to $3 \times 10^{-4}$) for PEFT methods where you're training fewer parameters.

Dataset size: Quality matters more than quantity. A few thousand high-quality examples often outperform millions of noisy ones for instruction tuning. For domain adaptation via continued pretraining, you need substantially more data.

Evaluation: Track both task-specific metrics and general capability benchmarks. Improvements on your target task that come with significant capability regression may indicate overfitting.

Mixing data: Many practitioners mix fine-tuning data with some pretraining data (or instruction data from other tasks) to prevent catastrophic forgetting. A common ratio is 5-10% replay data.

Hyperparameter sensitivity: LoRA rank $r$, which layers to adapt, learning rate schedules, and batch size all interact in non-obvious ways. Systematic sweeps are valuable when compute permits.


Summary Table

Method Trainable Params Memory Complexity Best For
Full Fine-Tuning 100% Very High Low Maximum performance, sufficient compute
LoRA ~0.1-1% Low Low General-purpose PEFT
QLoRA ~0.1-1% Very Low Medium Large models on consumer hardware
Adapters ~1-5% Low Low Multi-task scenarios
Prompt Tuning <0.1% Very Low Low Simple task adaptation
DPO Varies Medium Low Alignment without RL complexity
RLHF Varies High High Maximum alignment control

This covers the major techniques, though the field moves quickly. Recent work explores mixture-of-experts fine-tuning, multi-task approaches like FLAN, and methods for merging multiple fine-tuned models (TIES, DARE). The practical choice depends heavily on your compute budget, dataset characteristics, and whether you need to preserve general capabilities alongside task specialization.