<aside> 📢
<aside> 🎯
</aside>
<aside>
CLIP (Contrastive Language-Image Pretraining) maps visual and textual modalities into a shared latent space, enabling generalized open-vocabulary image recognition without task-specific fine-tuning (Radford et al., 2021, arXiv:2103.00020). It shifts the paradigm away from predicting a fixed set of categorical labels toward a proxy task: predicting which text snippet correctly pairs with an image across a massive, noisy dataset of web-scraped pairs.
⚠️ Limitations & Caveats
$$
\begin{aligned} & \text{[Cosine Similarity]: } \text{sim}(I_i, T_j) = \frac{I_i \cdot T_j}{\|I_i\|2 \|T_j\|2} \\ & \text{[Image-to-Text Loss]: } \mathcal{L}{I \to T} = -\frac{1}{N} \sum{i=1}^{N} \log \frac{\exp(\text{sim}(I_i, T_i) / \tau)}{\sum_{j=1}^{N} \exp(\text{sim}(I_i, T_j) / \tau)} \\ & \text{[Text-to-Image Loss]: } \mathcal{L}{T \to I} = -\frac{1}{N} \sum{i=1}^{N} \log \frac{\exp(\text{sim}(T_i, I_i) / \tau)}{\sum_{j=1}^{N} \exp(\text{sim}(T_j, I_i) / \tau)} \\ & \text{[Total InfoNCE Loss]: } \mathcal{L}{CLIP} = \frac{\mathcal{L}{I \to T} + \mathcal{L}{T \to I}}{2} \\ & \text{[SigLIP Objective]: } \mathcal{L}{SigLIP} = -\frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^{N} \log \sigma(z_{ij} \cdot \text{sim}(I_i, T_j) / \tau + b), \quad z_{ij} = \begin{cases} 1 & i=j \\ -1 & i \neq j \end{cases} \end{aligned} $$
The InfoNCE contrastive loss relies heavily on "in-batch negatives" to form its decision boundaries. Because the model learns by contrasting the correct pair against all incorrect pairs in the batch, a small batch size fails to provide enough "hard negatives"—examples that are subtly similar and force the model to learn fine-grained distinctions. To achieve state-of-the-art performance, the original CLIP model required an exceptionally large batch size of 32,768, necessitating complex infrastructure like gradient caching and multi-node synchronization.
The temperature parameter (τ) acts as a scaling multiplier for the cosine similarities before they are passed through the softmax function. Since cosine similarity is strictly bounded between −1 and 1, the raw logits lack the dynamic range to produce sharp probability distributions. By dividing by a small, learnable τ (which often converges to roughly 0.01), the model amplifies the logit differences, heavily penalizing the hardest negatives and smoothing the optimization landscape.
⚠️ Limitations & Caveats
Standard CLIP uses a softmax-based loss function, which requires normalizing the similarity scores across the entire batch. This global normalization creates an unavoidable memory dependency where every representation must be compared against every other representation across all GPUs before gradients can be computed.
SigLIP (Sigmoid Loss for Language Image Pre-Training) solves this by replacing the softmax with a simple, pairwise sigmoid classification loss (Zhai et al., 2023, arXiv:2303.15343). It treats every possible image-text pairing in the N×N grid as an independent binary classification task—predicting 1 for matching pairs and 0 for non-matching ones. This decouples the loss from the global batch dimension, allowing chunked processing and stable training at much smaller batch sizes without sacrificing zero-shot accuracy.
| Dimension | Standard CLIP (InfoNCE / Softmax) | SigLIP (Sigmoid Loss) |
|---|---|---|
| Computational Cost (Memory) | High (Requires gathering all embeddings globally) | Low (Pairwise operations can be heavily chunked) |
| Batch Size Dependency | Extreme (Performance degrades < 16k) | Minimal (Stable even at small batch sizes) |
| Data Requirements | Standard noisy image-text pairs | Standard noisy image-text pairs |
| Known Failure Modes | OOM errors during distributed training | Can underperform in dense retrieval tasks |
⚠️ Limitations & Caveats
<aside>
Quantization maps high-precision floating-point parameters (e.g., FP32 or FP16) to lower-precision representations (e.g., INT8 or INT4). In the context of Large Language Models (LLMs), this primarily targets the memory-bandwidth bottleneck during autoregressive decoding, where inference is often constrained by the rate at which weights and KV-cache data can be fetched from HBM rather than by raw compute throughput.
By compressing model weights, quantization reduces the number of bytes transferred per memory access, thereby lowering memory traffic per token generation step.
The process provides several practical advantages for inference deployment:
Memory Bandwidth Reduction:
Lower-bit representations reduce memory transfer volume approximately proportional to bit-width (e.g., INT4 vs FP16 yields ~4× reduction in weight bandwidth). This can improve tokens-per-second in memory-bound decoding regimes by reducing HBM pressure, rather than simply “preventing compute idling,” which is workload-dependent.
VRAM Efficiency:
Reducing precision decreases model storage footprint. For example, a 70B parameter model requires ~140GB in FP16. Under ideal INT4 weight-only quantization, this can be reduced to ~35GB. However, end-to-end memory usage is higher in practice due to KV cache (typically FP16/BF16), activation buffers, and runtime metadata.
Compute Acceleration:
Many modern accelerators (e.g., Tensor Cores) support lower-precision matrix multiplication (INT8/FP16/BF16). While lower precision can increase theoretical throughput, real-world speedups depend on kernel fusion, memory movement overhead, and whether computation or memory is the dominant bottleneck.
⚠️ Limitations & Caveats
Information loss is unavoidable:
Quantization introduces rounding error due to finite representational capacity. This error can accumulate across layers, potentially affecting calibration-sensitive behaviors such as long-horizon reasoning and instruction-following robustness. However, degradation is highly method-dependent (e.g., naive PTQ vs GPTQ/AWQ differ substantially).
Dequantization and mixed-precision overhead:
Even in INT8/INT4 kernels, many operations (especially nonlinearities and accumulations) are executed in higher precision (FP16/FP32). This introduces partial dequantization overhead and reduces the idealized “pure integer pipeline” assumption.
$$
\begin{aligned} & \text{[Affine Mapping]: } X_{int} = \text{clip}\left(\text{round}\left(\frac{X_{float}}{s}\right) + z, \, q_{min}, \, q_{max}\right) \\ & \text{[Dequantization]: } \tilde{X}{float} = s \cdot (X{int} - z) \\ & \text{[Symmetric Scaling Factor]: } s = \frac{\max(|X_{float}|)}{2^{b-1} - 1}, \quad z = 0 \\ & \text{[Quantization Error]: } \mathcal{E} = \|X_{float} - \tilde{X}{float}\|2^2 \\ & \text{[Straight-Through Estimator]: } \frac{\partial \mathcal{L}}{\partial X{float}} \approx \frac{\partial \mathcal{L}}{\partial X{int}} \\ & \text{[SmoothQuant Migration]: } \tilde{W}{ij} = W{ij} \cdot \alpha_j, \quad \tilde{X}{ij} = X{ij} / \alpha_j \end{aligned}
$$
Post-Training Quantization (PTQ) applies quantization after full-precision training is completed. It typically relies on a calibration dataset to estimate activation statistics (e.g., min/max, or histogram-based estimators) in order to determine scaling factors s and zero-points z. Importantly, PTQ does not modify model weights via gradient-based optimization.
Modern PTQ methods for LLMs (e.g., GPTQ, AWQ) go beyond naive min–max scaling by incorporating second-order or activation-aware approximations, significantly improving low-bit robustness compared to earlier PTQ approaches.
Quantization-Aware Training (QAT) incorporates quantization effects during training. Since rounding operations are non-differentiable, QAT typically employs a Straight-Through Estimator (STE) in the backward pass to approximate gradients through discrete operations. This allows the optimizer to adapt weights to compensate for quantization-induced error.
⚠️ Limitations & Caveats
Weight-only quantization (e.g., W4A16) reduces memory footprint but does not fully unlock compute acceleration. Full hardware acceleration requires both weights and activations to be quantized (e.g., W8A8). However, LLM activations exhibit heavy-tailed distributions, where a small subset of channels dominate magnitude (activation outliers), making uniform quantization inefficient.
</aside>
<aside> <img src="/icons/reorder_gray.svg" alt="/icons/reorder_gray.svg" width="40px" />
</aside>