[{"content":"Overview #This series provides a complete, from-first-principles guide to Bayesian Linear Regression with Automatic Relevance Determination — a powerful technique for sensor calibration that combines rigorous uncertainty quantification, automatic feature selection, and a closed-form solution efficient enough for embedded systems.\nWhether you\u0026rsquo;re calibrating a Hall effect sensor, building a measurement device, or working on any system where you need both accurate predictions and quantified confidence, BLR+ARD offers a principled alternative to black-box machine learning.\nThe Series #Part 1: Mathematical Foundations # When Your Sensor Knows What It Doesn\u0026rsquo;t Know: The Math Behind Bayesian Linear Regression and Automatic Relevance Determination\nStart here to understand:\nWhy Bayesian statistics are the right tool for calibration (compared to least squares, neural networks, etc.) The Gaussian miracle: why Bayesian posteriors have closed-form solutions Linear regression through the Bayesian lens: deriving the posterior covariance and mean How predictions become probability distributions with both aleatoric and epistemic uncertainty The role of hyperparameters $\\alpha_j$ and $\\beta$, and how ARD performs automatic feature selection Duration: ~20 minutes | Prerequisites: Linear algebra, basic probability\nPart 2: Production Implementation # From Math to Silicon: Implementing BLR+ARD with Rust and faer\nOnce you understand the math, discover how to implement it efficiently:\nThe critical principle: never invert a matrix if you can solve a linear system instead Cholesky decomposition: the SPD factorization that underpins all numerical stability Computing Mahalanobis distances without matrix inversion Log-determinants via Cholesky diagonals (avoiding overflow/underflow) Two algebraically identical posterior update forms: observation-space vs. parameter-space The Woodbury matrix identity: exact transformation between spaces Code patterns from the production blr-core crate in Rust using the faer numerical library Duration: ~25 minutes | Prerequisites: Part 1, familiarity with Rust helpful but not required\nKey Concepts # Concept What It Does Why It Matters Posterior Distribution Encodes both point estimate and uncertainty You know not just the prediction, but how confident you should be Automatic Relevance Determination Each feature gets its own regularization strength $\\alpha_j$ Irrelevant features are suppressed without manual feature selection Cholesky Decomposition $A = LL^T$ for symmetric positive-definite matrices Enables solving systems instead of inverting, reducing numerical error by a factor of condition number Epistemic vs. Aleatoric Uncertainty Model uncertainty vs. measurement noise Tells you where you need more data (high epistemic) vs. where noise dominates (high aleatoric) Hyperparameter Learning EM algorithm for $\\alpha_j$ and $\\beta$ Automatically tune regularization without cross-validation loops Use Cases #BLR+ARD shines when you have:\nLimited data — Hall sensor calibration with tens of observations, not millions Interpretability requirements — Each learned feature has a clear physical meaning Uncertainty budgets — You need to know not just predictions, but confidence bands Resource constraints — Embedded systems, WASM, or real-time inference loops Active learning — You can request measurements where uncertainty is highest It\u0026rsquo;s less suitable for:\nMassive unstructured data (ImageNet scale) Problems where black-box optimization is acceptable Domains where you have no domain knowledge for feature engineering Quick Start # Just want the math? → Read Part 1 Want to implement it? → Read both parts, then check the blr-core crate on GitHub Want to use it as a library? → Look for the published Rust crate (coming soon to crates.io) Want it in WASM? → Part 2 discusses embedding in WebAssembly components What\u0026rsquo;s Next? #Upcoming articles in this space will cover:\nActive Learning Strategies — Automatically selecting the next measurement point to maximize information gain WASM Deployment — Packaging BLR+ARD as a WebAssembly component for browser and edge deployment Hyperparameter Sensitivity — Understanding when ARD selection is stable and when to use stronger priors Multi-Sensor Fusion — Combining multiple sensors with heterogeneous noise models Real-Time Calibration Loops — Sequential Bayesian updating during sensor operation References \u0026amp; Further Reading # Textbook: Pattern Recognition and Machine Learning by Christopher Bishop (Chapter 3: Linear Models for Regression) Video: Philipp Hennig\u0026rsquo;s Probabilistic ML course (especially lectures 3–6 on Gaussian inference) Original ARD paper: MacKay, D. J. (1994). \u0026ldquo;Bayesian nonlinear modeling for the prediction competition.\u0026rdquo; ASHRAE Transactions Matrix inversion lemma: Woodbury matrix identity — the algebraic key to efficient Bayesian updates Ready to dive in? Start with Part 1: When Your Sensor Knows What It Doesn\u0026rsquo;t Know →\n","date":"15 May 2026","permalink":"https://wamli.github.io/blog/blr-sensor-calibration-series/","section":"Blog","summary":"A comprehensive two-part series on building principled, uncertainty-quantified sensor calibration systems using Bayesian Linear Regression with Automatic Relevance Determination. Theory, implementation, and real-world patterns.","title":"Bayesian Linear Regression for Sensor Calibration"},{"content":"Bayesian Linear Regression (BLR) with Automatic Relevance Determination (BLR+ARD) is one of those rare techniques that feels almost too good: it fits your data, quantifies its own uncertainty, and automatically discards the features that don\u0026rsquo;t matter — all from a single, coherent mathematical framework. This post explains how it works and, more importantly, why it has to work that way.\n1. Why Bayesian Statistics for Sensor Calibration? #The Problem: Calibration Is More Than Curve Fitting #When you calibrate a Hall effect sensor — the kind of magnetic sensor that tells a motor controller where the rotor is — you\u0026rsquo;re solving what looks like a simple problem: you know the sensor\u0026rsquo;s output voltage at a set of known magnetic field strengths. You want a function that maps voltage to field strength (or vice versa). Curve fitting, right?\nNot quite. The naive approach — fit a polynomial with least squares, pick the one with the best cross-validation score, call it done — hides a fundamental question: how confident are you? When the sensor operates in the middle of its calibrated range, fine. But what about at the edges? What about when a new sensor unit has slightly different manufacturing tolerances? What about when the next data point comes in: should you trust it, or was the last ten measurements flaky?\nLeast squares gives you a function. Bayesian Inference gives you a belief, expressed as a probability distribution over functions. That belief encodes not just \u0026ldquo;here\u0026rsquo;s my best fit\u0026rdquo; but \u0026ldquo;here\u0026rsquo;s how certain I am, and here\u0026rsquo;s how my certainty changes across the input range.\u0026rdquo;\nWhy Not a Neural Network? #The question almost always comes up. Neural networks are powerful, flexible, and increasingly easy to deploy. Why bother with a specialized Bayesian technique? There are three reasons for this.\nFirst, data. A Hall sensor calibration session typically produces tens to hundreds of measurements, not millions. Neural networks are data-hungry. A six-feature BLR model with ARD reaches confident, physically meaningful calibration with as few as ten data points.\nSecond, interpretability. When a BLR model tells you that the cubic polynomial feature is irrelevant for your Hall sensor, it is telling you something physically true: the sensor\u0026rsquo;s response really is dominated by a linear term plus a smooth saturation. The $\\alpha$ values — the ARD hyperparameters we\u0026rsquo;ll meet in Section 4 — are a direct window into the structure of the data. Neural network weights are not.\nThird, uncertainty. BLR produces a full posterior predictive distribution at every query point. You know not just the predicted output but how wide the uncertainty band is, and whether that band is wide because of model uncertainty (not enough data to pin down the weights) or measurement noise (the physical process is inherently stochastic). No post-hoc calibration technique, no conformal prediction wrapper — it falls out of the math automatically.\nWhat Bayesian Linear Regression Promises #To summarize the promise before we build the machinery:\nPrincipled uncertainty: predictions are distributions, not point estimates. Automatic feature selection: ARD hyperparameters \u0026ldquo;vote\u0026rdquo; on which basis functions matter. Irrelevant ones are suppressed to zero without manual selection or cross-validation. Interpretable results: every parameter has a physical or statistical meaning you can inspect. Efficiency: the algorithm converges in a small number of iterations, making it practical for embedded and real-time calibration loops. The cost is that you may want to understand the math before you trust the results. That\u0026rsquo;s what this post is for.\n2. Bayes\u0026rsquo; Theorem and the Gaussian Miracle #2.1 Bayes\u0026rsquo; Theorem: From Belief to Data-Driven Belief #The foundation is familiar:\n$$p(\\theta \\mid \\mathcal{D}) = \\frac{p(\\mathcal{D} \\mid \\theta) \\, p(\\theta)}{p(\\mathcal{D})} \\tag{2.1} \\label{eq:bayes_theorem}$$In words: the probability of parameters $\\theta$ given data $\\mathcal{D}$ is proportional to the likelihood of the data given those parameters, multiplied by our prior belief in those parameters.\n$p(\\theta)$ is the prior — what we believed before seeing data. $p(\\mathcal{D} \\mid \\theta)$ is the likelihood — how probable the data would be if the parameters were $\\theta$. $p(\\theta \\mid \\mathcal{D})$ is the posterior — what we believe after seeing data. The denominator $p(\\mathcal{D})$ is just a normalizing constant; it ensures the posterior integrates to 1.\nThe hard part of Bayesian inference, in general, is computing this posterior. For arbitrary distributions, the integral in the denominator is intractable. Sampling methods (e.g. MCMC) handle the general case, but they are slow. For the specific choice of Gaussian distributions, something remarkable happens: the posterior is also Gaussian, and the update has a closed-form solution in terms of matrix algebra.\nThis is the Gaussian miracle, and it is the engine that makes BLR+ARD computationally tractable.\n2.2 The Gaussian Miracle: Closure Under Bayes #A Gaussian distribution for a vector $x \\in \\mathbb{R}^d$ is written $\\mathcal{N}(x; \\mu, \\Sigma)$, where $\\mu$ is the mean vector and $\\Sigma$ is the covariance matrix. The key property is:\nIf the prior $p(x)$ is Gaussian and the likelihood $p(y \\mid x)$ is Gaussian in $x$, then the posterior $p(x \\mid y)$ is also Gaussian.\nThis is not a coincidence. The log of a Gaussian is a quadratic function of its argument. Adding two quadratic functions gives another quadratic function. Exponentiating a quadratic gives a Gaussian. Bayesian updating with Gaussians is, at its core, just completing the square — the same algebraic operation you learned in high school.\nProf. Philipp Hennig from University of Tübingen illustrates this beautifully in his lecture \u0026ldquo;Probabilistic ML - 03 - Gaussian Inference\u0026rdquo; | (53:50). His framework for Gaussian inference is:\n$$p(x) = \\mathcal{N}(x;\\, \\mu,\\, \\Sigma) \\tag{2.2} \\label{eq:p_x} $$$$p(y \\mid x) = \\mathcal{N}(y;\\, Ax + b,\\, \\Lambda^{-1}) \\tag{2.3} \\label{eq:p_y} $$where $A$ is a linear map from the latent $x$ to the observation $y$, $b$ is an offset, and $\\Lambda$ is the precision matrix (inverse covariance) of the observation noise.\nIn the lecture, equation \\eqref{eq:p_y} is actually\n$p(y \\mid x) = \\mathcal{N}(y;\\, Ax + b,\\, \\Lambda)$.\nNote that there is a difference in the inversion of $\\Lambda$. Result — the posterior is:\n$$p(x \\mid y) = \\mathcal{N}(x;\\, \\mu + K(y - A\\mu - b),\\, \\Sigma - K A \\Sigma) \\tag{2.4}$$where the Kalman gain $K$ is:\n$$K = \\Sigma A^T \\left(\\Lambda^{-1} + A \\Sigma A^T\\right)^{-1} \\tag{2.5}$$Let\u0026rsquo;s unpack the terminology, because each piece has a name for a reason:\nPrior mean $\\mu$: our best guess before seeing $y$. Prior covariance $\\Sigma$: the covariance before seeing $y$. Residual $(y - A\\mu - b)$: how far the actual observation is from what we predicted. This is the \u0026ldquo;surprise.\u0026rdquo; Gram matrix $(\\Lambda^{-1} + A \\Sigma A^T)$: the total covariance of the observation — noise covariance plus the uncertainty propagated from the prior through the linear map $A$. Gain $K$: how much we update the mean for each unit of surprise. High gain → observation dominates. Low gain → prior dominates. Posterior covariance $\\Sigma - K A \\Sigma$: strictly less than the prior covariance. We always become more certain after observing data, never less. This is exact, not approximate. No samples, no variational bound, no MCMC. Pure linear algebra.\n2.3 Why This Matters for Regression #The formula above is the general Gaussian inference update. For linear regression, we will specialize it: $x$ will be our weight vector $\\mathbf{w}$, $y$ will be our sensor measurements, $A$ will be the design matrix $\\Phi$, and $\\Lambda^{-1}$ will be our noise model. Section 3 builds this specialization explicitly.\nBefore moving on, note what the framework does not require: it does not require knowledge of the true weights. It does not require that we iterate until convergence. Given the prior and the likelihood, the posterior is determined in one shot. What will require iteration is learning the hyperparameters ($\\alpha$ and $\\beta$) — but that is a separate problem from computing the posterior.\n3. Linear Regression Through the Bayesian Lens #3.1 The Regression Model #We model sensor output as:\n$$y_i = \\mathbf{w}^T \\boldsymbol{\\phi}(x_i) + \\epsilon_i, \\quad \\epsilon_i \\sim \\mathcal{N}(0, \\beta^{-1})$$where:\n$x_i$ is the $i$-th input (e.g., magnetic field strength) $\\boldsymbol{\\phi}(x_i) \\in \\mathbb{R}^D$ is a vector of basis functions evaluated at $x_i$ $\\mathbf{w} \\in \\mathbb{R}^D$ are the weights we want to infer $\\epsilon_i$ is measurement noise with precision $\\beta$ (i.e., variance $\\sigma^2 = 1/\\beta$) Stack all $N$ observations into matrix form:\n$$\\mathbf{y} = \\Phi \\mathbf{w} + \\boldsymbol{\\epsilon}$$where:\n$\\mathbf{y} \\in \\mathbb{R}^N$ is the vector of $N$ observations $\\Phi \\in \\mathbb{R}^{N \\times D}$ is the design matrix — row $i$ is $\\boldsymbol{\\phi}(x_i)^T$ $\\boldsymbol{\\epsilon} \\sim \\mathcal{N}(\\mathbf{0}, \\beta^{-1} I)$ is the noise vector The design matrix is the bridge between raw inputs and the linear model. Each column of $\\Phi$ corresponds to one basis function applied to every data point. Bayesian inference learns which columns matter.\n3.2 Feature Engineering: Encoding Physical Hypotheses #The basis functions $\\boldsymbol{\\phi}(x)$ are not learned — they are designed by the engineer based on physical understanding. This is the feature engineering step, and it is where domain knowledge enters the picture.\nFor a Hall effect position sensor, the physics suggests:\n$$\\boldsymbol{\\phi}(x) = \\left[1,\\; x,\\; x^2,\\; x^3,\\; \\tanh(x/0.8),\\; \\tanh(x/1.5)\\right]^T$$Each component encodes a physical hypothesis:\nFeature Physical Hypothesis $1$ (bias) The sensor has a constant offset regardless of field (always true) $x$ The primary Hall response is linear in displacement $x^2$, $x^3$ Polynomial field non-uniformity at moderate displacements $\\tanh(x/0.8)$ Hard magnetic saturation with characteristic length 0.8 mm $\\tanh(x/1.5)$ Gradual saturation rolloff with characteristic length 1.5 mm The $\\tanh$ features capture the fact that at large displacements, the permanent magnet\u0026rsquo;s field begins to saturate the ferromagnetic core of the sensor — the response \u0026ldquo;clips\u0026rdquo; smoothly rather than increasing indefinitely. Two different width parameters hedge between a tight saturation profile and a gentler rolloff.\nThe crucial point: the model does not need to know which features are actually relevant. The ARD algorithm will determine that from data. If the sensor happens to have a genuinely linear response in the calibration range, then the $x^2$, $x^3$, and one of the $\\tanh$ features will be driven to zero. The engineer provides a vocabulary of physical hypotheses; the algorithm selects the right ones automatically.\nThis beats black-box learning in a specific way: every feature that survives ARD selection has a name. You can look at the final $\\alpha$ values and read off: \u0026ldquo;this sensor has a linear response with a gradual saturation rolloff and a 0.5 V offset.\u0026rdquo; That is interpretable calibration.\n3.3 The Prior on Weights #Before seeing any data, we express our belief about the weights as a Gaussian prior:\n$$p(\\mathbf{w}) = \\mathcal{N}\\!\\left(\\mathbf{0},\\; \\Lambda^{-1}\\right)$$where $\\Lambda = \\text{diag}(\\alpha_1, \\alpha_2, \\ldots, \\alpha_D)$ is a diagonal precision matrix — one regularization parameter $\\alpha_j$ per weight.\nWhy zero mean? Because before seeing data, we have no reason to believe any weight is large. We are agnostic about direction. The prior simply says: \u0026ldquo;I expect the weights to be small, but I\u0026rsquo;ll learn from data how small.\u0026rdquo;\nWhy diagonal? Because our prior belief about weight $w_j$ has nothing to do with weight $w_k$. Before seeing data, each feature is independent. Correlations between weights emerge from the data, not from our prior.\nWhy per-feature $\\alpha_j$ rather than a single shared $\\alpha$? This is the key ARD design choice. A single shared $\\alpha$ would regularize all features equally. But different features genuinely have different signal-to-noise properties. A bias feature is almost always relevant; a high-degree polynomial feature probably is not. With per-feature $\\alpha_j$, the algorithm can independently \u0026ldquo;turn the volume down\u0026rdquo; on each feature. This is Automatic Relevance Determination (ARD).\nInitially, we set all $\\alpha_j$ to the same small value (a weak prior). The EM algorithm described in Section 5 will learn the optimal $\\alpha_j$ values from data.\n3.4 The Posterior: Updating Beliefs with Data #Now we apply the Gaussian inference framework from Section 2 to our regression setting. Mapping the notation:\nHennig\u0026rsquo;s Framework BLR Setting $x$ (latent variable) $\\mathbf{w}$ (weights) $\\mu$ (prior mean) $\\mathbf{0}$ $\\Sigma$ (prior covariance) $\\Lambda^{-1} = \\text{diag}(\\alpha_j^{-1})$ $A$ (linear map) $\\Phi$ (design matrix, transposed; see note) $y$ (observations) $\\mathbf{y}$ (sensor measurements) $\\Lambda^{-1}$ (obs. noise covariance) $\\beta^{-1} I$ (aka homoscedastic noise) Be careful with conventions: In Prof. Hennig\u0026rsquo;s framework, the observation model is $y = Ax + b$. In regression, $y = \\Phi w$, so $A = \\Phi$ where each row of $\\Phi$ is a feature vector. The algebra works out consistently. Substituting $\\mu = \\mathbf{0}$, $b = 0$, $A = \\Phi$, and the observation noise covariance $\\beta^{-1}I$ into the Kalman gain form from Section 2.2 gives an intermediate result that inverts an $N \\times N$ Gram matrix — one that grows with the number of observations:\n$$K = \\Lambda^{-1}\\Phi^T\\underbrace{\\left(\\beta^{-1}I_N + \\Phi\\Lambda^{-1}\\Phi^T\\right)^{-1}}_{\\text{Gram matrix}}$$This is mathematically correct, but it works in the wrong space. The question — \u0026ldquo;what do the 6 feature weights look like after seeing the data?\u0026rdquo; — lives in a 6-dimensional parameter space, yet the Gram matrix inversion is $N \\times N$, growing with every new calibration measurement you add. For $D = 6$ features and $N = 100$ measurements, we would be inverting a $100 \\times 100$ matrix to answer a question that lives in $\\mathbb{R}^6$.\nThe Woodbury matrix identity (also known as the matrix inversion lemma; see reference) provides an exact algebraic route from the $N \\times N$ Gram inversion to a $D \\times D$ precision matrix inversion — a switch from observation space to parameter space. The full derivation is in Appendix A. If you want to continue reading without stopping to verify the algebra, the key takeaway is: the transformation is exact, not an approximation. The result is:\n$$\\boxed{\\Sigma_\\text{post} = \\left(\\Lambda + \\beta\\, \\Phi^T \\Phi\\right)^{-1}}$$$$\\boxed{\\boldsymbol{\\mu}_\\text{post} = \\beta\\, \\Sigma_\\text{post}\\, \\Phi^T \\mathbf{y}}$$This is Bayesian Linear Regression in closed form. Let\u0026rsquo;s read these formulas carefully, because they encode everything.\nThe posterior covariance $\\Sigma_\\text{post}$ is the inverse of the sum of two matrices:\n$\\Lambda$: the prior precision — our regularization (depends only on our beliefs, not data) $\\beta\\, \\Phi^T \\Phi$: the data precision — how strongly the data constrains the weights (depends only on data) As $N \\to \\infty$, the data precision dominates and the posterior collapses to a delta function. As $N \\to 0$, only the prior survives. The interpolation between prior and data is automatic and continuous.\nThe posterior mean $\\boldsymbol{\\mu}_\\text{post}$ is proportional to $\\Phi^T \\mathbf{y}$ (the \u0026ldquo;signal\u0026rdquo; in the data), scaled by both $\\beta$ (how much we trust individual measurements) and $\\Sigma_\\text{post}$ (which redistributes this signal according to the precision structure).\nCompare this to least squares: you minimize $\\mathcal{L}(\\mathbf{w}) = \\lVert \\mathbf{y} - \\Phi \\mathbf{w} \\rVert^2$. In least squares, the Hessian (second derivative) is $H_{\\text{LS}} = 2\\beta\\, \\Phi^T \\Phi$. Notice that $\\beta\\, \\Phi^T \\Phi$ appears directly in our Bayesian posterior covariance formula as well:\n$$\\Sigma_\\text{post} = (\\Lambda + \\underbrace{\\beta\\, \\Phi^T \\Phi}_{\\text{least squares Hessian term}})^{-1}$$In the Bayesian form, $\\beta\\, \\Phi^T \\Phi$ is the data\u0026rsquo;s information matrix (the inverse of curvature in the likelihood), and $\\Lambda$ adds prior information. The posterior covariance is the inverse of the total information. Where least squares gives only a point estimate, Bayesian regression inverts this information matrix to quantify uncertainty through the covariance. The regularization term $\\Lambda$ prevents overfitting — ridge regression is exactly the Bayesian posterior under a uniform $\\alpha$ prior, and ARD generalizes it by letting each feature have its own regularization strength.\n3.5 Predictions as Probability Distributions #Given the posterior $p(\\mathbf{w} \\mid \\Phi, \\mathbf{y}) = \\mathcal{N}(\\boldsymbol{\\mu}_\\text{post}, \\Sigma_\\text{post})$ and the likelihood at the new point $p(y_* \\mid \\mathbf{w}, x_*) = \\mathcal{N}(\\boldsymbol{\\phi}(x_*)^T \\mathbf{w}, \\beta^{-1})$, how do we predict at a new input $x_*$?\nWe want the posterior predictive distribution:\n$$p(y_* \\mid x_*, \\Phi, \\mathbf{y}) = \\int p(y_* \\mid \\mathbf{w}, x_*)\\, p(\\mathbf{w} \\mid \\Phi, \\mathbf{y})\\, d\\mathbf{w}$$Because everything is Gaussian, this integral is tractable. The result is:\n$$p(y_* \\mid x_*, \\Phi, \\mathbf{y}) = \\mathcal{N}(\\mu_*, \\sigma_*^2)$$where:\n$$\\mu_* = \\boldsymbol{\\phi}(x_*)^T \\boldsymbol{\\mu}_\\text{post}$$$$\\sigma_*^2 = \\underbrace{\\beta^{-1}}_{\\text{aleatoric}} + \\underbrace{\\boldsymbol{\\phi}(x_*)^T\\, \\Sigma_\\text{post}\\, \\boldsymbol{\\phi}(x_*)}_{\\text{epistemic}}$$The variance decomposes into two parts:\nAleatoric uncertainty ($\\beta^{-1}$): irreducible measurement noise. Even with infinite data, the sensor still has physical noise. This part never goes to zero. Epistemic uncertainty ($\\boldsymbol{\\phi}(x_*)^T \\Sigma_\\text{post} \\boldsymbol{\\phi}(x_*)$): uncertainty about the weights. This part goes to zero as $N \\to \\infty$ — more data pins down the weights. This is where BLR\u0026rsquo;s value becomes concrete. In regions where the training data is dense, epistemic uncertainty is small. In regions where data is sparse (e.g., the edges of the calibration range), it grows. A well-designed calibration system can use this signal to request additional measurements exactly where they are most needed. That would be active learning, which is a natural extension of this framework.\n4. The Unknowns: Hyperparameters and Why They Matter #4.1 What We Know, What We Don\u0026rsquo;t #Let\u0026rsquo;s take stock. The BLR posterior formulas are clean and exact. But they depend on two quantities we haven\u0026rsquo;t specified:\nSymbol Role Known? $\\mathbf{y}$, $\\Phi$ Observations and design matrix ✓ Yes — we measured them $\\mathbf{w}$ Regression weights ✗ Inferred (via posterior) $\\alpha_j$ Per-feature prior precision (ARD) ✗ Must learn from data $\\beta$ Noise precision ✗ Must learn from data The weights $\\mathbf{w}$ are \u0026ldquo;first-level\u0026rdquo; unknowns — they are inferred by the posterior distribution, which is exact given $\\alpha$ and $\\beta$.\nThe hyperparameters $\\alpha_j$ and $\\beta$ are \u0026ldquo;second-level\u0026rdquo; unknowns — they govern the prior and noise model. Choosing them badly hurts the posterior. We need a principled way to learn them from the same data used to fit the model.\n4.2 $\\alpha_j$: The ARD Knobs #$\\alpha_j$ is the precision of the prior on weight $w_j$. Recall that the prior is $w_j \\sim \\mathcal{N}(0, \\alpha_j^{-1})$. So:\nLarge $\\alpha_j$: tight prior, weight strongly pulled toward zero. Feature $j$ is being told \u0026ldquo;you probably don\u0026rsquo;t matter.\u0026rdquo; Small $\\alpha_j$: loose prior, weight is free to be large. Feature $j$ is being told \u0026ldquo;do what the data says.\u0026rdquo; The ARD idea, due to MacKay (1992) and developed further by Tipping \u0026amp; Bishop (2001), is to learn separate $\\alpha_j$ values for each feature dimension. During the optimization:\nIf the data provides strong evidence for feature $j$ (e.g., the $x$ term in a Hall sensor with linear response), $\\alpha_j$ stays small. The posterior for $w_j$ is broad and data-driven. If the data provides no evidence for feature $j$ (e.g., the $x^2$ term in a genuinely linear sensor), $\\alpha_j$ grows large — potentially toward infinity. The posterior for $w_j$ collapses to zero. The feature is automatically pruned. This is automatic feature selection without any explicit thresholding, cross-validation, or human decision. The physics is in the data; ARD reads it out.\nFor example, in a (fictive) Hall sensor calibration scenario with an assumed linear underlying curve, the following may be observed:\nα[bias] = 4.06 (relevant: captures a 0.5V offset) α[B-field] = 3.77 (relevant: the linear Hall response) α[B-field²] = 894,766 (SUPPRESSED: no quadratic signal) α[B-field³] = 1,844 (SUPPRESSED: no cubic signal) Ratio α[B-field²] / α[B-field] = 237,390× That ratio of a quarter-million-to-one is not a numerical glitch. It is the algorithm stating, unambiguously: \u0026ldquo;the linear term is essential, the quadratic term is noise.\u0026rdquo; That conclusion is physically correct, and it emerged without us telling the algorithm anything about Hall sensor physics.\n4.3 $\\beta$: The Noise Knob #$\\beta = 1/\\sigma^2$ is the precision of the measurement noise. Higher $\\beta$ means less noise — the algorithm trusts individual measurements more. Lower $\\beta$ means more noise — the algorithm is more skeptical of individual measurements and regularizes the fit.\nIn the posterior formulas, $\\beta$ multiplies $\\Phi^T \\Phi$ and $\\Phi^T \\mathbf{y}$. Increasing $\\beta$ is like having more data — it sharpens the posterior. Decreasing $\\beta$ broadens it.\nWhy learn $\\beta$ rather than fixing it? Manual noise estimation is error-prone. With $N = 10$ calibration points, it is easy to misestimate noise by a factor of two. The EM algorithm which will be discussed in Section 5 infers $\\beta$ from the residuals, automatically accounting for the fact that a portion of the apparent residuals is \u0026ldquo;explained\u0026rdquo; by the weights. Appendix C walks through the derivation in detail.\n4.4 The Central Problem: Evidence Maximization #We now know what we need: optimal values for $\\alpha_1, \\ldots, \\alpha_D$ and $\\beta$. How do we find them?\nThe Bayesian answer: maximize the marginal likelihood, also called the evidence:\n$$p(\\mathbf{y} \\mid \\alpha, \\beta) = \\int p(\\mathbf{y} \\mid \\mathbf{w}, \\beta)\\, p(\\mathbf{w} \\mid \\alpha)\\, d\\mathbf{w}$$This integral asks: \u0026ldquo;How probable is the observed data under our model, after having averaged out the weights?\u0026rdquo;. It is the probability of the data according to the prior-weighted ensemble of all possible weight vectors.\nWhy is this the right objective? Because it automatically trades off data fit against model complexity. A model with very large $\\alpha$ values (strong prior) has low complexity but may not fit the data well. A model with very small $\\alpha$ values (weak prior) fits training data well but overfits. The evidence is maximized at the sweet spot — the configuration of hyperparameters that makes the data as probable as possible without overfitting.\nThis is called Type-II Maximum Likelihood or Empirical Bayes. It is distinct from Type-I Maximum Likelihood (which would maximize $p(\\mathbf{y} \\mid \\mathbf{w})$ over $\\mathbf{w}$ directly, giving least squares). Type-II integrates out the weights and optimizes the hyperparameters. It is a more principled approach precisely because it avoids conditioning on any specific weight vector.\n5. The MacKay Algorithm: Finding the Optimal Hyperparameters #5.1 The EM Loop #To maximize the evidence with respect to $\\alpha$ and $\\beta$, we use the Expectation-Maximization (EM) algorithm:\nE-step (Expectation): Fix the current hyperparameters $\\alpha^{(t)}, \\beta^{(t)}$. Compute the posterior distribution $p(\\mathbf{w} \\mid \\Phi, \\mathbf{y}, \\alpha^{(t)}, \\beta^{(t)})$ — exactly, using the BLR formulas from Section 3.4. This gives us $\\boldsymbol{\\mu}^{(t)}$ and $\\boldsymbol{\\Sigma}^{(t)}$.\nM-step (Maximization): Use the posterior statistics ($\\boldsymbol{\\mu}^{(t)}$ and $\\boldsymbol{\\Sigma}^{(t)}$) to update the hyperparameters to maximize the expected log-evidence. MacKay derived closed-form fixed-point update rules that make this step a single matrix operation.\nIterate until convergence. The algorithm is guaranteed to not decrease the evidence at each step, so it converges monotonically.\n5.2 The $\\gamma$ Parameter: Data vs. Prior #Before presenting the update rules, we need to meet $\\gamma_j$ — perhaps the most interpretable quantity in the entire framework.\nDefine:\n$$\\gamma_j = 1 - \\alpha_j \\Sigma_{jj}^{\\text{post}}$$where $\\Sigma_{jj}^{\\text{post}}$ is the $j$-th diagonal element of the posterior covariance.\nTo understand $\\gamma_j$, consider two extremes:\nCase 1: The prior dominates. If $\\alpha_j$ is very large, the prior says \u0026ldquo;this weight is zero.\u0026rdquo; The data cannot override a strong prior with limited observations. As a result, the posterior variance $\\Sigma_{jj}^{\\text{post}}$ approaches the prior variance $\\alpha_j^{-1}$, so $\\alpha_j \\Sigma_{jj}^{\\text{post}} \\approx 1$, and $\\gamma_j \\approx 0$.\nCase 2: The data dominates. If $\\alpha_j$ is small (loose prior) and the data strongly constrains $w_j$, then the posterior variance $\\Sigma_{jj}^{\\text{post}}$ is much smaller than the prior variance $\\alpha_j^{-1}$, so $\\alpha_j \\Sigma_{jj}^{\\text{post}} \\approx 0$, and $\\gamma_j \\approx 1$.\nSo $\\gamma_j \\in [0, 1]$ is the fraction of information about weight $w_j$ that comes from the data (as opposed to the prior). It is sometimes called the effective number of data points allocated to feature $j$.\nThe sum $\\gamma = \\sum_j \\gamma_j$ is the total effective number of determined parameters — how many features the data is actually constraining, accounting for the prior regularization.\nThis is an elegant generalization of the classical degrees-of-freedom concept. In ordinary least squares, the effective degrees of freedom is exactly $D$ (the number of features). In BLR with ARD, features that are being suppressed by large $\\alpha_j$ contribute nearly zero to $\\gamma$, reducing the effective complexity of the model automatically.\n5.3 The Update Rules #After deriving the gradient of the log-evidence with respect to $\\alpha_j$ and $\\beta$ (see Appendix B and C for the full derivation), MacKay obtained these fixed-point rules:\nUpdate for $\\alpha_j$ (the ARD hyperparameter):\n$$\\alpha_j^{\\text{new}} = \\frac{\\gamma_j}{\\mu_j^2}$$where $\\mu_j = [\\boldsymbol{\\mu}_\\text{post}]_j$ is the posterior mean of weight $w_j$.\nUpdate for $\\beta$ (the noise precision):\n$$\\beta^{\\text{new}} = \\frac{N - \\gamma}{\\left\\lVert \\mathbf{y} - \\Phi \\boldsymbol{\\mu}_\\text{post} \\right\\rVert^2}$$where $\\gamma = \\sum_j \\gamma_j$ is the total effective number of parameters.\nThese formulas are deceptively simple. Let\u0026rsquo;s read them carefully.\n5.4 Reading the $\\alpha_j$ Update #$$\\alpha_j^{\\text{new}} = \\frac{\\gamma_j}{\\mu_j^2}$$Think of the numerator and denominator as competing forces:\nNumerator $\\gamma_j$: How much does the data say about feature $j$? If the data says nothing (data is irrelevant to this feature), $\\gamma_j \\to 0$ and the new $\\alpha_j$ will be very large — the prior gets tightened, the weight is pushed to zero. ARD prunes the feature.\nDenominator $\\mu_j^2$: How large is the posterior weight? If the posterior mean is large (the feature has a strong effect), the update keeps $\\alpha_j$ small — the feature remains relevant.\nThe fixed-point property: at convergence, these two forces balance. A feature survives when the data evidence for it is proportional to its squared effect size. This is a natural and elegant condition for feature relevance.\nNumerically, we always add a small regularization $\\epsilon$ to the denominator:\n$$\\alpha_j^{\\text{new}} = \\frac{\\gamma_j}{\\mu_j^2 + \\epsilon}$$to avoid division by zero when $\\mu_j$ is near zero — which happens exactly when a feature is being pruned.\n5.5 Reading the $\\beta$ Update #$$\\beta^{\\text{new}} = \\frac{N - \\gamma}{\\left\\lVert \\mathbf{y} - \\Phi \\boldsymbol{\\mu}_\\text{post} \\right\\rVert^2}$$This is the inverse of a noise variance estimate. The noise variance is estimated as:\n$$\\hat{\\sigma}^2 = \\frac{\\left\\lVert \\mathbf{y} - \\Phi \\boldsymbol{\\mu}_\\text{post} \\right\\rVert^2}{N - \\gamma}$$which has the beautiful structure of the classical unbiased estimator of variance:\n$$\\hat{\\sigma}^2_{\\text{classical}} = \\frac{\\text{RSS}}{N - D}$$where RSS (Residual Sum of Squares) is $\\sum_{i=1}^{N} (y_i - \\hat{y}_i)^2 = \\left\\lVert \\mathbf{y} - \\Phi \\boldsymbol{\\mu}_\\text{post} \\right\\rVert^2$ — the sum of squared prediction errors.\nBut instead of dividing by $N - D$ (where $D$ counts all parameters), we divide by $N - \\gamma$ (where $\\gamma$ counts only the effectively used parameters). Features that have been pruned by ARD don\u0026rsquo;t consume degrees of freedom — their $\\gamma_j \\approx 0$ contribution is essentially zero. The noise estimate automatically corrects for the fact that the model\u0026rsquo;s complexity is lower than the raw feature count suggests. Then $\\beta = 1/\\hat{\\sigma}^2$ gives the precision formula above.\nThis also reveals the anti-overfitting mechanism. Suppose the weights try to \u0026ldquo;soak up\u0026rdquo; residual noise (overfitting):\nThe fit improves slightly, but $\\gamma$ increases (more features are being used). The numerator $N - \\gamma$ shrinks. The new $\\beta$ is lower (more noise is assumed), which reduces how much we trust each new observation. Lower $\\beta$ feeds back to the $\\alpha_j$ update, pushing some $\\alpha_j$ values up, pruning features. The algorithm polices itself. No explicit regularization parameter to tune. No validation set. No early stopping heuristic. The evidence framework creates an automatic feedback loop between the noise model and the feature relevance — a property that feels almost too elegant to be practical, but which works robustly in implementation.\n5.6 The Complete Algorithm #Putting it together, the MacKay BLR+ARD algorithm is:\nInitialize: α₁ = α₂ = ... = αD = α₀, β = β₀ Repeat until convergence: 1. (E-step) Compute posterior: Σ_post = (diag(α) + β · ΦᵀΦ)⁻¹ μ_post = β · Σ_post · Φᵀy 2. Compute gamma: γⱼ = 1 − αⱼ · [Σ_post]ⱼⱼ for each j γ = Σⱼ γⱼ 3. (M-step) Update hyperparameters: αⱼ_new = γⱼ / (μⱼ² + ε) for each j β_new = (N − γ) / ‖y − Φμ_post‖² 4. Check convergence: If max_j |αⱼ_new − αⱼ| \u0026lt; tolerance: stop Else: α ← α_new, β ← β_new, continue Convergence note: In practice, the algorithm typically converges in 10–50 iterations for problems of this scale. A practical robustness trick: if $\\beta$ oscillates between iterations, damp it with an exponential moving average before using it in the next E-step.\n5.7 ARD in Action: What Happens During the Iterations #During the iterations, the $\\alpha_j$ values for irrelevant features do not merely grow — they grow unboundedly. Once $\\gamma_j \\approx 0$ and $\\mu_j \\approx 0$, the update rule effectively becomes $\\alpha_j \\leftarrow 0 / 0^+ = \\text{large}$. Large $\\alpha_j$ feeds back into the E-step, making the posterior variance for that feature even smaller, pushing $\\gamma_j$ closer to zero. The pruning is self-reinforcing and extremely robust.\nFor relevant features, the opposite happens: small $\\alpha_j$ keeps the posterior responsive to data. $\\gamma_j$ stays near 1. The weight $\\mu_j$ stabilizes at a physically meaningful value.\nThe net result — as seen in our Hall sensor example — is a sparse solution where the $\\alpha$ ratios between relevant and irrelevant features span many orders of magnitude (237,000× for the quadratic term). This is not numerical instability; it is the algorithm communicating very loudly that the irrelevant features should be zero.\n6. Summary: From Bayes\u0026rsquo; Theorem to Automatic Sensor Calibration #6.1 The Full Picture #Let\u0026rsquo;s retrace the journey:\nBayes\u0026rsquo; Theorem tells us how to update beliefs with data. Gaussian priors and likelihoods make this update a tractable linear algebra operation, with exact closed-form posteriors. Bayesian Linear Regression specializes Gaussian inference to the regression setting: the prior is over weights, the likelihood is a linear model with noise. Feature engineering encodes physical domain knowledge as basis functions — the algorithm selects which ones matter. ARD hyperparameters ($\\alpha_j$) give each feature its own regularization strength. Features irrelevant to the data are automatically pruned. Evidence maximization determines the optimal $\\alpha$ and $\\beta$ from data alone — no cross-validation, no external validation set. The MacKay fixed-point algorithm makes evidence maximization computationally efficient: one matrix inversion and a handful of scalar updates per iteration. 6.2 Takeaways for the Practicing Engineer #You can now read the math. The formulas in MacKay (1992) and Tipping \u0026amp; Bishop (2001) will not seem opaque after working through this post. The key equations are:\n$$\\Sigma_\\text{post} = (\\Lambda + \\beta\\, \\Phi^T \\Phi)^{-1}, \\quad \\boldsymbol{\\mu}_\\text{post} = \\beta\\, \\Sigma_\\text{post}\\, \\Phi^T \\mathbf{y}$$$$\\gamma_j = 1 - \\alpha_j \\Sigma_{jj}^{\\text{post}}, \\quad \\alpha_j^{\\text{new}} = \\frac{\\gamma_j}{\\mu_j^2 + \\epsilon}, \\quad \\beta^{\\text{new}} = \\frac{N - \\gamma}{\\lVert \\mathbf{y} - \\Phi \\boldsymbol{\\mu}_\\text{post} \\rVert^2}$$You have a mental model for debugging. If $\\alpha_j$ is not growing for a feature you know should be irrelevant, something is wrong with your basis functions — perhaps a relevant feature has been accidentally correlated with an irrelevant one. If $\\beta$ converges to a very low value, your model may be underfitting (your feature vocabulary is incomplete). The hyperparameter values are diagnostic, not just outputs.\nYou understand the uncertainty. When the calibration reports a 95% confidence interval, you know where that interval comes from: the posterior predictive variance decomposes into aleatoric noise ($\\beta^{-1}$) and epistemic weight uncertainty ($\\boldsymbol{\\phi}^T \\Sigma_\\text{post} \\boldsymbol{\\phi}$). You can inspect each component separately.\nYou can evaluate the method critically. BLR+ARD makes strong assumptions: Gaussian noise, linear model in the basis functions, i.i.d. observations. For sensor calibration in a controlled measurement session, these assumptions are almost always valid. For more complex settings (time-varying drift, multiplicative noise, heavy-tailed outliers), they may not be. Knowing the assumptions lets you decide when to use the tool and when to look for alternatives.\n6.3 Coming Up: Part 2 #The next post in this series moves from mathematics to code: \u0026ldquo;From Math to Silicon: Implementing BLR+ARD with Rust and faer\u0026rdquo;. We will walk through the Rust implementation in the blr-core crate, map every formula from this post to the corresponding code, and benchmark the implementation against the Python reference.\nAppendices #Appendix A: The Woodbury Lemma - Switching from Observation Space to Parameter Space #This appendix shows the explicit algebraic step that converts the Kalman gain form from Section 2.2 into the compact BLR posterior formulas in Section 3.4. If you are comfortable accepting the result on faith, skip this on first reading — you can always come back.\nThe Woodbury identity is ubiquitous in machine learning precisely because it lets you choose which space to work in. The canonical rule of thumb: invert in the smaller space. When you have fewer features than observations ($D \u003c N$), work in parameter space — this is the BLR case. When you have more features than observations ($D \u003e N$), work in observation space — this is the kernel trick case (e.g., Gaussian Processes). Prof. Hennig\u0026rsquo;s lecture slides show both forms of the posterior side by side (the \u0026ldquo;Kalman gain\u0026rdquo; form and the \u0026ldquo;precision matrix\u0026rdquo; form) precisely because neither is universally preferable; the right choice depends on the relative sizes of $N$ and $D$.\nA.1 The Problem: Two Spaces, Two Costs #After substituting the BLR assignments into the general Hennig formula, the posterior covariance takes the Kalman gain form:\n$$\\Sigma_\\text{post} = \\Lambda^{-1} - \\Lambda^{-1}\\Phi^T\\underbrace{\\left(\\beta^{-1}I_N + \\Phi\\Lambda^{-1}\\Phi^T\\right)^{-1}}_{N \\times N \\text{ inversion}}\\Phi\\Lambda^{-1}$$and the posterior mean:\n$$\\boldsymbol{\\mu}_\\text{post} = \\Lambda^{-1}\\Phi^T\\underbrace{\\left(\\beta^{-1}I_N + \\Phi\\Lambda^{-1}\\Phi^T\\right)^{-1}}_{N \\times N \\text{ inversion}}\\mathbf{y}$$Both require inverting the $N \\times N$ matrix $G = \\beta^{-1}I_N + \\Phi\\Lambda^{-1}\\Phi^T$. The cost of inverting an $n \\times n$ matrix scales as $O(n^3)$, so:\nForm Matrix inverted Size Scales with Kalman / observation space $\\beta^{-1}I_N + \\Phi\\Lambda^{-1}\\Phi^T$ $N \\times N$ Observations Precision / parameter space $\\Lambda + \\beta\\Phi^T\\Phi$ $D \\times D$ Features For $D = 6$ features and $N = 100$ observations, the observation-space inversion is roughly $(100/6)^3 \\approx 4600\\times$ more expensive — and it gets worse the more calibration data you collect. The parameter-space form, once the sufficient statistics $\\Phi^T\\Phi$ and $\\Phi^T\\mathbf{y}$ are pre-computed ($O(ND^2)$ once), never grows with $N$ again.\nA.2 The Woodbury Matrix Identity #For matrices of compatible dimensions with $P$ and $R$ invertible:\n$$(P^{-1} + B^T R^{-1} B)^{-1} = P - PB^T(BPB^T + R)^{-1}BP \\tag{A.1}$$$$(P^{-1} + B^T R^{-1} B)^{-1} B^T R^{-1} = PB^T(BPB^T + R)^{-1} \\tag{A.2}$$Both identities can be verified by multiplying the left-hand side by $(P^{-1} + B^T R^{-1} B)$ and confirming you recover the identity matrix — straightforward matrix algebra, no approximation involved.\nA.3 Applying the Identity to BLR #Set $P = \\Lambda^{-1}$ (prior covariance, $D \\times D$), $B = \\Phi$ (design matrix, $N \\times D$), $R = \\beta^{-1}I_N$ (noise covariance, $N \\times N$). Then:\n$P^{-1} = \\Lambda$ $B^T R^{-1} B = \\Phi^T (\\beta^{-1}I)^{-1} \\Phi = \\beta\\Phi^T\\Phi$ $BPB^T + R = \\Phi\\Lambda^{-1}\\Phi^T + \\beta^{-1}I_N$ Posterior covariance — applying identity (A.1):\n$$\\Sigma_\\text{post} = (\\Lambda^{-1} - \\Lambda^{-1}\\Phi^T G^{-1}\\Phi\\Lambda^{-1}) \\stackrel{\\text{A.1}}{=} (\\Lambda + \\beta\\Phi^T\\Phi)^{-1}$$Posterior mean (with zero prior mean) — applying identity (A.2):\n$$\\boldsymbol{\\mu}_\\text{post} = \\Lambda^{-1}\\Phi^T G^{-1}\\mathbf{y} \\stackrel{\\text{A.2}}{=} (\\Lambda + \\beta\\Phi^T\\Phi)^{-1}\\beta\\Phi^T\\mathbf{y} = \\beta\\,\\Sigma_\\text{post}\\,\\Phi^T\\mathbf{y}$$The two boxed formulas in Section 3.4 are the direct output of these two substitutions. No approximation, no hidden assumption — just the Woodbury identity applied once to each equation.\nAppendix B: Where Do $\\alpha_j$ and $\\gamma_j$ Come From? #This appendix derives the MacKay fixed-point rule for $\\alpha_j$ from first principles. The main text quoted the result; here we show why it has to be true.\nB.1 The Log-Posterior Is a Quadratic Function #In Bayesian inference, we often work with the log-posterior because it is easier to differentiate:\n$$\\ln p(\\mathbf{w} \\mid \\mathbf{y}) = \\ln p(\\mathbf{y} \\mid \\mathbf{w}) + \\ln p(\\mathbf{w}) + \\text{const}$$For our Gaussian setting:\n$$\\ln p(\\mathbf{y} \\mid \\mathbf{w}) = -\\frac{\\beta}{2} \\lVert\\mathbf{y} - \\Phi \\mathbf{w}\\rVert^2 + \\text{const}$$$$\\ln p(\\mathbf{w}) = -\\frac{1}{2} \\mathbf{w}^T \\Lambda \\mathbf{w} + \\text{const} = -\\frac{1}{2} \\sum_j \\alpha_j w_j^2 + \\text{const}$$Both terms are quadratic in $\\mathbf{w}$. Their sum is quadratic. A quadratic with negative leading coefficient is the log of a Gaussian. Therefore the posterior is Gaussian — this is the algebraic proof of the Gaussian miracle from Section 2.2.\nB.2 The Hessian Is the Precision Matrix #The Hessian of the log-posterior with respect to $\\mathbf{w}$ is:\n$$H = -\\frac{\\partial^2 \\ln p(\\mathbf{w} \\mid \\mathbf{y})}{\\partial \\mathbf{w}^2} = \\beta\\, \\Phi^T \\Phi + \\Lambda$$Notice: $H$ is exactly the inverse of the posterior covariance:\n$$\\Sigma_\\text{post} = H^{-1} = (\\Lambda + \\beta\\, \\Phi^T \\Phi)^{-1}$$This is not a coincidence. In a Gaussian, the precision matrix (inverse covariance) is the Hessian of the negative log-density. The Hessian encodes the curvature of the log-posterior surface:\nLarge Hessian eigenvalue → sharp curvature → narrow posterior → confident about that direction in weight space. Small Hessian eigenvalue → flat surface → broad posterior → uncertain. The posterior covariance $\\Sigma_\\text{post}$ is literally the inverse of this curvature.\nB.3 The Eigenvalue Balance and the Origin of $\\gamma$ #To understand $\\gamma_j$, consider first a simplified case: all features share a single precision $\\alpha$ (no ARD yet). The Hessian is:\n$$H = \\alpha I + \\beta\\, \\Phi^T \\Phi$$Let $\\lambda_i$ be the eigenvalues of the data matrix $\\beta\\, \\Phi^T \\Phi$. Then the eigenvalues of $H$ are $(\\lambda_i + \\alpha)$. MacKay defined the effective number of determined parameters as:\n$$\\gamma = \\sum_{i=1}^{D} \\frac{\\lambda_i}{\\lambda_i + \\alpha}$$Each term is a number between 0 and 1:\nIf $\\lambda_i \\gg \\alpha$: the data dominates in direction $i$. That parameter is \u0026ldquo;data-determined.\u0026rdquo; Contribution to $\\gamma$: nearly 1. If $\\lambda_i \\ll \\alpha$: the prior dominates. That parameter is \u0026ldquo;prior-determined.\u0026rdquo; Contribution to $\\gamma$: nearly 0. So $\\gamma$ counts how many parameters the data is actually constraining, on a soft scale from 0 to $D$.\nFor the general ARD case with per-feature $\\alpha_j$, the per-feature version is:\n$$\\gamma_j = 1 - \\alpha_j \\Sigma_{jj}^{\\text{post}}$$This can be derived from the same eigenvalue logic applied to the $j$-th diagonal: $\\Sigma_{jj}^{\\text{post}}$ is the posterior variance for feature $j$, which is small when data constrains $w_j$ and large (close to $\\alpha_j^{-1}$) when the prior dominates. Substituting the prior-dominated limit $\\Sigma_{jj} \\approx \\alpha_j^{-1}$ gives $\\gamma_j \\approx 0$; substituting a fully data-determined case gives $\\gamma_j \\approx 1$.\nB.4 The MacKay Fixed-Point Rule for $\\alpha_j$ #Now we derive $\\alpha_j^{\\text{new}} = \\gamma_j / \\mu_j^2$. The starting point is the log marginal likelihood (evidence):\n$$\\ln p(\\mathbf{y} \\mid \\alpha, \\beta) = \\ln \\int p(\\mathbf{y} \\mid \\mathbf{w}, \\beta)\\, p(\\mathbf{w} \\mid \\alpha)\\, d\\mathbf{w}$$For Gaussians, this integral evaluates to a closed-form expression involving the posterior quantities. Taking the derivative with respect to $\\alpha_j$ and setting it to zero (the optimality condition):\n$$\\frac{\\partial \\ln p(\\mathbf{y} \\mid \\alpha, \\beta)}{\\partial \\alpha_j} = 0$$After algebraic manipulation — using the matrix identity $\\frac{\\partial}{\\partial \\alpha_j} \\ln \\det H = \\frac{\\partial}{\\partial \\alpha_j} \\text{tr}(\\ln H)$ and differentiating through the posterior covariance — MacKay obtained:\n$$\\frac{1}{\\alpha_j} - \\Sigma_{jj}^{\\text{post}} - \\mu_j^2 = 0$$Rearranging:\n$$\\frac{1}{\\alpha_j} = \\mu_j^2 + \\Sigma_{jj}^{\\text{post}}$$Multiplying both sides by $\\alpha_j$:\n$$1 = \\alpha_j \\mu_j^2 + \\alpha_j \\Sigma_{jj}^{\\text{post}} = \\alpha_j \\mu_j^2 + (1 - \\gamma_j)$$Therefore:\n$$\\alpha_j \\mu_j^2 = \\gamma_j \\implies \\alpha_j = \\frac{\\gamma_j}{\\mu_j^2}$$This is a fixed-point equation: the optimal $\\alpha_j$ is expressed in terms of the posterior statistics ($\\mu_j$, $\\Sigma_{jj}^{\\text{post}}$), which themselves depend on $\\alpha_j$. Iterating the update alternately with the posterior computation converges to the evidence-maximizing solution.\nThe chain of reasoning is:\n$$\\text{Hessian} = \\text{Precision} \\xrightarrow{\\text{Eigenvalues}} \\gamma_j = \\text{data fraction} \\xrightarrow{\\text{Evidence gradient}} \\alpha_j = \\frac{\\gamma_j}{\\mu_j^2}$$Full reference: See MacKay in Primary References — Sections 4 and Appendix D contain the complete derivation.\nAppendix C: Where Does the $\\beta$ Update Come From? #This appendix derives the noise precision update from the marginal likelihood, explaining the \u0026ldquo;degrees of freedom\u0026rdquo; interpretation.\nC.1 The Log Evidence and Its Two Competing Terms #The log marginal likelihood decomposes naturally into an accuracy term and a complexity term:\n$$\\ln p(\\mathbf{y} \\mid \\alpha, \\beta) = \\underbrace{-\\frac{1}{2} \\lVert\\mathbf{y} - \\Phi \\boldsymbol{\\mu}_\\text{post}\\rVert^2 \\beta + \\ldots}_{\\text{accuracy}} \\underbrace{- \\frac{1}{2} \\ln |H| + \\ldots}_{\\text{complexity}}$$The accuracy term rewards small residuals (good fit). The complexity term penalizes high curvature in the log-posterior (a model that is too flexible, fitting noise as well as signal). The evidence is maximized at the sweet spot where neither term dominates.\nThis is the Bayesian Occam\u0026rsquo;s razor: among all models that fit the data roughly equally well, the simpler one has higher evidence.\nC.2 The Derivative Condition for $\\beta$ #Taking the derivative of the log evidence with respect to $\\beta$ and setting to zero:\n$$\\frac{\\partial \\ln p(\\mathbf{y} \\mid \\alpha, \\beta)}{\\partial \\beta} = 0$$After applying the chain rule through the determinant and the posterior mean, MacKay derived the fixed-point condition:\n$$\\frac{N}{\\beta} - \\lVert\\mathbf{y} - \\Phi \\boldsymbol{\\mu}_\\text{post}\\rVert^2 - \\text{tr}(\\Phi \\Sigma_\\text{post} \\Phi^T) = 0$$The last term, $\\text{tr}(\\Phi \\Sigma_\\text{post} \\Phi^T)$, can be shown to equal $(N - \\gamma) / \\beta$ at the fixed point, leading to:\n$$\\beta^{\\text{new}} = \\frac{N - \\gamma}{\\lVert\\mathbf{y} - \\Phi \\boldsymbol{\\mu}_\\text{post}\\rVert^2}$$C.3 The \u0026ldquo;Data Points Used Up\u0026rdquo; Interpretation #This formula has a beautiful interpretation. Think of your $N$ data points as a budget:\nThe model \u0026ldquo;spends\u0026rdquo; $\\gamma$ data points to determine the weights. A feature with $\\gamma_j \\approx 1$ \u0026ldquo;consumes\u0026rdquo; one full data point to determine its weight. Only $N - \\gamma$ data points remain to estimate the noise. The noise variance estimate is therefore residuals divided by the remaining budget:\n$$\\hat{\\sigma}^2 = \\frac{1}{\\beta^{\\text{new}}} = \\frac{\\lVert\\mathbf{y} - \\Phi \\boldsymbol{\\mu}_\\text{post}\\rVert^2}{N - \\gamma}$$Compare to classical statistics, where the unbiased estimator uses $N - D$ (the number of observations minus the number of fitted parameters). The BLR version replaces the hard count $D$ with the soft count $\\gamma = \\sum_j \\gamma_j$. Features pruned by ARD (with $\\gamma_j \\approx 0$) do not consume degrees of freedom. This is important: it means the noise estimate is not biased by including irrelevant features in the count.\nC.4 The Self-Policing Anti-Overfitting Mechanism #The interaction between the $\\alpha_j$ and $\\beta$ updates creates an elegant overfitting prevention mechanism. Suppose the weights try to overfit — absorbing noise into the fit:\nOverfitting reduces the residuals $\\lVert\\mathbf{y} - \\Phi \\boldsymbol{\\mu}\\rVert^2$. But it also increases $\\gamma$ (more features are being actively used). The numerator $N - \\gamma$ shrinks; the denominator shrinks too. The net effect on $\\beta$ is ambiguous or even decreasing (less confidence in observations). Lower $\\beta$ flows into the $\\alpha_j$ updates, increasing some $\\alpha_j$, pruning features. The model is forced back toward sparsity. MacKay\u0026rsquo;s summary of this property:\nThe formula ensures that the noise estimate isn\u0026rsquo;t biased by the fact that the weights are already trying to \u0026lsquo;soak up\u0026rsquo; some of the patterns in the data. It\u0026rsquo;s a very elegant way of saying: the noise variance is the average squared error, but only averaged over the dimensions that weren\u0026rsquo;t already captured by the model weights.\nThe BLR+ARD objective function builds in Occam\u0026rsquo;s razor, complexity penalization, and overfitting prevention — not as separate components that require tuning, but as natural consequences of the evidence maximization framework.\nFootnotes \u0026amp; References #References #This section collects literature and course references cited throughout the post. For definitions and foundational concepts (e.g., Hessian, Hall sensor), see the linked Wikipedia articles embedded in the text.\nImplementation Notes # Notation Convention: In Prof. Hennig\u0026rsquo;s framework, the linear map $A$ represents the forward transformation from latent variables to observations. In the BLR context, $A = \\Phi$ where each row of the design matrix $\\Phi$ is a feature vector evaluated at a single observation point. This gives the standard regression form: $\\mathbf{y} = \\Phi \\mathbf{w}$.\nLiterature \u0026amp; Courses # MacKay, D.J.C. (1992). \u0026ldquo;A Practical Bayesian Framework for Backpropagation Networks.\u0026rdquo; Neural Computation 4(3):448–472. — Original derivation of the evidence framework and ARD fixed-point rules.\nTipping, M.E. \u0026amp; Bishop, C.M. (2001). \u0026ldquo;Sparse Bayesian Learning and the Relevance Vector Machine.\u0026rdquo; Journal of Machine Learning Research 1:211–244. — Extension of ARD to kernel methods; clear exposition of fixed-point rules.\nHennig, P. (2025). \u0026ldquo;Probabilistic Machine Learning\u0026rdquo; course, University of Tübingen. Lecture series: https://www.youtube.com/playlist?list=PL05umP7R6ij0hPfU7Yuz8J9WXjlb3MFjm. See lecture 3 (\u0026ldquo;Gaussian Inference\u0026rdquo;) for the framework used in Section 2.\nRasmussen, C.E. \u0026amp; Williams, C.K.I. (2006). Gaussian Processes for Machine Learning. MIT Press. Free PDF: http://gaussianprocess.org/gpml/. Chapters 2 (Regression) and 5 (Model Selection).\nMurphy, K.P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Free PDF: https://probml.github.io/pml-book/. Chapter 11 (Linear Regression).\nRamsden, E. (2006). Hall-Effect Sensors: Theory and Application. Elsevier. Sensor physics background.\nWoodbury Matrix Identity. Wikipedia: https://en.wikipedia.org/wiki/Woodbury_matrix_identity. The matrix inversion lemma used in Appendix A.\nMarginal Likelihood \u0026amp; Evidence. Wikipedia: https://en.wikipedia.org/wiki/Marginal_likelihood. Overview of evidence and model selection.\nContinuing the Series #This article lays the mathematical foundation. The next part, From Math to Silicon: Implementing BLR+ARD with Rust and faer, walks through the production code that translates these formulas into efficient, numerically robust implementations — focusing on the key insight that you should never invert a matrix if a linear solve will do.\n","date":"15 May 2026","permalink":"https://wamli.github.io/blog/blr-and-ard/","section":"Blog","summary":"Bayesian Linear Regression with Automatic Relevance Determination combines principled uncertainty quantification, automatic feature selection, and a closed-form solution—all from one coherent mathematical framework. This post explains how and why.","title":"When Your Sensor Knows What It Doesn't Know"},{"content":"The first post in this series showed you why Bayesian Linear Regression with Automatic Relevance Determination is the right tool for sensor calibration: principled uncertainty, automatic feature selection, and a closed-form solution that a ten-year-old C compiler could run. This post shows you the part that textbooks skip — what actually happens when you translate those beautiful matrix formulas into production code.\nThe central theme is deceptively simple: never invert a matrix if you can avoid it. Everything else follows from understanding why.\nWhere We Left Off #The previous post derived two core formulas for Bayesian Linear Regression. Given a design matrix $\\Phi \\in \\mathbb{R}^{N \\times D}$, a diagonal prior precision $\\Lambda = \\text{diag}(\\alpha_1, \\ldots, \\alpha_D)$, and noise precision $\\beta$, the posterior over weights is:\n$$\\boldsymbol{\\Sigma}_\\text{post} = \\left(\\Lambda + \\beta\\,\\Phi^T\\Phi\\right)^{-1} \\tag{1}$$$$\\boldsymbol{\\mu}_\\text{post} = \\beta\\,\\Sigma_\\text{post}\\,\\Phi^T\\mathbf{y} \\tag{2}$$The superscript $-1$ in equation (1) is the source of almost every numerical problem you will encounter. This post is about replacing it with something better.\nWe also derived the MacKay EM update rules for learning the hyperparameters $\\alpha$ and $\\beta$ from data. This post walks through the code in blr-core/src/gaussian.rs and blr-core/src/ard.rs that makes this work in practice — efficiently, numerically safely, and on a CPU that might be inside a WASM component.\nPart I — Linear Algebra for Bayesian Inference #1.1 The Matrix Inversion Problem #Let us begin with a concrete diagnosis. Suppose you have a $D \\times D$ symmetric positive-definite (SPD) matrix $A$ and you want to compute $A^{-1} b$ for some vector $b$. The naive approach:\n// DON\u0026#39;T DO THIS let a_inv = invert(a); let x = a_inv * b; This wastes $O(D^3)$ flops to compute all $D^2$ entries of $A^{-1}$, when the actual answer $x$ lives in $\\mathbb{R}^D$. Worse, the error in $x$ scales with the square of $A$\u0026rsquo;s condition number $\\kappa(A)^2$ (see Appendix A) — because you first invert with error $\\sim \\kappa(A) \\cdot \\epsilon_\\text{machine}$, then multiply, amplifying that error again. By the time you examine the diagonal of $\\Sigma_\\text{post}$ to extract uncertainty estimates, your $\\pm 3\\sigma$ confidence bands might be meaningless.\nThe correct way to phrase the problem is: you do not want $A^{-1}$. You want the solution $x$ to the system $Ax = b$. That reframing unlocks a much better path.\nGoal Naive approach Better approach Compute $A^{-1}b$ Form $A^{-1}$, multiply Solve $Ax = b$ directly Compute $\\log\\|A\\|$ Eigendecompose, sum logs Cholesky $LL^T$, sum $2\\log L_{ii}$ Compute $b^T A^{-1} b$ Form $A^{-1}$, two multiplications Solve $Az = b$, compute $\\|z\\|^2$ Cost $O(D^3)$ + amplified errors $O(D^3)$ once, then $O(D^2)$ per RHS The key insight is that $A$ is symmetric positive-definite, and SPD matrices have a uniquely efficient factorization.\n1.2 The Cholesky Decomposition #For any SPD matrix $A$, there exists a unique lower-triangular matrix $L$ with positive diagonal entries such that:\n$$A = LL^T$$This is the Cholesky decomposition (or LLT factorization). It costs approximately $\\frac{D^3}{3}$ flops — roughly half the cost of LU decomposition, because symmetry means you only process the lower triangle. The algorithm is backward-stable for SPD matrices: the computed $L$ satisfies $(L + \\delta L)(L + \\delta L)^T = A$ with $\\|\\delta L\\| / \\|L\\| \\leq c \\cdot \\epsilon_\\text{machine} \\cdot \\kappa(A)$ — a first-power condition number dependence, not the squared dependence you get from inverting (see Appendix A.2 for details).\nWhy is $\\Lambda + \\beta\\Phi^T\\Phi$ always SPD? Let\u0026rsquo;s verify:\nSymmetry: $(\\Lambda + \\beta\\Phi^T\\Phi)^T = \\Lambda^T + \\beta(\\Phi^T\\Phi)^T = \\Lambda + \\beta\\Phi^T\\Phi$. ✓ Positive-definiteness: For any nonzero $v$, $v^T(\\Lambda + \\beta\\Phi^T\\Phi)v = v^T\\Lambda v + \\beta\\|\\Phi v\\|^2$. The first term is $\\sum_j \\alpha_j v_j^2 \\geq 0$ (since all $\\alpha_j \u003e 0$), and is strictly positive in at least one coordinate. So the whole expression is strictly positive. ✓ SPD is not just an algebraic nicety — it is a mathematical guarantee that the Cholesky algorithm will never fail with a \u0026ldquo;negative pivot\u0026rdquo; error, and that the factorization is unique. Whenever you see the posterior precision matrix blow up in a poorly conditioned problem, the error from Cholesky tells you exactly what went wrong: SingularMatrix.\nIn faer, the Cholesky factorization of a matrix a: Mat\u0026lt;f64\u0026gt; is simply:\nlet llt = a.llt(Side::Lower)?; The .llt() call computes $L$ and bundles it into a solver object. From that point forward, solving $Ax = b$ is:\nlet x = llt.solve(b.as_ref()); This performs two triangular solves — a forward substitution through $L$ and a back substitution through $L^T$ — at cost $O(D^2)$ each. The factorization is amortized across multiple right-hand sides. In BLR, this is exactly what we exploit: the same Cholesky factor solves for both $\\boldsymbol{\\Sigma}$ and $\\boldsymbol{\\mu}$ in one shot.\n1.3 Computing the Mahalanobis Distance Without Inverting #The first place Cholesky appears in gaussian.rs is in log_pdf — the log-probability density of a multivariate Gaussian. The formula is:\n$$\\log\\mathcal{N}(x;\\,\\mu,\\,\\Sigma) = -\\frac{1}{2}(x - \\mu)^T\\Sigma^{-1}(x - \\mu) - \\frac{1}{2}\\log|\\Sigma| - \\frac{D}{2}\\log(2\\pi)$$The quadratic form $(x - \\mu)^T\\Sigma^{-1}(x - \\mu)$ is the squared Mahalanobis distance — a generalization of the $z$-score that accounts for correlations between dimensions. Computing it by forming $\\Sigma^{-1}$ is wasteful; using the Cholesky factor makes it elegant:\n// from blr-core/src/gaussian.rs pub fn log_pdf(\u0026amp;self, x: \u0026amp;[f64]) -\u0026gt; f64 { let d = self.dim; let sigma = Mat::\u0026lt;f64\u0026gt;::from_fn(d, d, |i, j| self.cov[i * d + j]); let diff = Mat::\u0026lt;f64\u0026gt;::from_fn(d, 1, |i, _| x[i] - self.mean[i]); let llt = sigma .llt(Side::Lower) .expect(\u0026#34;Covariance must be positive-definite for log_pdf\u0026#34;); // Solve L · Lᵀ · z = diff → z = Σ⁻¹ diff (but we never form Σ⁻¹) let z = llt.solve(diff.as_ref()); // ‖z‖² = diff^T Σ⁻¹ diff (the Mahalanobis distance) let quadratic: f64 = (0..d).map(|i| { let v = z[(i, 0)]; v * v }).sum(); let logdet = cholesky_logdet(\u0026amp;sigma, d).expect(\u0026#34;Covariance must be PD\u0026#34;); -0.5 * quadratic - 0.5 * logdet - (d as f64 / 2.0) * (2.0 * std::f64::consts::PI).ln() } Read through the code line by line:\nWe build the $D \\times D$ covariance matrix sigma from the flattened row-major self.cov array — faer\u0026rsquo;s Mat::from_fn is a clean way to do this with explicit index computation. We compute diff = $x - \\mu$. .llt() factors $\\Sigma = LL^T$. If the covariance is not positive-definite (perhaps due to accumulated numerical drift), this panics immediately with a diagnostic message. .solve(diff) computes $z$ such that $\\Sigma z = \\text{diff}$ — in other words, $z = \\Sigma^{-1}(x - \\mu)$. $\\|z\\|^2 = z^T z = (x - \\mu)^T \\Sigma^{-1} (x - \\mu)$. This is the Mahalanobis distance, obtained without ever computing $\\Sigma^{-1}$. The key insight: you wanted a scalar (the Mahalanobis distance). You got it by solving a linear system. The matrix inverse was never needed. This pattern — \u0026ldquo;replace inversion with a solve\u0026rdquo; — appears throughout the entire codebase.\nThe logdet term is handled separately by a helper function. We\u0026rsquo;ll look at that next.\n1.4 Log-Determinant via the Cholesky Diagonal #The log-determinant $\\log|\\Sigma|$ appears in the log-pdf normalization, and critically, in the log marginal likelihood (the EM objective). Computing it numerically requires care: $\\det(\\Sigma)$ can be astronomically small or large for moderate $D$, causing underflow or overflow before you can take the logarithm.\nThe Cholesky factorization solves this too. Since $\\Sigma = LL^T$:\n$$|\\Sigma| = |L|^2 = \\left(\\prod_{j=1}^{D} L_{jj}\\right)^2$$(because the determinant of a triangular matrix is the product of its diagonal, and the factor of 2 comes from $|LL^T| = |L||L^T| = |L|^2$). Taking logarithms:\n$$\\log|\\Sigma| = 2\\sum_{j=1}^{D} \\log L_{jj}$$Every $L_{jj} \u003e 0$ by the SPD guarantee, so the logarithms are all finite. And since we are summing logarithms of numbers that are individually $O(1)$ to $O(10^3)$, there is no overflow or underflow problem.\nThe implementation in gaussian.rs runs a manual Cholesky factorization specifically to accumulate this sum:\n// from blr-core/src/gaussian.rs pub(crate) fn cholesky_logdet(mat: \u0026amp;Mat\u0026lt;f64\u0026gt;, d: usize) -\u0026gt; Result\u0026lt;f64, BLRError\u0026gt; { let mut a = mat.clone(); for j in 0..d { // Compute diagonal pivot: L[j,j] = sqrt(A[j,j] - Σ_{k\u0026lt;j} L[j,k]²) let mut diag = a[(j, j)]; for k in 0..j { let l_jk = a[(j, k)]; diag -= l_jk * l_jk; } if diag \u0026lt;= 0.0 { return Err(BLRError::SingularMatrix); } let l_jj = diag.sqrt(); a[(j, j)] = l_jj; // store L[j,j] in-place // Fill the column below the diagonal for i in (j + 1)..d { let mut s = a[(i, j)]; for k in 0..j { s -= a[(i, k)] * a[(j, k)]; } a[(i, j)] = s / l_jj; // L[i,j] = (A[i,j] - Σ_{k\u0026lt;j} L[i,k]L[j,k]) / L[j,j] } } Ok(2.0 * (0..d).map(|j| a[(j, j)].ln()).sum::\u0026lt;f64\u0026gt;()) } A few things worth noting:\nThe factorization is done in-place on a clone of the input matrix. The lower triangle is overwritten with $L$; the upper triangle is irrelevant (never read after the computation). No extra allocation needed. The check if diag \u0026lt;= 0.0 is the SPD guard. In a well-posed Bayesian problem, $\\Lambda + \\beta\\Phi^T\\Phi$ is always strictly SPD. If you hit SingularMatrix, something has gone wrong upstream — either your design matrix has linearly dependent columns, or your hyperparameter initialization is degenerate. The function is pub(crate) — it\u0026rsquo;s a shared utility used by both gaussian.rs (in log_pdf) and ard.rs (in the E-step to compute the log-evidence denominator). One natural question: faer\u0026rsquo;s llt() already computes the Cholesky factor. Why not just read off the diagonal? The answer is that faer\u0026rsquo;s solver abstraction doesn\u0026rsquo;t expose the raw $L$ matrix in a convenient slice form, and for this particular computation — summing log-diagonals — the explicit loop is clearer and has no performance cost for the matrix sizes we operate at ($D \\leq 20$ in the Hall sensor calibration). This is a deliberate tradeoff: clarity over abstraction.\n1.5 The Posterior Update — Two Forms, One Identity #The most important function in gaussian.rs is condition — the Bayesian posterior update. Given the current Gaussian prior $p(\\mathbf{w}) = \\mathcal{N}(\\boldsymbol{\\mu}, \\Sigma)$ and new observations $\\mathbf{y} = A\\mathbf{w} + \\boldsymbol{\\epsilon}$ with homoscedastic noise $\\boldsymbol{\\epsilon} \\sim \\mathcal{N}(\\mathbf{0}, \\sigma^2 I_N)$, it computes the exact posterior $p(\\mathbf{w} \\mid \\mathbf{y})$.\nThere are two algebraically equivalent ways to perform this update, each named for the space in which the key Cholesky factorization lives:\nForm Cholesky size Cheaper when Observation-space (Gram form) $N \\times N$ $N \u003c D$ Parameter-space (precision form) $D \\times D$ $D \\leq N$ For Hall sensor calibration — $D \\approx 6$ features, $N \\approx 25$–$100$+ observations — $D \\ll N$ holds in every realistic scenario. The parameter-space form is strictly cheaper and always selected in production. The Gram form is retained for generality: if you ever build a model with far more basis functions than observations (sparse kernel regression, compressed sensing), the adaptive dispatcher switches to it automatically.\nThe Gram Form (Observation-Space, $N \\times N$) #The Gram form works in the $N$-dimensional observation space. Define the Gram matrix:\n$$G = \\sigma^2 I_N + A \\Sigma A^T \\quad (N \\times N)$$This is the total predictive covariance in observation space — sensor noise $\\sigma^2 I_N$ plus the prior uncertainty $\\Sigma$ propagated through the linear measurement model $A$. Its Cholesky factor drives both posterior updates:\n$$\\boldsymbol{\\mu}' = \\boldsymbol{\\mu} + \\Sigma A^T G^{-1}(\\mathbf{y} - A\\boldsymbol{\\mu}) \\tag{Gram-mean}$$$$\\Sigma' = \\Sigma - \\Sigma A^T G^{-1} A\\Sigma \\tag{Gram-cov}$$$G^{-1}$ never appears explicitly. We introduce $Z$ by solving $GZ = A\\Sigma$, so $Z = G^{-1} A\\Sigma$ (shape $N \\times D$). Both updates then follow from $Z$:\n$$\\boldsymbol{\\mu}' = \\boldsymbol{\\mu} + Z^T(\\mathbf{y} - A\\boldsymbol{\\mu}), \\qquad \\Sigma' = \\Sigma - \\Sigma A^T Z$$The computational bottleneck is the $N \\times N$ Cholesky factorization of $G$, costing $\\frac{N^3}{3}$ flops.\nThe Woodbury Identity: The Bridge Between Forms #The two forms compute identical posteriors. The Woodbury matrix identity (also called the matrix inversion lemma) is the algebraic proof. In its general form:\n$$(P_0 + UCV)^{-1} = P_0^{-1} - P_0^{-1} U \\left(C^{-1} + VP_0^{-1}U\\right)^{-1} V P_0^{-1}$$Setting $P_0 = \\Sigma^{-1}$, $U = A^T$, $C = \\frac{1}{\\sigma^2}I$, $V = A$ gives:\n$$\\underbrace{\\left(\\Sigma^{-1} + \\frac{1}{\\sigma^2} A^T A\\right)^{-1}}_{\\Sigma_\\text{post}\\ \\text{via}\\ D \\times D\\ \\text{form}} = \\Sigma - \\Sigma A^T \\underbrace{\\left(\\sigma^2 I_N + A\\Sigma A^T\\right)^{-1}}_{G^{-1}\\ \\text{via}\\ N \\times N\\ \\text{form}} A\\Sigma$$The left side is the parameter-space form; the right side is the observation-space (Gram) form. They produce the same $\\Sigma_\\text{post}$. The Woodbury identity is an exact algebraic equality, not an approximation.\n(The complete derivation — showing how this emerges from the general Gaussian inference update — is in Appendix A of the companion post When Your Sensor Knows What It Doesn\u0026rsquo;t Know.)\nThe Precision Form (Parameter-Space, $D \\times D$) #The precision form builds the posterior precision matrix directly in the $D$-dimensional weight space:\n$$P = \\Sigma_\\text{prior}^{-1} + \\frac{1}{\\sigma^2} A^T A \\quad (D \\times D) \\tag{posterior precision}$$To form $P$ we need $\\Sigma_\\text{prior}^{-1}$. A naive approach would approximate this as $\\lambda_0 I_D$ (isotropic). This is wrong: it would make the two forms numerically inconsistent, and the approximation error compounds across sequential updates. The correct approach derives $\\Sigma_\\text{prior}^{-1}$ exactly by Cholesky-factoring self.cov — the distribution\u0026rsquo;s current covariance in its role as the prior:\n$$\\Sigma_\\text{prior} = L_0 L_0^T \\;\\Longrightarrow\\; \\Sigma_\\text{prior}^{-1} = \\text{solve}(L_0 L_0^T,\\ I_D)$$With the exact prior precision in hand, the posterior follows in five steps:\nBuild $P = \\Sigma_\\text{prior}^{-1} + \\frac{1}{\\sigma^2} A^T A$ ($D \\times D$) Cholesky-factor $P = LL^T$ Solve $P X = I_D$ to get $\\Sigma_\\text{post} = P^{-1}$ Form the information vector: $r = \\Sigma_\\text{prior}^{-1}\\boldsymbol{\\mu}_\\text{prior} + \\frac{1}{\\sigma^2} A^T \\mathbf{y}$ ($D \\times 1$) Solve $P \\boldsymbol{\\mu}_\\text{post} = r$ — reuse the Cholesky from step 2 Steps 3 and 5 amortize the single $O(D^3)$ Cholesky from step 2 across two $O(D^2)$ solves — the same pattern as Section 1.2. The prior Cholesky is a separate $O(D^3)$ factorization. All matrix dimensions are $D \\times D$; $N$ appears only in the accumulation of $A^T A$ and $A^T \\mathbf{y}$, both of which produce $D \\times D$ and $D \\times 1$ outputs regardless of how large $N$ is.\nWhy materializing $\\Sigma_\\text{post}$ here is justified. Step 3 solves $PX = I_D$, which formally computes $P^{-1}$. The ARD M-step requires the diagonal of $\\Sigma_\\text{post}$ for the $\\gamma_j$ computation (Section 2.4), and prediction requires the full matrix for epistemic uncertainty at arbitrary test points (Section 3.5). The cost is $O(D^3)$ — equal to the factorization itself — and is fixed in $D$, not in $N$.\nThe Dispatch #// from blr-core/src/gaussian.rs — condition() /// Bayesian posterior update: p(w | y) from p(w) = N(μ, Σ) and y = Aw + ε. /// /// Dispatches to the cheaper form automatically: /// - n_obs \u0026lt; d_feat → observation-space Gram form (N×N Cholesky) /// - n_obs \u0026gt;= d_feat → parameter-space precision form (D×D Cholesky) /// /// For sensor calibration (D ≈ 6–16, N ≈ 25–100+) the precision form is /// always selected. See the Woodbury identity in blr-and-ard.md Appendix A /// for the proof of algebraic equivalence. /// /// `noise_variance` is the homoscedastic scalar σ² (same for all observations). pub fn condition( self, a: \u0026amp;[f64], n_obs: usize, d_feat: usize, y: \u0026amp;[f64], noise_variance: f64, ) -\u0026gt; Result\u0026lt;Self, BLRError\u0026gt; { if n_obs \u0026lt; d_feat { self.condition_gram_form(a, n_obs, y, noise_variance) } else { self.condition_precision_form(a, n_obs, y, noise_variance) } } The tie-break (n_obs == d_feat → precision form) is deliberate: it matches the convention used by the ARD loop in ard.rs, which always builds the posterior precision in $D$-dimensional parameter space.\nNote on noise model. The noise_variance: f64 parameter replaces the earlier lambda: \u0026amp;[f64] (a per-observation noise vector). For a single-session Hall sensor calibration the measurement conditions are uniform, so heteroscedastic noise adds complexity without physical justification. The scalar model also simplifies both forms: in the Gram form, $\\sigma^2$ is added uniformly to the diagonal of $G$; in the precision form, $\\frac{1}{\\sigma^2} A^T A$ reduces to a single scalar multiplication rather than a diagonal-weighted sum.\nGram Form — Code Walkthrough #// from blr-core/src/gaussian.rs — condition_gram_form() (internal) // ── Step 1: Gram matrix G = A Σ Aᵀ + σ²I (N×N) ───────────────────────── let a_sigma = /* A · Σ (N×D)×(D×D) → N×D */; let mut gram = /* a_sigma · Aᵀ (N×D)×(D×N) → N×N */; for i in 0..n_obs { gram[(i, i)] += noise_variance; // add σ² uniformly (homoscedastic) } // ── Step 2: Cholesky G = L Lᵀ ──────────────────────────────────────────── let llt_gram = gram.llt(Side::Lower).map_err(|_| BLRError::SingularMatrix)?; // ── Step 3: Solve G·Z = A·Σ → Z = G⁻¹·A·Σ (N×D) ───────────────────── // sigma_at = Σ·Aᵀ (D×N); A·Σ = (Σ·Aᵀ)ᵀ is the RHS for the solve let sigma_at = /* Σ · Aᵀ (D×N) */; let z = llt_gram.solve(sigma_at.as_ref().transpose()); // Z is N×D // ── Step 4: Mean update μ\u0026#39; = μ + Zᵀ·(y − A·μ) ─────────────────────────── let residual = /* y − A·μ (N×1) */; let delta_mu = /* Zᵀ · residual (D×1) */; // ── Step 5: Covariance update Σ\u0026#39; = Σ − Σ·Aᵀ·Z (D×D) ─────────────────── faer::linalg::matmul::matmul( sigma_new.as_mut(), Accum::Add, sigma_at.as_ref(), z.as_ref(), -1.0_f64, // Σ\u0026#39; = Σ + (-1.0) × (Σ Aᵀ · Z) Par::Seq, ); Step 3 is the \u0026ldquo;replace inversion with a solve\u0026rdquo; pattern from Section 1.1, applied at full generality: solving $GZ = A\\Sigma$ gives $Z = G^{-1}A\\Sigma$ without materialising $G^{-1}$. Both update equations use only $Z$ and $\\Sigma A^T$; the Gram matrix\u0026rsquo;s inverse never appears.\nPrecision Form — Code Walkthrough #// from blr-core/src/gaussian.rs — condition_precision_form() (internal) let d = self.dim; let sigma_prior = Mat::\u0026lt;f64\u0026gt;::from_fn(d, d, |i, j| self.cov[i * d + j]); let mu_prior = Mat::\u0026lt;f64\u0026gt;::from_fn(d, 1, |i, _| self.mean[i]); let a_mat = Mat::\u0026lt;f64\u0026gt;::from_fn(n_obs, d, |i, j| a[i * d + j]); let y_mat = Mat::\u0026lt;f64\u0026gt;::from_fn(n_obs, 1, |i, _| y[i]); // ── Step 1: Cholesky Σ_prior; derive Σ_prior⁻¹ and Σ_prior⁻¹·μ_prior ──── let llt_prior = sigma_prior.llt(Side::Lower) .map_err(|_| BLRError::SingularMatrix)?; let sigma_prior_inv = llt_prior.solve(Mat::\u0026lt;f64\u0026gt;::identity(d, d).as_ref()); let prior_info_vec = llt_prior.solve(mu_prior.as_ref()); // Σ_prior⁻¹·μ // ── Step 2: Build P = Σ_prior⁻¹ + (1/σ²)·AᵀA (D×D) ───────────────────── let mut at_a = Mat::\u0026lt;f64\u0026gt;::zeros(d, d); matmul::matmul(at_a.as_mut(), Accum::Replace, a_mat.as_ref().transpose(), a_mat.as_ref(), 1.0 / noise_variance, Par::Seq); let prec_post = /* sigma_prior_inv + at_a (element-wise sum, D×D) */; // ── Step 3: Cholesky P = L Lᵀ ──────────────────────────────────────────── let llt_post = prec_post.llt(Side::Lower).map_err(|_| BLRError::SingularMatrix)?; // ── Step 4: Σ_post = P⁻¹ (solve P·X = I_D) ───────────────────────────── let sigma_post = llt_post.solve(Mat::\u0026lt;f64\u0026gt;::identity(d, d).as_ref()); // ── Step 5: μ_post = Σ_post·(Σ_prior⁻¹·μ_prior + (1/σ²)·Aᵀ·y) ────────── let mut at_y = Mat::\u0026lt;f64\u0026gt;::zeros(d, 1); matmul::matmul(at_y.as_mut(), Accum::Replace, a_mat.as_ref().transpose(), y_mat.as_ref(), 1.0 / noise_variance, Par::Seq); let rhs = /* prior_info_vec + at_y (D×1 information vector) */; let mu_post = llt_post.solve(rhs.as_ref()); // reuse llt_post Two independent Cholesky factorizations — llt_prior (from self.cov) and llt_post (posterior precision) — each amortised across multiple solves. llt_prior yields both $\\Sigma_\\text{prior}^{-1}$ and $\\Sigma_\\text{prior}^{-1}\\boldsymbol{\\mu}_\\text{prior}$. llt_post yields both $\\Sigma_\\text{post}$ and $\\boldsymbol{\\mu}_\\text{post}$. All Cholesky work is $D \\times D$; the dataset size $N$ contributes only the $O(ND^2)$ accumulation of $A^T A$ and $A^T \\mathbf{y}$. As $N$ grows, the per-update cost stays $O(D^3)$ — fixed in the number of features, not in the amount of data.\nBoth forms pass the same analytic unit test (test_condition_analytic) against a closed-form 2D Gaussian update, and a dedicated parity test (test_condition_parity) confirms agreement within $10^{-10}$ on all matrix entries for inputs where neither form is strongly favoured by the dispatch rule.\nPart II — Automatic Relevance Determination via the EM Algorithm #With the linear algebra machinery in place, we can now build the complete BLR+ARD fitting loop. Recall from Part 1 that the model has two levels of unknowns:\nLevel 1 (weights $\\mathbf{w}$): Given $\\alpha$ and $\\beta$, the posterior is exact — computed in one shot via the Cholesky formulas above. Level 2 (hyperparameters $\\alpha_j$, $\\beta$): These govern the prior and noise model. We learn them by maximizing the marginal likelihood (evidence) via Expectation-Maximization. The EM algorithm alternates: fix hyperparameters → compute exact posterior (E-step) → update hyperparameters to maximize evidence (M-step) → repeat.\n2.1 Configuration and Initialisation #The fitting loop is configured through ArdConfig:\n// from blr-core/src/ard.rs #[derive(Debug, Clone)] pub struct ArdConfig { pub alpha_init: f64, // Initial ARD precision (same for all features) pub beta_init: f64, // Initial noise precision pub max_iter: usize, // Maximum EM iterations pub tol: f64, // Convergence tolerance (period-2 log-evidence delta) pub update_beta: bool, // Whether to update β in the M-step } impl Default for ArdConfig { fn default() -\u0026gt; Self { Self { alpha_init: 1.0, beta_init: 1.0, max_iter: 100, tol: 1e-5, update_beta: true } } } The defaults match the Python reference implementation. Starting with alpha_init = 1.0 means the prior on each weight is $\\mathcal{N}(0, 1)$ — a mild regularisation that neither suppresses nor encourages any feature. The EM loop will sort out which features deserve to survive.\n2.2 Pre-computing the Sufficient Statistics #Before entering the EM loop, fit() pre-computes two quantities that appear in every iteration:\n// from blr-core/src/ard.rs — fit() // Pre-compute Φᵀ Φ (D×D) and Φᵀ y (D×1) — reused every iteration. let mut phi_t_phi = Mat::\u0026lt;f64\u0026gt;::zeros(d, d); matmul::matmul( phi_t_phi.as_mut(), Accum::Replace, phi_mat.as_ref().transpose(), phi_mat.as_ref(), 1.0_f64, Par::Seq, ); let mut phi_t_y = Mat::\u0026lt;f64\u0026gt;::zeros(d, 1); matmul::matmul( phi_t_y.as_mut(), Accum::Replace, phi_mat.as_ref().transpose(), y_mat.as_ref(), 1.0_f64, Par::Seq, ); The matrix $\\Phi^T\\Phi$ is the Gram matrix in feature space — a $D \\times D$ summary of how the features covary across all training points. The vector $\\Phi^T \\mathbf{y}$ is the sufficient statistic for the regression problem — a $D$-dimensional vector that captures everything the raw data knows about the weights.\nThis pre-computation matters. Each E-step builds the posterior precision as $\\Lambda + \\beta\\Phi^T\\Phi$. If you recomputed $\\Phi^T\\Phi$ inside the loop, you\u0026rsquo;d pay $O(ND^2)$ per iteration. With pre-computation, you pay it once, then each iteration costs only $O(D^2)$ (to add the diagonal $\\Lambda$ and factor the result). For $N = 1000$ training points, $D = 10$ features, and 50 EM iterations, this pre-computation reduces the total matrix work by a factor of ~50.\n2.3 The E-Step: An Exact Posterior in Three Lines #Each iteration begins by computing the exact posterior distribution over weights, given the current hyperparameters:\n// from blr-core/src/ard.rs — fit(), inside the for loop // ── E-step ──────────────────────────────────────────────────────────────── // σ_inv = diag(α) + β Φᵀ Φ (posterior precision matrix, D×D) let mut sigma_inv = Mat::\u0026lt;f64\u0026gt;::from_fn(d, d, |i, j| beta * phi_t_phi[(i, j)]); for j in 0..d { sigma_inv[(j, j)] += alpha[j]; // add per-feature prior precision } // Cholesky factor: L Lᵀ = σ_inv let llt = sigma_inv .llt(Side::Lower) .map_err(|_| BLRError::SingularMatrix)?; // Σ = σ_inv⁻¹ (solve σ_inv · X = I, i.e. X = σ_inv⁻¹) let eye = Mat::\u0026lt;f64\u0026gt;::identity(d, d); sigma_mat = llt.solve(eye.as_ref()); // μ = β Σ Φᵀ y (equivalently, solve σ_inv · μ = β Φᵀ y) let mut rhs = phi_t_y.clone(); for i in 0..d { rhs[(i, 0)] *= beta; // rhs = β Φᵀ y } let mu_mat = llt.solve(rhs.as_ref()); for i in 0..d { mu_vec[i] = mu_mat[(i, 0)]; } Notice that the same Cholesky factor llt is used twice: once to compute the posterior covariance ($\\Sigma = (\\Lambda + \\beta\\Phi^T\\Phi)^{-1}$) and once to compute the posterior mean ($\\boldsymbol{\\mu} = \\beta\\Sigma\\Phi^T\\mathbf{y}$). The $O(D^3)$ factorisation is paid only once per iteration, and the two solves each cost $O(D^2)$.\nThe solve for $\\Sigma$ is written as llt.solve(I) — solving $AX = I$ — which is exactly inverting $A$. This is one of the rare cases where we genuinely need the full matrix, not just its action on a specific vector: the M-step\u0026rsquo;s $\\gamma_j$ computation requires the diagonal entries $\\Sigma_{jj}$, and the prediction code needs $\\Sigma$ to compute epistemic uncertainty at arbitrary test points.\nIf you only needed to compute uncertainty at the training points themselves, you could avoid materialising $\\Sigma$ by solving one system per training point. But since we need the full posterior for downstream prediction, forming $\\Sigma$ is justified here.\n2.4 The γ Parameter: The Heart of ARD #After the E-step, we have $\\boldsymbol{\\mu}$ and $\\Sigma$. Before diving into the M-step update formulas, we need to meet the most interpretable quantity in the entire algorithm: $\\gamma_j$.\n$$\\gamma_j = 1 - \\alpha_j \\Sigma_{jj}$$This deceptively simple formula is the key to understanding what ARD is actually doing. To see why, consider two limiting cases:\nCase 1: The prior dominates. Suppose $\\alpha_j$ is enormous — say, $\\alpha_j = 10^6$. The prior says \u0026ldquo;weight $j$ should be essentially zero,\u0026rdquo; and unless the data is extraordinarily strong, the posterior obeys. When the prior is very tight, the posterior variance $\\Sigma_{jj}$ is close to the prior variance $\\alpha_j^{-1}$, so $\\alpha_j \\Sigma_{jj} \\approx 1$, and $\\gamma_j \\approx 0$. This feature is effectively switched off.\nCase 2: The data dominates. Suppose $\\alpha_j$ is small, and the data strongly constrains weight $j$. The posterior is much tighter than the prior, so $\\Sigma_{jj} \\ll \\alpha_j^{-1}$, meaning $\\alpha_j \\Sigma_{jj} \\ll 1$, and $\\gamma_j \\approx 1$. The data is fully determining this weight.\nSo $\\gamma_j \\in [0, 1]$ measures how much of the information about weight $j$ comes from the data. A $\\gamma_j$ near 1 means \u0026ldquo;the data knows something about this feature.\u0026rdquo; A $\\gamma_j$ near 0 means \u0026ldquo;the prior is calling the shots — the feature is irrelevant.\u0026rdquo;\nThe sum $\\gamma = \\sum_j \\gamma_j$ has an equally beautiful interpretation: it is the effective number of parameters being estimated from the data. In classical statistics, fitting $D$ parameters from $N$ data points \u0026ldquo;uses up\u0026rdquo; $D$ degrees of freedom. With ARD, features that are being suppressed contribute nearly zero to $\\gamma$, so the effective complexity of the model is automatically lower than the raw feature count. The algorithm self-selects its own complexity.\nIn code, this computation is a single line:\n// γ_j = 1 − α_j Σ_jj (effective parameters per feature) let gamma: Vec\u0026lt;f64\u0026gt; = (0..d).map(|j| 1.0 - alpha[j] * sigma_mat[(j, j)]).collect(); 2.5 The M-Step: Learning from Your Own Uncertainty #Given $\\boldsymbol{\\mu}$, $\\Sigma$, and $\\gamma$, the M-step updates the hyperparameters. These are the MacKay fixed-point rules derived in Part 1:\n$$\\alpha_j^{\\text{new}} = \\frac{\\gamma_j}{\\mu_j^2} \\qquad\\qquad \\beta^{\\text{new}} = \\frac{N - \\gamma}{\\|\\mathbf{y} - \\Phi\\boldsymbol{\\mu}\\|^2}$$Let\u0026rsquo;s read the code and the formulas together:\n// from blr-core/src/ard.rs — fit(), M-step // ── Residuals ───────────────────────────────────────────────────────────── // Φ μ (N×1): predicted values using posterior mean weights let mut phi_mu = Mat::\u0026lt;f64\u0026gt;::zeros(n, 1); let mu_mat_ref = Mat::\u0026lt;f64\u0026gt;::from_fn(d, 1, |i, _| mu_vec[i]); matmul::matmul(phi_mu.as_mut(), Accum::Replace, phi_mat.as_ref(), mu_mat_ref.as_ref(), 1.0_f64, Par::Seq); // ||r||² = ||y - Φ μ||² (sum of squared residuals) let residual_sq: f64 = (0..n).map(|i| { let r = y[i] - phi_mu[(i, 0)]; r * r }).sum(); // ── M-step ──────────────────────────────────────────────────────────────── // α_j = γ_j / (μ_j² + ε), clamped to ≥ 1e-8 for j in 0..d { alpha[j] = (gamma[j] / (mu_vec[j] * mu_vec[j] + 1e-10)).max(1e-8); } // β = (N − Σγ_j) / (||r||² + ε), clamped to ≥ 1e-8 if config.update_beta { let gamma_sum: f64 = gamma.iter().sum(); beta = ((n as f64 - gamma_sum) / (residual_sq + 1e-10)).max(1e-8); } The $\\epsilon = 10^{-10}$ additive term in the denominators prevents division by zero. It\u0026rsquo;s not a regularisation hack — it\u0026rsquo;s a numerical guard for the single case where ARD is doing its job perfectly: when $\\mu_j \\to 0$ as a feature is being pruned, the update formula $\\alpha_j = \\gamma_j / \\mu_j^2$ would try to send $\\alpha_j \\to \\infty$ (completely suppressing the feature). The clamp max(1e-8) achieves the same end — a very large but finite precision — without floating-point infinity propagating into the Cholesky.\nThe $\\alpha_j$ update has a beautiful fixed-point interpretation. At convergence, the update sets $\\alpha_j^{\\text{new}} = \\alpha_j$ (the value does not change). Substituting back:\n$$\\alpha_j = \\frac{\\gamma_j}{\\mu_j^2} = \\frac{1 - \\alpha_j \\Sigma_{jj}}{\\mu_j^2}$$This says: at the optimum, the precision is set so that the \u0026ldquo;residual information\u0026rdquo; in the prior ($1 - \\alpha_j\\Sigma_{jj}$) exactly equals $\\alpha_j \\mu_j^2$. Features with large posterior mean carry information proportional to their effect size; features with small posterior mean get their prior tightened until they are suppressed.\nThe $\\beta$ update is equally satisfying. In classical statistics, the unbiased variance estimator is $\\hat{\\sigma}^2 = \\text{RSS} / (N - D)$, where we subtract $D$ degrees of freedom for the $D$ estimated parameters. Here, the formula is $\\hat{\\sigma}^2 = \\text{RSS} / (N - \\gamma)$. The effective degrees of freedom is $\\gamma$, not $D$ — and since suppressed features contribute $\\gamma_j \\approx 0$, the noise estimate is automatically corrected for model sparsity. Prune 4 out of 6 features, and $\\gamma \\approx 2$ instead of 6; the noise estimate becomes appropriately less conservative.\n2.6 The Log Evidence: Measuring Quality and Checking Convergence #Each iteration computes the log marginal likelihood (evidence):\n$$\\mathcal{L} = \\frac{1}{2}\\left(\\sum_j \\log\\alpha_j + N\\log\\beta - \\log|\\Sigma_\\text{inv}| - \\beta\\|\\mathbf{r}\\|^2 - \\boldsymbol{\\mu}^T\\Lambda\\boldsymbol{\\mu} + D\\log(2\\pi)\\right) - \\frac{N}{2}\\log(2\\pi)$$This is the probability of observing the data $\\mathbf{y}$, marginalised over all possible weight vectors $\\mathbf{w}$, under the current hyperparameters $\\alpha$ and $\\beta$. In code:\n// from blr-core/src/ard.rs — log_evidence() fn log_evidence( n: usize, d: usize, alpha: \u0026amp;[f64], beta: f64, mu: \u0026amp;[f64], logdet_sigma_inv: f64, residual_sq: f64, ) -\u0026gt; f64 { let log_alpha_sum: f64 = alpha.iter().map(|a| a.ln()).sum(); let mu_lambda_mu: f64 = alpha.iter().zip(mu.iter()).map(|(a, m)| a * m * m).sum(); 0.5 * ( log_alpha_sum + (n as f64) * beta.ln() - logdet_sigma_inv // from cholesky_logdet() - beta * residual_sq - mu_lambda_mu + (d as f64) * (2.0 * PI).ln() ) - 0.5 * (n as f64) * (2.0 * PI).ln() } The logdet_sigma_inv term is the Cholesky log-determinant of the posterior precision matrix $\\Lambda + \\beta\\Phi^T\\Phi$, computed using cholesky_logdet() in the same iteration — no extra work.\nThe evidence is guaranteed to increase (or remain constant) with each EM iteration. This monotone convergence is what distinguishes EM from other optimisation methods and makes it trustworthy: if your log-evidence is decreasing, something is wrong — a bug in the M-step, a numerical issue, or an incorrect formula.\nConvergence criterion. Rather than checking a single-step delta (which can be noisy near convergence), the implementation uses a period-2 check:\n// Period-2 convergence: compare smoothed log-evidence over pairs of iterations let n_ev = log_evidences.len(); let delta = if n_ev \u0026gt;= 4 { let mean_curr = 0.5 * (log_evidences[n_ev - 1] + log_evidences[n_ev - 2]); let mean_prev = 0.5 * (log_evidences[n_ev - 3] + log_evidences[n_ev - 4]); (mean_curr - mean_prev).abs() } else if n_ev \u0026gt;= 2 { (log_evidences[n_ev - 1] - log_evidences[n_ev - 2]).abs() } else { f64::INFINITY }; if delta \u0026lt; config.tol { break; } The period-2 smoothing averages two consecutive iterations before comparing to the previous pair. Near convergence, the $\\alpha_j$ updates can oscillate slightly — the prior on a borderline-relevant feature might bounce between \u0026ldquo;slightly relevant\u0026rdquo; and \u0026ldquo;slightly irrelevant\u0026rdquo; before settling. Averaging pairs of iterations smooths this oscillation and prevents premature termination.\nPart III — Putting It All Together: The Hall Sensor Calibration #Let us trace through a complete calibration example. We have 60 real measurements from a Hall effect position sensor (loaded from data/hall_sensor_calibration.csv): a sensor voltage $y_i$ at each of 60 known displacement positions $x_i$.\n3.1 Building the Design Matrix #The first decision is feature engineering. For a Hall sensor, physical reasoning suggests:\nColumn Feature $\\phi_j(x)$ Physical Hypothesis 0 $1$ Constant offset (always present) 1 $x$ Linear Hall response (primary) 2 $x^2$ Quadratic field non-uniformity 3 $x^3$ Cubic non-linearity 4 $\\tanh(x/0.8)$ Hard magnetic saturation (tight knee) 5 $\\tanh(x/1.5)$ Gradual saturation rolloff (wide knee) We are not claiming that all six features are relevant. We are providing ARD with a vocabulary of physical hypotheses and letting it select. This is the correct way to think about feature engineering in a Bayesian model: you design a rich basis that spans the space of plausible physics, then let the data decide which elements matter.\n// from blr-core/examples/hall_sensor.rs let (poly_mat, _) = features::polynomial(\u0026amp;x_vals, 3); // [1, x, x², x³] let mut phi = vec![0.0f64; n * 6]; for i in 0..n { phi[i * 6 + 0] = poly_mat[i * 4 + 0]; // 1 phi[i * 6 + 1] = poly_mat[i * 4 + 1]; // x phi[i * 6 + 2] = poly_mat[i * 4 + 2]; // x² phi[i * 6 + 3] = poly_mat[i * 4 + 3]; // x³ phi[i * 6 + 4] = (x_vals[i] / 0.8).tanh(); // tanh(x/0.8) phi[i * 6 + 5] = (x_vals[i] / 1.5).tanh(); // tanh(x/1.5) } The features::polynomial helper from blr_core::features returns an $N \\times (D+1)$ matrix in row-major order — each row $i$ contains $[1, x_i, x_i^2, \\ldots, x_i^D]$. We then augment it with the two tanh columns computed directly. The final design matrix phi is $60 \\times 6$ in row-major layout, which is exactly what fit() expects.\n3.2 Fitting the Model #// from blr-core/examples/hall_sensor.rs let config = ArdConfig { max_iter: 500, tol: 1e-7, ..ArdConfig::default() }; let fitted = fit(\u0026amp;phi, \u0026amp;y_vals, n, 6, \u0026amp;config).expect(\u0026#34;BLR+ARD fit failed\u0026#34;); We tighten the tolerance slightly compared to the default (1e-7 vs 1e-5) and allow more iterations, because Hall sensor data is clean and the algorithm converges reliably. The fit() call returns a FittedArd struct containing the posterior distribution, the learned $\\alpha$ and $\\beta$ values, and the log-evidence trajectory.\n3.3 Reading the Results #// from blr-core/examples/hall_sensor.rs println!(\u0026#34;Noise std (learned): {:.6}\u0026#34;, fitted.noise_std()); println!(\u0026#34;Log marginal likelihood: {:.6}\u0026#34;, fitted.log_marginal_likelihood()); let feature_names = [\u0026#34;1 (bias)\u0026#34;, \u0026#34;x\u0026#34;, \u0026#34;x²\u0026#34;, \u0026#34;x³\u0026#34;, \u0026#34;tanh(x/0.8)\u0026#34;, \u0026#34;tanh(x/1.5)\u0026#34;]; // Posterior mean weights for (name, \u0026amp;mu_j) in feature_names.iter().zip(fitted.posterior.mean.iter()) { println!(\u0026#34; {:\u0026lt;18} {:+.6}\u0026#34;, name, mu_j); } // ARD relevance scores let rel = fitted.relevance(); // = 1/α_j for each feature for (name, r) in feature_names.iter().zip(rel.iter()) { println!(\u0026#34; {:\u0026lt;18} {:.3e}\u0026#34;, name, r); } // Active feature mask let active = fitted.relevant_features(None); // threshold = geometric mean of α The relevance() method returns $1/\\alpha_j$ for each feature — a direct measure of how loosely the prior constrains each weight. Large relevance means the data is using that feature; small relevance means the prior has essentially pinned the weight to zero.\nThe relevant_features() method computes a boolean mask using the geometric mean of the $\\alpha$ values as the threshold: features with $\\alpha_j$ below the geometric mean are \u0026ldquo;relevant.\u0026rdquo; This is a sensible default that adapts to the scale of the problem, though you can pass an explicit threshold when you have domain-specific cutoffs.\n3.4 What the Numbers Tell You #Running cargo run --example hall_sensor from the repository root produces output like:\n=== Hall Sensor BLR+ARD Results === EM iterations: 154 Noise std (learned): 0.102875 Log marginal likelihood:50.484934 Posterior mean weights: 1 (bias) +0.000001 x -0.000040 x² +0.000028 x³ -0.000017 tanh(x/0.8) +0.598037 tanh(x/1.5) +2.011467 ARD relevance (1/α — larger = more relevant): 1 (bias) 1.328e-7 x 3.917e-6 x² 1.093e-7 x³ 9.162e-8 tanh(x/0.8) 3.707e-1 tanh(x/1.5) 4.065e0 Active features (α \u0026lt; geometric-mean threshold): ✓ tanh(x/0.8) ✓ tanh(x/1.5) In-sample RMSE: 0.101169 Mean total std: 0.104551 The numbers tell a revealing physical story — but not the one a naive linear model would suggest. The algorithm converged in 154 iterations (more than the 23 needed for synthetic data), learned a noise level of ~103 mV, and identified exactly two relevant features: both nonlinear tanh functions that model magnetic saturation. The posterior weights for the polynomial terms — the bias, linear, quadratic, and cubic — are all suppressively tiny (posterior means at the $10^{-5}$ level or below).\nThis is not a numerical accident. The relevance ratio between the saturation features and the linear term is approximately $10^5$. The algorithm is saying, with extraordinary clarity: \u0026ldquo;this real Hall sensor does not exhibit linear behavior. Its response saturates near both extremes — tight knee saturation modeled by tanh(x/0.8) and gradual rolloff by tanh(x/1.5). Linear approximations are simply wrong.\u0026rdquo;\nNotice the posterior mean weights for the saturation features: tanh(x/0.8) has coefficient ~0.60 and tanh(x/1.5) has coefficient ~2.01. These two weighted tanh functions combine to approximate the true sensor\u0026rsquo;s characteristic curve. The fact that they have different widths (0.8 vs 1.5) means they capture the asymmetry in how the device saturates — a subtle but crucial physical detail.\nThe in-sample RMSE of 0.101 V matches the learned noise std of 0.103 V almost exactly — evidence that the model has found the ground truth noise floor and no residual bias remains. The algorithm has correctly identified that remaining prediction error comes from measurement noise, not from model misspecification.\n3.5 Making Predictions #let preds = fitted.predict(\u0026amp;phi, n, 6); // preds.mean[i] = E[y_i] = φ(x_i)ᵀ μ // preds.aleatoric_std = 1/√β (noise; same for all points) // preds.epistemic_std[i] = √(φ(x_i)ᵀ Σ φ(x_i)) (model uncertainty at point i) // preds.total_std[i] = √(aleatoric² + epistemic²) The prediction decomposes uncertainty into two orthogonal components, as derived in Part 1:\n$$\\sigma^2_{*} = \\underbrace{\\beta^{-1}}_{\\text{aleatoric}} + \\underbrace{\\boldsymbol{\\phi}(x_*)^T \\Sigma \\boldsymbol{\\phi}(x_*)}_{\\text{epistemic}}$$The aleatoric component is the irreducible measurement noise — the sensor is genuinely noisy at the ~8 mV level, and no amount of additional calibration data will reduce this term. The epistemic component is the model\u0026rsquo;s uncertainty about the weights, which decreases as more calibration points are added.\nA practical use of this decomposition: if the total uncertainty at a new operating point is dominated by epistemic uncertainty, you should gather more calibration data near that point. If it is dominated by aleatoric noise, more data won\u0026rsquo;t help — you need a better sensor or a lower-noise measurement circuit. BLR+ARD tells you which situation you are in.\nPart IV — Common Pitfalls and Practical Wisdom #4.1 Feature Scaling #BLR+ARD is sensitive to the scale of the input features. If you have features measured in millimetres alongside features measured in volts, the $\\alpha_j$ values are not comparable — a \u0026ldquo;large\u0026rdquo; $\\alpha$ for a voltage feature might correspond to a tiny absolute regularisation in physical terms.\nThe best practice: normalise each input dimension to have zero mean and unit variance before computing the design matrix. The ARD $\\alpha$ values then have a uniform interpretation across all features.\n4.2 The Condition Number of $\\Phi^T\\Phi$ #The posterior precision matrix $\\Lambda + \\beta\\Phi^T\\Phi$ inherits the condition number of $\\Phi^T\\Phi$. If your design matrix contains nearly linearly dependent columns — for example, $x^2$ and $(x+\\epsilon)^2$ over a narrow input range — the Cholesky factorisation will fail with SingularMatrix.\nThe solution is to check for near-linear-dependence in your feature set and either remove redundant features or apply a small jitter to the diagonal:\n// Add a small jitter for numerical stability let jitter = 1e-9; for j in 0..d { sigma_inv[(j, j)] += jitter; } This is acceptable because a jitter of $10^{-9}$ is far below the precision of any physical sensor, so it has no practical effect on the posterior but prevents Cholesky failures.\n4.3 Interpreting \u0026ldquo;Irrelevant\u0026rdquo; Features #ARD suppression is a probabilistic statement, not a hard zero. A feature with $\\alpha_j = 10^5$ has a posterior weight distribution $\\mathcal{N}(0, 10^{-5})$ — the weight is extremely likely to be near zero, but it is not exactly zero. This matters when you need to predict at extrapolation points far outside the training range: the irrelevant features might have small but nonzero posterior means, contributing tiny but nonzero predictions.\nIf you need strict sparsity (exactly zero weights for irrelevant features), use the boolean mask from relevant_features() to zero out the corresponding columns of the design matrix and refit with the reduced feature set.\n4.4 Not Enough Iterations vs. Wrong Convergence #The period-2 convergence criterion can be fooled by two consecutive iterations that happen to have nearly the same log-evidence for the wrong reason — for example, if the algorithm is oscillating around a saddle point. If your results look suspicious (e.g., all $\\alpha_j$ equal to the initial value), increase max_iter and check whether the log-evidence trajectory is monotone increasing.\nA monotone-increasing evidence trajectory is a sanity check you can perform for free: if it ever decreases by more than floating-point noise, there is a bug in the M-step.\nConclusion: Three Levels of Understanding #This two-part series has developed the same algorithm at three levels:\nThe statistical level (Part 1): Bayesian updating as quadratic form manipulation; ARD as empirical Bayes over per-feature precision hyperparameters; the self-policing feedback loop between γ, α, and β.\nThe linear algebra level (Part 2, sections 1–1.5): the SPD structure of the posterior precision as a Cholesky-safe guarantee; the Woodbury identity connecting the observation-space ($N \\times N$ Gram form) and parameter-space ($D \\times D$ precision form) posterior updates, with adaptive dispatch to the cheaper form for any problem size; log-determinant via Cholesky diagonal as the trick that keeps log-probabilities finite.\nThe implementation level (Part 2, sections 2–3): pre-computing sufficient statistics to amortise $O(ND^2)$ work; reusing a single Cholesky factorisation for both $\\Sigma$ and $\\boldsymbol{\\mu}$; the period-2 convergence criterion to handle near-oscillation; the ε-clamps that keep finite arithmetic finite.\nEach level is necessary. The statistics tells you what to compute. The linear algebra tells you how to compute it without numerical disasters. The implementation tells you when to cache, reuse, and guard.\nThe result is a model that:\nFits in ~23 iterations on 25 data points Learns a noise estimate of 8 mV with no manual tuning Correctly identifies 2 out of 6 features as relevant — matching the physics Provides calibrated uncertainty bands that distinguish noise from model uncertainty Runs on a CPU inside a WebAssembly component, with no GPU, no PyTorch, no dependencies beyond faer = \u0026quot;0.24\u0026quot; That is what \u0026ldquo;principled machine learning\u0026rdquo; looks like in Rust.\nAppendices #Appendix A: The Condition Number κ (Kappa) #Section 1.1 mentions the condition number $\\kappa(A)$ — an abstract quantity that governs numerical error in matrix computations. This appendix unpacks what it means and why it matters.\nA.1 Definition and Intuition #The condition number $\\kappa(A)$ of a matrix $A$ is a scalar that measures how sensitive the matrix is to perturbations or rounding errors. Informally:\nSmall $\\kappa$ (close to 1): the matrix is well-conditioned. Tiny errors stay tiny. Large $\\kappa$ (e.g., $10^{10}$): the matrix is ill-conditioned. Tiny rounding errors can amplify into large errors in the computed result. Formally, for an invertible matrix $A$:\n$$\\kappa(A) = \\|A\\| \\cdot \\|A^{-1}\\|$$where $\\|\\cdot\\|$ denotes a matrix norm (e.g., the spectral norm, the largest singular value). For a symmetric positive-definite (SPD) matrix — which is what we have in BLR — there is an elegant simplification:\n$$\\kappa(A) = \\frac{\\lambda_{\\max}(A)}{\\lambda_{\\min}(A)}$$where $\\lambda_{\\max}$ and $\\lambda_{\\min}$ are the largest and smallest eigenvalues of $A$. In other words: the condition number is the ratio of the biggest eigenvalue to the smallest.\nIntuition: If all eigenvalues are similar in magnitude, the ratio is close to 1, and the matrix \u0026ldquo;stretches space uniformly.\u0026rdquo; If one eigenvalue is much larger than others, the matrix stretches along that direction much more than others — it creates a very \u0026ldquo;elongated\u0026rdquo; geometry, and numerical errors accumulate along the short-stretch directions.\nA.2 Why Condition Number Matters in Matrix Computations #When you solve a linear system $Ax = b$ or compute a matrix inverse $A^{-1}$, the presence of floating-point rounding errors ($\\epsilon_{\\text{machine}} \\sim 10^{-16}$ for 64-bit floats) means you do not get the exact answer. Instead, you get an answer $\\tilde{x}$ that satisfies $A\\tilde{x} = b + \\delta b$, where $\\delta b$ is a tiny perturbation introduced by rounding.\nThe key result from numerical linear algebra is:\n$$\\frac{\\|\\delta x\\|}{\\|x\\|} \\lesssim \\kappa(A) \\cdot \\epsilon_{\\text{machine}}$$In plain English: the relative error in your computed solution is proportional to the condition number.\nNow, the relationship between relative error and computational method is crucial:\nTask Method Error Dependence Solve $Ax = b$ via Cholesky Direct solve $\\mathcal{O}(\\kappa(A) \\cdot \\epsilon_{\\text{machine}})$ Solve $Ax = b$ by inverting $A^{-1}$ then multiplying Two-step $\\mathcal{O}(\\kappa(A)^2 \\cdot \\epsilon_{\\text{machine}})$ The error from explicitly inverting grows as $\\kappa(A)^2$, not $\\kappa(A)$. This is why section 1.1 emphasizes \u0026ldquo;never invert a matrix if you can avoid it\u0026rdquo; — you are paying a squared condition number penalty.\nA.3 Connection to Section 1.1 and Posterior Uncertainty #In section 1.1, we write:\nWorse, the error in $x$ scales with the square of $A$\u0026rsquo;s condition number $\\kappa(A)^2$ — because you first invert with error $\\sim \\kappa(A) \\cdot \\epsilon_\\text{machine}$, then multiply, amplifying that error again. By the time you examine the diagonal of $\\Sigma_\\text{post}$ to extract uncertainty estimates, your $\\pm 3\\sigma$ confidence bands might be meaningless.\nThis is not mere hyperbole. Consider what happens:\nYou compute $\\Sigma_\\text{post} = (\\Lambda + \\beta\\Phi^T\\Phi)^{-1}$ by explicit inversion. The error grows as $\\kappa(A)^2 \\cdot \\epsilon_{\\text{machine}}$. You then extract the diagonal entries $\\Sigma_{\\text{post}, jj}$ — these are your uncertainty estimates for each weight. Those diagonal entries are now corrupted by the amplified error. Your \u0026ldquo;3-sigma confidence band\u0026rdquo; might actually be a 2-sigma or 4-sigma band due to numerical corruption. Downstream, you use those uncertainty estimates to make decisions about calibration quality, data collection strategy, or safety margins. Bad uncertainty → bad decisions. By contrast, using the Cholesky solve (section 1.2) reduces the error to $\\mathcal{O}(\\kappa(A) \\cdot \\epsilon_{\\text{machine}})$ — a first-power dependence. For a moderately ill-conditioned matrix with $\\kappa(A) = 10^4$, the difference is: $10^8 \\cdot 10^{-16} = 10^{-8}$ (direct solve) vs. $10^{16} \\cdot 10^{-16} = 1$ (explicit inversion). That is, explicit inversion can corrupt your answer by a factor of $10^8$.\nA.4 Why Posterior Precision Is Always SPD (and Why That Protects Us) #The posterior precision matrix in BLR is $\\Lambda + \\beta\\Phi^T\\Phi$, where:\n$\\Lambda = \\text{diag}(\\alpha_1, \\ldots, \\alpha_D)$ with all $\\alpha_j \u003e 0$ (strictly positive). $\\Phi^T\\Phi$ is a Gram matrix of the design matrix $\\Phi$. By section 1.2, we know this is always symmetric positive-definite (SPD). What does this buy us in terms of condition number?\nFor an SPD matrix, the Cholesky algorithm has a built-in numerical safety feature: the algorithm is backward-stable — the computed factor $L$ satisfies $(L + \\delta L)(L + \\delta L)^T = A$ with a backward error bounded by $\\mathcal{O}(\\kappa(A))$, not $\\mathcal{O}(\\kappa(A)^2)$.\nMore importantly: an SPD matrix will never cause Cholesky to fail unexpectedly. There are no \u0026ldquo;negative pivot\u0026rdquo; surprises. If the Cholesky algorithm reports SingularMatrix, it means the posterior precision is genuinely singular — a true mathematical problem (e.g., linearly dependent design matrix columns, or degenerate hyperparameter initialization), not a numerical artifact.\nA.5 Practical: When Should You Worry About Condition Number? #In the context of the Hall sensor calibration (section 3):\nDesign matrix features: If your features are polynomials $[1, x, x^2, x^3, \\ldots]$ evaluated over a narrow range (e.g., $x \\in [-0.1, 0.1]$), the Gram matrix $\\Phi^T\\Phi$ can become ill-conditioned because high-degree polynomials are nearly collinear.\nSolution: Use orthogonal polynomials (e.g., Chebyshev) or normalize input features to $[-1, 1]$. Hyperparameter scale: If the $\\alpha_j$ hyperparameters span many orders of magnitude (e.g., $\\alpha_1 = 1$, $\\alpha_6 = 10^8$), the diagonal of $\\Lambda$ has a large condition number. When you add $\\beta\\Phi^T\\Phi$ (with its own condition number), the result can be poorly conditioned.\nSolution: Monitor the condition number of $\\Lambda + \\beta\\Phi^T\\Phi$ in the EM loop. If it exceeds, say, $10^{15}$, consider re-scaling or feature selection. EM convergence: Occasionally, a borderline-relevant feature\u0026rsquo;s $\\alpha_j$ explodes to $10^{10}$, then the posterior precision becomes ill-conditioned, and convergence stalls.\nSolution: The clamping in section 2.5 (max(1e-8)) prevents $\\alpha_j$ from becoming infinite, keeping the condition number bounded. For typical sensor calibration (10–100 measurements, 3–10 features), the posterior precision is well-conditioned and Cholesky factorisation is rock-solid numerically.\nA.6 Further Reading # Trefethen \u0026amp; Bau (1997), Numerical Linear Algebra, Lectures 10–12: authoritative treatment of condition numbers and Cholesky stability. Golub \u0026amp; Van Loan (2013), Matrix Computations (4th ed.), Section 12.2: comprehensive analysis of condition number and backward error. Higham, N. J. (2002), Accuracy and Stability of Numerical Algorithms (2nd ed.): the definitive reference on floating-point error analysis. References # MacKay, D. J. (1992). \u0026ldquo;Bayesian nonlinear modeling for the prediction competition.\u0026rdquo; ASHRAE Transactions 98(1): 1052–1066. — Original ARD paper; derivation of α and β updates.\nTipping, M. E. (2001). \u0026ldquo;Sparse Bayesian learning and the relevance vector machine.\u0026rdquo; Journal of Machine Learning Research 1(Jun): 211–244. — Modern treatment; Equations (14)–(16) for the M-step.\nBishop, C. M. (2006). Pattern Recognition and Machine Learning. MIT Press. — Chapters 7 and 10 for ARD and EM; Appendix C for matrix identities.\nHennig, P. Probabilistic Machine Learning (course materials). Universität Tübingen. — Lecture 3 for the Kalman gain / Gram-form (observation-space) derivation; Lecture 4 for empirical Bayes. The Woodbury identity (Appendix A of Part 1) establishes exact algebraic equivalence with the parameter-space precision form used by condition_precision_form(). https://uni-tuebingen.de/\nMurphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press. — Chapter 7 (empirical Bayes), Chapter 21 (practical BLR).\nTrefethen, L. N. \u0026amp; Bau, D. (1997). Numerical Linear Algebra. SIAM. — Lectures 10–12 for Cholesky stability analysis.\nGolub, G. H. \u0026amp; Van Loan, C. F. (2013). Matrix Computations (4th ed.). Johns Hopkins University Press. — Algorithm 4.2.2 for Cholesky; Section 12.2 for condition number analysis.\nPetersen, K. B. \u0026amp; Pedersen, M. S. (2012). The Matrix Cookbook (v. 20121127). — Sections 4.1–4.2 for determinants and the matrix inversion lemma.\nfaer Documentation. https://crates.io/crates/faer — Rust linear algebra library used throughout blr-core.\nPart II — Continuing the Series #For now, the key takeaway is this: the mathematical formulas from part one translate directly into production code when you replace matrix inversions with solves and factor once to amortize across multiple operations. The same principles of stability and efficiency apply whether you\u0026rsquo;re running on a desktop CPU or inside a WASM component on an embedded device.\nFurther articles in this series will walk through complete application examples, demonstrating how these principles translate into production systems across different domains and deployment scenarios.\nBringing It Together #These two articles form a complete story:\nPart 1: The Mathematics — Why Bayesian Linear Regression with ARD is the right choice for sensor calibration, and how the math provides closed-form posterior solutions.\nPart 2: The Implementation (this post) — How to translate those formulas into efficient, numerically stable Rust code.\nIf you\u0026rsquo;re building a sensor calibration system and want to understand both the why and the how, start with part 1. If you\u0026rsquo;re already convinced of the approach and want to dive into the code patterns, start here and reference part 1 as needed.\nMore articles on related topics are coming — including active learning strategies, deployment as WASM components, and advanced topics in hyperparameter selection.\n","date":"15 May 2026","permalink":"https://wamli.github.io/blog/blr-implementation-rust-faer/","section":"Blog","summary":"The central theme is deceptively simple: never invert a matrix if you can avoid it. This post walks through the production Rust code that makes Bayesian Linear Regression with ARD efficient, numerically safe, and ready for embedded systems. Everything else follows from understanding why this principle matters.","title":"From Math to Silicon: Implementing BLR+ARD with Rust and faer"},{"content":"","date":null,"permalink":"https://wamli.github.io/blog/","section":"Blog","summary":"","title":"Blog"},{"content":"","date":null,"permalink":"https://wamli.github.io/docs/guides/","section":"Docs","summary":"","title":"Guides"},{"content":"Guides lead a user through a specific task they want to accomplish, often with a sequence of steps. Writing a good guide requires thinking about what your users are trying to do.\nFurther reading # Read about how-to guides in the Diátaxis framework ","date":"7 September 2023","permalink":"https://wamli.github.io/docs/guides/example/","section":"Docs","summary":"","title":"Example Guide"},{"content":"","date":null,"permalink":"https://wamli.github.io/docs/reference/","section":"Docs","summary":"","title":"Reference"},{"content":"Reference pages are ideal for outlining how things work in terse and clear terms. Less concerned with telling a story or addressing a specific use case, they should give a comprehensive outline of what your documenting.\nFurther reading # Read about reference in the Diátaxis framework ","date":"7 September 2023","permalink":"https://wamli.github.io/docs/reference/example/","section":"Docs","summary":"","title":"Example Reference"},{"content":"Link to valuable, relevant resources.\n","date":"27 February 2024","permalink":"https://wamli.github.io/docs/resources/","section":"Docs","summary":"","title":"Resources"},{"content":"","date":null,"permalink":"https://wamli.github.io/docs/","section":"Docs","summary":"","title":"Docs"},{"content":"","date":null,"permalink":"https://wamli.github.io/categories/bayesian-linear-regression/","section":"Categories","summary":"","title":"Bayesian Linear Regression"},{"content":"","date":null,"permalink":"https://wamli.github.io/tags/bayesian-inference/","section":"Tags","summary":"","title":"Bayesian-Inference"},{"content":"","date":null,"permalink":"https://wamli.github.io/tags/blr+ard/","section":"Tags","summary":"","title":"Blr+ard"},{"content":"","date":null,"permalink":"https://wamli.github.io/categories/","section":"Categories","summary":"","title":"Categories"},{"content":"","date":null,"permalink":"https://wamli.github.io/tags/faer/","section":"Tags","summary":"","title":"Faer"},{"content":"","date":null,"permalink":"https://wamli.github.io/tags/implementation/","section":"Tags","summary":"","title":"Implementation"},{"content":"","date":null,"permalink":"https://wamli.github.io/tags/machine-learning/","section":"Tags","summary":"","title":"Machine-Learning"},{"content":"","date":null,"permalink":"https://wamli.github.io/tags/mathematics/","section":"Tags","summary":"","title":"Mathematics"},{"content":"","date":null,"permalink":"https://wamli.github.io/tags/matrix-algebra/","section":"Tags","summary":"","title":"Matrix-Algebra"},{"content":"","date":null,"permalink":"https://wamli.github.io/tags/numerical-computing/","section":"Tags","summary":"","title":"Numerical-Computing"},{"content":"","date":null,"permalink":"https://wamli.github.io/tags/rust/","section":"Tags","summary":"","title":"Rust"},{"content":"","date":null,"permalink":"https://wamli.github.io/tags/sensor-calibration/","section":"Tags","summary":"","title":"Sensor-Calibration"},{"content":"","date":null,"permalink":"https://wamli.github.io/tags/series/","section":"Tags","summary":"","title":"Series"},{"content":"","date":null,"permalink":"https://wamli.github.io/tags/","section":"Tags","summary":"","title":"Tags"},{"content":"","date":null,"permalink":"https://wamli.github.io/tags/uncertainty-quantification/","section":"Tags","summary":"","title":"Uncertainty-Quantification"},{"content":" WASM Machine Learning Inference\nMaintainer of blr-core crate ","date":null,"permalink":"https://wamli.github.io/","section":"WAMLI","summary":"","title":"WAMLI"},{"content":"","date":"7 September 2023","permalink":"https://wamli.github.io/privacy/","section":"WAMLI","summary":"","title":"Privacy Policy"},{"content":"About WAMLI #Coming soon\u0026hellip;\n","date":"1 January 0001","permalink":"https://wamli.github.io/about/","section":"WAMLI","summary":"","title":"About"}]