Perpendicular Distance LeastSquares Fitting for Arbitrary Lines
The mathematician does not study pure mathematics because it is useful; he studies it because he delights in it and he delights in it because it is beautiful.
Henri Poincare
I’m not a mathematician, but I understand the mathematician’s delight. When terms cancel, substitutions simplify, and something that was previously hidden is revealed and exposed to the mind for the first time, the intellect is entertained.
I can still recall one of my earlier moments of “mathematician’s delight”. I was sitting in Mr. Pontifex’s grade 12 physics class as he derived the formulas for obtaining the bestfit line for a set of points. We had already used the equations often to find the answer our lab experiments were designed to produce, but the equations were just prescribed to us with directions how to use them. To finally see how the equations were derived, with the help of the calculus we had just learned, was far more interesting to me than the physics we were using it for.
For reference, here are the equations that find the line, \(y = mx + b\), that “best” fits a set of points.
The standard derivation for this can be found on MathWorld.
These equations, though mathematically correct, are often misapplied. Implicit in the equations is a definition of “best” that isn’t universal. The error being minimized is the sum of the squares of the vertical distances between the data points and the fitted line. This is often best when your data points correspond to experimental observations that measure \(y\) for some independent variable \(X\). But if both the coordinates of your data points are independent, the classic formulation is not only inappropriate, but also mathematically undefined if the best line is vertical since such a line cannot be expressed as \(y = mx + b\).
What is often the better solution is to find the line that minimizes the perpendicular distances between the data points and the line:
A solution that uses perpendicular distances can also be found on MathWorld. In that solution, the author notes the derivation takes “a fair bit of algebra” and produces an “unwieldy form of the bestfit parameters”. Furthermore, it still uses \(y = mx + b\) as its basis, so it fails to address the case when the best line is vertical.
While developing an early version of Music Lens I was guilty of misapplying the traditional leastsquares linefitting equations. Music Lens is a program that employs computer vision algorithms to “read” sheet music. Music notation is dominated by a lot of lines (staff lines, barlines, stems, beams, etc.) so linefinding is naturally a major component. One of the intermediate results inherent in my linefinding algorithm was knowing whether the line was more horizontal or vertical. This allowed me to swap the \(x\) and \(y\) variables when the line was “more vertical”, thus avoiding the degenerate case of a vertical line. With this hack, the classic leastsquares fitting algorithm worked in practice, but I recognized it was inelegant and mathematically questionable. More than two decades after implementing this hack, I finally found a better solution in the chapter “The Best LeastSquares Line Fit” from the book Graphics Gem V (GGV).
The solution in GGV uses perpendicular distances and works for lines at any orientation.
There were several factors that compelled me to write this up:

It’s April, and April is Math Awareness Month^{✻}. Seems like a good time to share one of my favourite derivations.

This solution is not well known and deserves broader dissemination. The only description I was able to find was in GGV, and this book, first published in 1995, is not easy to find in local bookstores and is expensive on Amazon (or here for my fellow Canadians).

My derivation differs slightly than the one in GGV because I use a different line parameterization. Both GGV and my derivation use a \(\rho\)\(\theta\) parameterization, but our definitions of \(\theta\) differ. Those familiar with the Hough Transform will find the same line parameterization used below.

My derivation is not quite complete, and I’m hoping someone reading can fill in the loose ends.

Did I mention it’s Math Awareness Month?
Without further ado, here is the derivation for the best leastsquares line fit using perpendicular distances. All credit for this goes to GGV; the derivation below is based entirely on this work. The only difference is the slighly different line parameterization used. I also show more steps in the derivation, as I found some of the steps in GGV large and nonobvious; it took me many hours to figure out some of the intermediate steps.
In case any readers have access to GGV, I want to point out a newsgroup posting that claims there are errors in the GGV derivation and source code.
Derivation
The vital aspect to this derivation is using the relatively uncommon \(\rho\)\(\theta\) parameterization of a line, also known as the Hessian normal form. The equation of a line is given as:
The geometric interpretation of \(\rho\) and \(\theta\) are shown in Figure 3.
The distance between a point \(\mathbf{p} = (x, y)\) and the line \(L_1\) with \((\rho_1, \theta_1)\) can be found by constructing a new line \(L_2\) that is parallel to \(L_1\) and passes through \(\mathbf{p}\). Because \(L_2\) is parallel to \(L_1\), it follows that \(\theta_2 = \theta_1\), and \(\rho_2 = x \cos \theta_1 + y \sin \theta_1\). The distance between \(\mathbf{p}\) and \(L_1\) is then just the difference between \(\rho_1\) and \(\rho_2\):
To find the best fit line to a set of points we need to find the \((\rho, \theta)\) that minimizes the sum of \(D(\mathbf{p}_i, L)^2\).
To minimize equation (\ref{eq:error}), the partial derivatives of \(E\) must be zero:
Let’s examine \(\partial E/\partial \rho\) first:
Since (\ref{eq:centroid}) is in the same form as (\ref{eq:line}), the point \((\overline{x}, \overline{y})\), which is the centroid of the points, passes through the best fit line. If we translate all the points by \((\overline{x}, \overline{y})\) and find the best fit line to the translated points, it’s intuitively clear from the geometry that this line has the same \(\theta\) as the best fit line to the original points.
To avoid introducing extra notation, I’m going to skip the formalism of introducing new variables to represent the translated points and just pretend that the centroid of the original points is at the origin. So, \(\overline{x} = \overline{y} = 0\). Consequently, \(\sum_{i=i}^{n} x_i = \sum_{i=i}^{n} y_i = 0\). From equation (\ref{eq:centroid}) it follows that \(\rho = 0\). This greatly simplifies the solution for finding \(\theta\) as any terms with \(\sum_{i=i}^{n} x_i\), \(\sum_{i=i}^{n} y_i\), or \(\rho\) can be dropped. (Once we’ve found the solution for \(\theta\), we’ll disregard this convenient masquerade and have \(\overline{x}\) and \(\overline{y}\) take on their real values to determine \(\rho\) for the original, untranslated points.)
Let’s looks at \(\partial E/\partial \theta\):
where
Proceed by dividing (\ref{eq:theta}) by \(\cos^2 \theta\):
Using the quadratic formula, solving the quadratic (\ref{eq:theta2}) yields the following solutions:
For those only interested in how to compute the best fit line, we’re done. Of the two solutions, \(\theta_1\) is the correct value for \(\theta\), and \(\rho\) is given by equation (\ref{eq:centroid}). So, we have:
Recall that earlier we pretended the original points were translated by \((\overline{x}, \overline{y})\) so we could assume \(\overline{x} = \overline{y} = 0\) and take advantage of some convenient simplifications. In equation (\ref{eq:rho}), we stop pretending, so \(\overline{x}\) and \(\overline{y}\) describe the centroid of the original points.
When implementing in software, the case \(b = 0\) is not problematic by using the atan2(y, x) function.
A Swift implementation of this is available as a GitHub gist.
To complete this derivation, we need to show that \(\theta_1\) is the solution for the minimum and not the maximum (which would be the worst fitting line). To prove that \(\theta_1\) is the minimum, the second derivative test can be used: it must be shown that the second derivative of \(E\) at \(\theta_1\) is positive.
If we take the second partial derivative of \(E\) with respect to \(\theta\) and substitute \(\rho = 0\) and the summation terms with equations (\ref{eq:a}) and (\ref{eq:b}), we get:
It takes some algebra to get the above, but it’s straight forward. The next steps are less obvious (to me, anyway). From equation (\ref{eq:quadratic}) we can express the solution for \(\theta_1\) (I’ll drop the subscript from here on) as:
If we introduce the following substitutions:
we can rewrite (\ref{eq:ratio}) as:
which we can rewrite as two equations by introducing a constant \(t\):
Now we rewrite the partial derivative (\ref{eq:partial}), replacing instances of \(a\), \(b\), \(\cos \theta\), and \(\sin \theta\) with their values expressed in terms of \(\alpha\), \(\beta\), \(\gamma\), and \(t\):
We’ll eliminate \(\beta^2\) by noting that \(\beta^2 = \gamma^2  \alpha^2\). This gives us:
From the definition of \(\gamma\) it follows that \(\gamma \gt 0\) and that \(\gamma \gt \alpha\) (if \(b \ne 0\)). This implies that \(\alpha  \gamma \lt 0\). Also \(t^2\) and \(\gamma^2\) are clearly both positive. Since there are two negative factors and two positive factors in (\ref{eq:partial2}), it must be that \(\partial^2E/\partial \theta^2 > 0\). Therefore, \(E\) has a minimum at \(\theta_1\).
We could follow an almost identical sequence of steps to show that the other solution to the quadratic (\ref{eq:quadratic}), \(\theta_2\), is the maximum of \(E\).
There are a few loose ends that I’ve failed to cover. To get (\ref{eq:theta2}) I’ve assumed that that \(\cos \theta \ne 0\), and thus safe to divide by. In deducing that equation (\ref{eq:partial2}) is positive, it was assumed that \(b \ne 0\). I’ve done several tests of my Swift implementation with points that exercise the case \(b = 0\), and the solution seems to work in all cases. But a proof that \(\theta_1\) is the minimum when \(b = 0\) eludes me. I’ve tried the extreme test, but I’ve been unable to show that a requisite higher order derivative is positive. If any reader can fill in the missing bits, I’d love to hear from you.
✻ As of 2017 it’s formally known as “Mathematics and Statistics Awareness Month”, but that’s a mouthful and doesn’t sound as good IMO. And I wonder, why does statistics–just a branch of mathematics–get such prominence in the name?