Multivariable optimization problems are ubiquitous in applied math and data science, because a common path to achieving desirable results for real-world problems is to specify a value which captures some notion of badness and use optimization methods to make it as small as possible. Let's begin with a concrete example of this pattern.
The line of best fit
Suppose we want to use a line to describe the relationship between the -values and -values for a collection of points in the -plane. Some lines clearly do a very poor job of aligning with the points, while others are better. Try adjusting the sliders below to see how the resulting graph changes.
To identify which line which does the "best" job, we need to quantify how well a given line fits the points. The most common way to do this is to measure the sum of the squares of the vertical distances from each point to the line. We seek to make that sum of squared distances as small as possible.
If the points are , then the value we're trying to minimize is
The letter is used in this context in reference to the word loss, which is a synonym for badness.
Here's the key insight: at any point where is a nonzero vector, there will be directions in which the pair may be perturbed to increase 's value (specifically, any directions which make an
Taking the partial derivatives of and setting them equal to zero, we get the system of equations
We have two equations and two unknowns ( and ), so we can solve using standard techniques for systems of linear equations. For example, solving the second equation for gives
and substituting into the first equation and solving for yields
We can substitute back into the equation for to find the only pair at which is equal to the zero vector. If we trust our intuition that at least one minimizing pair must exist, then this is the one. In fact, this intuition is correct, and the formula above is indeed the formula for the line of best fit.
Run the code below to confirm visually that the formula we derived above fits the sampled points reasonably well. (Note that you have to run the cell twice to get the plot to show.)
import numpy as np import matplotlib.pyplot as plt n = 10 x = np.random.random_sample(n) y = 2 * x + 3 + 0.2 * np.random.standard_normal(n) m = ((np.sum(x*y) - np.sum(x)*np.sum(y)/n) / (np.sum(x**2) - np.sum(x)**2/n)) b = (np.sum(y) - m * np.sum(x))/n yfit = m*x + b plt.scatter(x,y) plt.plot(x,yfit)
Let's consider the problem of optimizing a differentiable function from a subset of to .
As we observed in the previous section, if is nonzero at a point away from the boundary of , then there are directions in which decreases as you move away from that point and other directions where it increases. Therefore, any minima or maxima of away from the boundary of must occur at points where the gradient of is zero. We call points where the gradient of is zero or where is non-differentiable critical points of . If a function has a minimum or a maximum at a point, then either that point is a critical point, or it is on the
Now let's consider the possibility that has a maximum or minimum on the boundary of its domain . For simplicity, we'll assume that is a subset of the plane. Because of the
If the boundary of is specified as a level set of a function , then being perpendicular to the tangent line is the same as being parallel to the gradient of . Two vectors are parallel if one is a multiple of the other, so we arrive at the equation
for some . The variable is called a Lagrange multiplier.
Although we framed the discussion assuming that the domain is a subset of the plane, the same analysis works in any dimension, and the resulting equation is still .
The extreme value theorem confirms the intuition that a continuous function defined on a closed and bounded subset of must have a maximum somewhere and a minimum somewhere. Let's combine all of these points into a single theorem statement:
Theorem (Extreme value theorem and Lagrange multipliers)
Suppose that is a continuous function defined on a closed and bounded subset of . Then
- realizes an absolute maximum and absolute minimum on (the extreme value theorem)
- any point where realizes an extremum is either a critical point—meaning that or is non-differentiable at that point—or at a point on the boundary.
- if realizes an extremum at a point on a portion of the boundary which is the level set of a differentiable function with non-vanishing gradient , then either is non-differentiable at that point or the equation is satisfied at that point, for some .
This theorem suggests a program for optimizing a function on a domain whose boundary is a level set of a function .
- Identify any points where
- is undefined,
- the boundary of is not smooth, or
- for some
- Calculate at all of these points and select the largest and smallest values.
Find the maximum and minimum values of on the rectangle .
Solution. The system of equations tells us that and . Substituting, we find , which can be rearranged and factored (by repeated applications of difference of squares) to get
The only value of in the interior of the rectangle which satisfies this equation is . Substituting into , we find the critical point in the interior of the rectangle. The value of the function this point is .
Along the bottom edge of the rectangle, the value of the function at is which ranges monotonically from 0 to 81. Along the top edge, we have . This has a critical point at .
Along the left edge, the function is equal to , which has no interior critical points. And finally, along the right edge, we have , which has a critical point at .
So all together, checking critical points and corners gives
Therefore, the absolute maximum is and the absolute minimum is .
Find the point closest to the origin in the region .
Solution. We can rule out the possibility that the point nearest the origin is in the interior of the region, on geometric grounds. To find the boundary critical points, we set the objective function
and the constraint function describing the boundary of the given set. Solving the Lagrange equation
together with the constraint equation , we get , which implies . So we get a single critical point , and since there's only one, the minimum must occur here.
We can also use the Lagrange multiplier idea in the reverse direction. Let's look at an example from statistics where Lagrange multipliers give rise to a meaningful insight.
Lasso and ridge regression
Consider the problem of fitting a plane of the form to a set of points in . As we did for the line of best fit above, we will look for the plane which minimizes the sum of squared distances in the -direction from the points to the plane.
Sometimes this approach gives undesirable results because the resulting values for and are considered too large for a given application. If we want to force them to be smaller, we can add a term to our objective function which penalizes largeness of and . We choose some positive value and propose the objective function
This is called the ridge objective function. Alternatively, we could use
This is called the lasso objective function.
It turns out in practice that the lasso minimizer frequently has either or equal to zero, while the ridge minimizer almost never does. Zero coefficients can be helpful, because if then we are expressing a simpler relationship between just the -value and -value of each point (and similarly if is zero instead). Being able to ignore some of the coordinates is especially helpful if we have hundreds of coordinates rather than just two, as is often the case in statistical applications.
Use the method of Lagrange multipliers to explain why the lasso minimizer tends to have solutions on the coordinate axes in the plane, while the ridge minimizer does not.
Solution. Let's define and . The function is minimized at a point where . In other words, the minimizer must be a point where and are
By the method of Lagrange multipliers, this is the same equation we get if consider the optimization problem of minimizing subject to the constraint that , where is any constant. Therefore, the ridge minimizer will occur at a point where has a minimum on some set of the form . Since is a disk in the -plane, we would expect the typical solution to be at a point on the boundary of the circle where neither nor is zero.
Applying the same analysis to the lasso minimizer, we get an optimization problem on the squared-shaped set . Since the square has corners which always contribute candidate points for minimization of (since there is no tangent line at those points), it's reasonable to expect the lasso minimizer to occur at these points.
Many airlines require that the sum of the length, width and height of a checked baggage cannot exceed 62 inches. Find the dimensions of the rectangular baggage that has the greatest possible volume under this regulation.
Solution. The objective function is the volume formula for a box, namely . The constraint equation is . The Lagrange equation tells us that
Since we also have the equation , we have four equations and four variables. Multiplying the three equations above, we get , which means . Dividing this equation by each of the original three equations gives . Substituting into the last equation, we find . Thus the maximum-volume luggage shape is a cube.