We’re been talking about fully observable, deterministic, static, known environments so far…
Also where the solution is just a sequence of actions.
Today we’re going to start relaxing those constraints.
In our Travel Agent, both the goal and the path were relevant. That’s not always the case.
IC design, factory floor layout, automatic programming, crop planning, etc. are strictly goal-oriented
Local search start searching from an initial state then move to neighboring states without keeping track of their path nor previously visited states.
Yes, they’re not systematic. But what does that get us?
Local search is also good for optimization problems where the goal is to find the best state according to some objective function.
We can think about this in terms of a state-space landscape where we’re trying to find a global maximum through hill climbing (alternatively we could find the global minimum, called gradient descent)
1D Hill-Climbing
Hill-Climbing Pseudocode
Let’s demonstrate with the 8-queens problem
We’ll use a complete-state formulation, where every state already contains parts of the solution (but maybe not in the correct order).
We get the initial state at random, and we can produce new states by moving a single queen at a time.
How many states does that get us?
Does our initial state have to be truly random?
We can organize the queens in columns (or rows I suppose), and only valid moves for that queen are to other spaces in that column.
So our number of successor states are…
\[ 8 \times 7 = 56 \]
What should be our heuristic cost?
Let’s let \(h\) be “pairs of queens that are attacking each other”
8-Queens
Does this strategy look familiar?
And it works pretty well, because often it’s easy to improve a very bad state.
Local Maxima
Ridges
Plateaus (or shoulder)
Given a random state (8-queens), a steepest-ascent strategy gets stuck 86% of the time (solving only 14% of initial states).
But it’s fast! Four steps (average) to succeed and three (average) to get stuck.
Reminder: this is a search space with \(8^8 \approx 17M\) states
Sideways move, may recover from fix shoulders (how can this go wrong?)
Plateaus (will wander forever)
We could limit sideways moves (e.x. 100) which boosts our problems solved from 14% to 94% (wow!)
This comes at a cost, ~21 steps for success ~64 for failure
Stochastic hill climbing, where the decision is weighted on steepness (slower but may get better solutions)
First-choice hill climbing, where we generate sucessors randomly until one is better when would we use this?
Random hill restart: “Just try again”
This will, to the limit of time, always work, why?
If each attempt has the probability of success \(p\) then our estimated restarts are \(1/p\)
For 8-queens (no sideways moves) then \(p \approx 0.14\), we need ~7 iterations (6 failures and 1 successes).
Our expected number of steps is one iteration plus \((1-p)/p\) times the cost of failure (22 steps alltogether)
If we allow sideways moves, then \(1/.94 \approx 1.06\) so we can expect a cost of:
\[ (1 \times 21) + (0.06/0.94)\times 64 \approx 25 \]
This works pretty well for 8-queens! Even for 3M queens, this approaches solves in seconds.
When would random-restart struggle?
Consider two approaches:
Any metallurgists in the crowd?
Let’s consider gradient descent for a moment.
The purpose of simulated annealing is to knock the ball out of local minima without knocking it out of global minima.
Annealing Pseudocode
A really nice property of this (Boltzmann) is that if we lower the “temperature” slowly enough, all of the probability will be concentrated around the global minimum.
Remind me: what is the difference between discrete and continuous spaces?
(we might say that one has an infinite branching factor)
How often do you think this shows up in the real world?
Consider a problem where we want to place three new airports in Romania, where the sum of squared distance (straight line) between each city and it’s nearest airport is minimized.
How many dimensions do we have to consider here?
Our space requires one dimension per \((x,y)\) component of each airport. This makes our objective function:
\[ f(\mathbf{x})=f(x_1,y_1,x_2,y_2,x_3,y_3)=\sum^3_{i=1}\sum_{c\in C_i}(x_i-x_c)^2+(y_i-y_c)^2 \]
Where \(C_i\) is the set of cities (in the state \(\mathbf{x}\)) is airport \(i\)
(this only works if we constrain the \((x,y)\) otherwise we’d have to compute \(C_i\)
So, how can we deal with this problem?
We could discretize the problem! Which makes this behave in a similar way to prior examples.
Or… when I thin continuous… I think… what?
If we measure progress by the change in value of the objective function then we call the method a empirical gradient method
If you’re mathematically minded, you may have thought of a way to game the objective function here…
\[ \nabla f= (\frac{\delta f}{\delta x_1},\frac{\delta f}{\delta y_1},\frac{\delta f}{\delta x_2},\frac{\delta f}{\delta y_2},\frac{\delta f}{\delta x_3},\frac{\delta f}{\delta y_3}) \]
Taking the partial derivative of our objective function with respect to our dimensions gives us a vector to follow!
In some cases we can even find our maximum by solving \(\nabla f=0\) (for example in the case of a single airport)
So using:
\[ \frac{\delta f}{\delta x_1} = 2 \sum_{c \in C_1}(x_1-x_c) \]
We can update our state like so:
\[ \mathbf{x}\leftarrow\mathbf{x}+\alpha\nabla f(\mathbf{x}) \]
Where \(\alpha\) is some small scalar value called a step size, as you might imaging finding the correct \(\alpha\) can be tricky. Why might this be?
line search just extends \(\alpha\) over and over until \(f\) starts to decrease again, then we select that state.
For many applications, an effective algorithm is the Newton-Raphson method (a general way to find roots of functions, that is, \(g(\mathbf{x})=0\))
\[ x \leftarrow x - g(x)/g'(x) \]
If we want the maximum (or minimum) of \(f\), we need an \(\mathbf{x}\) such that our gradient is a zero vector, so our \(g(x)\) becomes \(\nabla f(\mathbf{x})\) and can be written in matrix-vector form as:
\[ \mathbf{x}\leftarrow\mathbf{x}-\mathbf{H}^{-1}_f(\mathbf{x})\nabla f(\mathbf{x}) \]
Where the \(\mathbf{H}_f(\mathbf{x})\) is the Hessian matrix of second derivatives, where each element (\(H_{ij}\) ) is given by \(\delta^2f/\delta x_i \delta x_j\) .
Our Hessian for the Airport problem would have off-diagonal values as 0 and have \(2 \times |C_i|\) diagonal elements for airport \(i\)
What do you think will happen when we calculate this?
If we actually do this, it just plops the airport right at the centroid of all the cities in \(C_i\)
For high-dimensional problems, this gets expensive as we have to calculate \(n^2\) entries and invert an \(n \times n\) matrix.
Also, all the prior problems in discrete spaces are also problems in continuous spaces.
Random-restarts and simulated annealing are helpful, but in high-dimensional spaces, it’s easy to get lost.
As one might imagine, it’s not valid to place an airport just anywhere (examples)?
How this effects the difficulty is very domain-dependent.
Linear programming is perhaps the best known version of this. The constraints must be linear inequalities forming a convex set and \(f\) must also be linear. This is solvable with polynomial complexity with respect to the variables to be solved.
Convex optimization is even more general, and captures many important problems (control theory, machine learning).