Hyperparameter Optimization Fun

You all know that datarootsians are excellent data athletes. Olympics athletes train with weights, we evolved past that mere display of physical strength and started training the weights. In this way, Machine Learning Engineers train our models to achieve optimal performance on any task given to us. That’s how we shine!

Behind the shine and bling-bling, we have to admit that the performance of our models depends largely on how our model looks like, call it model infrastructure. This infrastructure is governed by a set of ‘hyperparameters’. Think about the depth of a tree, the number of layers and nodes in a neural network, weight regularization terms and even distributions, activation functions, and dropout probabilities. All of these choices need to be made before we can even start training our weights and these choices cannot be trained by the model itself. Therefore, choosing the best hyperparameters is one of the key challenges of many Machine Learning Engineer (PS: we do a lot more cool stuff, but we’ll talk about that in the next Homemade Muffin-meeting).

In the days of yore, we chose not to tune hyperparameters at all and rely on vanilla models or used Thibauld’s darts-at-office skills to pick the best hyperparameters. Even though Thibauld’s skills remain uncontested, we can do better when picking hyperparameters.

By now, we are quite familiar with the good ol’ grid search, random search and some of us might have heard about evolutionary algorithms using heuristics (based on birds and ants) or genetic approaches (based on NSFW). The problem with these algorithms are that (1) they take a friggin’ long time as a whole new model needs to be trained for every possible combination of hyperparameters, (2) they do not ensure optimal hyperparameters as we can only hope to randomly pick the best, and (3) these algorithms do not really learn from previous trials as we can only take blind steps in the direction of where we hope the grass is greener.

Before we can prance through green meadows with flowers in our hair, we need to come up with an approach that gives us a flexible and robust approach to effectively and efficiently find the best hyperparameters in the configuration space… in a fast and preferably cheap way. When confronted with a seemingly impossible problem, we all know that one guy that says: “Let’s go Bayesian”.

Honoring the true Bayesian spirit, allow me to make things easy for you:

First, we just have to try training the full model a few times, each time using distinct hyperparameters. We log the hyperparameters and the resulting loss metric of our trials. We call this 'evaluation of the objective function'. Luckily for us, about 3 runs will do the trick.
After this, we infer a surrogate model –call it a meta-model– which maps the hyperparameters to our loss metric (in the test set). A good surrogate model predicts how our loss is expected to evolve when we change our hyperparameters and incorporates our uncertainty. The key takeaways for a surrogate model are:
- It is very easy to build and evaluate, which means that we can do thorough and fast optimization!
- You can make this model as kick-ass as you would like, but currently 'they' explored (i) Gaussian processes, (ii) random forests, (iii) Tree-Parzen Estimators and (iii) Bayesian Neural Nets.
- After every trial, we can update this model (learn better and reduce uncertainty).
We also define an acquisition function to help us choose the next hyperparameters to be tried. This acquisition function balances exploitation (rely on the current best) and exploration (keep exploring better possibilities) of our configuration space. We would be wise to move in the direction we expect the next trial to be much better than the best result found so far, call this Expected Improvement.
Repeat this process a few times leads you to the best set of hyperparameters quite fast and with as few iterations as possible`

Starts to sound insanely -Bayesian-y- simple, doesn’t it?

Remember I started of with talking about athletics (if not, scroll up)? Well, in hyperparameter optimization, we are allowed to play dirty! We can actually put our meta-models on steroids, which resulted in the creation of our good friend BOHB, the current state-of-the-art in hyperparameter optimization. This is basically where most AutoML libraries, both commercial and open-source, are thriving on.

I’ll explain this, and much more, in the upcoming Rootlabs@Lunch. Make sure to tune in and spread the word!