# What is the difference between parameters and hyperparameters?

### aporras

#### 13/05/2020

Although mathematically both are considered parameters, in the context of Machine Learning, parameters and hyperparameters are separate concepts. Let’s see the need for this distinction that occasionally makes things confusing.

## Hyperparameter vs Parameter

Back to basics to remind what a parameter is and its difference with variable:

Mathematical functions have one or more variables as arguments and sometimes they also contain parameters. The existence of parameters means that in fact, the function is representing a whole family of functions, one for every valid set of values of the parameters. For example, the expression for the linear function is f(x) = a · x + b, where a and b are parameters and x the variable. A particular pair of fixed values for the parameters (a, b) determine a particular line.

A trained model is a particular mathematical function, belonging to a certain type of machine learning algorithm, i. e. family of functions, which has been determined by a particular tuple of parameters. The parameters that provide the customization of the function are the model parameters or simply parameters and they are exactly what the machine is going to learn from data, the training features set. Given some training data, the model parameters are fitted automatically. The features are the variables of this trained model.

Nevertheless, in the process of building a trained model, more parameters are needed in order to define how the ML algorithm is going to do it. In ML, we use hyperparameters to denote this specific type of parameter. Hyperparameters can’t be learned using the algorithm that needs them, but they must be tuned before the training stage, manually or automatically. They are also named as meta parameters, free parameters, or tuning parameters.

While hyperparameters are part of the input that we supply to the ML algorithm, parameters are the output as a result of fitting during training.

There are two main reasons for not including hyperparameters in the training process: In certain circumstances, it is better not to fit the values to favour generalisation rather than overfitting. Even so, the typical reason is that there is no optimizable analytical formula, which prevents including them as part of the training.

## Hyperparameters types

When you decide to work with a specific ML algorithm, you need to tailor its configuration by setting the hyperparameters. Some of them are related to the architecture or specification, the definition of the model itself. For example, the number of layers for Neural Networks, the kernel selection for Gaussian Processes, or the number of neighbours K in K-Nearest Neighbours. This sort of hyperparameters determines the shape of the model that is going to be trained, i. e., the shape of the parameters tuple to optimize. Besides, there are others that will control the learning process. For example, the learning rate for several cases like Boosting algorithms

Once all the hyperparameters are ready, the training stage can take place. And, as a result of it, the model parameters are obtained, those that will be applied to produce predictions when new features need to be processed. These tuples of parameters will be the centroid of the clusters for K-Means, split and end nodes for any Tree algorithms,or support vectors coefficients for Support Vector Machines. Hyperparameters won’t be present in the prediction stage.

The required hyperparameters vary widely depending on the ML algorithm. Even a few of them require none at all, like is the case for Linear Regression.

Certain hyperparameters can be fixed by definition without a doubt. The distance metric used in PCA, for example, is usually derived directly from our problem. Moreover, there will be several not very decisive hyperparameters that we can also fix easily. I mean the tolerance for a stopping criterion, for example. Although any hyperparameter could impact the final model, there are only a few of them more likely to affect it aggressively.

Some people prefer to call hyperparameters only those related to training. On the other hand, other experts don’t consider hyperparameters those that are directly or easily fixed but only those that will be optimized. Open debate. Personally I use hyperparameter to refer to every piece you need to configure outside the ML algorithm, as they all are potentially subject to optimization.

I even consider the loss function as one more hyperparameter, that is, as part of the algorithm configuration. Even though, in general, it’s pretty straight forward to select the loss function,  many times, there’s a reasonable doubt about what loss function will be better to find the solution, therefore, loss function would be potentially optimizable in this case.

Actually the relevant difference rests on what is inside the final model and what is outside of it, as configuration parameters.

And… What about the ML algorithm itself? Could it be considered one more hyperparameter or parameter? Yes, it could. Automated Machine Learning consists of automating the process of applying machine learning, including algorithm selection as a part of the learning pipeline to optimize.

## Hyperparameters tuning

Choosing appropriate hyperparameters is an essential task when applying ML. Hyperparameters can affect the speed and also the accuracy of the final model. Hyperparameter optimization finds a tuple of hyperparameters that lead to the model which better solves the problem.

A general approach works this way:

As previously mentioned, we should identify the decisive hyperparameters. Thus, we avoid an immense optimizable tuple of hyperparameters. Then, we should select a range of candidate values for these decisive hyperparameters. A general hyperparameter optimization will consist of evaluating the performance of several models, those that different values combinations inside these ranges yield. The performance metric is evaluated on the holdout samples that we can get by using cross-validation. The best performance will determine the final hyperparameters to use. Then we build the definitive model by recovering the complete data and setting the optimal hyperparameters.

Note that the hyperparameters we wanted to optimize in the first place, become now the parameters of the tuning process, as they are the subject of the optimization carried on by these hyper-optimization algorithms. Besides, a new higher-level set of hyper-hyperparameters will be necessary to run the tuning. The number of folds needed for cross-validation is a good example of hyper-hyperparameter. As you can see, the hierarchy of layers of hyperparameters can pile up as we add further layers of optimization, and the learning process can quickly become an endless chain.

Here, a list of the three most widespread algorithms to perform hyperparameters optimization:

1. Grid search: It performs an exhaustive search by evaluating any candidates’ combinations. Obviously, it could result in an unfeasible computing cost, so grid search is an option only when the number of candidates is limited enough.
2. Random search: Providing a cheaper alternative, random search tests only as many tuples as you choose. The selection of the values to evaluate is completely random. Logically the required time decreases significantly. Apart from speed, Random search takes advantage of randomization in the case of continuous hyperparameters that must be discretized when optimized by Grid search.
3. Bayesian optimization: Contrary to Grid and random search, Bayesian optimization uses previous iterations to guide the next ones. It consists of building a distribution of functions (Gaussian Process) that best describes the function to optimize. In this case, hyperparameter optimization, the function to optimize is those which, given the hyperparameters, returns the performance of the trained model they would lead to. After every step, this distribution of functions is updated and the algorithm detects which regions in the hyperparameter space are more interesting to explore and which are not. After a defined number of iterations, the algorithm stops and returns the optimum tuple. Bayesian optimization is a more efficient method for exploring the possibilities.