- Hands-On Ensemble Learning with R
- Prabhanjan Narayanachar Tattar
- 364字
- 2025-04-04 16:30:55
Statistical/machine learning models
The previous section introduced a host of problems through real datasets, and we will now discuss some standard model variants that are useful for dealing with such problems. First, we set up the required mathematical framework.
Suppose that we have n independent pairs of observations, , where
denotes the random variable of interest, also known as the dependent variable, regress and, endogenous variable, and so on.
is the associated vector of explanatory variables, or independent/exogenous variables. The explanatory vector will consist of k elements, that is,
. The data realized is of the form
, where
is the realized value (data) of random variable
. A convention will be adapted throughout the book that
, and this will take care of the intercept term. We assume that the observations are from the true distribution F, which is not completely known. The general regression model, including the classification model as well as the regression model, is specified by:

Here, the function f is an unknown function and is the regression parameter, which captures the influence of
on
. The error
is the associated unobservable error term. Diverse methods can be applied to model the relationship between the Ys and the
xes
. The statistical regression model focused on the complete specification of the error distribution , and in general the functional form would be linear as in
. The function
is the link function in the class of generalized linear models. Nonparametric and semiparametric regression models are more flexible, as we don't place a restriction on the error's probability distribution. Flexibility would come with a price though, and here we need a much higher number of observations to make a valid inference, although that number is unspecified and is often subjective.
The machine learning paradigm includes some black box methods, and we have a healthy overlap between this paradigm and non- and semi-parametric models. The reader is also cautioned that black box does not mean unscientific in any sense. The methods have a firm mathematical foundation and are reproducible every time. Next, we quickly review some of the most important statistical and machine learning models, and illustrate them through the datasets discussed earlier.