- Hands-On Ensemble Learning with R
- Prabhanjan Narayanachar Tattar
- 300字
- 2025-04-04 16:30:55
Primary Biliary Cirrhosis
The pbc
dataset from the survival package is a benchmark dataset in the domain of clinical trials. Mayo Clinic collected the data, which is concerned with the primary biliary cirrhosis (PBC) of the liver. The study was conducted between 1974 and 1984. More details can be found by running pbc
, followed by library(survival)
on the R terminal. Here, the main time to the event of interest is the number of days between registration and either death, transplantation, or study analysis in July 1986, and this is captured in the time variable. Similarly to a survival study, the events might be censored and the indicator is in the column status. The time to event needs to be understood, factoring in variables such as trt
, age
, sex
, ascites
, hepato
, spiders
, edema
, bili
, chol
, albumin
, copper
, alk.phos
, ast
, trig
, platelet
, protime
, and stage
.
The eight datasets discussed up until this point have a target variable, or a regressand/dependent variable, and are examples of the supervised learning problem. On the other hand, there are practical cases in which we simply attempt to understand the data and find useful patterns and groups/clusters in it. Of course, it is important to note that the purpose of clustering is to find an identical group and give it a sensible label. For instance, if we are trying to group cars based on their characteristics such as length, width, horsepower, engine cubic capacity, and so on, we may find groups that might be labeled as hatch, sedan, and saloon classes, while another clustering solutions might result in labels of basic, premium, and sports variant groups. The two main problems posed in clustering are the choice of the number of groups and the formation of robust clusters. We consider a simple dataset from the factoextra
R package.