﻿ 5. Detailed description of the Actions > 5.11. R Predictive > 5.11.2. CART – Step 1 - Create one “deep” tree (R action)

# 5.11.2. CART – Step 1 - Create one “deep” tree ( action)

Icon:

Function: My_R_CART_Full

Property window:

Short description:

Create a tree using the CART algorithm

Long Description:

The CRT node uses Recursive Partitioning and Regresion Tree as implemented in the R rpart library, based on the 1984 book of Breiman, Friedman, Olshen and Stone. The library is the work of Terry Therneau, Beth Akinson, and Brian Ripley.

A complete introduction to the technique can be found in:

This generates a CART model, and send the following output tables:

Output pin 0 : full table with prediction

Output pin 1: error table for selection of pruning parameters

Output pin 2: variable importances

The goal of this analysis is to generate a large tree and identify where to prune it (see next action node). While this algorithm has gained popularity in the early stage of data mining, it is now increasingly being left unused.

The main advantage of CART is it ability to combine nominal and continuous variable to predict (typically) nominal groups. Being non-parametric, it adapts to fairly large datasets, and does not suffer from multicollinearity or non-normality of the data. When it was introduced, it was also highlighted that it yields better predictions than CHAID.  Another key advantage is that it is intuitive, and offers a methodology similar to a drill down, but with some validation.

The key problems of CRT are that as trees deepen, we are almost certain to overfit the data (some argue that trees below three levels should mever be used, other go up to 6 or 7 levels). Indeed, as we see a succession of conditional probabilities, whatever comes out of the model becomes irrealistic, which is why many use combinations of hundreds of small trees (XGBoost, TreeNet, Forest of Stumps, etc.) to overcome this problem. This yields better models, but they are “black boxes”.

In practice, the first cut is often based on particularities of the sample, and the following cuts are dependent on the first one, making such models unstable for predictive purposes. They are, however, useful in terms of data exploration and understanding.

Parameters:

The CRT tree (Classification and Regression Tree) requires the following settings:

Predictors: list of independent variables we want to use to predict in the form of x1+x2+…+xk

Target: the variable we want to predict (binary, continuous, or multinomial)

Weight: the tree needs to be “balanced”, so it is a good idea to create a weight variable. If all categories are equal, this variable is a constant.

Job to do: classification or regression

Model Name: name of the model file. This is required to add pruning, or apply the model as is.

Plot Scree: select whether to plot the Scree Plot to identify where overfitting begins

Plot Tree: select whether to plot the treee

Plot Scree and Tree chars in a single window: self explanatory

Tree display: simplified or fancy. The simplified one is much faster to display, but will not show text categories properly.

Seed: set a seed value to be able to change the starting point, or reproduce a previous model

Maximum depth of the tree: set how many successive splits are allowed. Note that very deep trees are unstable (overfitting). Typically, more than 10 is not worth exploring. More

Minimum size to allow a split: crt weill not cut nodes than have less observations

Minimum size of a terminal node: when splitting, CRT will attempt to respect this size. If splitting a node leads to a terminal node smaller than this criterion, the split will not happen

Use surrogates in model: this option allows you to control what happens in case of missing value.

Display Only: an observation with a missing value for the primary split rule is not sent further down the tree.

NO Split for all missing: means use surrogates, in order, to split subjects missing the primary

variable; if all surrogates are missing the observation is not split.

Majority Rule (recommended): if all surrogates are missing, then send the observation in the majority direction. This corresponds to the recommendations of Breiman et.al (1984).

Maximum Competing Splits: the number of competitor splits retained in the output. It is useful to know not just which split was chosen, but which variable came in second, third, etc

Maximum Surrogate: the number of surrogate splits retained in the output. If this is set to zero, the

compute time will be reduced, since approximately half of the computational time (other than setup) is used in the search for surrogate splits. However, this will yield to a model harder to put into production.

In the log, you will see the details of the analysis. The Scree plot shows the shape of the error reduction (and gives you a pruning criterion):

Once we observe a flat line, there is no improvement on the relative error, hence the tree is most likely overfitting by creating additional cuts. In the log, we can see details of this metric, where we see that two “good” stopping criteria could be 18, where it seems to stabilize, or 47 splits, based on the minimum error:

 CP nsplit rel error xerror xstd 1 0.207568 0 1 1 0.00527 2 0.191278 1 0.79243 0.79243 0.005003 3 0.018632 2 0.60115 0.60771 0.004611 4 0.011306 6 0.52663 0.53087 0.004396 5 0.009137 7 0.51532 0.52335 0.004373 6 0.005584 8 0.50618 0.51878 0.004359 7 0.002423 9 0.5006 0.50392 0.004312 8 0.00223 11 0.49575 0.50032 0.0043 9 0.002123 17 0.48237 0.49931 0.004297 10 0.00203 18 0.48025 0.49931 0.004297 11 0.001384 19 0.47822 0.49557 0.004285 12 0.001206 22 0.47407 0.49142 0.004271 13 0.001015 33 0.45828 0.48874 0.004262 14 0.000969 34 0.45727 0.48869 0.004262 15 0.0009 35 0.4563 0.48929 0.004264 16 0.000715 37 0.4545 0.48565 0.004252 17 0.000692 41 0.45159 0.48325 0.004244 18 0.000554 47 0.44712 0.48205 0.00424 19 0.000538 48 0.44656 0.48463 0.004249 20 0.000531 51 0.44495 0.48708 0.004257

A chart with variable importance is also returned:

You will also find this information in text format in the Anatella log-window.