Cut Cut Cut

library(Dmisc)

# Datos sintéticos para los ejemplos.
set.seed(123)
df <- data.frame(
  sex = rep(c('M', 'F'), each = 500),
  age = c(sample(20:60, 500, TRUE), sample(30:70, 500, TRUE))
)

When working with data, it is common to encounter the need to analyze numerical variables. In certain cases, using basic statistical measures such as mean, sum, minimum and maximum values, and others is sufficient, especially when looking at the relationship of these variables with categorical variables.

When the situation requires comparing two numerical variables, we can use metrics like correlation, regression, covariance, and others, to establish a relationship between both. However, there often arises a need to convert these numerical variables into categorical ones to capture and highlight differences among different demographic or population groups.

For this purpose, the statistical programming language R provides the cut function. As per its documentation, cut is a function designed to convert a numerical variable into a factor one1.

Moreover, there are third-party libraries that expand the capabilities of the cut function, adding more functionalities. An example of this is the cut3 function from the Dmisc package. Next, we will examine how these functions work and compare the utility and effectiveness of cut3 with other available alternatives.

The first and fundamental difference of cut3, compared to the rest of the functions designed for these purposes, is that the former is designed to operate on a data.frame, while the latter are designed to work with the numerical vector of interest.

In terms of advantages, this can mean greater flexibility and efficiency in some cases. Instead of having to isolate a numerical vector and work with it individually, cut3 allows the user to operate directly on the complete data.frame. This can be particularly useful in situations where multiple columns need to be manipulated or analyzed simultaneously, as will be seen later.

However, this functionality also presents some disadvantages. The main one is that cut3 overwrites the variable in question within the data.frame, and it is not intuitive how to assign it to a new variable.2

Cuts

The most important argument of these functions, after the data, is breaks. This argument consists of an indication of how the intervals in the resulting categorical variable are constructed.

A Single Number

When the breaks argument consists of a single number (an integer and greater than or equal to 2)3, it is interpreted as the number of cuts that should be made in the numerical variable when converting it into a categorical variable.

df2 <- df
df2$age <- cut(df2$age, breaks = 4)
table(df2)
#>    age
#> sex (19.9,32.5] (32.5,45] (45,57.5] (57.5,70]
#>   F          31       171       144       154
#>   M         155       159       150        36

In this case, except for a slight difference in terms of syntax, the same result could be achieved using cut3.

df3 <- Dmisc::cut3(
  df, 
  var_name = 'age', 
  breaks = 4
)
table(df3)
#>    age
#> sex (19.9,32.5] (32.5,45] (45,57.5] (57.5,70]
#>   F          31       171       144       154
#>   M         155       159       150        36

A Numeric Vector

When the breaks argument is passed as a numeric vector, the values contained in this vector are interpreted as the cut points for constructing the intervals into which the variable in question will be divided.

df2 <- df
df2$age <- cut(
  df2$age, 
  breaks = c(0, 20, 40, 60, 80, 100)
)
table(df2)
#>    age
#> sex (0,20] (20,40] (40,60] (60,80] (80,100]
#>   F      0     137     245     118        0
#>   M      5     235     260       0        0

Again, the same can be achieved using the cut3() function.

df3 <- df
df3 <- cut3(
  df3, 
  var_name = "age", 
  breaks = c(0, 20, 40, 60, 80, 100)
)
table(df3)
#>    age
#> sex (0,20] (20,40] (40,60] (60,80] (80,100]
#>   F      0     137     245     118        0
#>   M      5     235     260       0        0

It is important to note that this vector of values must contain an initial value less than the minimum of the variable and a final value greater or equal to the maximum of the variable. This is due to how the intervals are constructed.

Note in the label of the created variable (0,20] that the lower limit of the interval is open (, which means that only values greater than 0 are included. You can change this behavior and make the first range closed using the include.lowest argument. While the upper limit is closed ], meaning it includes values less or equal to 20.

If the vector of values does not meet the criteria described above, values that do not fit into any interval will be marked as NA in the resulting variable. Additionally, in the case of the cut3() function, if the user sets the .inf argument to True, the provided cut points will be extended with -Inf and Inf. This means that any value will be included in the resulting variable, even if it is outside the range of the original values. It’s a useful feature when the maximum and minimum value of the variable is not known beforehand, and you want to include all values without having to set specific limits.

cut3(
  df, 
  var_name = "age", 
  breaks = c(0, 20, 40, 60, 80, 100)
) # Will leave NAs

cut3(
  df, 
  var_name = "age", 
  breaks = c(20, 40, 60, 80), 
  include.lowest = T
) # Will leave NAs

cut3(
  df, 
  var_name = "age", 
  breaks = c(20, 40, 60, 80), 
  .inf = F
) # Will not leave NAs

Finally, it is essential that the values in this vector are unique. This is fundamentally relevant when the cut points are not manually assigned, but some other strategy is used for the ends.

Functions

One of the main innovations of cut3() compared to cut() is that the breaks vector can be a function that generates the numeric vector of cuts. The most common use case for this feature is perhaps when quantiles are wanted to divide the variable.

For these purposes, the bf_args argument should be used, which specifies additional arguments, which should be passed to the function used to construct the break points.

table(
  cut3(
    df, 
    var_name = "age", 
    breaks = quantile
  )
)
#>    age
#> sex (20,36] (36,45] (45,55] (55,70]
#>   F      76     126     126     172
#>   M     195     114     128      58

In the example code above, the cut3() function is used to divide the “age” variable in the df data.frame into groups or bins based on quantiles. Here, breaks is set as quantile, which means that the quantiles of the “age” distribution are used to define the break points.

An interesting feature of this approach is that it allows for greater flexibility in defining the groups. For example, instead of dividing the variable into equal-sized groups, you can divide it into groups based on the distribution of the data. This can be particularly useful when dealing with variables that have a skewed distribution or are highly concentrated in certain ranges.

Furthermore, by allowing breaks to be a function, cut3() offers the possibility of dynamically generating break points based on the data. This can facilitate the creation of more robust and adaptable analysis, as there is no need to define the break points beforehand.

Finally, it should be mentioned that the bf_args argument allows you to pass additional arguments to the breaks function. This offers even more flexibility, as you can customize the breaks function to suit your specific needs. For example, you could change the quantiles used to divide the variable, or you could use a completely different function to generate the break points.

Dmisc::cut3_quantiles

The Dmisc package already includes the cut3_quantile function, which is a variant of cut3 designed specifically to divide a variable into quantiles.

cut3_quantile takes a data.frame, a variable, and a set of probabilities that define the quantiles. If no probabilities are specified, by default the first quartile, median, and third quartile are used. It then calls cut3 with R’s quantile function as an argument for the break points.

table(
  cut3_quantile(
    df, 
    var_name = "age"
  )
)
#>    age
#> sex (-Inf,36] (36,45] (45,55] (55, Inf]
#>   F        76     126     126       172
#>   M       200     114     128        58

In the example code above, cut3_quantile divides the “age” variable in the df data.frame into quartiles.

Therefore, if you want to divide a variable into quantiles, you can directly use cut3_quantile instead of passing quantile as an argument for breaks in cut3.

Groups

Another addition to cut3() is the ability to define different cuts for different groups. This is particularly useful when break points are specified using functions.

table(
  cut3(
    df, 
    var_name = "age", 
    breaks = 4, 
    .inf = F
  )
)
#>    age
#> sex (19.9,32.5] (32.5,45] (45,57.5] (57.5,70]
#>   F          31       171       144       154
#>   M         155       159       150        36
table(
  cut3(
    df, 
    var_name = "age", 
    breaks = 4, 
    .inf = F, 
    groups = "sex"
  )
)
#>    age
#> sex (20,30] (30,40] (40,50] (50,60] (60,70]
#>   F       0     137     122     123     118
#>   M     134     106     141     119       0

In the previous code example, cut3() is used to divide the “age” variable in the data.frame df into groups, but the cuts are defined separately for each value of the “sex” variable. This means that the groups generated for “age” will be different for males and females, which can be very useful in analyses that require taking into account differences between groups.

This feature of cut3() is very powerful as it allows adapting the break points to the specific characteristics of each group. In many cases, the distribution of a variable can significantly vary among different groups, and using the same break points for all groups might result in an inaccurate representation of the data.

For instance, imagine that you are analyzing the age of participants in a study and you discover that the distribution of ages is quite different for males and females. If you use the same break points for both groups, you might end up with bins that contain many females but few males, or vice versa. By allowing you to define break points separately for each group, cut3() enables you to avoid this problem and ensure a more accurate representation of the data for each group.

In addition, this function is particularly useful when break points are specified using functions, as these functions can adapt to the specific characteristics of each group. For example, you could use quantiles to define the break points, which would ensure that the bins contain approximately the same proportion of observations for each group, regardless of differences in the data distributions.

Weights

TODO

Labels

As you may have noticed in the previous examples, the labels of the data for the resulting variable are constructed using the corresponding interval notation. However, this behavior can be modified by providing the labels = F argument. In this case, a simple auto-incrementing number will be used to name the intervals.

table(
  cut3(
    df, 
    var_name = "age", 
    breaks = 4, 
    labels = F
  )
)
#>    age
#> sex   1   2   3   4
#>   F  31 171 144 154
#>   M 155 159 150  36

Furthermore, you can provide a vector specifying the labels you want to use in constructing the variable. Labels will be assigned in the order they are specified. Additionally, the number of labels must be exactly equal to the number of resulting bins.

table(
  cut3(
    df, 
    var_name = "age", 
    breaks = 4, 
    labels = c(
      "1 - 25 years", 
      "21 - 50 years", 
      "51 - 75 years", 
      "76 - 100 years"
    )
  )
)
#>    age
#> sex 1 - 25 years 21 - 50 years 51 - 75 years 76 - 100 years
#>   F           31           171           144            154
#>   M          155           159           150             36

This can be useful if you wish to customize the labels to make them more descriptive or easier to understand. For instance, you could use labels that describe the characteristics of individuals in each group, or you could use labels that are more consistent with the terminology used in your field of study. This can make your results clearer and easier to interpret, for both you and other people who may be working with your data.

Conclusion

The cut3 function from the Dmisc package in R offers advantages and disadvantages. Compared to functions like cut, cut3 can offer greater flexibility and efficiency by working directly with data.frames rather than just with numeric vectors. This can be particularly helpful when you need to manipulate or analyze multiple columns at once.

However, it also presents drawbacks, such as overwriting the original variable within the data.frame, and can be less intuitive when needing to assign the result to a new variable. In terms of performance, the choice between cut and cut3 will largely depend on the context and the specific needs of your analysis.

Lastly, it is always recommended to understand the differences and peculiarities of the different tools available before making a decision on which method to use to convert numeric variables into categorical ones.


  1. In R, a factor is a categorical variable type. It is similar to a text (character) variable type, but only takes a finite set of values (levels).↩︎

  2. This can be achieved by simply creating a variable that is a copy of the original before making the cut.↩︎

  3. When breaks is specified in this way, the variable is divided into segments of equal length. For more details, refer to the cut() function documentation.↩︎