Data Structures

This vignette is partly outdated since nested structure was implemented completely.

This vignette illustrates how the core of styler currently¹ works, i.e. how rules are applied to a parse table and how limitations of this approach can be overcome with a refined approach.

Status quo - the flat approach

Roughly speaking, a string containing code to be formatted is parsed with parse and the output is passed to getParseData in order to obtain a parse table with detailed information about every token. For a simple example string “a <- function(x) { if(x > 1) { 1+1 } else {x} }” to be formatted, the parse table on which styler performs the manipulations looks similar to the one presented below.

library("styler")
library("dplyr")

code <- "a <- function(x) { if(x > 1) { 1+1 } else {x} }"

(parse_table <- styler:::compute_parse_data_flat_enhanced(code))

## # A tibble: 24 x 14
##    line1  col1 line2  col2          token     text terminal short newlines
##    <int> <int> <int> <int>          <chr>    <chr>    <lgl> <chr>    <int>
##  1     1     0     1     0          START                NA  <NA>        0
##  2     1     1     1     1         SYMBOL        a     TRUE     a        0
##  3     1     3     1     4    LEFT_ASSIGN       <-     TRUE    <-        0
##  4     1     6     1    13       FUNCTION function     TRUE funct        0
##  5     1    14     1    14            '('        (     TRUE     (        0
##  6     1    15     1    15 SYMBOL_FORMALS        x     TRUE     x        0
##  7     1    16     1    16            ')'        )     TRUE     )        0
##  8     1    18     1    18            '{'        {     TRUE     {        0
##  9     1    20     1    21             IF       if     TRUE    if        0
## 10     1    22     1    22            '('        (     TRUE     (        0
## # ... with 14 more rows, and 5 more variables: lag_newlines <int>,
## #   spaces <int>, multi_line <lgl>, indention_ref_id <lgl>, indent <dbl>

The column spaces was computed from the columns col1 and col2, newlines was computed from line1 and line2 respectively.

So far, styler can set the spaces around the operators correctly. In our example, that involves adding spaces around +, so in the spaces column, element nine and ten must be set to one. This means that a space is added after 1 and after +. To get the spacing right and cover the various cases, a set of functions has to be applied to the parse table subsequently (and in the right order), which is essentially done via Reduce(). After all modifications on the table are completed, serialize_parse_data() collapses the text column and adds the number of spaces and line breaks specified in spaces and newlines in between the elements of text. If we serialize our table and don’t perform any modification, we obviously just get back what we started with.

styler:::serialize_parse_data_flat(parse_table)

## [1] "a <- function(x) { if(x > 1) { 1+1 } else {x} }"

Refining the flat approach - nesting the parse table

Although the flat approach is good place to start, e.g. for fixing spaces between operators, it has its limitations. In particular, it treats each token the same way in the sense that it does not account for the context of the token, i.e. in which sub-expression it appears. To set the indention correctly, we need a hierarchical view on the parse data, since all tokens in a sub-expression have the same indention level. Hence, a natural approach would be to create a nested parse table instead of a flat parse table and then take a recursion over all elements in the table, so for each sub(-sub etc.)-expression, a separate parse table would be created and the modifications would be applied to this table before putting everything back together. A function to create a nested parse table already exists in styler. Let’s have a look at the top level:

(l1 <- styler:::compute_parse_data_nested(code)[-1])

## # A tibble: 1 x 13
##    col1 line2  col2    id parent token terminal  text short token_before
##   <int> <int> <int> <int>  <int> <chr>    <lgl> <chr> <chr>        <chr>
## 1     1     1    47    49      0  expr    FALSE                     <NA>
## # ... with 3 more variables: token_after <chr>, internal <lgl>,
## #   child <list>

The tibble contains the column child, which itself contains a tibble. If we “enter” the first child, we can see that the expression was split up further.

l1$child[[1]] %>%
  select(text, terminal, child, token)

## # A tibble: 3 x 4
##    text terminal             child       token
##   <chr>    <lgl>            <list>       <chr>
## 1          FALSE <tibble [1 x 14]>        expr
## 2    <-     TRUE            <NULL> LEFT_ASSIGN
## 3          FALSE <tibble [5 x 14]>        expr

And further…

l1$child[[1]]$child[[3]]$child[[5]]

## # A tibble: 3 x 14
##   line1  col1 line2  col2    id parent token terminal  text short
##   <int> <int> <int> <int> <int>  <int> <chr>    <lgl> <chr> <chr>
## 1     1    18     1    18     9     45   '{'     TRUE     {     {
## 2     1    20     1    45    42     45  expr    FALSE            
## 3     1    47     1    47    40     45   '}'     TRUE     }     }
## # ... with 4 more variables: token_before <chr>, token_after <chr>,
## #   internal <lgl>, child <list>

… and so on. Every child that is not a terminal contains another tibble where the sub-expression is split up further - until we are left with tibbles that only contain terminals.

Recall the above example. a <- function(x) { if(x > 1) { 1+1 } else {x} }. In the last printed parse table, we can see that see that the whole if condition is a sub-expression of code, surrounded by two curly brackets. Hence, one would like to set the indention level for this sub-expression before doing anything with it in more detail. Later, when we progressed deeper into the nested table, we hit a similar pattern:

l1$child[[1]]$child[[3]]$child[[5]]$child[[2]]$child[[5]]

## # A tibble: 3 x 14
##   line1  col1 line2  col2    id parent token terminal  text short
##   <int> <int> <int> <int> <int>  <int> <chr>    <lgl> <chr> <chr>
## 1     1    30     1    30    20     30   '{'     TRUE     {     {
## 2     1    32     1    34    27     30  expr    FALSE            
## 3     1    36     1    36    26     30   '}'     TRUE     }     }
## # ... with 4 more variables: token_before <chr>, token_after <chr>,
## #   internal <lgl>, child <list>

Again, we have two curly brackets and an expression inside. We would like to set the indention level for the expression 1+1 in the same way as for the whole if condition.

The simple example above makes it evident that a recursive approach to this problem would be the most natural.

The code for a function that kind of sketches the idea and illustrates such a recursion is given below.

It takes a nested parse table as input and then does the recursion over all children. If the child is a terminal, it returns the text, otherwise, it “enters” the child to find the terminals inside of the child and returns them.

serialize <- function(x) {
  out <- Map(
    function(terminal, text, child) {
      if (terminal)
        text
      else
        serialize(child)
    },
    x$terminal, x$text, x$child
  )
  out
}

x <- styler:::compute_parse_data_nested(code)
serialize(x) %>% unlist

##  [1] "a"        "<-"       "function" "("        "x"        ")"       
##  [7] "{"        "if"       "("        "x"        ">"        "1"       
## [13] ")"        "{"        "1"        "+"        "1"        "}"       
## [19] "else"     "{"        "x"        "}"        "}"

How to exactly implement a similar recursion to not just return each text token separately, but the styled text as one string (or one string per line) is subject to future work, so would be the functions to be applied to a sub-expression parse table that create correct indention. Similar to compute_parse_data_flat_enhanced, the column spaces and newlines would be required to be computed by compute_parse_data_nested as well as a new column indention.

Final Remarks

Although a flat structure would possibly also allow us to solve the problem of indention, it is a less elegant and flexible solution to the problem. It would involve looking for an opening curly bracket in the parse table, set the indention level for all subsequent rows in the parse table until the next opening or closing curly bracket is hit and then intending one level further or setting indention back to where it was at the beginning of the table.

Note that the vignette just addressed the question of indention caused by curly brackets and has not dealt with other operators that would trigger indention, such as ( or +.

[1] at commit e6ddee0f510d3c9e3e22ef68586068fa5c6bc140

at commit e6ddee0f510d3c9e3e22ef68586068fa5c6bc140↩︎

Lorenz Walthert

2022-06-04

Status quo - the flat approach

Refining the flat approach - nesting the parse table

Final Remarks