TrueType-Aware Automatic Column Widths
Gabriel Becker
2025-06-23
Source:vignettes/auto_colwidths.Rmd
auto_colwidths.Rmd
Introduction
TrueType fonts (i.e., those where different characters have different printed widths) complicate the calculation of column widths based on the contents of a table or listing, particularly when combined with verbose human readable column and-or row labels.
junco
provides default algorithms for calculating
appropriate column widths for both tables and listings when exporting to
RTF via tt_to_tlgrtf
.
These can be invoked explicitly by calling the
def_colwidths
function on a TableTree
or
listng_df
object, along with a font specification.
Tables
Many tables have column labels many times longer than the data in that column’s cells; the width of cell data tends to be bounded by the fact it is a set of one to three numbers interspersed with punctuation, rather than words as is the case for labels.
Pagination Assumptions
tt_to_tlgrtf
allows for horizontal
rtables
-style pagination, but does not perform vertical
pagination; each vertical strip of the table (which, mind, comes from
horizontal pagination) is written to a separate file. The
combined_rtf
argument indicates whether a single combined
rtf should also be generated by stacking those separate
sections of the table into a single RTF (as different table
objects).
Algorithm And Optimality Criterion
The column-width algorithm for tables is relatively simple. For table
columns, it calculates the widths required so that no cell
values will be word-wrapped. This is essentially what
rtables:::propose_column_widths
does, with the exception
that it does so including the column labels, which we have found in
practice to be much wider than the cells. def_colwidths
also constrains the maximum width of the row labels to the width (in
inches) specified via label_width_ins
, with a default of
two inches.
Examples
We can see this by tables with the same structure and value contents but varying verbosity with column and row labels.
library(junco)
#> Loading required package: formatters
#>
#> Attaching package: 'formatters'
#> The following object is masked from 'package:base':
#>
#> %||%
#> Loading required package: rtables
#> Loading required package: magrittr
#>
#> Attaching package: 'rtables'
#> The following object is masked from 'package:utils':
#>
#> str
#> Registered S3 method overwritten by 'tern':
#> method from
#> tidy.glm broom
adsl2 <- ex_adsl
adsl2$ARM2 <- adsl2$ARM
levels(adsl2$ARM2) <- c("A", "B", "C")
adsl2$ARM3 <- adsl2$ARM
levels(adsl2$ARM3) <- c("Full Drug Name Of Drug X", "Current Best-Practice Standard Of Care", "The Weird Other Arm")
## col-labels unmodified (middling width)
lyt1 <- basic_table() |>
split_cols_by("ARM") |>
split_rows_by("RACE") |>
summarize_row_groups(format = "xx (xx.xx%)") |>
analyze("DCSREAS")
tbl1 <- build_table(lyt1, adsl2)
head(tbl1)
#> A: Drug X B: Placebo C: Combination
#> ————————————————————————————————————————————————————————————————————————————
#> ASIAN 68 (50.75%) 67 (50.00%) 73 (55.30%)
#> ADVERSE EVENT 4 4 5
#> LACK OF EFFICACY 5 5 2
#> PHYSICIAN DECISION 2 4 4
#> PROTOCOL VIOLATION 1 7 5
#> WITHDRAWAL BY PARENT/GUARDIAN 3 1 2
## super narrow column labels
lyt2 <- basic_table() |>
split_cols_by("ARM", labels_var = "ARM2") |>
split_rows_by("RACE") |>
summarize_row_groups(format = "xx (xx.xx%)") |>
analyze("DCSREAS")
tbl2 <- build_table(lyt2, adsl2)
head(tbl2)
#> A B C
#> —————————————————————————————————————————————————————————————————————————
#> ASIAN 68 (50.75%) 67 (50.00%) 73 (55.30%)
#> ADVERSE EVENT 4 4 5
#> LACK OF EFFICACY 5 5 2
#> PHYSICIAN DECISION 2 4 4
#> PROTOCOL VIOLATION 1 7 5
#> WITHDRAWAL BY PARENT/GUARDIAN 3 1 2
## super wide column labels
lyt3 <- basic_table() |>
split_cols_by("ARM", labels_var = "ARM3") |>
split_rows_by("RACE") |>
summarize_row_groups(format = "xx (xx.xx%)") |>
analyze("DCSREAS")
tbl3 <- build_table(lyt3, adsl2)
head(tbl3)
#> Full Drug Name Of Drug X Current Best-Practice Standard Of Care The Weird Other Arm
#> —————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————
#> ASIAN 68 (50.75%) 67 (50.00%) 73 (55.30%)
#> ADVERSE EVENT 4 4 5
#> LACK OF EFFICACY 5 5 2
#> PHYSICIAN DECISION 2 4 4
#> PROTOCOL VIOLATION 1 7 5
#> WITHDRAWAL BY PARENT/GUARDIAN 3 1 2
rtables
’ default column widths (implemented via
formatters::propose_column_widths
) takes the maximum width
required for a label or value for each column (and the
row-label pseudo column):
propose_column_widths(tbl1)
#> [1] 41 11 11 14
Which means that width of the third column will be slightly smaller
for tbl2
as the column label is no longer wider than the
group summary cell values. The first and second columns remain the same
as the cell value widths were already slightly larger than the labels in
tbl1
.
propose_column_widths(tbl2)
#> [1] 41 11 11 11
Meanwhile, the verbose column labels in tbl3
result in
dramatically wider column widths, as propose_column_widths
enforces no wrapping even within column labels:
propose_column_widths(tbl3)
#> [1] 41 24 38 19
Meanwhile, def_colwidths
gives the same widths for the 3
columns as with tbl2
for all 3 tables:
def_colwidths(tbl1, fontspec = font_spec(), label_width_ins = 2, col_gap = 0)
#> [1] 30 11 11 11
def_colwidths(tbl2, fontspec = font_spec(), label_width_ins = 2, col_gap = 0)
#> [1] 30 11 11 11
def_colwidths(tbl3, fontspec = font_spec(), label_width_ins = 2, col_gap = 0)
#> [1] 30 11 11 11
We see, however, that the label-row width has been reduced due to the
label_width_ins
constraint, which we can vary up to the
maximum width the row labels need with no wrapping:
## bigger than 2, but not what we got from propose_column_labels
def_colwidths(tbl1, fontspec = font_spec(), label_width_ins = 2.2, col_gap = 0)
#> [1] 33 11 11 11
## bigger than required so we get same row label width as propose_column_widths
def_colwidths(tbl1, fontspec = font_spec(), label_width_ins = 6, col_gap = 0)
#> [1] 41 11 11 11
While we have done these examples with the default monospace font
used by rtables
and formatters
, the difference
is often particularly large when using a TrueType font with verbose
labels, as many letters have larger print widths than punctuation and
numeric digit characters:
fspec_times <- font_spec("Times", 9)
propose_column_widths(tbl3, fontspec = fspec_times )
#> [1] 93 45 65 36
def_colwidths(tbl3, fontspec = fspec_times, label_width_ins = 2, col_gap = 0)
#> [1] 64 20 20 20
We note here that for our (fictional but realistically verbose)
column labels in tbl3
, the default behavior from formatters
will not fit on a single page as even without padding between the
columns, those widths take up
sum(propose_column_widths(tbl3, fontspec = fspec_times))
#> [1] 239
space-character widths (which is the unit formatters
calculates widths in) while a standard page only has
formatters::page_lcpp(fontspec = fspec_times )$cpp
#> [1] 224
spaces of width available.
The column widths calculated by def_colwidths
, however,
easily fit on a single page.
Listings
Listings, unlike tables, often have text in their cell values, sometimes even concatenations of multiple demographic variables into a single column. They also do not have the row-labels pseudo-column present in tables. As such, we need a different, and much more complicated, algorithm to calculate good column widths.
Pagination Assumptions
def_colwidths
assumes that listings should
not be horizontally paginated, so all columns, and any
gaps between them, must fit within the width of a single page.
Optimality Criterion
For listings, we optimize the number of total lines a listing will require to print, including repetition of the table header. This helps control the total size of the resulting RTF file, as well as generally providing a better reading experience for the listing.
We further constrain our column widths such that no words within cell values will need to be broken up by word wrapping, if possible. We define “words” for this purpose as a string of characters separated by space(s) or “-”.
For this reason, we recommend concatenation of values into listing
column values to be split by e.g., " / "
rather than
"/"
, as even though that makes the value slightly longer it
gives the algorithm much more flexibility to find column widths that
don’t break up individual “words”.
This translates, generally to finding widths where after wrapping, a single column isn’t wrapped many more times than others within the majority of rows. In practice, we have found that this results in listings that are both legible and aesthetically reasonable.
Algorithm
The algorithm for selecting column widths has two parts. First, for each column individually, all widths that would result in different numbers of total lines for the cells in the columns are determined; the constraint that words within cells not be broken up is key here, as it dramatically reduces the number of widths that actually result in different numbers of lines. The second step is to search the space of candidate column widths collectively for the optimal set, which combines to less than the total available space.
We will use the following data to illustrate:
library(rlistings)
#> Loading required package: tibble
adae <- pharmaverseadam::adae
adae$AEOUT <- gsub("/", " / ", adae$AEOUT)
adsl <- pharmaverseadam::adsl
adsl <- adsl[, c("USUBJID", setdiff(names(adsl), names(adae)))]
lstdat <- merge(adae, adsl, by = "USUBJID")
var_labels(lstdat) <- c(var_labels(adae), var_labels(adsl)[-1])
lstdat$demog <- with_label(paste(lstdat$RACE, lstdat$SEX, lstdat$AGE, sep = " / "), "Demographic Information")
lsting <- as_listing(lstdat,key_cols = c("USUBJID"),
disp_cols = c("ACTARM", "COUNTRY", "demog", "AESEV", "AEBODSYS", "AEDECOD", "ASTDTM", "AENDTM", "AEOUT", "EOSSTT"))
Candidate Column Widths
For example, the last cell in the demographics column contains the value
demcell <- lstdat$demog[nrow(lstdat)]
demcell
#> [1] "BLACK OR AFRICAN AMERICAN / F / 74"
Broken up according to our definition, it contains the following “words” which must remain whole during column width selection.
wrds <- strsplit(demcell, "[ -]")[[1]]
wrds
#> [1] "BLACK" "OR" "AFRICAN" "AMERICAN" "/" "F" "/"
#> [8] "74"
Assuming a monospace font for simplicity, then, the smallest possible width of the column is
And using that width, the first two words fit into a line, the third into another, the fourth in its own, and “words” five through 8 all fit into a final line, for a total of four lines. We call this packing lines
packed_widths <- function(...) {
lst <- list(...)
nchar(vapply(lst, paste, collapse = " ", ""))
}
packed_widths(wrds[1:2],
wrds[3],
wrds[4],
wrds[5:8])
#> [1] 8 7 8 8
Recall that we do not care which words are allocated where, only the total number of lines required, so a colwidth of 10, which would allow the fifth word (/) to be packed into the same line as the fourth, resulting in AMERICAN /, results in the same number of total lines, so will not be considered a distinct possible column width with respect to that cell.
The next column width that results in fewer lines for that cell is one where words one through three are all able to be packed into a single line, with spaces between them, 16 in this case.
With that column width, we get three lines as we do not have enough room for the space required to consolidate the final two lines into one.
packed_widths(wrds[1:3], wrds[4], wrds[5:8])
#> [1] 16 8 8
Increasing the column width to 17, however, allows us to get down to two lines:
packed_widths(wrds[1:3],
wrds[4:8])
#> [1] 16 17
Finally, the last possible width with a different line total is the smallest width that will fit the entire value, i.e., 34.
So for this cell, there are four, and only four, candidate column widths.
Selecting The Optimal Set Of Widths
Once we have the full set of candidate widths for each column individually, the algorithm for selecting the optimal collective set is as follows:
- Initialize
- Remove candidate widths which result in column labels requiring more than allowable lines (default 3)
- Initialize with smallest candidate width for each column
- Determine column which requires the largest total lines
- Check if total space allows for changing to next candidate width for
that column
- If it does, select that column width and goto step (1)
- otherwise, end search and spread any remaining available space equally among columns
We are able to end the search at step (2b) because even if another column has a candidate width available that would require less lines, the total lines for the document are determined solely by the column which requires the most lines, so changing it as such won’t affect the outcome.
Example
def_colwidths
calls down to
listing_column_widths
with default values when passed a
listing_df
object. We will call the latter directly here
for explicitness, and to make the column widths more directly comparable
via export_as_txt
output.
fspec_times8 <- font_spec("Times", 8, 1)
cw <- listing_column_widths(lsting, col_gap = 0, fontspec = fspec_times8, verbose = TRUE)
#> Optimizng Column Widths
#> Initial lines required: 3979
#> Available adjustment: 33 spaces
#> COL 10 width: 26->51 lines req: 3825->1914
#> COL 6 width: 40->48 lines req: 3415->2974
txt <- export_as_txt(lsting, pg_width = inches_to_spaces(8.88, fontspec = fspec_times8),
lpp = NULL, colwidths = cw,
fontspec = fspec_times8, col_gap = 0)
txt2 <- strsplit(txt, "\n", fixed = FALSE)[[1]]
head(txt2)
#> [1] "Unique Subject Description of Analysis Start Analysis End "
#> [2] " Identifier Actual Arm Country Demographic Information Severity/Intensity Body System or Organ Class Dictionary-Derived Term Date/Time Date/Time Outcome of Adverse Event End of Study Status "
#> [3] "————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————"
#> [4] " 01-701-1015 Placebo USA WHITE / F / 63 MILD GENERAL DISORDERS AND ADMINISTRATION SITE APPLICATION SITE ERYTHEMA 2014-01-03 NA NOT RECOVERED / NOT RESOLVED COMPLETED "
#> [5] " CONDITIONS "
#> [6] " Placebo USA WHITE / F / 63 MILD GENERAL DISORDERS AND ADMINISTRATION SITE APPLICATION SITE PRURITUS 2014-01-03 NA NOT RECOVERED / NOT RESOLVED COMPLETED "
length(txt2)
#> [1] 2096
Versus giving each column an equal portion of the width (admittedly an ill-conceived strategy)
txtbad <- export_as_txt(lsting, pg_width = inches_to_spaces(8.88, fontspec = fspec_times8),
lpp = NULL, colwidths = rep(floor(320/11), 11),
fontspec = fspec_times8, col_gap = 0)
txt2bad <- strsplit(txtbad, "\n", fixed = TRUE)[[1]]
head(txt2bad)
#> [1] " Unique Subject Identifier Description of Actual Arm Country Demographic Information Severity/Intensity Body System or Organ Class Dictionary-Derived Term Analysis Start Date/Time Analysis End Date/Time Outcome of Adverse Event End of Study Status "
#> [2] "———————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————"
#> [3] " 01-701-1015 Placebo USA WHITE / F / 63 MILD GENERAL DISORDERS AND APPLICATION SITE ERYTHEMA 2014-01-03 NA NOT RECOVERED / NOT RESOLVED COMPLETED "
#> [4] " ADMINISTRATION SITE "
#> [5] " CONDITIONS "
#> [6] " Placebo USA WHITE / F / 63 MILD GENERAL DISORDERS AND APPLICATION SITE PRURITUS 2014-01-03 NA NOT RECOVERED / NOT RESOLVED COMPLETED "
length(txt2bad)
#> [1] 2306
So we see that our algorithm saved 9.11 percent of the total lines required by (a set of) naive column widths in this instance.