Playing with post-hoc power with R – why we shouldn’t do it

You can also check this post, written in #blogdown, here: playing-with-post-hoc-power.

Current state of the matter

The reason for bringing this here is that I witnessed an interesting exchange some time ago, regarding one article and their use of post-hoc power, pinpointed by @ADAlthousePhD:

 

The tweets refer also to this great post: Observed power, and what to do if your editor asks for post-hoc power analyses, written by Daniël Lakens in which this issue is discussed.

At first, I wasn’t too interested in this topic (to be honest); but then I read the above mentioned study, showcasing post-hoc calculations, and a few others that were spreading and being cited (even worse) in a field very, very close to mine: surgery. That’s when it got personal.

Like almost a year ago, I came across this beautiful paper: The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis by John Hoenig et al.

This paper patiently explains why post-hoc power is bad. Interestingly, the manuscript aims to be understood by scientists of all backgrounds, not only statistical. It uses appealing examples and develops them in a didactic, orderly manner, so a clinician (like myself) can grasp those concepts.

Also, the second paragraph begins using the word “Dismayingly”. How graphic.

What can we do about it?

My two cents to this discussion abut post-hoc power come below in the form of a small R tutorial. It shows (yet again), using the rationale in the paper, how it doesn’t make sense to get power calculations after hypothesis testing.

I hope this can help those peeps that need a more ‘hands-on’ (rather than purely abstract) approach to learning stats! R serves as a great, flexible tool for this!.

Observed power is a function of the p-value

(One determines the other)

TL;DR: nonsignificant p values always correspond to low observed powers.

This piece of code is inspired by the one written by Lakens, only that it is done with p-values from a Chi-Square test, making use of the pwr::pwr.chisq.test() function to get the power of the study. The only problem is that the effect size in this function is given by w, but fortunately w can be estimated from our contingency table using this formula (from Cohen’s book, p.216):

  • P0i= the proportion in cell i posited by the null hypothesis,
  • P1i= the proportion in cell i posited by the alternate hypothesis and reflects the effect for that cell, and
  • m= the number of cells.

The reason I say “estimate” instead of to “calculate” the range of w values, is that we will use the sample proportions (observed, expected) instead of the population values (which we don’t know). Regardless effect size estimation in this case the results would be the same, so for didactic purposes this should be fine.

Defining the problem: genomic data

Let’s start with a simplified case for the simplest possible contingency table: a 2×2 table featuring an exposure and the affected status in a case-control study:

  • The Antithrombin-III-Hamilton disease is characterized by recurrent thromboembolic events (PMID: 3179438).
  • Caused by a DNA mutation exchanging Guanine (G) for Adenine (A) in the first base of codon 382 from the AT-III gene. This impairs its serine protease reactivity, thus being less effective blocking thrombus formation.
  • Thus, we can consider the “A” allele as the exposure, “G” as the absence of exposure, and the “cases” would be those with recurrent thromboembolic events.

We cannot simulate from a rnorm distribution like in t-tests. To simulate datasets with different table proportions we will set a range of different frequency combinations for two of the cells:

library(tidyverse)
# Set basic parameters
lower_prob_range=0.3
upper_prob_range=0.7
prob_step=0.01
probabilities <- seq(from = lower_prob_range, to = upper_prob_range, by = prob_step)

# Create dataframe with all the possible effect sizes
mydata0 <- expand.grid(probabilities,probabilities) %>%
   tibble::as_tibble() %>%
   rename(
      `prob of exposure in cases` = Var1,
      `prob of exposure in controls` = Var2
   )

Next, we define a function that does all the calculations and stores the results in a list (the code is as inelegant and explicit as possible):

power_against_pvalue <- function(probabilities_expo_case=0.5, probabilities_expo_control=0.5, N=1000, exposure="exposed", unexposed="unexposed"){
   require(pwr)
   require(magrittr)

   table_exposure <- c(exposure, unexposed)
   
   # Make a data frame
   cases <- tibble(
      status = "case",
      allele = sample( table_exposure, N, replace=TRUE, prob=c(probabilities_expo_case, 1-probabilities_expo_case) )
   )
   controls <- tibble(
      status = "control",
      allele = sample( table_exposure, N, replace=TRUE, prob=c(probabilities_expo_control, 1-probabilities_expo_control) )
   )
   study <- cases %>%
      bind_rows(controls)
   
   # Observed counts
   .TableOC <- table(study$status, study$allele)
   # Observed proportions (row)
   .TableOPr <- prop.table(.TableOC, 1) # proportions by row
   # Observed proportions (table)
   .TableOP <- prop.table(.TableOC) # proportions by table
   # Chi-squared test
   xsq_text <- .TableOC %>% chisq.test()
   # Expected table counts:
   .TableEC <- xsq_text$expected   
   # Expected table proportions:
   .TableEP <- prop.table(.TableEC) # proportions by table
   # get p-value
   xsq_pvalue <- xsq_text$p.value
   
   # Observed counts:
   H_n <- count_observed_ctrls_unexpo <- .TableOC[2,2]
   D_n <- count_observed_cases_unexpo <- .TableOC[1,2]
   H_e <- count_observed_ctrls_expo <- .TableOC[2,1] # A is the exposure
   D_e <- count_observed_cases_expo <- .TableOC[1,1] # A is the exposure
   
   # Get the OR
   OR <- ( D_e / H_e ) / ( D_n / H_n )
   
   # Observed table proportions
   pr_observed_ctrls_unexpo <- .TableOP[2,2]
   pr_observed_case_unexpo <- .TableOP[1,2]
   pr_observed_ctrls_expo <- .TableOP[2,1]
   pr_observed_case_expo <- .TableOP[1,1]
   
   # Expected table proportions:
   pr_expected_ctrls_unexpo <- .TableEP[2,2]
   pr_expected_case_unexpo <- .TableEP[1,2]
   pr_expected_ctrls_expo <- .TableEP[2,1]
   pr_expected_case_expo <- .TableEP[1,1]
   
   # Cohen's w
   w <- (
      (
         (pr_expected_ctrls_unexpo - pr_observed_ctrls_unexpo)^2 / pr_expected_ctrls_unexpo
      ) + (
         (pr_expected_ctrls_expo - pr_observed_ctrls_expo)^2 / pr_expected_ctrls_expo
      ) + (
         (pr_expected_case_unexpo - pr_observed_case_unexpo)^2 / pr_expected_case_unexpo
      ) + (
         (pr_expected_case_expo - pr_observed_case_expo)^2 / pr_expected_case_expo
      )
   )^(1/2)
   
   # Power
   pw <- pwr.chisq.test(
      w = w, 
      N = N, 
      df = 1, 
      sig.level = 0.05, 
      power = NULL
   )
   
   # Final result
   result0 <- list(
      prob_case_in_exposed = probabilities_expo_case,
      prob_control_in_exposed = probabilities_expo_control,
      N = N,
      w = w,
      power = pw$power,
      p_value = xsq_pvalue,
      OR = OR,
      observed_counts = .TableOC,
      expected_counts = .TableEC,
      observed_proportions = .TableOP,
      expected_proportions = .TableEP
   )
   result0
   
}

# A helper function to extract the results into new columns
power_against_pvalue_extractor <- function(power_versus_pvalue, extracted=NULL){
   if(is.null(extracted)){
      result <- power_versus_pvalue
   }else{
      result <- power_versus_pvalue[[extracted]]
   }
   result
}

We can execute this function and see the output:

power_against_pvalue(
   probabilities_expo_case=0.5, 
   probabilities_expo_control=0.5,
   N=1000, 
   exposure="A=exposed", 
   unexposed="G=unexposed" 
)
   $prob_case_in_exposed
   [1] 0.5
   
   $prob_control_in_exposed
   [1] 0.5
   
   $N
   [1] 1000
   
   $w
   [1] 0.02300609
   
   $power
   [1] 0.1124906
   
   $p_value
   [1] 0.3250515
   
   $OR
   [1] 1.096436
   
   $observed_counts
            
             A=exposed G=unexposed
     case          523         477
     control       500         500
   
   $expected_counts
            
             A=exposed G=unexposed
     case        511.5       488.5
     control     511.5       488.5
   
   $observed_proportions
            
             A=exposed G=unexposed
     case       0.2615      0.2385
     control    0.2500      0.2500
   
   $expected_proportions
            
             A=exposed G=unexposed
     case      0.25575     0.24425
     control   0.25575     0.24425

And now we can map this function to all the combinations of proportions we had stored in our data frame (using the wonderful purrr package):

mydata <- mydata0 %>%
   # calculate
   mutate(
      power_versus_pvalue = purrr::map2(
         `prob of exposure in cases`, `prob of exposure in controls`, 
         ~power_against_pvalue(.x, .y, N=1000, exposure="A=exposed", unexposed="G=unexposed")
         )
   ) %>%
   # extract elements
   mutate(
      power = purrr::map(power_versus_pvalue, power_against_pvalue_extractor, extracted="power"),
      pvalue = purrr::map(power_versus_pvalue, power_against_pvalue_extractor, extracted="p_value"),
      OR = purrr::map(power_versus_pvalue, power_against_pvalue_extractor, extracted="OR"),
      cohen_w = purrr::map(power_versus_pvalue, power_against_pvalue_extractor, extracted="w")
   ) %>%
   unnest(cols = c(power, pvalue, OR, cohen_w))
mydata %>% select(-power_versus_pvalue, -cohen_w) %>% head(3) %>% kable()
prob of exposure in cases prob of exposure in controls power pvalue OR
0.30 0.3 0.0500000 1.0000000 1.000000
0.31 0.3 0.1439164 0.2275967 1.128824
0.32 0.3 0.1547361 0.2025910 1.138560

We can see the power that corresponds to a p-value of 0.05:

mydata %>%
   filter(
      pvalue > 0.049 & pvalue < 0.051
   ) %>%
   select(-power_versus_pvalue, -cohen_w) %>% kable()
prob of exposure in cases prob of exposure in controls power pvalue OR
0.40 0.44 0.2925585 0.0508424 0.8339090
0.46 0.50 0.2962194 0.0490027 0.8350416
0.48 0.50 0.2960685 0.0490725 0.8351346
0.65 0.69 0.2930812 0.0508489 0.8263437

 

The power corresponding to a p-value of 0.05 is between 0.29 and 0.30 (exact results will vary each time code runs; for didactic purposes I won’t set.seed here). In Hoenig and Lakens examples, -using different tests- the correspondence was between p=0.05 and a power of 0.5.

We can see how plotting power against p-value (regardless of sample size, you can try with a different one) always yield the same relationship:

power_vs_pvalue <- mydata %>%
   ggplot(
      aes(
         x=power,
         y=pvalue,
         colour=pvalue
      )
   ) +
   geom_point() + 
   geom_hline(yintercept=0.05) + geom_vline(xintercept=0.29) +
   scale_colour_gradientn(colours = terrain.colors(5)) +
   theme_bw() + theme(legend.position = "none")

power_vs_pvalue

Conclusion

Once you set your data and statistical test to compute a p-value, your power is already fixed. Hence, power doesn’t add information and cannot be further interpreted, as higher p-values will always correspond to low powers.

Hoenig’s paper elaborates on more reasons why calculated (post-hoc) power cannot be interpreted (hence used). The Discussion section definitely worths reading.

Bonus: Plotting power & p-pvalue against OR and Cohen’s w

Here are 4 graphs corresponding to 1000 patients. Unlike the ‘power versus p-value’ representation, those are influenced by sample size. With bigger sample sizes:

  • As effect size increases, power increases are more steep and p-value decreases more quickly,
  • OR values that correspond either to p-values under 0.05 or power values over 0.8 are closer to 1.
or_vs_pvalue <- mydata %>%
   ggplot(
      aes(x=OR, y=pvalue, colour=pvalue)
   ) +
   geom_point() + 
   geom_hline(yintercept=0.05) +
   scale_colour_gradientn(colours = terrain.colors(5)) +
   theme_bw() + theme(legend.position = "none")

w_vs_pvalue <- mydata %>%
   ggplot(
      aes(x=cohen_w, y=pvalue, colour=pvalue)
   ) +
   geom_point() + 
   geom_hline(yintercept=0.05) +
   scale_colour_gradientn(colours = terrain.colors(5)) +
   theme_bw() + theme(legend.position = "none")

or_vs_power <- mydata %>%
   ggplot(
      aes(x=OR, y=power, colour=pvalue)
   ) +
   geom_point() + 
   geom_hline(yintercept=0.8) +
   scale_colour_gradientn(colours = terrain.colors(5)) +
   theme_bw() + theme(legend.position = "none")

w_vs_power <- mydata %>%
   ggplot(
      aes(x=cohen_w, y=power, colour=pvalue)
   ) +
   geom_point() + 
   geom_hline(yintercept=0.8) +
   scale_colour_gradientn(colours = terrain.colors(5)) +
   theme_bw() + theme(legend.position = "none")

cowplots <- list(or_vs_pvalue, w_vs_pvalue, or_vs_power, w_vs_power) 
mylabels <- letters[1:length(cowplots)]
composed_figure <- cowplot::plot_grid(plotlist = cowplots,
                 labels = mylabels, nrow = 2, align = "h") 

composed_figure

Different sample sizes

This .Rmd source code

You can download it from here

Introducing RStudio

whateverblog.

It’s been an exciting week for me! On Monday, we finally took the wraps off the product that I’ve been working on since leaving Microsoft 18 months ago.

First, a little background: There’s a programming language called R that is taking the world of statistical computing and data analysis by storm. If you’ve never heard of R, you can read more about it in this New York Times article. But the gist of it is:

“R is really important to the point that it’s hard to overvalue it,” said Daryl Pregibon, a research scientist at Google, which uses the software widely. “It allows statisticians to do very intricate and complicated analyses without knowing the blood and guts of computing systems.”

We’re trying to make R a better language by giving it a better IDE. It’s called RStudio. It’s free as in beer, and free as in speech. You…

Ver la entrada original 282 palabras más

R Blogdown Setup in GitHub (2)

An updated tutorial to set up a blogdown blog

You can also check this post, written in #blogdown, here: r-blogdown-setup-in-github-2.

This is an updated blog post from the previous version: R Blogdown Setup in GiHub.

Inspired by this great blog, by Elio Campitelli: eliocamp.github.io/codigo-r and his settings on the hugo-tranquilpeak theme, by Louis Barranqueiro & Thibaud Leprêtre.

As the changes I made imply many modifications in paths, directories (some very subtle), they invalidate much of my previous tutorial. The changes I basically made are:

  • I still use Github pages for hosting. However, I changed the rendering folder from master to docs.
  • Made some other fixes to the config.toml file.

The steps I followed are detailed here:

Git & GitHub repos

  •  Set up a GitHub account, following for example this.
  •  Set up a new GitHub repo with a name of your choice (in my case talesofr). See this  and this.
  •  Activate GitHub pages. For this, you must go to the /settings section of your repo and find the epigraph “GitHub Pages”. In the dropdown menu, select:   master branch /docs folder and save.

Activating GitHub pages

Activating GitHub pages

  • Create a git local repo in your machine:
    • Create manually a new directory called like your repository, e.g. talesofr.
    • Run in the terminal (Windows users have to install git first):
    cd /Git/talesofr # your path may be different
    git init # initiates repo in the directory
    git remote add origin https://github.com/[USERNAME]/talesofr # connects git local repo to remote Github repo
    git pull origin master # in case you have LICENSE and Readme.md files in the GitHub repo, they're downloaded to your machine
  • You can add a .gitignore text file to the root directory of your repo. This tells git what files not to add to the repo, even if they are into the folder. One example of .gitignore file is this.
  • For now, your repo is ready. We will now focus in creating & customising our Blogdown.

RStudio and blogdown setup

  • We will open RStudio (my Version is 1.1.419).
    • First, you may need to install Blogdown in R:
    install.packages("blogdown")
    • In RStudio, select the Menu > File > New Project following the lower half of these instructions. The wizard for setting up a Hugo Blogdown project should be available in your RStudio version.

Creating new Project

Creating new Project

Selecting Hugo Blogdown format

Selecting Hugo Blogdown format

Selecting Hugo Blogdown theme

Selecting Hugo Blogdown theme

A config.toml file appears

config.toml file appears


Customising paths and styles

Before we build and serve our site, we need to tweak a couple of things in advance, if we want to smoothly deploy our blog into GitHub pages.

Modify config.toml file

To integrate with GiHub pages, there are the essential modifications at the top of our config.toml file:

  • We need to set up the base URL to the “root” of the web page (https://[USERNAME].github.io/[REPO_NAME] in this case).
  • By default, the web page is published in the /public folder. We need it to be published in the /docs folder of the repository (we must create it if it doesn’t exist yet), to match the structure of the GitHub docs branch (we’ll see what that means):
baseURL = "https://aurora-mareviv.github.io/talesofr/"
publishDir = "docs"
  • Other useful global settings:
draft: false # if set to true, changes will not be published
ignoreFiles = ["\\.Rmd$", "\\.Rmarkdown$", "_files$", "_cache$"]
enableEmoji = true
...

Images & styling paths

We can revisit the config.toml file to make changes to the default settings.

The file for the logo that appears in the corner must be placed in the /docs/logo.png path. To modify it in the config.toml:

picture = "logo.png" # the path to the logo

The cover (background) image must be located in /themes/hugo-tranquilpeak-theme/static/images. To modify it in the config.toml:

coverImage = "myimage.jpg"

We want some custom css and js. We need to locate them in /static/css/my-style.css and in /static/js/myjs.js respectively.

  # Custom CSS. Put here your custom CSS files. They are loaded after the theme CSS;
  # they have to be referred from static root. Example
   [[params.customCSS]]
     href = "css/my-style.css"

  # Custom JS. Put here your custom JS files. They are loaded after the theme JS;
  # they have to be referred from static root. Example
   [[params.customJS]]
     src = "js/myjs.js"

Custom css

Now, we can add arbitrary classes to our css file! (see above).

Since I started writing in Bootstrap, I love it a lot. Since this theme already has bootstrap classes, I brought some others I didn’t find in the theme (they’re available for .md files, but currently not for .Rmd)

Here is an example of a custom css file that can be copied to the file /static/css/my-style.css:

/* @import url('https://maxcdn.bootstrapcdn.com/bootswatch/3.3.7/cosmo/bootstrap.min.css'); may conflict with default theme!*/
@import url('https://fonts.googleapis.com/icon?family=Material+Icons'); /*google icons*/
@import url('https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css'); /*font awesome icons*/

.input-lg {
  font-size: 30px;
}
.input {
  font-size: 20px;
}
.font-sm {
    font-size: 0.7em;
}
.texttt {
  font-family: monospace;
}
.alert {
padding: 15px;
margin-bottom: 20px;
border: 1px solid transparent;
border-radius: 4px;
}
.alert-success {
color: #3c763d;
background-color: #dff0d8;
border-color: #d6e9c6;
}
.alert-danger,
.alert-error {
  color: #b94a48;
  background-color: #f2dede;
  border-color: #eed3d7;
}
.alert-info {
  color: #3a87ad;
  background-color: #d9edf7;
  border-color: #bce8f1;
}
.alert-gray {
  background-color: #f2f3f2;
  border-color: #f2f3f2;
}

/*style for printing*/
@media print {
    .noPrint {
       display:none;
   }
    }

/*link formatting*/
a:link {
    color: #478ca7;
    text-decoration: none;
} 
a:visited {
    color: #478ca7;
    text-decoration: none;
}
a:hover {
    color: #82b5c9;
    text-decoration: none;
}

Also, we have font-awesome icons!

Update! Hugo shortcodes in .Rmd !

Searching through blogdown/issues in GitHub, I found this good trick by Yihui:
You can add Hugo shortcodes if you wrap them with the function htmltools:HTML

htmltools::HTML("{{< hl-text danger >}}
              this is a highlighted danger text
              {{< /hl-text >}}")

Custom javascript

We can also add really cool functions to enhance our post! (see the blogdown version of this post).

Here is the code of the button:

<div class="button well alert alert-danger text-center" id="myButton" onclick="alertColor()">
  <span id="toInfo"> <span class="fa fa-minus-circle"></span>alert-danger </span>
</div>

And here is the javascript function that you can copy into /static/js/myjs.js:

function alertColor() {
  var result = "<span class='fa fa-info-circle'></span>&nbsp;alert-info";
  document.getElementById("toInfo").innerHTML = result; 
  document.getElementById("myButton").style.backgroundColor = "#d9edf7";
  document.getElementById("myButton").style.color = "#3a87ad";
  document.getElementById("myButton").style.borderColor = "#bce8f1";
}

Site build with blogdown

Once we have ready our theme, we can add some content, modifying or deleting the various examples we will find in /content/post.

We need to make use of Blogdown & Hugo to compile our .Rmd file and create our html post:

blogdown::build_site()
blogdown::serve_site()

In the viewer, at the right side of the IDE you can examine the resulting html and see if something didn’t go OK.

Deploying the site

Updating the local git repository

This can be done with simple git commands:

cd /Git/[REPO_NAME] # your path to the repo may be different
git add . # indexes all files that wil be added to the local repo
git commit -m "Starting my Hugo blog" # adds all files to the local repo, with a commit message

Pushing to GitHub

git push origin master # we push the changes from the local git repo to the remote repo (GitHub repo)

Just go to the page https://[USERNAME].github.io/[REPO_NAME] and enjoy your blog!


Add R code

Works just the same as in Rmarkdown. R code is compiled into an html and published as static web content in few steps. Welcome to the era of reproducible blogging!

The figure 1 uses the ggplot2 library:

library(ggplot2)
ggplot(diamonds, aes(x=carat, y=price, color=clarity)) + geom_point()

diamonds plot with ggplot2.

Figure 1: diamonds plot with ggplot2.


A new post in blogdown

I know there is a procedure to create a new post using the interface in RStudio (via the “Addins” button in RStudio IDE) but this one will also work:

I copied a blank .Rmd file into the folder /content/post. I gave it a different name than the other post, with the date and some explanatory text (I called it “2017-08-22-new-post.Rmd”).

Then, I added an appropriate YAML heading (similar to the one in the previous post, but changing dates and times). Beware, as the YAML header indent-sensitive:

---
title: "New Post in Blogdown"
author: "1"
date: 2017-08-22T23:41:14-05:00
draft: false
categories: ["R"]
tags: ["R Markdown", "blogdown", "#rstats"]
thumbnailImagePosition: left
thumbnailImage: ./images/logo.png
metaAlignment: center
disable_comments: true
output:
  blogdown::html_page:
    toc: false
    fig_width: 8
    css: "/css/my-style.css"
--- 

Finally I added some content to the Rmarkdown and saved the changes.

We need to make use of Blogdown & Hugo to compile our .Rmd file and create our html post:

blogdown::build_site()
blogdown::serve_site()

In the viewer, at the right side of the IDE you can examine the resulting html and see if something didn’t go OK.

Deploying the site with the new post

Updating the local git repository

This can be done with simple git commands:

cd /Git/[USERNAME].github.io # your path to the repo may be different
git add . # indexes all files that wil be added to the local repo
git commit -m "Adding a new post" # adds all files to the local repo, with a commit message

Pushing to GitHub

git push origin master # we push the changes from the local git repo to the remote repo (GitHub repo)

You can rinse and repeat this procedure for the rest of the posts.


This .Rmd source code

You can download it from here

 

An introduction to joint modeling in R

By J Espasandin, O Lado, C Díaz, A Bouzas, I Guler, A Baluja. 

You can also check this post, written in #blogdown, here: intro-joint-modeling-r.

These days, between the 19th and 21st of February, has taken place the learning activity titled “An Introduction to the Joint Modeling of Longitudinal and Survival Data, with Applications in R” organized by the Interdisciplinary Group of Biostatistics (ICBUSC), directed by Professor Carmen Cadarso-Suárez, from the University of Santiago de Compostela.

The international nature of this scientific activity has been marked by the presence of researchers from different European countries such as Germany, Portugal, Holland, Greece or Turkey. It also emphasizes its interdisciplinary nature, with attendees from different fields of research, such as statistics, biology, medicine, ecology or bioinformatics, belonging to different universities, biomedical institutions or the industry.

The training activity has been taught by the professor Dimitris Rizopoulos of the Erasmus University Medical Center in Rotterdam, specialist in joint-modeling techniques. Professor Rizopoulos is the author of a book on joint modeling, as well as numerous publications and two related R packages: JM and JMbayes.

The Joint Modeling techniques presented during the scientific meeting allow for the simultaneous study of longitudinal and time-to-event data. Longitudinal data includes repeated measurements of individuals over time, and time-to event data represent the expected time before an event occurs (like death, an asthma crisis or a transplant). That combination of data frequently arises in the biomedical sciences, where it is common to analyze the evolution of a sick person over time.

This novel statistical tool is especially useful in the field of biomedicine. For instance, in patient follow-up studies after surgery; to design a personalised pattern of medical visits; to carry out predictions of survival based on the evolution of a patient, or updating those predictions in light of new data; identification of useful biomarkers; prediction of patient outcome with different chronic diseases such as diabetes, some types of cancer or cardiovascular disease.

The applicability of these models has been illustrated through the JM and JMBayes R packages (by D Rizopoulos), as well as the packages joineR (by Philipson et al.), and lcmm (by Proust-Lima et al.)

An overview of joint modeling

It basically combines (joins) the probability distributions from a linear mixed-effects model with random effects (which takes care of the longitudinal data) and a survival Cox model (which calculates the hazard ratio for an event from the censored data). The whole model and its parts can be extended in several ways:

  • To find latent population heterogeneity (latent class joint models).
  • Allow for multiple longitudinal markers.
  • Allow for the analysis of multiple failure times. This is the case of competing risks and recurrent events (for instance, when a child develops asthma attacks, to find the risk of recurrence).
  • Time-Dependent accelerated failure time (AFT) Models.
  • Dynamic predictions when new values are added for the longitudinal variable, using Maximum Likelihood Estimates and empirical Bayes estimates.

Also, the JM package has functions for discrimination and callibration, (of a single marker and between models): sensitivity & specificity, time-dependent ROCs and AUC.

Applications for joint modeling

Citing D. Rizopoulos:

Joint models for longitudinal and time-to-event data have become a valuable tool in the analysis of follow-up data. These models are applicable mainly in two settings: First, when the focus is on the survival outcome and we wish to account for the effect of an endogenous time-dependent covariate measured with error, and second, when the focus is on the longitudinal outcome and we wish to correct for nonrandom dropout.

Summary

When we need joint models for longitudinal and survival outcomes?

  • To handle endogenous time-varying covariates in a survival analysis context
  • To account for nonrandom dropout in a longitudinal data analysis context

How joint models work?

  • A mixed model for the longitudinal outcome
  • A relative risk model for the event process
  • Explain interrelationships with shared random effects

Last but not least… a dynamic predicion GIF!

library(JM)
# Animation example 
# Mixed-effects model fit
lmeFit.p1 <- lme(log(pro) ~ time + time:treat, data = prothro,
    random = ~ time | id)  

# Cox survival model fit
survFit.p1 <- coxph(Surv(Time, death) ~ treat, data = prothros, x = TRUE)  

# Joint model
jointFit.p1 <- jointModel(lmeFit.p1, survFit.p1, timeVar = "time",
    method = "piecewise-PH-aGH")

# We are interested in producing predictions of survival probabilities for Patient 155
dataP155 <- prothro[prothro$id == 155, ]
len_id <- nrow(dataP155)

# We can plot the data
sfit3 <- survfitJM(jointFit.p1, newdata = dataP155[1:3, ]) 
sfit4 <- survfitJM(jointFit.p1, newdata = dataP155[1:4, ]) 

par(mfrow=c(1,2))
plotfit3 <- plot(sfit3, estimator="mean", include.y = TRUE, conf.int=0.95, fill.area=TRUE, col.area="lightblue", main="Patient 155")
plotfit4 <- plot(sfit4, estimator="mean", include.y = TRUE, conf.int=0.95, fill.area=TRUE, col.area="lightblue", main="Patient 155")

library(animation)
saveGIF({
  for(i in c(1:len_id)){
      sfit <- survfitJM(jointFit.p1, newdata = dataP155[1:i, ]) 
      plot(sfit, estimator="mean", include.y = TRUE, conf.int=0.95, fill.area=TRUE, col.area="lightblue", main="Patient 1")
      
  }
},ani.width = 400, ani.height=400)

A great crowd over there!

 

A minimal Project Tree in R

You can also check this post, written in #blogdown, here: minimal-project-tree-r.

Introduction

The last two days arrived at my twitter feed some discussions on how bad are the following sentences at the beginning of your R script/notebook, sparked by @JennyBryan’s slides at the IASC-ARS/NZSA Conference:

setwd()

and

rm(list = ls())

Jenny Bryan offered a detailed explanation for this, as well as some fixes, in her tidyverse blog post. The main idea was:

  • To ensure reproducibility within a stable working directory tree. She proposes the very concise here::here() but other methods are available such as the template or the ProjectTemplate packages.
  • To avoid break havoc in other’s computers with rm(list = ls())!.

All of this buzz around project self-containment and reproducibility motivated me to finish a minimal directory tree that (with some variations) I have been using for this year’s data analysis endeavours.

It is a extremely simple tree which separates a /data, a /plot and an /img directory inside the main folder (root)

  • The data folder contains both raw data and processed data files saved by R.
  • The plot folder contains all the plots saved during the workflow.
  • The img folder has every other image (logos, etc) that R takes as an input to build the results.
  • Inside the root folder I store the main .R or .Rmd scripts.

This ensures that every folder has an unidirectional relationship with the root folder (except the data dir in this case). But the important thing is that the paths in the scripts are set relative to the root folder, so the entire tree can be copied elsewhere and still work as expected.

I also added some more features to the tree:

  • An .Rproj file
  • Parametrize the .Rmd file
  • Git repository so the tree can be conveniently cloned or downloaded, with a .gitignore file:

Here is a sketch of how it works:

And here is the actual code of the notebook/script. I have not included regular markdown text outside the R chunks, as this template is intended to be changed and filled with new text each time:

Script code

# Installs missing libraries on render!
list.of.packages <- c("rmarkdown", "dplyr", "ggplot2", "Rcpp", "knitr", "Hmisc", "readxl")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages, repos='https://cran.rstudio.com/')
library(dplyr)
library(knitr)
library(ggplot2)

Working directories

# directory where the notebook is
wdir <- getwd() 
# directory where data are imported from & saved to
datadir <- file.path(wdir, "data") # better than datadir <- paste(wdir, "/data", sep="")
# directory where external images are imported from
imgdir <- file.path(wdir, "img")
# directory where plots are saved to
plotdir <- file.path(wdir, "plot")
# the folder immediately above root
Up <- paste("\\", basename(wdir), sep="") 
wdirUp <- gsub(Up, "", wdir) 

Data import

# Data name (stored as a parameter in the Rmarkdown notebook)
params <- NULL
params$dataname <- "cars"
# MSEXCEL
dataname <- params$dataname # archive name
routexl <- paste(datadir, "/", dataname, ".xlsx", sep="")  # complete route to archive

library(readxl)
mydata <- read_excel(routexl, sheet = 1)  # imports first sheet
# CSV / TSV (separated by tabs in this example)
dataname <- params$dataname # archive name
routecsv <- paste(datadir, "/", dataname, ".csv", sep="")  # complete route to archive

mydata <- read.csv(paste(routecsv, sep=""), 
         header = TRUE, 
         sep = "\t",
         dec = ".")

Data operations

# Hmisc::describe(mydata)
head(mydata)
     speed dist
   1     4    2
   2     4   10
   3     7    4
   4     7   22
   5     8   16
   6     9   10
p1 <- ggplot(mydata, aes(x=speed, y=dist)) + geom_point()
p1

Save plots

# TO PDF 
plotname1 <- "p1.pdf"
# TO PNG 
plotname2 <- "p1.png"

routeplot1 <- file.path(plotdir, plotname1)
routeplot2 <- file.path(plotdir, plotname2)
ggsave(routeplot1)  # (see http://ggplot2.tidyverse.org/reference/ggsave.html)
ggsave(routeplot2) 

Save data

# RDATA
save(mydata, file="data/mydata.RData")
# MSEXCEL # not run
dataname2 <- "mydata"  # name we will give to file
routexl2 <- paste(datadir, "/", dataname2, ".xlsx", sep="")   # complete route to future archive

library(xlsx)
write.xlsx(mydata, routexl2) # creates archive in specified route
# CSV / TSV (separated by tabs in this example)
dataname2 <- "mydata"  # name we will give to file
routecsv2 <- paste(datadir, "/", dataname2, ".csv", sep="")  # complete route to future archive

write.table(mydata, file = routecsv2, append = FALSE, quote = FALSE, sep = "\t ",
            eol = "\n", na = "NA", dec = ".", row.names = FALSE,
            col.names = TRUE)

This script -and the dir tree that contains it- is saving me a lot of time and headaches (where I’ve put that data?….), I hope it can be also useful for people out there!.

Future improvements

 

Taming exam results in pdf with pdftools

You can also check this post, written in #blogdown, here: taming-exam-results-with-pdf.

Introduction

There are several ways to mine tables and other content from a pdf, using R. After a lot of trial & error, here’s how I managed to extract global exam results from an international, massive, yearly examination, the EDAIC.

This is my first use case of “pdf mining” with R, and also a fairly simple one. However, more complex and very fine examples of this can be found elsewhere, using both pdftools and tabulizer packages.

As can be seen from the original pdf, exam results are anonymous. They consist on a numeric, 6-digit code and a binary result: “FAIL / PASS”. I was particularly interested into seeing how many of them passed the exam, as some indirect measure of how “hard” it can be.

Mining the table

In this case I preferred pdftools as it allowed me to extract the whole content from the pdf:

install.packages("pdftools")
library(pdftools) 
txt <- pdf_text("EDAIC.pdf") 
txt[1] 
class(txt[1]) 
  [1] "EDAIC Part I 2017                                                  Overall Results\n                                         Candidate N°       Result\n                                            107131            FAIL\n                                            119233            PASS\n                                            123744            FAIL\n                                            127988            FAIL\n                                            133842            PASS\n                                            135692            PASS\n                                            140341            FAIL\n                                            142595            FAIL\n                                            151479            PASS\n                                            151632            PASS\n                                            152787            PASS\n                                            157691            PASS\n                                            158867            PASS\n                                            160211            PASS\n                                            161970            FAIL\n                                            162536            PASS\n                                            163331            PASS\n                                            164442            FAIL\n                                            164835            PASS\n                                            165734            PASS\n                                            165900            PASS\n                                            166469            PASS\n                                            167241            FAIL\n                                            167740            PASS\n                                            168151            FAIL\n                                            168331            PASS\n                                            168371            FAIL\n                                            168711            FAIL\n                                            169786            PASS\n                                            170721            FAIL\n                                            170734            FAIL\n                                            170754            PASS\n                                            170980            PASS\n                                            171894            PASS\n                                            171911            PASS\n                                            172047            FAIL\n                                            172128            PASS\n                                            172255            FAIL\n                                            172310            PASS\n                                            172706            PASS\n                                            173136            FAIL\n                                            173229            FAIL\n                                            174336            PASS\n                                            174360            PASS\n                                            175177            FAIL\n                                            175180            FAIL\n                                            175184            FAIL\nYour candidate number is indicated on your admission document        Page 1 of 52\n"
  [1] "character"

These commands return a lenghty blob of text. Fortunately, there are some \n symbols that signal the new lines in the original document.

We will use these to split the blob into something more approachable, using tidyversal methods…

  • Split the blob.
  • Transform the resulting list into a character vector with unlist.
  • Trim leading white spaces with stringr::str_trim.
library(tidyverse) 
library(stringr) 
tx2 <- strsplit(txt, "\n") %>% # divide by carriage returns
  unlist() %>% 
  str_trim(side = "both") # trim white spaces
tx2[1:10]
   [1] "EDAIC Part I 2017                                                  Overall Results"
   [2] "Candidate N°       Result"                                                         
   [3] "107131            FAIL"                                                            
   [4] "119233            PASS"                                                            
   [5] "123744            FAIL"                                                            
   [6] "127988            FAIL"                                                            
   [7] "133842            PASS"                                                            
   [8] "135692            PASS"                                                            
   [9] "140341            FAIL"                                                            
  [10] "142595            FAIL"
  • Remove the very first row.
  • Transform into a tibble.
tx3 <- tx2[-1] %>% 
  data_frame() 
tx3
  # A tibble: 2,579 x 1
                             .
                         <chr>
   1 Candidate N°       Result
   2    107131            FAIL
   3    119233            PASS
   4    123744            FAIL
   5    127988            FAIL
   6    133842            PASS
   7    135692            PASS
   8    140341            FAIL
   9    142595            FAIL
  10    151479            PASS
  # ... with 2,569 more rows
  • Use tidyr::separate to split each row into two columns.
  • Remove all spaces.
tx4 <- separate(tx3, ., c("key", "value"), " ", extra = "merge") %>%  
  mutate(key = gsub('\\s+', '', key)) %>%
  mutate(value = gsub('\\s+', '', value)) 
tx4
  # A tibble: 2,579 x 2
           key    value
         <chr>    <chr>
   1 Candidate N°Result
   2    107131     FAIL
   3    119233     PASS
   4    123744     FAIL
   5    127988     FAIL
   6    133842     PASS
   7    135692     PASS
   8    140341     FAIL
   9    142595     FAIL
  10    151479     PASS
  # ... with 2,569 more rows
  • Remove rows that do not represent table elements.
tx5 <- tx4[grep('^[0-9]', tx4[[1]]),] 
tx5
  # A tibble: 2,424 x 2
        key value
      <chr> <chr>
   1 107131  FAIL
   2 119233  PASS
   3 123744  FAIL
   4 127988  FAIL
   5 133842  PASS
   6 135692  PASS
   7 140341  FAIL
   8 142595  FAIL
   9 151479  PASS
  10 151632  PASS
  # ... with 2,414 more rows

Extracting the results

We already have the table! now it’s time to get to the summary:

library(knitr)
tx5 %>%
  group_by(value) %>%
  summarise (count = n()) %>%
  mutate(percent = paste( round( (count / sum(count)*100) , 1), "%" )) %>% 
  kable()
value count percent
FAIL 1017 42 %
PASS 1407 58 %

From these results we see that the EDAIC-Part1 exam doesn’t have a particularly high clearance rate. It is currently done by medical specialists, but its dificulty relies in a very broad list of subjects covered, ranging from topics in applied physics, the entire human physiology, pharmacology, clinical medicine and latest guidelines.

Despite being a hard test to pass -and also the exam fee-, it’s becoming increasingly popular among anesthesiologists and critical care specialists that wish to stay up-to date with the current medical knowledge and practice.

 

 

Starting a Rmarkdown Blog with Blogdown + Hugo + Github

Finally, -after 24h of failed attempts-, I could get my favourite Hugo theme up and running with R Studio and Blogdown.

All the steps I followed are detailed in my new Blogdown entry, which is also a GitHub repo. However, there is an updated version of the tutorial here:

https://talesofr.wordpress.com/2018/02/26/r-blogdown-setup-in-github-2/

After exploring some alternatives, like Shirin’s (with Jekyll), and Amber Thomas advice (which involved Git skills beyond my basic abilities), I was able to install Yihui’s hugo-lithium-theme in a new repository.

However, I wanted to explore other blog templates, hosted in GiHub, like:

The three first themes are currently linked in the blogdown documentation as being most simple and easy to set up for unexperienced blog programmers, but I hope the list will grow in the following months. For those who are willing to experiment, the complete list is here.

Finally I chose the hugo-tranquilpeak theme, by Thibaud Leprêtre, for which I mostly followed Tyler Clavelle’s entry on the topic. This approach turned out to be easy and good, given some conditions:

  • Contrary to Yihui Xie’s advice, I felt brave enough to choose github.io to host my blog, instead of Netlify (I love my desktop integration with GitHub, so it was interesting for me not to move to another service for my static content).
  • In my machine, I installed Blogdown & Hugo using R studio (v 1.1.336).
  • In GiHub, it was easier for me to host the blog directly in my main github pages repository (always named [USERNAME].github.io), in the master branch, following Tyler’s tutorial.
  • I have basic knowledge of html, css and javascript, so I didn’t mind to tinker around with the theme.
  • My custom styles didn’t involve theme rebuilding. At this moment they’re simple cosmetic tricks.

The steps I followed were:

Git & GitHub repos

  • Setting a GitHub repo with the name [USERNAME].github.io (in my case aurora-mareviv.github.io). See this and this.
  • Create a git repo in your machine:
    • Create manually a new directory called [USERNAME].github.io.
    • Run in the terminal (Windows users have to install git first):
    cd /Git/[USERNAME].github.io # your path may be different
    
    git init # initiates repo in the directory
    git remote add origin https://github.com/[USERNAME]/[USERNAME].github.io # connects git local repo to remote Github repo
    
    git pull origin master # in case you have LICENSE and Readme.md files in the GitHub repo, they're downloaded
  • For now, your repo is ready. We will now focus in creating & customising our Blogdown.

RStudio and blogdown

  • We will open RStudio (v 1.1.336, development version as of today).
    • First, you may need to install Blogdown in R:
    install.packages("blogdown")
    • In RStudio, select the Menu > File > New Project following the lower half of these instructions. The wizard for setting up a Hugo Blogdown project may not be yet available in your RStudio version (not for much longer probably).

Creating new Project

Creating new Project

Selecting Hugo Blogdown format

Selecting Hugo Blogdown format

Selecting Hugo Blogdown theme

Selecting Hugo Blogdown theme

A config.toml file appears

config.toml file appears


Customising paths and styles

Before we build and serve our site, we need to tweak a couple of things in advance, if we want to smoothly deploy our blog into GitHub pages.

Modify config.toml file

To integrate with GiHub pages, there are the essential modifications at the top of our config.toml file:

  • We need to set up the base URL to the “root” of the web page (https://USERNAME.github.io/, in this case)
  • By default, the web page is published in the “public” folder. We need it to be published in the root of the repository, to match the structure of the GitHub masterbranch:
baseurl = "/./" 
publishDir = "./"
  • Other useful global settings:
ignoreFiles = ["\\.Rmd$", "\\.Rmarkdown$", "_files$", "_cache$"]
enableEmoji = true

Images & styling paths

We can revisit the config.toml file to make changes to the default settings.

The logo that appears in the corner must be in the root folder. To modify it in the config.toml:

picture = "logo.png" # the path to the logo

The cover (background) image must be located in /themes/hugo-tranquilpeak-theme/static/images . To modify it in the config.toml:

coverImage = "myimage.jpg"

We want some custom css and js. We need to locate them in /static/css and in /static/js respectively.

# Custom CSS. Put here your custom CSS files. They are loaded after the theme CSS;
# they have to be referred from static root. Example
customCSS = ["css/my-style.css"]

# Custom JS. Put here your custom JS files. They are loaded after the theme JS;
# they have to be referred from static root. Example
customJS = ["js/myjs.js"]

Custom css

We can add arbitrary classes to our css file (see below).

Since I started writing in Bootstrap, I miss it a lot. Since this theme already has bootstrap classes, I brought some others I didn’t find in the theme (they’re available for .md files, but currently not for .Rmd)

Here is my custom css file to date:

/* @import url('https://maxcdn.bootstrapcdn.com/bootswatch/3.3.7/cosmo/bootstrap.min.css'); may conflict with default theme*/
@import url('https://fonts.googleapis.com/icon?family=Material+Icons'); /*google icons*/
@import url('https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css'); /*font awesome icons*/

.input-lg {
  font-size: 30px;
}
.input {
  font-size: 20px;
}
.font-sm {
    font-size: 0.7em;
}
.texttt {
  font-family: monospace;
}
.alert {
padding: 15px;
margin-bottom: 20px;
border: 1px solid transparent;
border-radius: 4px;
}
.alert-success {
color: #3c763d;
background-color: #dff0d8;
border-color: #d6e9c6;
}
.alert-danger,
.alert-error {
  color: #b94a48;
  background-color: #f2dede;
  border-color: #eed3d7;
}
.alert-info {
  color: #3a87ad;
  background-color: #d9edf7;
  border-color: #bce8f1;
}
.alert-gray {
  background-color: #f2f3f2;
  border-color: #f2f3f2;
}

/*style for printing*/
@media print {
    .noPrint {
       display:none;
   }
    }

/*link formatting*/
a:link {
    color: #478ca7;
    text-decoration: none;
} 
a:visited {
    color: #478ca7;
    text-decoration: none;
}
a:hover {
    color: #82b5c9;
    text-decoration: none;
}

Also, we have font-awesome icons!

Site build with blogdown

Once we have ready our theme, we can add some content, modifying or deleting the various examples we will find in /content/post .

We need to make use of Blogdown & Hugo to compile our .Rmd file and create our html post:

blogdown::build_site()
blogdown::serve_site()

In the viewer, at the right side of the IDE you can examine the resulting html and see if something didn’t go OK.

Deploying the site

Updating the local git repository

This can be done with simple git commands:

cd /Git/[USERNAME].github.io # your path to the repo may be different
git add . # indexes all files that wil be added to the local repo
git commit -m "Starting my Hugo blog" # adds all files to the local repo, with a commit message

Pushing to GitHub

git push origin master # we push the changes from the local git repo to the remote repo (GitHub repo)

Just go to the page https://USERNAME.github.io and enjoy your blog!


R code

Works just the same as in Rmarkdown. R code is compiled into an html and published as static web content in few steps. Welcome to the era of reproducible blogging!

The figure 1 uses the ggplot2 library:

library(ggplot2)
ggplot(diamonds, aes(x=carat, y=price, color=clarity)) + geom_point()

diamonds plot with ggplot2.

Figure 1: diamonds plot with ggplot2.

Rmd source code

You can download it from here

I, for one, welcome the new era of reproducible blogging!

Updated tutorial

I updated (& hopefully improved) this tutorial, which you can check here.

hexbins

Society for Social Medicine & Population Health

Advancing Knowledge for Population Health

Mad (Data) Scientist

Musings, useful code etc. on R and data science

My Blog

A topnotch WordPress.com site

walkandfish

Just another WordPress.com site

Georeferenced

A blog on all things Geo, Data, Technology & the interconnected world. Occasionally off-piste.

Retraction Watch

Tracking retractions as a window into the scientific process

"R" you ready?

My advances in R - a learner's diary

TRinker's R Blog

Experiments & Experiences in R

What You're Doing Is Rather Desperate

Notes from the life of a [data] scientist

On unicorns and genes

Martin Johnsson's blog about genetics and sundry things

vet epi

Denis Haine

FreshBiostats

Young Researchers in Biostatistics

Nicebread

...messing around with free code

TRinker's R Blog

...messing around with free code

Revolutions

...messing around with free code

Learning R

Finding my way around R

R-bloggers

R news and tutorials contributed by hundreds of R bloggers