Tales of R

…messing around with free code

Tales of R

Playing with post-hoc power with R – why we shouldn’t do it

You can also check this post, written in #blogdown, here: playing-with-post-hoc-power.

Current state of the matter

The reason for bringing this here is that I witnessed an interesting exchange some time ago, regarding one article and their use of post-hoc power, pinpointed by @ADAlthousePhD:

 

The tweets refer also to this great post: Observed power, and what to do if your editor asks for post-hoc power analyses, written by Daniël Lakens in which this issue is discussed.

At first, I wasn’t too interested in this topic (to be honest); but then I read the above mentioned study, showcasing post-hoc calculations, and a few others that were spreading and being cited (even worse) in a field very, very close to mine: surgery. That’s when it got personal.

Like almost a year ago, I came across this beautiful paper: The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis by John Hoenig et al.

This paper patiently explains why post-hoc power is bad. Interestingly, the manuscript aims to be understood by scientists of all backgrounds, not only statistical. It uses appealing examples and develops them in a didactic, orderly manner, so a clinician (like myself) can grasp those concepts.

Also, the second paragraph begins using the word “Dismayingly”. How graphic.

What can we do about it?

My two cents to this discussion abut post-hoc power come below in the form of a small R tutorial. It shows (yet again), using the rationale in the paper, how it doesn’t make sense to get power calculations after hypothesis testing.

I hope this can help those peeps that need a more ‘hands-on’ (rather than purely abstract) approach to learning stats! R serves as a great, flexible tool for this!.

Observed power is a function of the p-value

(One determines the other)

TL;DR: nonsignificant p values always correspond to low observed powers.

This piece of code is inspired by the one written by Lakens, only that it is done with p-values from a Chi-Square test, making use of the pwr::pwr.chisq.test() function to get the power of the study. The only problem is that the effect size in this function is given by w, but fortunately w can be estimated from our contingency table using this formula (from Cohen’s book, p.216):

  • P0i= the proportion in cell i posited by the null hypothesis,
  • P1i= the proportion in cell i posited by the alternate hypothesis and reflects the effect for that cell, and
  • m= the number of cells.

The reason I say “estimate” instead of to “calculate” the range of w values, is that we will use the sample proportions (observed, expected) instead of the population values (which we don’t know). Regardless effect size estimation in this case the results would be the same, so for didactic purposes this should be fine.

Defining the problem: genomic data

Let’s start with a simplified case for the simplest possible contingency table: a 2×2 table featuring an exposure and the affected status in a case-control study:

  • The Antithrombin-III-Hamilton disease is characterized by recurrent thromboembolic events (PMID: 3179438).
  • Caused by a DNA mutation exchanging Guanine (G) for Adenine (A) in the first base of codon 382 from the AT-III gene. This impairs its serine protease reactivity, thus being less effective blocking thrombus formation.
  • Thus, we can consider the “A” allele as the exposure, “G” as the absence of exposure, and the “cases” would be those with recurrent thromboembolic events.

We cannot simulate from a rnorm distribution like in t-tests. To simulate datasets with different table proportions we will set a range of different frequency combinations for two of the cells:

library(tidyverse)
# Set basic parameters
lower_prob_range=0.3
upper_prob_range=0.7
prob_step=0.01
probabilities <- seq(from = lower_prob_range, to = upper_prob_range, by = prob_step)

# Create dataframe with all the possible effect sizes
mydata0 <- expand.grid(probabilities,probabilities) %>%
   tibble::as_tibble() %>%
   rename(
      `prob of exposure in cases` = Var1,
      `prob of exposure in controls` = Var2
   )

Next, we define a function that does all the calculations and stores the results in a list (the code is as inelegant and explicit as possible):

power_against_pvalue <- function(probabilities_expo_case=0.5, probabilities_expo_control=0.5, N=1000, exposure="exposed", unexposed="unexposed"){
   require(pwr)
   require(magrittr)

   table_exposure <- c(exposure, unexposed)
   
   # Make a data frame
   cases <- tibble(
      status = "case",
      allele = sample( table_exposure, N, replace=TRUE, prob=c(probabilities_expo_case, 1-probabilities_expo_case) )
   )
   controls <- tibble(
      status = "control",
      allele = sample( table_exposure, N, replace=TRUE, prob=c(probabilities_expo_control, 1-probabilities_expo_control) )
   )
   study <- cases %>%
      bind_rows(controls)
   
   # Observed counts
   .TableOC <- table(study$status, study$allele)
   # Observed proportions (row)
   .TableOPr <- prop.table(.TableOC, 1) # proportions by row
   # Observed proportions (table)
   .TableOP <- prop.table(.TableOC) # proportions by table
   # Chi-squared test
   xsq_text <- .TableOC %>% chisq.test()
   # Expected table counts:
   .TableEC <- xsq_text$expected   
   # Expected table proportions:
   .TableEP <- prop.table(.TableEC) # proportions by table
   # get p-value
   xsq_pvalue <- xsq_text$p.value
   
   # Observed counts:
   H_n <- count_observed_ctrls_unexpo <- .TableOC[2,2]
   D_n <- count_observed_cases_unexpo <- .TableOC[1,2]
   H_e <- count_observed_ctrls_expo <- .TableOC[2,1] # A is the exposure
   D_e <- count_observed_cases_expo <- .TableOC[1,1] # A is the exposure
   
   # Get the OR
   OR <- ( D_e / H_e ) / ( D_n / H_n )
   
   # Observed table proportions
   pr_observed_ctrls_unexpo <- .TableOP[2,2]
   pr_observed_case_unexpo <- .TableOP[1,2]
   pr_observed_ctrls_expo <- .TableOP[2,1]
   pr_observed_case_expo <- .TableOP[1,1]
   
   # Expected table proportions:
   pr_expected_ctrls_unexpo <- .TableEP[2,2]
   pr_expected_case_unexpo <- .TableEP[1,2]
   pr_expected_ctrls_expo <- .TableEP[2,1]
   pr_expected_case_expo <- .TableEP[1,1]
   
   # Cohen's w
   w <- (
      (
         (pr_expected_ctrls_unexpo - pr_observed_ctrls_unexpo)^2 / pr_expected_ctrls_unexpo
      ) + (
         (pr_expected_ctrls_expo - pr_observed_ctrls_expo)^2 / pr_expected_ctrls_expo
      ) + (
         (pr_expected_case_unexpo - pr_observed_case_unexpo)^2 / pr_expected_case_unexpo
      ) + (
         (pr_expected_case_expo - pr_observed_case_expo)^2 / pr_expected_case_expo
      )
   )^(1/2)
   
   # Power
   pw <- pwr.chisq.test(
      w = w, 
      N = N, 
      df = 1, 
      sig.level = 0.05, 
      power = NULL
   )
   
   # Final result
   result0 <- list(
      prob_case_in_exposed = probabilities_expo_case,
      prob_control_in_exposed = probabilities_expo_control,
      N = N,
      w = w,
      power = pw$power,
      p_value = xsq_pvalue,
      OR = OR,
      observed_counts = .TableOC,
      expected_counts = .TableEC,
      observed_proportions = .TableOP,
      expected_proportions = .TableEP
   )
   result0
   
}

# A helper function to extract the results into new columns
power_against_pvalue_extractor <- function(power_versus_pvalue, extracted=NULL){
   if(is.null(extracted)){
      result <- power_versus_pvalue
   }else{
      result <- power_versus_pvalue[[extracted]]
   }
   result
}

We can execute this function and see the output:

power_against_pvalue(
   probabilities_expo_case=0.5, 
   probabilities_expo_control=0.5,
   N=1000, 
   exposure="A=exposed", 
   unexposed="G=unexposed" 
)
   $prob_case_in_exposed
   [1] 0.5
   
   $prob_control_in_exposed
   [1] 0.5
   
   $N
   [1] 1000
   
   $w
   [1] 0.02300609
   
   $power
   [1] 0.1124906
   
   $p_value
   [1] 0.3250515
   
   $OR
   [1] 1.096436
   
   $observed_counts
            
             A=exposed G=unexposed
     case          523         477
     control       500         500
   
   $expected_counts
            
             A=exposed G=unexposed
     case        511.5       488.5
     control     511.5       488.5
   
   $observed_proportions
            
             A=exposed G=unexposed
     case       0.2615      0.2385
     control    0.2500      0.2500
   
   $expected_proportions
            
             A=exposed G=unexposed
     case      0.25575     0.24425
     control   0.25575     0.24425

And now we can map this function to all the combinations of proportions we had stored in our data frame (using the wonderful purrr package):

mydata <- mydata0 %>%
   # calculate
   mutate(
      power_versus_pvalue = purrr::map2(
         `prob of exposure in cases`, `prob of exposure in controls`, 
         ~power_against_pvalue(.x, .y, N=1000, exposure="A=exposed", unexposed="G=unexposed")
         )
   ) %>%
   # extract elements
   mutate(
      power = purrr::map(power_versus_pvalue, power_against_pvalue_extractor, extracted="power"),
      pvalue = purrr::map(power_versus_pvalue, power_against_pvalue_extractor, extracted="p_value"),
      OR = purrr::map(power_versus_pvalue, power_against_pvalue_extractor, extracted="OR"),
      cohen_w = purrr::map(power_versus_pvalue, power_against_pvalue_extractor, extracted="w")
   ) %>%
   unnest(cols = c(power, pvalue, OR, cohen_w))
mydata %>% select(-power_versus_pvalue, -cohen_w) %>% head(3) %>% kable()
prob of exposure in cases prob of exposure in controls power pvalue OR
0.30 0.3 0.0500000 1.0000000 1.000000
0.31 0.3 0.1439164 0.2275967 1.128824
0.32 0.3 0.1547361 0.2025910 1.138560

We can see the power that corresponds to a p-value of 0.05:

mydata %>%
   filter(
      pvalue > 0.049 & pvalue < 0.051
   ) %>%
   select(-power_versus_pvalue, -cohen_w) %>% kable()
prob of exposure in cases prob of exposure in controls power pvalue OR
0.40 0.44 0.2925585 0.0508424 0.8339090
0.46 0.50 0.2962194 0.0490027 0.8350416
0.48 0.50 0.2960685 0.0490725 0.8351346
0.65 0.69 0.2930812 0.0508489 0.8263437

 

The power corresponding to a p-value of 0.05 is between 0.29 and 0.30 (exact results will vary each time code runs; for didactic purposes I won’t set.seed here). In Hoenig and Lakens examples, -using different tests- the correspondence was between p=0.05 and a power of 0.5.

We can see how plotting power against p-value (regardless of sample size, you can try with a different one) always yield the same relationship:

power_vs_pvalue <- mydata %>%
   ggplot(
      aes(
         x=power,
         y=pvalue,
         colour=pvalue
      )
   ) +
   geom_point() + 
   geom_hline(yintercept=0.05) + geom_vline(xintercept=0.29) +
   scale_colour_gradientn(colours = terrain.colors(5)) +
   theme_bw() + theme(legend.position = "none")

power_vs_pvalue

Conclusion

Once you set your data and statistical test to compute a p-value, your power is already fixed. Hence, power doesn’t add information and cannot be further interpreted, as higher p-values will always correspond to low powers.

Hoenig’s paper elaborates on more reasons why calculated (post-hoc) power cannot be interpreted (hence used). The Discussion section definitely worths reading.

Bonus: Plotting power & p-pvalue against OR and Cohen’s w

Here are 4 graphs corresponding to 1000 patients. Unlike the ‘power versus p-value’ representation, those are influenced by sample size. With bigger sample sizes:

  • As effect size increases, power increases are more steep and p-value decreases more quickly,
  • OR values that correspond either to p-values under 0.05 or power values over 0.8 are closer to 1.
or_vs_pvalue <- mydata %>%
   ggplot(
      aes(x=OR, y=pvalue, colour=pvalue)
   ) +
   geom_point() + 
   geom_hline(yintercept=0.05) +
   scale_colour_gradientn(colours = terrain.colors(5)) +
   theme_bw() + theme(legend.position = "none")

w_vs_pvalue <- mydata %>%
   ggplot(
      aes(x=cohen_w, y=pvalue, colour=pvalue)
   ) +
   geom_point() + 
   geom_hline(yintercept=0.05) +
   scale_colour_gradientn(colours = terrain.colors(5)) +
   theme_bw() + theme(legend.position = "none")

or_vs_power <- mydata %>%
   ggplot(
      aes(x=OR, y=power, colour=pvalue)
   ) +
   geom_point() + 
   geom_hline(yintercept=0.8) +
   scale_colour_gradientn(colours = terrain.colors(5)) +
   theme_bw() + theme(legend.position = "none")

w_vs_power <- mydata %>%
   ggplot(
      aes(x=cohen_w, y=power, colour=pvalue)
   ) +
   geom_point() + 
   geom_hline(yintercept=0.8) +
   scale_colour_gradientn(colours = terrain.colors(5)) +
   theme_bw() + theme(legend.position = "none")

cowplots <- list(or_vs_pvalue, w_vs_pvalue, or_vs_power, w_vs_power) 
mylabels <- letters[1:length(cowplots)]
composed_figure <- cowplot::plot_grid(plotlist = cowplots,
                 labels = mylabels, nrow = 2, align = "h") 

composed_figure

Different sample sizes

This .Rmd source code

You can download it from here

Introducing RStudio

whateverblog.

It’s been an exciting week for me! On Monday, we finally took the wraps off the product that I’ve been working on since leaving Microsoft 18 months ago.

First, a little background: There’s a programming language called R that is taking the world of statistical computing and data analysis by storm. If you’ve never heard of R, you can read more about it in this New York Times article. But the gist of it is:

“R is really important to the point that it’s hard to overvalue it,” said Daryl Pregibon, a research scientist at Google, which uses the software widely. “It allows statisticians to do very intricate and complicated analyses without knowing the blood and guts of computing systems.”

We’re trying to make R a better language by giving it a better IDE. It’s called RStudio. It’s free as in beer, and free as in speech. You…

Ver la entrada original 282 palabras más

R Blogdown Setup in GitHub (2)

An updated tutorial to set up a blogdown blog

You can also check this post, written in #blogdown, here: r-blogdown-setup-in-github-2.

This is an updated blog post from the previous version: R Blogdown Setup in GiHub.

Inspired by this great blog, by Elio Campitelli: eliocamp.github.io/codigo-r and his settings on the hugo-tranquilpeak theme, by Louis Barranqueiro & Thibaud Leprêtre.

As the changes I made imply many modifications in paths, directories (some very subtle), they invalidate much of my previous tutorial. The changes I basically made are:

  • I still use Github pages for hosting. However, I changed the rendering folder from master to docs.
  • Made some other fixes to the config.toml file.

The steps I followed are detailed here:

Git & GitHub repos

  •  Set up a GitHub account, following for example this.
  •  Set up a new GitHub repo with a name of your choice (in my case talesofr). See this  and this.
  •  Activate GitHub pages. For this, you must go to the /settings section of your repo and find the epigraph “GitHub Pages”. In the dropdown menu, select:   master branch /docs folder and save.

Activating GitHub pages

Activating GitHub pages

  • Create a git local repo in your machine:
    • Create manually a new directory called like your repository, e.g. talesofr.
    • Run in the terminal (Windows users have to install git first):
    cd /Git/talesofr # your path may be different
    git init # initiates repo in the directory
    git remote add origin https://github.com/[USERNAME]/talesofr # connects git local repo to remote Github repo
    git pull origin master # in case you have LICENSE and Readme.md files in the GitHub repo, they're downloaded to your machine
  • You can add a .gitignore text file to the root directory of your repo. This tells git what files not to add to the repo, even if they are into the folder. One example of .gitignore file is this.
  • For now, your repo is ready. We will now focus in creating & customising our Blogdown.

RStudio and blogdown setup

  • We will open RStudio (my Version is 1.1.419).
    • First, you may need to install Blogdown in R:
    install.packages("blogdown")
    • In RStudio, select the Menu > File > New Project following the lower half of these instructions. The wizard for setting up a Hugo Blogdown project should be available in your RStudio version.

Creating new Project

Creating new Project

Selecting Hugo Blogdown format

Selecting Hugo Blogdown format

Selecting Hugo Blogdown theme

Selecting Hugo Blogdown theme

A config.toml file appears

config.toml file appears


Customising paths and styles

Before we build and serve our site, we need to tweak a couple of things in advance, if we want to smoothly deploy our blog into GitHub pages.

Modify config.toml file

To integrate with GiHub pages, there are the essential modifications at the top of our config.toml file:

  • We need to set up the base URL to the “root” of the web page (https://[USERNAME].github.io/[REPO_NAME] in this case).
  • By default, the web page is published in the /public folder. We need it to be published in the /docs folder of the repository (we must create it if it doesn’t exist yet), to match the structure of the GitHub docs branch (we’ll see what that means):
baseURL = "https://aurora-mareviv.github.io/talesofr/"
publishDir = "docs"
  • Other useful global settings:
draft: false # if set to true, changes will not be published
ignoreFiles = ["\\.Rmd$", "\\.Rmarkdown$", "_files$", "_cache$"]
enableEmoji = true
...

Images & styling paths

We can revisit the config.toml file to make changes to the default settings.

The file for the logo that appears in the corner must be placed in the /docs/logo.png path. To modify it in the config.toml:

picture = "logo.png" # the path to the logo

The cover (background) image must be located in /themes/hugo-tranquilpeak-theme/static/images. To modify it in the config.toml:

coverImage = "myimage.jpg"

We want some custom css and js. We need to locate them in /static/css/my-style.css and in /static/js/myjs.js respectively.

  # Custom CSS. Put here your custom CSS files. They are loaded after the theme CSS;
  # they have to be referred from static root. Example
   [[params.customCSS]]
     href = "css/my-style.css"

  # Custom JS. Put here your custom JS files. They are loaded after the theme JS;
  # they have to be referred from static root. Example
   [[params.customJS]]
     src = "js/myjs.js"

Custom css

Now, we can add arbitrary classes to our css file! (see above).

Since I started writing in Bootstrap, I love it a lot. Since this theme already has bootstrap classes, I brought some others I didn’t find in the theme (they’re available for .md files, but currently not for .Rmd)

Here is an example of a custom css file that can be copied to the file /static/css/my-style.css:

/* @import url('https://maxcdn.bootstrapcdn.com/bootswatch/3.3.7/cosmo/bootstrap.min.css'); may conflict with default theme!*/
@import url('https://fonts.googleapis.com/icon?family=Material+Icons'); /*google icons*/
@import url('https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css'); /*font awesome icons*/

.input-lg {
  font-size: 30px;
}
.input {
  font-size: 20px;
}
.font-sm {
    font-size: 0.7em;
}
.texttt {
  font-family: monospace;
}
.alert {
padding: 15px;
margin-bottom: 20px;
border: 1px solid transparent;
border-radius: 4px;
}
.alert-success {
color: #3c763d;
background-color: #dff0d8;
border-color: #d6e9c6;
}
.alert-danger,
.alert-error {
  color: #b94a48;
  background-color: #f2dede;
  border-color: #eed3d7;
}
.alert-info {
  color: #3a87ad;
  background-color: #d9edf7;
  border-color: #bce8f1;
}
.alert-gray {
  background-color: #f2f3f2;
  border-color: #f2f3f2;
}

/*style for printing*/
@media print {
    .noPrint {
       display:none;
   }
    }

/*link formatting*/
a:link {
    color: #478ca7;
    text-decoration: none;
} 
a:visited {
    color: #478ca7;
    text-decoration: none;
}
a:hover {
    color: #82b5c9;
    text-decoration: none;
}

Also, we have font-awesome icons!

Update! Hugo shortcodes in .Rmd !

Searching through blogdown/issues in GitHub, I found this good trick by Yihui:
You can add Hugo shortcodes if you wrap them with the function htmltools:HTML

htmltools::HTML("{{< hl-text danger >}}
              this is a highlighted danger text
              {{< /hl-text >}}")

Custom javascript

We can also add really cool functions to enhance our post! (see the blogdown version of this post).

Here is the code of the button:

<div class="button well alert alert-danger text-center" id="myButton" onclick="alertColor()">
  <span id="toInfo"> <span class="fa fa-minus-circle"></span>alert-danger </span>
</div>

And here is the javascript function that you can copy into /static/js/myjs.js:

function alertColor() {
  var result = "<span class='fa fa-info-circle'></span>&nbsp;alert-info";
  document.getElementById("toInfo").innerHTML = result; 
  document.getElementById("myButton").style.backgroundColor = "#d9edf7";
  document.getElementById("myButton").style.color = "#3a87ad";
  document.getElementById("myButton").style.borderColor = "#bce8f1";
}

Site build with blogdown

Once we have ready our theme, we can add some content, modifying or deleting the various examples we will find in /content/post.

We need to make use of Blogdown & Hugo to compile our .Rmd file and create our html post:

blogdown::build_site()
blogdown::serve_site()

In the viewer, at the right side of the IDE you can examine the resulting html and see if something didn’t go OK.

Deploying the site

Updating the local git repository

This can be done with simple git commands:

cd /Git/[REPO_NAME] # your path to the repo may be different
git add . # indexes all files that wil be added to the local repo
git commit -m "Starting my Hugo blog" # adds all files to the local repo, with a commit message

Pushing to GitHub

git push origin master # we push the changes from the local git repo to the remote repo (GitHub repo)

Just go to the page https://[USERNAME].github.io/[REPO_NAME] and enjoy your blog!


Add R code

Works just the same as in Rmarkdown. R code is compiled into an html and published as static web content in few steps. Welcome to the era of reproducible blogging!

The figure 1 uses the ggplot2 library:

library(ggplot2)
ggplot(diamonds, aes(x=carat, y=price, color=clarity)) + geom_point()

diamonds plot with ggplot2.

Figure 1: diamonds plot with ggplot2.


A new post in blogdown

I know there is a procedure to create a new post using the interface in RStudio (via the “Addins” button in RStudio IDE) but this one will also work:

I copied a blank .Rmd file into the folder /content/post. I gave it a different name than the other post, with the date and some explanatory text (I called it “2017-08-22-new-post.Rmd”).

Then, I added an appropriate YAML heading (similar to the one in the previous post, but changing dates and times). Beware, as the YAML header indent-sensitive:

---
title: "New Post in Blogdown"
author: "1"
date: 2017-08-22T23:41:14-05:00
draft: false
categories: ["R"]
tags: ["R Markdown", "blogdown", "#rstats"]
thumbnailImagePosition: left
thumbnailImage: ./images/logo.png
metaAlignment: center
disable_comments: true
output:
  blogdown::html_page:
    toc: false
    fig_width: 8
    css: "/css/my-style.css"
--- 

Finally I added some content to the Rmarkdown and saved the changes.

We need to make use of Blogdown & Hugo to compile our .Rmd file and create our html post:

blogdown::build_site()
blogdown::serve_site()

In the viewer, at the right side of the IDE you can examine the resulting html and see if something didn’t go OK.

Deploying the site with the new post

Updating the local git repository

This can be done with simple git commands:

cd /Git/[USERNAME].github.io # your path to the repo may be different
git add . # indexes all files that wil be added to the local repo
git commit -m "Adding a new post" # adds all files to the local repo, with a commit message

Pushing to GitHub

git push origin master # we push the changes from the local git repo to the remote repo (GitHub repo)

You can rinse and repeat this procedure for the rest of the posts.


This .Rmd source code

You can download it from here

 

An introduction to joint modeling in R

By J Espasandin, O Lado, C Díaz, A Bouzas, I Guler, A Baluja. 

You can also check this post, written in #blogdown, here: intro-joint-modeling-r.

These days, between the 19th and 21st of February, has taken place the learning activity titled “An Introduction to the Joint Modeling of Longitudinal and Survival Data, with Applications in R” organized by the Interdisciplinary Group of Biostatistics (ICBUSC), directed by Professor Carmen Cadarso-Suárez, from the University of Santiago de Compostela.

The international nature of this scientific activity has been marked by the presence of researchers from different European countries such as Germany, Portugal, Holland, Greece or Turkey. It also emphasizes its interdisciplinary nature, with attendees from different fields of research, such as statistics, biology, medicine, ecology or bioinformatics, belonging to different universities, biomedical institutions or the industry.

The training activity has been taught by the professor Dimitris Rizopoulos of the Erasmus University Medical Center in Rotterdam, specialist in joint-modeling techniques. Professor Rizopoulos is the author of a book on joint modeling, as well as numerous publications and two related R packages: JM and JMbayes.

The Joint Modeling techniques presented during the scientific meeting allow for the simultaneous study of longitudinal and time-to-event data. Longitudinal data includes repeated measurements of individuals over time, and time-to event data represent the expected time before an event occurs (like death, an asthma crisis or a transplant). That combination of data frequently arises in the biomedical sciences, where it is common to analyze the evolution of a sick person over time.

This novel statistical tool is especially useful in the field of biomedicine. For instance, in patient follow-up studies after surgery; to design a personalised pattern of medical visits; to carry out predictions of survival based on the evolution of a patient, or updating those predictions in light of new data; identification of useful biomarkers; prediction of patient outcome with different chronic diseases such as diabetes, some types of cancer or cardiovascular disease.

The applicability of these models has been illustrated through the JM and JMBayes R packages (by D Rizopoulos), as well as the packages joineR (by Philipson et al.), and lcmm (by Proust-Lima et al.)

An overview of joint modeling

It basically combines (joins) the probability distributions from a linear mixed-effects model with random effects (which takes care of the longitudinal data) and a survival Cox model (which calculates the hazard ratio for an event from the censored data). The whole model and its parts can be extended in several ways:

  • To find latent population heterogeneity (latent class joint models).
  • Allow for multiple longitudinal markers.
  • Allow for the analysis of multiple failure times. This is the case of competing risks and recurrent events (for instance, when a child develops asthma attacks, to find the risk of recurrence).
  • Time-Dependent accelerated failure time (AFT) Models.
  • Dynamic predictions when new values are added for the longitudinal variable, using Maximum Likelihood Estimates and empirical Bayes estimates.

Also, the JM package has functions for discrimination and callibration, (of a single marker and between models): sensitivity & specificity, time-dependent ROCs and AUC.

Applications for joint modeling

Citing D. Rizopoulos:

Joint models for longitudinal and time-to-event data have become a valuable tool in the analysis of follow-up data. These models are applicable mainly in two settings: First, when the focus is on the survival outcome and we wish to account for the effect of an endogenous time-dependent covariate measured with error, and second, when the focus is on the longitudinal outcome and we wish to correct for nonrandom dropout.

Summary

When we need joint models for longitudinal and survival outcomes?

  • To handle endogenous time-varying covariates in a survival analysis context
  • To account for nonrandom dropout in a longitudinal data analysis context

How joint models work?

  • A mixed model for the longitudinal outcome
  • A relative risk model for the event process
  • Explain interrelationships with shared random effects

Last but not least… a dynamic predicion GIF!

library(JM)
# Animation example 
# Mixed-effects model fit
lmeFit.p1 <- lme(log(pro) ~ time + time:treat, data = prothro,
    random = ~ time | id)  

# Cox survival model fit
survFit.p1 <- coxph(Surv(Time, death) ~ treat, data = prothros, x = TRUE)  

# Joint model
jointFit.p1 <- jointModel(lmeFit.p1, survFit.p1, timeVar = "time",
    method = "piecewise-PH-aGH")

# We are interested in producing predictions of survival probabilities for Patient 155
dataP155 <- prothro[prothro$id == 155, ]
len_id <- nrow(dataP155)

# We can plot the data
sfit3 <- survfitJM(jointFit.p1, newdata = dataP155[1:3, ]) 
sfit4 <- survfitJM(jointFit.p1, newdata = dataP155[1:4, ]) 

par(mfrow=c(1,2))
plotfit3 <- plot(sfit3, estimator="mean", include.y = TRUE, conf.int=0.95, fill.area=TRUE, col.area="lightblue", main="Patient 155")
plotfit4 <- plot(sfit4, estimator="mean", include.y = TRUE, conf.int=0.95, fill.area=TRUE, col.area="lightblue", main="Patient 155")

library(animation)
saveGIF({
  for(i in c(1:len_id)){
      sfit <- survfitJM(jointFit.p1, newdata = dataP155[1:i, ]) 
      plot(sfit, estimator="mean", include.y = TRUE, conf.int=0.95, fill.area=TRUE, col.area="lightblue", main="Patient 1")
      
  }
},ani.width = 400, ani.height=400)

A great crowd over there!

 

A minimal Project Tree in R

You can also check this post, written in #blogdown, here: minimal-project-tree-r.

Introduction

The last two days arrived at my twitter feed some discussions on how bad are the following sentences at the beginning of your R script/notebook, sparked by @JennyBryan’s slides at the IASC-ARS/NZSA Conference:

setwd()

and

rm(list = ls())

Jenny Bryan offered a detailed explanation for this, as well as some fixes, in her tidyverse blog post. The main idea was:

  • To ensure reproducibility within a stable working directory tree. She proposes the very concise here::here() but other methods are available such as the template or the ProjectTemplate packages.
  • To avoid break havoc in other’s computers with rm(list = ls())!.

All of this buzz around project self-containment and reproducibility motivated me to finish a minimal directory tree that (with some variations) I have been using for this year’s data analysis endeavours.

It is a extremely simple tree which separates a /data, a /plot and an /img directory inside the main folder (root)

  • The data folder contains both raw data and processed data files saved by R.
  • The plot folder contains all the plots saved during the workflow.
  • The img folder has every other image (logos, etc) that R takes as an input to build the results.
  • Inside the root folder I store the main .R or .Rmd scripts.

This ensures that every folder has an unidirectional relationship with the root folder (except the data dir in this case). But the important thing is that the paths in the scripts are set relative to the root folder, so the entire tree can be copied elsewhere and still work as expected.

I also added some more features to the tree:

  • An .Rproj file
  • Parametrize the .Rmd file
  • Git repository so the tree can be conveniently cloned or downloaded, with a .gitignore file:

Here is a sketch of how it works:

And here is the actual code of the notebook/script. I have not included regular markdown text outside the R chunks, as this template is intended to be changed and filled with new text each time:

Script code

# Installs missing libraries on render!
list.of.packages <- c("rmarkdown", "dplyr", "ggplot2", "Rcpp", "knitr", "Hmisc", "readxl")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages, repos='https://cran.rstudio.com/')
library(dplyr)
library(knitr)
library(ggplot2)

Working directories

# directory where the notebook is
wdir <- getwd() 
# directory where data are imported from & saved to
datadir <- file.path(wdir, "data") # better than datadir <- paste(wdir, "/data", sep="")
# directory where external images are imported from
imgdir <- file.path(wdir, "img")
# directory where plots are saved to
plotdir <- file.path(wdir, "plot")
# the folder immediately above root
Up <- paste("\\", basename(wdir), sep="") 
wdirUp <- gsub(Up, "", wdir) 

Data import

# Data name (stored as a parameter in the Rmarkdown notebook)
params <- NULL
params$dataname <- "cars"
# MSEXCEL
dataname <- params$dataname # archive name
routexl <- paste(datadir, "/", dataname, ".xlsx", sep="")  # complete route to archive

library(readxl)
mydata <- read_excel(routexl, sheet = 1)  # imports first sheet
# CSV / TSV (separated by tabs in this example)
dataname <- params$dataname # archive name
routecsv <- paste(datadir, "/", dataname, ".csv", sep="")  # complete route to archive

mydata <- read.csv(paste(routecsv, sep=""), 
         header = TRUE, 
         sep = "\t",
         dec = ".")

Data operations

# Hmisc::describe(mydata)
head(mydata)
     speed dist
   1     4    2
   2     4   10
   3     7    4
   4     7   22
   5     8   16
   6     9   10
p1 <- ggplot(mydata, aes(x=speed, y=dist)) + geom_point()
p1

Save plots

# TO PDF 
plotname1 <- "p1.pdf"
# TO PNG 
plotname2 <- "p1.png"

routeplot1 <- file.path(plotdir, plotname1)
routeplot2 <- file.path(plotdir, plotname2)
ggsave(routeplot1)  # (see http://ggplot2.tidyverse.org/reference/ggsave.html)
ggsave(routeplot2) 

Save data

# RDATA
save(mydata, file="data/mydata.RData")
# MSEXCEL # not run
dataname2 <- "mydata"  # name we will give to file
routexl2 <- paste(datadir, "/", dataname2, ".xlsx", sep="")   # complete route to future archive

library(xlsx)
write.xlsx(mydata, routexl2) # creates archive in specified route
# CSV / TSV (separated by tabs in this example)
dataname2 <- "mydata"  # name we will give to file
routecsv2 <- paste(datadir, "/", dataname2, ".csv", sep="")  # complete route to future archive

write.table(mydata, file = routecsv2, append = FALSE, quote = FALSE, sep = "\t ",
            eol = "\n", na = "NA", dec = ".", row.names = FALSE,
            col.names = TRUE)

This script -and the dir tree that contains it- is saving me a lot of time and headaches (where I’ve put that data?….), I hope it can be also useful for people out there!.

Future improvements

 

Taming exam results in pdf with pdftools

You can also check this post, written in #blogdown, here: taming-exam-results-with-pdf.

Introduction

There are several ways to mine tables and other content from a pdf, using R. After a lot of trial & error, here’s how I managed to extract global exam results from an international, massive, yearly examination, the EDAIC.

This is my first use case of “pdf mining” with R, and also a fairly simple one. However, more complex and very fine examples of this can be found elsewhere, using both pdftools and tabulizer packages.

As can be seen from the original pdf, exam results are anonymous. They consist on a numeric, 6-digit code and a binary result: “FAIL / PASS”. I was particularly interested into seeing how many of them passed the exam, as some indirect measure of how “hard” it can be.

Mining the table

In this case I preferred pdftools as it allowed me to extract the whole content from the pdf:

install.packages("pdftools")
library(pdftools) 
txt <- pdf_text("EDAIC.pdf") 
txt[1] 
class(txt[1]) 
  [1] "EDAIC Part I 2017                                                  Overall Results\n                                         Candidate N°       Result\n                                            107131            FAIL\n                                            119233            PASS\n                                            123744            FAIL\n                                            127988            FAIL\n                                            133842            PASS\n                                            135692            PASS\n                                            140341            FAIL\n                                            142595            FAIL\n                                            151479            PASS\n                                            151632            PASS\n                                            152787            PASS\n                                            157691            PASS\n                                            158867            PASS\n                                            160211            PASS\n                                            161970            FAIL\n                                            162536            PASS\n                                            163331            PASS\n                                            164442            FAIL\n                                            164835            PASS\n                                            165734            PASS\n                                            165900            PASS\n                                            166469            PASS\n                                            167241            FAIL\n                                            167740            PASS\n                                            168151            FAIL\n                                            168331            PASS\n                                            168371            FAIL\n                                            168711            FAIL\n                                            169786            PASS\n                                            170721            FAIL\n                                            170734            FAIL\n                                            170754            PASS\n                                            170980            PASS\n                                            171894            PASS\n                                            171911            PASS\n                                            172047            FAIL\n                                            172128            PASS\n                                            172255            FAIL\n                                            172310            PASS\n                                            172706            PASS\n                                            173136            FAIL\n                                            173229            FAIL\n                                            174336            PASS\n                                            174360            PASS\n                                            175177            FAIL\n                                            175180            FAIL\n                                            175184            FAIL\nYour candidate number is indicated on your admission document        Page 1 of 52\n"
  [1] "character"

These commands return a lenghty blob of text. Fortunately, there are some \n symbols that signal the new lines in the original document.

We will use these to split the blob into something more approachable, using tidyversal methods…

  • Split the blob.
  • Transform the resulting list into a character vector with unlist.
  • Trim leading white spaces with stringr::str_trim.
library(tidyverse) 
library(stringr) 
tx2 <- strsplit(txt, "\n") %>% # divide by carriage returns
  unlist() %>% 
  str_trim(side = "both") # trim white spaces
tx2[1:10]
   [1] "EDAIC Part I 2017                                                  Overall Results"
   [2] "Candidate N°       Result"                                                         
   [3] "107131            FAIL"                                                            
   [4] "119233            PASS"                                                            
   [5] "123744            FAIL"                                                            
   [6] "127988            FAIL"                                                            
   [7] "133842            PASS"                                                            
   [8] "135692            PASS"                                                            
   [9] "140341            FAIL"                                                            
  [10] "142595            FAIL"
  • Remove the very first row.
  • Transform into a tibble.
tx3 <- tx2[-1] %>% 
  data_frame() 
tx3
  # A tibble: 2,579 x 1
                             .
                         <chr>
   1 Candidate N°       Result
   2    107131            FAIL
   3    119233            PASS
   4    123744            FAIL
   5    127988            FAIL
   6    133842            PASS
   7    135692            PASS
   8    140341            FAIL
   9    142595            FAIL
  10    151479            PASS
  # ... with 2,569 more rows
  • Use tidyr::separate to split each row into two columns.
  • Remove all spaces.
tx4 <- separate(tx3, ., c("key", "value"), " ", extra = "merge") %>%  
  mutate(key = gsub('\\s+', '', key)) %>%
  mutate(value = gsub('\\s+', '', value)) 
tx4
  # A tibble: 2,579 x 2
           key    value
         <chr>    <chr>
   1 Candidate N°Result
   2    107131     FAIL
   3    119233     PASS
   4    123744     FAIL
   5    127988     FAIL
   6    133842     PASS
   7    135692     PASS
   8    140341     FAIL
   9    142595     FAIL
  10    151479     PASS
  # ... with 2,569 more rows
  • Remove rows that do not represent table elements.
tx5 <- tx4[grep('^[0-9]', tx4[[1]]),] 
tx5
  # A tibble: 2,424 x 2
        key value
      <chr> <chr>
   1 107131  FAIL
   2 119233  PASS
   3 123744  FAIL
   4 127988  FAIL
   5 133842  PASS
   6 135692  PASS
   7 140341  FAIL
   8 142595  FAIL
   9 151479  PASS
  10 151632  PASS
  # ... with 2,414 more rows

Extracting the results

We already have the table! now it’s time to get to the summary:

library(knitr)
tx5 %>%
  group_by(value) %>%
  summarise (count = n()) %>%
  mutate(percent = paste( round( (count / sum(count)*100) , 1), "%" )) %>% 
  kable()
value count percent
FAIL 1017 42 %
PASS 1407 58 %

From these results we see that the EDAIC-Part1 exam doesn’t have a particularly high clearance rate. It is currently done by medical specialists, but its dificulty relies in a very broad list of subjects covered, ranging from topics in applied physics, the entire human physiology, pharmacology, clinical medicine and latest guidelines.

Despite being a hard test to pass -and also the exam fee-, it’s becoming increasingly popular among anesthesiologists and critical care specialists that wish to stay up-to date with the current medical knowledge and practice.

 

 

Starting a Rmarkdown Blog with Blogdown + Hugo + Github

Finally, -after 24h of failed attempts-, I could get my favourite Hugo theme up and running with R Studio and Blogdown.

All the steps I followed are detailed in my new Blogdown entry, which is also a GitHub repo. However, there is an updated version of the tutorial here:

R Blogdown Setup in GitHub (2)

After exploring some alternatives, like Shirin’s (with Jekyll), and Amber Thomas advice (which involved Git skills beyond my basic abilities), I was able to install Yihui’s hugo-lithium-theme in a new repository.

However, I wanted to explore other blog templates, hosted in GiHub, like:

The three first themes are currently linked in the blogdown documentation as being most simple and easy to set up for unexperienced blog programmers, but I hope the list will grow in the following months. For those who are willing to experiment, the complete list is here.

Finally I chose the hugo-tranquilpeak theme, by Thibaud Leprêtre, for which I mostly followed Tyler Clavelle’s entry on the topic. This approach turned out to be easy and good, given some conditions:

  • Contrary to Yihui Xie’s advice, I felt brave enough to choose github.io to host my blog, instead of Netlify (I love my desktop integration with GitHub, so it was interesting for me not to move to another service for my static content).
  • In my machine, I installed Blogdown & Hugo using R studio (v 1.1.336).
  • In GiHub, it was easier for me to host the blog directly in my main github pages repository (always named [USERNAME].github.io), in the master branch, following Tyler’s tutorial.
  • I have basic knowledge of html, css and javascript, so I didn’t mind to tinker around with the theme.
  • My custom styles didn’t involve theme rebuilding. At this moment they’re simple cosmetic tricks.

The steps I followed were:

Git & GitHub repos

  • Setting a GitHub repo with the name [USERNAME].github.io (in my case aurora-mareviv.github.io). See this and this.
  • Create a git repo in your machine:
    • Create manually a new directory called [USERNAME].github.io.
    • Run in the terminal (Windows users have to install git first):
    cd /Git/[USERNAME].github.io # your path may be different
    
    git init # initiates repo in the directory
    git remote add origin https://github.com/[USERNAME]/[USERNAME].github.io # connects git local repo to remote Github repo
    
    git pull origin master # in case you have LICENSE and Readme.md files in the GitHub repo, they're downloaded
  • For now, your repo is ready. We will now focus in creating & customising our Blogdown.

RStudio and blogdown

  • We will open RStudio (v 1.1.336, development version as of today).
    • First, you may need to install Blogdown in R:
    install.packages("blogdown")
    • In RStudio, select the Menu > File > New Project following the lower half of these instructions. The wizard for setting up a Hugo Blogdown project may not be yet available in your RStudio version (not for much longer probably).

Creating new Project

Creating new Project

Selecting Hugo Blogdown format

Selecting Hugo Blogdown format

Selecting Hugo Blogdown theme

Selecting Hugo Blogdown theme

A config.toml file appears

config.toml file appears


Customising paths and styles

Before we build and serve our site, we need to tweak a couple of things in advance, if we want to smoothly deploy our blog into GitHub pages.

Modify config.toml file

To integrate with GiHub pages, there are the essential modifications at the top of our config.toml file:

  • We need to set up the base URL to the “root” of the web page (https://USERNAME.github.io/, in this case)
  • By default, the web page is published in the “public” folder. We need it to be published in the root of the repository, to match the structure of the GitHub masterbranch:
baseurl = "/./" 
publishDir = "./"
  • Other useful global settings:
ignoreFiles = ["\\.Rmd$", "\\.Rmarkdown$", "_files$", "_cache$"]
enableEmoji = true

Images & styling paths

We can revisit the config.toml file to make changes to the default settings.

The logo that appears in the corner must be in the root folder. To modify it in the config.toml:

picture = "logo.png" # the path to the logo

The cover (background) image must be located in /themes/hugo-tranquilpeak-theme/static/images . To modify it in the config.toml:

coverImage = "myimage.jpg"

We want some custom css and js. We need to locate them in /static/css and in /static/js respectively.

# Custom CSS. Put here your custom CSS files. They are loaded after the theme CSS;
# they have to be referred from static root. Example
customCSS = ["css/my-style.css"]

# Custom JS. Put here your custom JS files. They are loaded after the theme JS;
# they have to be referred from static root. Example
customJS = ["js/myjs.js"]

Custom css

We can add arbitrary classes to our css file (see below).

Since I started writing in Bootstrap, I miss it a lot. Since this theme already has bootstrap classes, I brought some others I didn’t find in the theme (they’re available for .md files, but currently not for .Rmd)

Here is my custom css file to date:

/* @import url('https://maxcdn.bootstrapcdn.com/bootswatch/3.3.7/cosmo/bootstrap.min.css'); may conflict with default theme*/
@import url('https://fonts.googleapis.com/icon?family=Material+Icons'); /*google icons*/
@import url('https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css'); /*font awesome icons*/

.input-lg {
  font-size: 30px;
}
.input {
  font-size: 20px;
}
.font-sm {
    font-size: 0.7em;
}
.texttt {
  font-family: monospace;
}
.alert {
padding: 15px;
margin-bottom: 20px;
border: 1px solid transparent;
border-radius: 4px;
}
.alert-success {
color: #3c763d;
background-color: #dff0d8;
border-color: #d6e9c6;
}
.alert-danger,
.alert-error {
  color: #b94a48;
  background-color: #f2dede;
  border-color: #eed3d7;
}
.alert-info {
  color: #3a87ad;
  background-color: #d9edf7;
  border-color: #bce8f1;
}
.alert-gray {
  background-color: #f2f3f2;
  border-color: #f2f3f2;
}

/*style for printing*/
@media print {
    .noPrint {
       display:none;
   }
    }

/*link formatting*/
a:link {
    color: #478ca7;
    text-decoration: none;
} 
a:visited {
    color: #478ca7;
    text-decoration: none;
}
a:hover {
    color: #82b5c9;
    text-decoration: none;
}

Also, we have font-awesome icons!

Site build with blogdown

Once we have ready our theme, we can add some content, modifying or deleting the various examples we will find in /content/post .

We need to make use of Blogdown & Hugo to compile our .Rmd file and create our html post:

blogdown::build_site()
blogdown::serve_site()

In the viewer, at the right side of the IDE you can examine the resulting html and see if something didn’t go OK.

Deploying the site

Updating the local git repository

This can be done with simple git commands:

cd /Git/[USERNAME].github.io # your path to the repo may be different
git add . # indexes all files that wil be added to the local repo
git commit -m "Starting my Hugo blog" # adds all files to the local repo, with a commit message

Pushing to GitHub

git push origin master # we push the changes from the local git repo to the remote repo (GitHub repo)

Just go to the page https://USERNAME.github.io and enjoy your blog!


R code

Works just the same as in Rmarkdown. R code is compiled into an html and published as static web content in few steps. Welcome to the era of reproducible blogging!

The figure 1 uses the ggplot2 library:

library(ggplot2)
ggplot(diamonds, aes(x=carat, y=price, color=clarity)) + geom_point()

diamonds plot with ggplot2.

Figure 1: diamonds plot with ggplot2.

Rmd source code

You can download it from here

I, for one, welcome the new era of reproducible blogging!

Updated tutorial

I updated (& hopefully improved) this tutorial, which you can check here.

hexbins

Quick wordclouds from PubMed abstracts – using PMID lists in R

Wordclouds are one of the most visually straightforward, compelling ways of displaying text info in a graph.

Of course, we have a lot of web pages (and even apps) that, given an input text, will plot you some nice tagclouds. However, when you need reproducible results, or getting done complex tasks -like combined wordclouds from several files-, a programming environment may be the best option.

In R, there are (as always), several alternatives to get this done, such as tagcloud and wordcloud.

For this script I used the following packages:

  • «RCurl» to retrieve a PMID list, stored in my GitHub account as a .csv file.
  • «RefManageR» and «plyr« to retrieve and arrange PM records. To fetch the info from the inets, we’ll be using the PubMed API (free version, with some limitations). 
  • Finally, «tm«, «SnowballC» to prepare the data and «wordcloud» to plot the wordcloud. This part of the script is based on this from Georeferenced.

One of the advantages of using RefManageR is that you can easily change the field which you are importing from, and it usually works flawlessly with the PubMed API.

My biggest problem sources when running this script: download caps, busy hours, and firewalls!.

At the beginning of the gist, there is also a handy function that automagically downloads all needed packages for you.

To source the script, simply type in the R console:

This script creates two directories in your working directory: ‘corpus1‘ for the abstracts file, and ‘wordcloud‘ to store the plot.

library(devtools)
source_url("https://gist.githubusercontent.com/aurora-mareviv/697cbb505189591648224ed640e70fb1/raw/b42ac2e361ede770e118f217494d70c332a64ef8/pmid.tagcloud.R")

And there is the code:


#########################################################
#### CAPTURE ABSTRACTS FROM PMIDs & MAKE A WORDCLOUD ####
#########################################################
# GNU-GPL license
# Author: Mareviv (https://talesofr.wordpress.com)
# Script to retrieve and mine abstracts from PubMed (http://www.ncbi.nlm.nih.gov/pubmed)
# Uses the function ReadPubMed from the package RefManageR. It reads the PubMed API similarly to the search engine in PubMed's page.
# This script creates two directories in your working directory: 'corpus1' for the abstracts file, and 'wordcloud' to store the plot.
# First, automagically install needed libraries:
list.of.packages <- c("slam") # installing 'slam' gives error in OSX Yosemite/El Capitan. This is an attempt to fix it, specifying 'type="binary"'.
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages, repos='https://cran.rstudio.com/&#39;, type="binary")
list.of.packages <- c("RCurl", "RefManageR", "plyr", "tm", "wordcloud", "SnowballC")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages, repos='https://cran.rstudio.com/&#39;)
# Get and store the working directory
wdir <- getwd()
# 1. Import PMIDs
message("retrieving PMIDs info…")
library(RCurl)
urli <- getURL("https://gist.githubusercontent.com/aurora-mareviv/14e5837814a8d8d47c20/raw/90b198bae82154688dcd9a2596af798612e6619f/pmids.csv&quot;, ssl.verifypeer = FALSE)
pmids <- read.csv(textConnection(urli))
message("PMID info succesfully retrieved")
# 2. Loop several queries to PubMed and return in a data.frame
index <- pmids$pmId[1:length(pmids$pmId)]
# The PubMed (free) API may give problems with large queries, so we'll prefer a shorter vector for this test:
index50 <- pmids$pmId[1:50]
library(RefManageR)
library(plyr)
message("connecting to the free PubMed API…")
auth.pm <- ldply(index50, function(x){
tmp <- unlist(ReadPubMed(x, database = "PubMed", mindate = 1950))
tmp <- lapply(tmp, function(z) if(is(z, "person")) paste0(z, collapse = ",") else z)
data.frame(tmp, stringsAsFactors = FALSE)
})
message("abstract data successfully downloaded!")
# 3. Create a directory to write the abstracts.txt file into: (this folder can only contain this .txt file!)
corpus.dir <- paste(wdir, "corpus1", sep="/")
message(paste("creating new directory: ", corpus.dir, sep=""))
dir.create(corpus.dir)
setwd(corpus.dir)
# 4. Extract abstracts to a .txt
text <- paste(auth.pm$abstract)
message(paste("writing file: ", corpus.dir, "/abstracts.txt", sep=""))
writeLines(text, "abstracts.txt")
# 5. Create tagcloud
library(tm)
library(wordcloud)
library(SnowballC)
message("constructing the tagcloud…")
abstract <- Corpus (DirSource(corpus.dir)) # import text file in this directory
abstract <- tm_map(abstract, stripWhitespace) # transformations
abstract <- tm_map(abstract, content_transformer(tolower))
abstract <- tm_map(abstract, removeWords, stopwords("english"))
# abstract <- tm_map(abstract, stemDocument) # optional in this case
abstract <- tm_map(abstract, removeNumbers) # optional in this case
abstract <- tm_map(abstract, removePunctuation)
# tuning
abstract <- tm_map(abstract, removeWords, "methods")
abstract <- tm_map(abstract, removeWords, "results")
abstract <- tm_map(abstract, removeWords, "conclusions")
abstract <- tm_map(abstract, removeWords, "conclusion")
abstract <- tm_map(abstract, removeWords, "whether")
abstract <- tm_map(abstract, removeWords, "due")
# 6. Print image in a new folder: wordcloud
plot.dir <- paste(wdir, "wordcloud", sep="/")
message(paste("creating new directory: ", plot.dir, sep=""))
dir.create(plot.dir)
setwd(plot.dir)
message(paste("printing file: ", plot.dir, "/wordcloud.png", sep=""))
png(file = "wordcloud.png", width = 1500, height = 1500, units = "px", res = 300, bg = "transparent")
wordcloud(abstract, scale=c(5,0.5), max.words=150, random.order=FALSE, rot.per=0.35, use.r.layout=FALSE, colors=brewer.pal(8, "Dark2"))
dev.off()
# 7. Reset the working directory
setwd(wdir)

view raw

pmid.tagcloud.R

hosted with ❤ by GitHub

Enjoy!

wordcloud

R/Shiny for clinical trials: simple randomization tables

One of the things I most like from R + Shiny is that it enables me to serve the power and flexibility of R in small «chunks» to cover different needs, allowing people not used to R to benefit from it. However, what I like most is that’s really fun and easy to program those utilities for a person without any specific programming background.

Here’s a small hack done in R/Shiny: it covered an urgent need for a study involving patient randomisation to two branches of treatment, in what is commonly known as a clinical trial. This task posed some challenges:

  • First, this trial was not financed in any way (at least initially). It was a small, independent study comparing two approved techniques for chronic pain, so the sponsor had to avoid expensive software or services.
  • Another reason for software customization is that treatment groups were partially ‘blind’: for people who assessed effectiveness and… also for statistical analysis (treatment administration was open-label). This means that the person in charge of data analysis must know which group is assigned to a patient, but doesn’t know what treatment is assigned to either group.

To tackle the points above, my app should have two main features:

  • The sponsor (here, a medical doctor) must be able to effectively control study blindness and also provide emergency blind disclosure. This control should extend to data analysis to minimize bias favoring either treatment.
  • R has tools to create random samples, but the MD in charge of the study sponsoring doesn’t know how to use R. We needed a friendly interface for random table creation.

Here’s how I got it to work:

  • The very core of this Shiny app is a combination between the set.seed and sample R functions. The PIN number (the set.seed argument) works like a secret passcode that links to a given random table. E.g., every time I enter ‘5432’, the random tables will look the same. This protects from accidental blindness disclosure, as nobody can find the correct random table without the proper PIN, even if they can access the app’s source code.
  • The tables are created column by column, ordered at first. Then we proceed to randomize (via the sample function) both the treatment column (in the random table) and the Group column (in the PIN table).
  • Once the tables are created they can be downloaded as .CSV files, printed, signed and dated to document the randomization procedure. The app’s open source code and the PIN number will provide reproducibility to the procedure for many years.

Unfortunately I wasn’t able to insert iframes to embed the app, so I posted a screenshot:

Random table generator for clinical trials

The app is far from perfect, but it covers the basic needs for the trial. You can test it here:

http://aurora.shinyapps.io/random_gen

And the GitHub repo is available here. Feel free to use/ adapt/ fork it to your needs!

https://github.com/aurora-mareviv/random_gen

Also, you can cite it if it’s been useful for your study methods!


# server.R
shinyServer(function(input, output) {
f <- function(seed, ncases, branches){
set.seed(seed)
branch <- branches
if(branch==2){ # table creation
rond1 <- round(ncases/2, 0)
rond2 <- ncases-rond1
patient <- seq(1:ncases)
code <- paste("P", patient, sep="")
patient <- paste("patient", patient, sep="")
treatment <- c(rep("group 1", rond1), rep("group 2", rond2))
order <- seq(1:ncases)
}
if(branch==3){ # table creation
rond1 <- round(ncases/3, 0)
rond2 <- rond1
rond3 <- ncases-(rond1+rond2)
patient <- seq(1:ncases)
code <- paste("P", patient, sep="")
patient <- paste("patient", patient, sep="")
treatment <- c(rep("group 1", rond1), rep("group 2", rond2), rep("group 3", rond3))
order <- seq(1:ncases)
}
random.0 <- data.frame(patient, code, treatment, order)
random.1 <- transform(random.0, treatment = sample(treatment)) # here goes the randomisation (sampling the treatment column)
random.1
}
g <- function(seed, branches){
set.seed(seed)
branch <- branches
if(branch==2){ # table creation
Treatment.Key <- c(paste(input$tta), paste(input$ttb))
Group <- c("group 1", "group 2")
}
if(branch==3){ # table creation
Treatment.Key <- c(paste(input$tta), paste(input$ttb), paste(input$ttc))
Group <- c("group 1", "group 2", "group 3")
}
chave <- data.frame(Treatment.Key, Group)
chave.rand <- transform(chave, Group = sample(Group)) # here goes the randomisation (sampling the Group column)
names(chave.rand)[1] <- paste("Treatment.PIN_is_", seed, sep="")
chave.rand
}
mydata <- reactive(f(input$seed,
input$ncases,
input$branches))
mydatachave <- reactive(g(input$seed,
input$branches))
# Show the final calculated values from RAND table
output$randTable <- renderDataTable(
{mydata <- f(input$seed,
input$ncases,
input$branches)
mydata}
)
output$randChave <- renderDataTable(
{mydatachave <- g(input$seed,
input$branches)
mydatachave},
options = list(searching = FALSE, paging = FALSE, caption = 'Table 1: This is it')
)
output$text1 <- renderText({
paste("This is a randomization table for a study involving ",
input$ncases,
" patients and ",
input$branches,
" branches of treatment.",
sep="")
})
output$text2 <- renderText({
paste("- The treatment PIN table can be used to mask treatments when the group allocation must be unblinded (e.g. for data analysis).",
sep="")
})
output$text3 <- renderText({
paste("- The random table assigns patients to the ", input$branches, " branches/groups.",
sep="")
})
info <- sessionInfo()
output$version <- renderText({
paste(info$R.version[c(13, 2)]$version.string, info$R.version[c(13, 2)]$arch,
sep=", ")
})
output$downloadChave <- downloadHandler(
filename = function() { paste(input$title, 'treatment_allocation_PIN.csv', sep='-') },
content = function(file) {
write.csv(mydatachave(), file, na="")
}
)
output$downloadData <- downloadHandler(
filename = function() { paste(input$title, 'random_table.csv', sep='-') },
content = function(file) {
write.csv(mydata(), file, na="")
}
)
})

view raw

server.R

hosted with ❤ by GitHub


# ui.R
library(shiny)
# Define UI for slider demo application
shinyUI(pageWithSidebar(
# Application title
headerPanel("Randomization table for clinical trials!"),
# Sidebar with sliders that demonstrate various available options
sidebarPanel(
# Simple integer interval
textInput("title", "Set your study title:", "My trial name"),
numericInput("ncases", label = "Total number of patients:", value = 60),
numericInput("seed", label = "Set your secret passcode:", value = 12345),
selectInput("branches", "Number of branches:",
choices = c("2", "3")),
textInput("tta", "Name your first branch:", "First branch"),
textInput("ttb", "Name your second branch:", "Second branch"),
conditionalPanel(
condition = "input.branches == 3",
textInput("ttc", "Name your third branch:", "Third branch")),
textOutput("text1"),
textOutput("text2"),
textOutput("text3"),
textOutput("version"),
helpText("Random table for blind clinical trials.
Written in R/Shiny by A. Baluja."),
downloadButton('downloadChave', 'Download PIN table'),
downloadButton('downloadData', 'Download random table')
),
# Show a table summarizing the values entered
mainPanel(
div(dataTableOutput("randChave"), style = "font-size:80%"),
div(dataTableOutput("randTable"), style = "font-size:80%")
)
)
)
# To execute this script from inside R, you need to have the ui.R and server.R files into the session's working directory. Then, type:
# runApp()
# To execute directly this Gist, from the Internet to your browser, type:
# shiny:: runGist(' ')

view raw

ui.R

hosted with ❤ by GitHub

 

 

 

 

 

Compile BayesX from source code – via Fink in OSX 10.10

My PC just passed away. My good old one since 2009… so I decided to buy a desktop computer running OSX 10.10 (although I cannot exclude the possibility of partitioning the disk and installing also Linux…). For now, I missed some apps, which I installed with Fink (a popular debian-based distro ported to Mac).

One of the programs I was fiddling around with was BayesX: «Bayesian Inference in Structured Additive Regression Models», authored by great people at Muenchen University, (specially Nadja and Thomas, whom I’m glad to know personally).

The problem is that I used to have both BayesX binaries for Linux and Windows, but now I needed one for my Mac. Here’s how I did (pretty easy once I figured out how to do it!):

1. OBTAIN XCODE, XQUARTZ AND JAVA

Xcode, Xquartz and Java are the first tools you will need in order to build Fink, and hence BayesX.

2. INSTALL FINK

Fink is a popular port of GNU-Linux for MacOS. You can install it easily following the steps in this page.

Because my OS was 10.10, I had to follow instructions to compile Fink from source. Fortunately, the page provides a handy helper script to run in the terminal (for OSX 10.10). This script goes through all the steps. Open the Terminal and just either run the script or the commands one by one, and follow the instructions. It may take a while to download and install.

3. DOWNLOAD BAYESX SOURCE CODE

Go to the Download page and get the source. I suggest that you store the .zip file in a dedicated directory such as /Users/me/Downloads/source/code (changing «me» for your user). Then, open the Terminal and type:

cd /Users/me/Downloads/source/code
unzip -a bayesxsource_3_0_1.zip # bayesxsource_3_0_1.zip or whatever the name is

4. INSTALL CMAKE AND GSL

If you want to compile this source for Mac, you’ll see you need two more components (at least in this tutorial): cmake and the GNU Scientific Library (GSL).

Cmake is used to create a custom Makefile which will be used in compilation. You can insytall it typing in the Terminal:

sudo apt-get install cmake

Or you can use a (very useful for debugging) graphical user interface for Cmake.

To install GSL, simply type:

sudo apt-get install gsl

5. COMPILE BAYESX

Once everything is installed, you can start to compile. Run in the Terminal

cd /Users/me/Downloads/source/code
cmake . # this will create the Makefile
cd /Users/Auri/Downloads/source/ # locate the Makefile
make

You will obtain an Unix Executable File in /Users/me/Downloads/source called BayesX. Double-clicking on it should open the BayesX prompt!.

6. FINAL NOTES

Please, go to README.BayesX in the source for more information and issue solving.

I have tested this tutorial in another Mac machine -OSX 10.10- without any of this software installed, and it worked well. However, you may experience (mostly unknown) issues that may depend on different versions of Xcode, Fink, Java, GSL… etc. Despite this, I hope these steps will help you.

Also, from here you can only compile a mere console version of BayesX. I’m not sure how to run the Java user interface, but I will update this post to include any progress.

Hope you enjoy!

bayesicon

The Curious Clinicians

A Medical Podcast that asks "Why"?

young statisticians section

of the royal statistical society

El blog de Picanúmeros

Cuentas, cuentas y más cuentas

Society for Social Medicine & Population Health

Advancing Knowledge for Population Health

Mad (Data) Scientist

Musings, useful code etc. on R and data science

My Blog

A topnotch WordPress.com site

walkandfish

Just another WordPress.com site

Georeferenced

A blog on all things Geo, Data, Technology & the interconnected world. Occasionally off-piste.

Retraction Watch

Tracking retractions as a window into the scientific process

"R" you ready?

My advances in R - a learner's diary

TRinker's R Blog

Experiments & Experiences in R

What You're Doing Is Rather Desperate

Notes from the life of a [data] scientist

On unicorns and genes

Martin Johnsson's blog about genetics and sundry things

vet epi

Denis Haine

FreshBiostats

Young Researchers in Biostatistics

nicebread.de

...messing around with free code

TRinker's R Blog

...messing around with free code

Revolutions

...messing around with free code