Taming exam results in pdf with pdftools

You can also check this post, written in #blogdown, here: taming-exam-results-with-pdf.

Introduction

There are several ways to mine tables and other content from a pdf, using R. After a lot of trial & error, here’s how I managed to extract global exam results from an international, massive, yearly examination, the EDAIC.

This is my first use case of “pdf mining” with R, and also a fairly simple one. However, more complex and very fine examples of this can be found elsewhere, using both pdftools and tabulizer packages.

As can be seen from the original pdf, exam results are anonymous. They consist on a numeric, 6-digit code and a binary result: “FAIL / PASS”. I was particularly interested into seeing how many of them passed the exam, as some indirect measure of how “hard” it can be.

Mining the table

In this case I preferred pdftools as it allowed me to extract the whole content from the pdf:

install.packages("pdftools")
library(pdftools) 
txt <- pdf_text("EDAIC.pdf") 
txt[1] 
class(txt[1]) 
  [1] "EDAIC Part I 2017                                                  Overall Results\n                                         Candidate N°       Result\n                                            107131            FAIL\n                                            119233            PASS\n                                            123744            FAIL\n                                            127988            FAIL\n                                            133842            PASS\n                                            135692            PASS\n                                            140341            FAIL\n                                            142595            FAIL\n                                            151479            PASS\n                                            151632            PASS\n                                            152787            PASS\n                                            157691            PASS\n                                            158867            PASS\n                                            160211            PASS\n                                            161970            FAIL\n                                            162536            PASS\n                                            163331            PASS\n                                            164442            FAIL\n                                            164835            PASS\n                                            165734            PASS\n                                            165900            PASS\n                                            166469            PASS\n                                            167241            FAIL\n                                            167740            PASS\n                                            168151            FAIL\n                                            168331            PASS\n                                            168371            FAIL\n                                            168711            FAIL\n                                            169786            PASS\n                                            170721            FAIL\n                                            170734            FAIL\n                                            170754            PASS\n                                            170980            PASS\n                                            171894            PASS\n                                            171911            PASS\n                                            172047            FAIL\n                                            172128            PASS\n                                            172255            FAIL\n                                            172310            PASS\n                                            172706            PASS\n                                            173136            FAIL\n                                            173229            FAIL\n                                            174336            PASS\n                                            174360            PASS\n                                            175177            FAIL\n                                            175180            FAIL\n                                            175184            FAIL\nYour candidate number is indicated on your admission document        Page 1 of 52\n"
  [1] "character"

These commands return a lenghty blob of text. Fortunately, there are some \n symbols that signal the new lines in the original document.

We will use these to split the blob into something more approachable, using tidyversal methods…

  • Split the blob.
  • Transform the resulting list into a character vector with unlist.
  • Trim leading white spaces with stringr::str_trim.
library(tidyverse) 
library(stringr) 
tx2 <- strsplit(txt, "\n") %>% # divide by carriage returns
  unlist() %>% 
  str_trim(side = "both") # trim white spaces
tx2[1:10]
   [1] "EDAIC Part I 2017                                                  Overall Results"
   [2] "Candidate N°       Result"                                                         
   [3] "107131            FAIL"                                                            
   [4] "119233            PASS"                                                            
   [5] "123744            FAIL"                                                            
   [6] "127988            FAIL"                                                            
   [7] "133842            PASS"                                                            
   [8] "135692            PASS"                                                            
   [9] "140341            FAIL"                                                            
  [10] "142595            FAIL"
  • Remove the very first row.
  • Transform into a tibble.
tx3 <- tx2[-1] %>% 
  data_frame() 
tx3
  # A tibble: 2,579 x 1
                             .
                         <chr>
   1 Candidate N°       Result
   2    107131            FAIL
   3    119233            PASS
   4    123744            FAIL
   5    127988            FAIL
   6    133842            PASS
   7    135692            PASS
   8    140341            FAIL
   9    142595            FAIL
  10    151479            PASS
  # ... with 2,569 more rows
  • Use tidyr::separate to split each row into two columns.
  • Remove all spaces.
tx4 <- separate(tx3, ., c("key", "value"), " ", extra = "merge") %>%  
  mutate(key = gsub('\\s+', '', key)) %>%
  mutate(value = gsub('\\s+', '', value)) 
tx4
  # A tibble: 2,579 x 2
           key    value
         <chr>    <chr>
   1 Candidate N°Result
   2    107131     FAIL
   3    119233     PASS
   4    123744     FAIL
   5    127988     FAIL
   6    133842     PASS
   7    135692     PASS
   8    140341     FAIL
   9    142595     FAIL
  10    151479     PASS
  # ... with 2,569 more rows
  • Remove rows that do not represent table elements.
tx5 <- tx4[grep('^[0-9]', tx4[[1]]),] 
tx5
  # A tibble: 2,424 x 2
        key value
      <chr> <chr>
   1 107131  FAIL
   2 119233  PASS
   3 123744  FAIL
   4 127988  FAIL
   5 133842  PASS
   6 135692  PASS
   7 140341  FAIL
   8 142595  FAIL
   9 151479  PASS
  10 151632  PASS
  # ... with 2,414 more rows

Extracting the results

We already have the table! now it’s time to get to the summary:

library(knitr)
tx5 %>%
  group_by(value) %>%
  summarise (count = n()) %>%
  mutate(percent = paste( round( (count / sum(count)*100) , 1), "%" )) %>% 
  kable()
value count percent
FAIL 1017 42 %
PASS 1407 58 %

From these results we see that the EDAIC-Part1 exam doesn’t have a particularly high clearance rate. It is currently done by medical specialists, but its dificulty relies in a very broad list of subjects covered, ranging from topics in applied physics, the entire human physiology, pharmacology, clinical medicine and latest guidelines.

Despite being a hard test to pass -and also the exam fee-, it’s becoming increasingly popular among anesthesiologists and critical care specialists that wish to stay up-to date with the current medical knowledge and practice.

 

 

Anuncios

An .EPS to .PDF converter (using LaTeX!)

I am about to go on a short holiday, so I was tidying the code lines I had scattered around before leaving… And I found this: a minimal EPS to PDF converter, which is barely a LaTeX template.

It is intended for transforming an .EPS graph to the .PDF format. You can copy & paste this whole code into a blank text file (but with .TEX extension) and run it with a TeX editor. To install and use LaTeX, here it is a previous post about it.

When you have compiled it, you can search in the same file’s directory for the newly created PDF graph!

Mad (Data) Scientist

Musings, useful code etc. on R and data science

My Blog

A topnotch WordPress.com site

walkandfish

Just another WordPress.com site

Georeferenced

A blog on all things Geo, Data, Technology & the interconnected world. Occasionally off-piste.

Retraction Watch

Tracking retractions as a window into the scientific process

"R" you ready?

My advances in R - a learner's diary

TRinker's R Blog

Experiments & Experiences in R

What You're Doing Is Rather Desperate

Notes from the life of a [data] scientist

On unicorns and genes

Martin Johnsson's blog

vet epi

Denis Haine

FreshBiostats

Young Researchers in Biostatistics

Nicebread

...messing around with free code

TRinker's R Blog

...messing around with free code

Revolutions

...messing around with free code

Learning R

Finding my way around R

R-bloggers

R news and tutorials contributed by (750) R bloggers