…start using R, from scratch!

Some time ago, since I was able to use R by myself, have found some fellows and other people who wanted to learn R as well. Then I pointed them to help pages, to CRAN repositories… but in some cases they said that didn’t know how to start using those resources. Obviously, the main self-perceived limitation for non-programmers is the use of “commands” -ok, many of the 80’s kids will remember the use of some command lines to access games such as PacMan, Frogger… :).

At the same time, they also wanted to refresh some basic statistics, acquiring a general knowledge of their data before asking for a statistician’s help. An idea to quickly help them was to make some scripts to guide them through basic commands, seeing results on real-time, and being able to recycle them for their own data.

If you have just started using R, maybe they can be useful for you. However, I will recommend that you use some open “plain text” file(s) to paste your favorite commands and clone/modify them to suit your needs. Remember to store the files where you can access them later!

  • Tip: you can change the extension of your mytext.txt file into mytext.R file, telling Windows to open it with the Notepad again. It will be also a plain text document, but some text editors will recognize it as an “R script” and will highlight the content according to that.
  • Apart from the Notepad in Windows, you also have a bunch of other text/code editors which are more pleasant to use. See for example R-studio and Notepad ++.

Copy the Gists below into your own text files, and begin playing with R!

Power and sample size calculator for mitochondrial DNA association studies (Shiny)

The functions detailed inside the piece of code below (in a Gist) has been useful for me when I had to calculate many possible scenarios of statistical power and sample size. The formulae were taken from the article of Samuels et al., AJHG 2006, and the script showed even useful for making a variety of comparative plots.

This is intended for estimating power/ sample size in association studies, involving mitochondrial DNA haplogroups (which are categories whose frequencies depend on each other), on a Chi-square test basis. The problem with scripts is that sometimes they aren’t as friendly to many people as GUIs are. To solve this, there are many solutions but, as I don’t have programming background (apart from R), the most straightforward for me was Shiny.

Shiny is a friendly interface which allows for great interactive features (see its Tutorial), and it loads onto the web browser from an open R console, just by clicking:

library(shiny)
shiny:: runGist('5900303')

Where 5900303 is the ID of the Gist. Here is the source:

To work with files inside your computer, just run R from the same directory of the files ui.R and server.R, and execute the Gist with the command:

runApp()

If this doesn’t work, you can paste the complete path to the ui and server files:

runApp("path/to/directory")

BONUS / EDIT:

New Gist, which displays a simple graph using two power/ncases values (it was hard for me to come up to this, mostly thanks to Stackoverflow and to MadScone):

library(shiny)
shiny:: runGist('5895082')
Structure of the human mitochondrial genome.

Structure of the human mitochondrial genome. (Photo credit: Wikipedia)

An .EPS to .PDF converter (using LaTeX!)

I am about to go on a short holiday, so I was tidying the code lines I had scattered around before leaving… And I found this: a minimal EPS to PDF converter, which is barely a LaTeX template.

It is intended for transforming an .EPS graph to the .PDF format. You can copy & paste this whole code into a blank text file (but with .TEX extension) and run it with a TeX editor. To install and use LaTeX, here it is a previous post about it.

When you have compiled it, you can search in the same file’s directory for the newly created PDF graph!

-omics in 2013

Originalmente publicado en What You're Doing Is Rather Desperate:

Just how many (bad) -omics are there anyway? Let’s find out.

Update: code and data now at Github

1. Get the raw data

It would be nice if we could search PubMed for titles containing all -omics:

1

However, we cannot since leading wildcards don’t work in PubMed search. So let’s just grab all articles from 2013:

1

and save them in a format which includes titles. I went with “Send to…File”, “Format…CSV”, which returns 575 068 records in pubmed_result.csv, around 227 MB in size.

2. Extract the -omics
Titles are in column 1 and we only want the -omics, so:

1

Note: grep changed so the following now works. Note: this approach will miss a very few cases where omics is preceded by a hyphen. That included the…

Ver original 272 palabras más

Biostatisticians… beware of fuzzy researchers!

In the last days I was thinking about about how researchers could collaborate efficiently with their experts in statistics. With the increasing complexity in science, interchanging information can be crucial to get the best results. But what happens when a researcher doesn’t give all the information a statistician needs?.

When someone asks for help in statistics -as a clinician, I need it often-, many biostatisticians ensure that some basic points are met in a research study -prior to their analysis-, and I think they should ask for them as soon as possible:

  1. Is the research question sensible, and supported by the literature in any way? A good research idea should be accompanied by a short review with pros, cons and all the important references, which could guide you to the most appropriate or most used methods in whatever field. Ask for it. Ask for it, read it. If the study has a flaw at this point, it’s easier to correct it now than later.
  2. Is the research question defined and detailed at the end of the review? Does it have a main objective? Do they plan further analyses depending on the results? Do researchers give enough information for sample size calculation? With these points you can avoid getting stuck in the middle of a study for a long time. The scientific method is always a good guide if things go unexpected.
  3. Data? Ok, they have planned something taking into account their available knowledge. But how about the resources? Can they collect all the data they need? If not, how can the design be adapted? Often, they have to change something in the research question, and start again. But in the worst case it takes much less time than not doing things right from the beginning.
  4. Have they -researchers, clinicians, whoever-, met all three tips above? Then, your chances of helping them to answer the question will increase, and even the time to the answer can decrease substantially.

In an ideal world, everyone follows good methods and asks the right questions… But sometimes things go wrong, and the statistician hasn’t usually many chances to realise what’s really going on. Here are two examples:

One example:

It was about a pioneer study in some pathology in children. I think it was multicenter, observational -cohort-, and it evaluated some outcomes at the end. I don’t remember all the details, but the statisticians told me that got stuck -see point 2- with one covariate that was incorrectly defined. In fact, it mixed patients of potentially very different categories!. And the worst part was that… the doctors who designed the study were aware of this drawback. But they didn’t warn of it, and they didn’t discussed it with statisticians. That is a loss of time. They knew that they should have retrieved another covariate, so those groups could be separated. At the same time, statisticians were thinking of changing the type of analysis…

Another example:

This had more severe errors, in the sense of a question badly supported by previous knowledge, and horribly tested. On the other hand, the statistical methods used were really good, and also broadly applicable to the type of data used. I’m going to be more explicit here:

That study, about psychology, was based on measuring the changes on tiny blood vessels -capillars- in some place inside the “brain”. But, for doing that, they measured those changes in a poorly related part of the brain!. The neuronal mass that lies inside the skull, is not an unifom entity: is an entire system, composed by multiple, different “organs”. The cortex is different from the thalamus, from the locus coeruleus, from the putamen… Anyway, the book “Carpenter’s” is a good place to start studying neuroanatomy.

When I asked the biostatistician if those people were sure of what they were doing, the answer was: “researchers told me that this is supported by scientific literature”. So I was curious enough to look into Google. And then to look into PubMed for evidence. No background was found other than this single study.

The technique they used was the Hemoencephalography (HEG). It measures the changes of capillar blood flow in both frontal lobes of the brain cortex, using an infrared light against red blood cells -more info in the links above-. It seems much better known in other fields of Psychology, used for neurofeedback, based on that vascular changes may reflect neural activity in that exact place. Unfortunately, its use is limited to superficial parts of the brain, so the infrared light can reach the capillars through the skin. Thus is not able to assess deep frontal areas, only higher ones. This technique also has life-saving applications in the fields of Anaesthesiology and Critical Care -see the second picture below-, where blood oxygen saturation in brain hemispheres can be acutely altered by an infarct or emboli, and thanks to this device can be early detected and treated.

If you search in PubMed for the word “Hemoencephalography”, you will only find -up to today- one study that uses it for neurofeedback in migraines -which have an obvious cortical component, able to be studied by HEG-. So there is not any scientific evidence at all for its use in other capillars than frontal ones. Unsurprisingly, the poor statistician explained that had ended up with uninterpretable results. Maybe if the researchers complied with the conditions in point 1, the study would have been completely redesigned -because the inner parts of the brain can be studied by proton emission tomography or by functional MRI-. In that case, the statistical aspects would’ve been way more rewarding.

To end this post, just some pictures of cool brain monitoring in Anaesthesia:

  • Electroencephalography monitor with bi-spectral index and raw billateral input (heatmaps!):

BIS billateral

  • Variety of stickers to watch brain electricity (BIS monitor) and oxymetry changes with infrared light (INVOS, with a mechanism similar to HEG):

stickers to monitor brain

SPSS, SAS or R…which one to choose?

mareviv:

A very interesting post!

Originalmente publicado en FreshBiostats:

Performing statistical analyses by hand in the era of new technologies would seem crazy.  Nowadays, there are three main statistical programs for doing statistics: IBM SPSS, SAS  and R, as it can be read in a more extensive post of this site. Sometimes, biostatisticians need to use more than one package to carry out their analyses.  This means that users of these programs have to move from one to another environment, from front-end to back-end, using different wizard and graphical interfaces, wasting in most occasions an important amount of time. Because of that, in this post I would like to address the differences between the aforementioned programs, pointing out the advantages and the most suitable usage of each software from my personal point of view.

For a good choice of a statistical program, one should take into account several issues such as the following:

1. Habit: It refers to how…

Ver original 576 palabras más

Using gdata, for MS Windows users

I use both GNU-Linux and Windows systems on a regular basis… so I’m aware of the advantages (more for GNU-Linux in my case) and disadvantages of both.

Recently I needed to analyse a database from a remote location, an Excel (*xlsx) file.

The problem was that I couldn’t put my gdata library to work… some weird errors about a missing Perl interpreter… just needed to install one. Based on this tutorial in CRAN, I downloaded ActivePerl.

Then, I followed the instructions to install it, leaving the default options. The program sets the PATH to the interpreter, so R can finally find Perl…

Just start then a R session… Done!