As I mentioned in previous posts, I often have to work with Next Generation Sequencing data. This implies dealing with several variables that are text data or sequences of characters that might also contain spaces or numbers, e.g. gene names, functional categories or amino acid change annotations. This type of data is called string in programming language.
Finding matches is one of the most common tasks involving strings. In doing so, it is sometimes necessary to format or recode this kind of variables, as well as search for patterns.
Some R functions I have found quite useful when handling this data include the following ones:
- colsplit ( ) in the reshape package. It allows to split up a column based on a regular expression
- grepl ( ) for subsetting based on string values that match a given pattern. Here again we use regular expressions to describe the pattern
Ver la entrada original 151 palabras más