Getting biobank data into R, part 2
Thanks to RStudio and Rcpp I’ve succeeded in getting an Rcpp package up and running and making it link to the bgen code. Now it’s down to writing the C++ code that makes it actually work.
Implementing the interface
We want to implement this function:
Rcpp::List load(
std::string const& filename,
Rcpp::DataFrame ranges
) {
to load data from a bgen file. Here’s how we’ll do it.
Opening the files
First, we’ll open the bgen file and its index. For this I will use the genfile::bgen::View
and genfile::bgen::IndexQuery
classes that are part of the implementation of bgenix in the bgen repo. (The relevant #include
s need to go at the top of the file.)
Warning: The API for the View
and IndexQuery
classes is somewhat experimental, meaning that it might change in future. (In particular in the current design, the View
class really represents both the bgen file, and a view of that bgen file; likewise IndexQuery
represents both the index and a query against that index. That’s a bit weird.) But they work, so let’s use them without further ado:
using namespace genfile::bgen ;
using namespace Rcpp ;
View::UniquePtr view = View::create( filename ) ;
IndexQuery::UniquePtr query = IndexQuery::create( filename + ".bgi" ) ;
Good! Files successfully opened (or not, in which case an exception will be thrown, we’ll deal with that later if we need to).
(The UniquePtr
s here are typedefs for std::auto_ptr
, which represents unique ownership of the pointed-to objects. These days we are supposed to use std::unique_ptr
instead, but I’m keeping to std::auto_ptr
for the moment to keep the code working on older compilers.)
Setting up the query
Next we set up the query. I’m going to assume that ranges
is a dataframe specifying a set of genomic ranges, each with a specified chromosome
, start
coordinate, and end
coordinate. (By convention intervals in bgen are taken to be 1-based, closed intervals). Let’s add these to the query:
for( std::size_t i = 0; i < ranges.nrows(); ++i ) {
query->include_range(
genfile::bgen::GenomicRange(
ranges['chromosome'][i],
ranges['start'][i],
ranges['end'][i]
)
) ;
}
Now: I told you the API isn’t perfect. The query currently needs to be initialised before it can be added to the view:
query->initialise() ;
view->set_query( query ) ;
At this point, view
represents a view of the specified ranges in the bgen file.
Reading the variant data
The load()
function has to return quite a lot of information:
- The list of variants in the query
- The list of samples in the data
- The ploidy of each sample at each variant
- The genotype probabilities for each sample at each variant
In addition, these need nice things like useful row and column names. So there’s a bit of fiddling around to do. First let’s set up storage for the data we’ll need:
std::size_t const number_of_variants = view->number_of_variants() ;
// For this example we assume diploid samples and two alleles
DataFrame variants = DataFrame::create(
Named("chromosome") = StringVector( number_of_variants ),
Named("position") = IntegerVector( number_of_variants ),
Named("rsid") = StringVector( number_of_variants ),
Named("allele0") = StringVector( number_of_variants ),
Named("allele1") = StringVector( number_of_variants )
) ;
StringVector sampleNames ;
(Rcpp::StringVector
turns out to be the same thing as Rcpp:CharacterVector
. I dunno why there are two names for the same thing.)
We’ll return the genotype data as a 3 dimensional array indexed by the variant, the sample, and the genotype. (For now we’ll assume at most three genotypes exist). Likewise the ploidy will be stored as a 2d array. In this version of Rcpp there seems to be no multidimensional array class, and the right way to set this up is to pass a Dimension
object to the vector constructor instead:
Dimension data_dimension = Dimension( number_of_variants, number_of_samples, 3ul ) ;
Dimension ploidy_dimension = Dimension( number_of_variants, number_of_samples ) ;
NumericVector data = NumericVector( data_dimension ) ;
IntegerVector ploidy = IntegerVector( ploidy_dimension ) ;
Now we’re ready to get data. First let’s get the list of sample names from the bgen file (some bgen files have no sample names, in this case the View class provides a dummy set of names):
view->get_sample_ids(
[&sampleNames]( std::string const& name ) { sampleNames.push_back( name ) }
) ;
The middle line here is a C++11 lambda function that adds each name to the list of sample names.
Now let’s iterate the variants. In the bgen::View
class, that’s done by alternately calling the read_variant()
and the ignore_genotype_data_block()
or read_genotype_data_block()
methods. (For the moment we’ll ignore the genotypes themselves, I’ll get back to this below). Like this:
for( std::size_t variant = 0; variant < number_of_variants; ++variant ) {
view->read_variant( &SNPID, &rsid, &chromosome, &position, &alleles ) ;
variants['chromosome'][i] = chromosome ;
variants['position'][i] = position ;
variants['rsid'][i] = rsid ;
variants['allele0'][i] = alleles[0] ;
variants['allele1'][i] = alleles[1] ;
view->ignore_genotype_data_block() ; // will be fixed later
}
For ease of use we’ll give variants
row names of the variant IDs:
variants.attr( "row.names" ) = rsids ;
Finally, we need to put everything together in an Rcpp::List
as the return value:
List result ;
result[ "variants" ] = variants ;
result[ "samples" ] = sampleNames ;
result[ "ploidy" ] = ploidy ;
result[ "data" ] = data ;
And we’re done!
return( result ) ;
}
Some things I learned about Rcpp
On trying to compile this I learned some things about Rcpp:
Thing 1: Rcpp.h
includes R header files that define some macros, including one called ERROR
(defined in RS.h). Bad, bad R! This means you’d better #include <Rcpp.h>
last, not before your other files, otherwise you’ll get weirdo errors if you’ve used ERROR
anywhere. In my case, this broke the sqlite headers that try to define an enum value SQLITE_ERROR
.
Thing 2: You can’t access dataframe elements as I did above, like mydataframe["column"][0] = 5
. This is because Rcpp doesn’t know what the type of each column in a DataFrame is. Instead you make a reference to the column first, as in
IntegerVector const& column = mydataframe["column"];
column[0] = 5 ;
or you use the as<>
function to specify the type, as in
as< IntegerVector >( mydataframe["column"] ) = 5 ;
Thing 3: C++11 support was not turned on for me, so it worked better to replace that C++11 lambda function with a class built for the purpose, like this one:
struct set_sample_names {
set_sample_names( Rcpp::StringVector* result ):
m_result( result )
{
assert( result != 0 ) ;
}
void operator()( std::string const& value ) {
m_result->push_back( value ) ;
}
private:
Rcpp::StringVector* m_result ;
} ;
which can be used like
view->get_sample_ids( set_sample_names( &sampleNames ) ) ;
instead of the lambda function.
Does it work?
Let’s try. Apple-shift-b again (or R CMD INSTALL
or whatever) and let’s try:
> library( rbgen )
> D = load( "example/example.16bits.bgen", data.frame( chromosome = '01', start = 0, end = 100000 ))
> str( D )
List of 4
$ variants:'data.frame': 198 obs. of 5 variables:
..$ chromosome: Factor w/ 1 level "01": 1 1 1 1 1 1 1 1 1 1 ...
..$ position : int [1:198] 1001 2000 2001 3000 3001 4000 4001 5000 5001 6000 ...
..$ rsid : Factor w/ 198 levels "RSID_10","RSID_100",..: 3 111 4 122 5 133 6 144 7 155 ...
..$ allele0 : Factor w/ 1 level "A": 1 1 1 1 1 1 1 1 1 1 ...
..$ allele1 : Factor w/ 1 level "G": 1 1 1 1 1 1 1 1 1 1 ...
$ samples : chr [1:500] "sample_001" "sample_002" "sample_003" "sample_004" ...
$ ploidy : int [1:198, 1:500] 0 0 0 0 0 0 0 0 0 0 ...
$ data : num [1:198, 1:500, 1:3] 0 0 0 0 0 0 0 0 0 0 ...
> head( D$variants )
chromosome position rsid allele0 allele1
RSID_101 01 1001 RSID_101 A G
RSID_2 01 2000 RSID_2 A G
RSID_102 01 2001 RSID_102 A G
RSID_3 01 3000 RSID_3 A G
RSID_103 01 3001 RSID_103 A G
RSID_4 01 4000 RSID_4 A G
It works!
Of course - we haven’t actually got any genotype data yet. For that, see the next post.