Slept like a…

Me vs. the EGA part 4: losing again

2019-05-15T00:00:00+00:00

It’s not over. As it turns out I’d omitted to include the README files.

I decided this was such a little detail that I’d ask the EGA to just deal with it.

However, they suggested instead that I instead add them as new analyses via the EGA webin site.

I retorted that based on inspection of the XML schema, the analysis XML has no appropriate ANALYSIS_TYPE*.

They suggested I send it in with ANALYSIS_TYPE set to SAMPLE_PHENOTYPE and file type readme_file.

I spent half an hour building the relevant XML file (which are more annoying than you might think, because they are supposed to include references to the samples). It looks like this:

<ANALYSIS_SET>
<ANALYSIS alias="my_README_files" center_name="<center name>" broker_name="EGA" analysis_center="<center name>">
        <TITLE>My README</TITLE>
        <DESCRIPTION>Release note for my data</DESCRIPTION>
        <STUDY_REF refname="alias of my study" refcenter="<center name>"/>
        <SAMPLE_REF refname="sample 1 alias" refcenter="<center name>" label="<sample label>"/>
        ...
        (etc.)
        <ANALYSIS_TYPE>
            <SAMPLE_PHENOTYPE></SAMPLE_PHENOTYPE>
        </ANALYSIS_TYPE>
        <FILES>
            <FILE filename="<path to release note file" filetype="readme_file" checksum_method="MD5" checksum="470d0e6f8f8f6dc1794a0d28aa63bae5" unencrypted_checksum="53405d44a4593b7bbfa2d62f7853000c"/>
        </FILES>
    </ANALYSIS>
</ANALYSIS_SET>
...

I sent it in. Quoth it:

    In analysis, alias:"my_README_files", accession:"".
    Invalid group of files: 1 "readme_file" file.
    Supported file grouping(s) are: [ at least 1 "phenotype_file" files, any number of "readme_file" file].

In other words, it doesn’t work.

* The docs on ANALYSIS_TYPE aren’t very clear. There are these docs, which talk about BAMs (REFERENCE_ALIGNMENT), VCFs (SEQUENCE_VARIATION), and Phenotype files (SAMPLE_PHENOTYPE). I think these are the correct docs to follow for EGA, i.e. maybe these three are all that are permitted.

Confusing though because those docs link to these docs, which point to the XML schema files files which list the following analysis types: REFERENCE_ALIGNMENT, SEQUENCE_VARIATION, SEQUENCE_ASSEMBLY, SEQUENCE_FLATFILE, SEQUENCE_ANNOTATION, REFERENCE_SEQUENCE, SAMPLE_PHENOTYPE, PROCESSED_READS, GENOME_MAP, AMR_ANTIBIOGRAM, PATHOGEN_ANALYSIS, and TRANSCRIPTOME_ASSEMBLY. And not to be left out, there are also these docs, which have examples for SEQUENCE_VARIATION and REFERENCE_ALIGNMENT nad GENOME_MAP, and feature (dead) links to the schema.

But in any case it turns out that not all of these analysis types are allowed; for example, trying to submit an analysis of type SEQUENCE_ANNOTATION gives you: “In analysis, alias:”my_README_files”, accession:””. Invalid analysis type SEQUENCE_ANNOTATION.”.

Me vs. the EGA part 3: winning

2019-05-12T00:00:00+00:00

If you’ve been following these posts so far, you’ll have XML files that register a study and sample samples, and an XML file that registers an analysis (i.e. that lists all the actual data files). And you’ve uploaded all your files to your FTP inbox, and you’ve run them through the test service, and it all responded without any errors.

Right?

Right. So now you need to join all this into datasets.

Anti-aliasing

Since we’ve generated a unique ‘alias’ for each object (studies, samples, analysis files) we should be able use that to link our datasets together, right? Wrong. For the dataset XML we need the accessions.

To get accessions for the analysis files, we need to actually submit all the stuff we had already. I mean you could do that by submitting to the real (non test) service, like this:

$ curl \
-u <ega-box-XXX>:<password> \
-F "SUBMISSION=@submit.xml" \
-F "[THING]=@[XML filename]" \
https://www.ebi.ac.uk/ena/submit/drop-box/submit/

and capturing the response, which is another XML file, and parsing it to get the accessions, but there’s an easier way.

The easier way to submit

Go to this EGA Webin site and log in. You’ll see a ‘Submit’ tag and a ‘Submit XML files’ button. They give us a way to submit the XML files and get useful results back.

Start by choosing your submission xml (submit.xml) and your study xml (study.xml) described in part one. Submit them. You get back a tab-separated file giving you the accession for your study, and also a receipt file. Save them.

Now do the same for each of the ‘samples’ and ‘analysis’ XMLs in turn. (To do this you have to unselect the XML you’ve already selected. I couldn’t see a way to do this except by Ctrl-R reloading the page, but that seems to work.) You get back more accessions. Save them.

So now you’ve got files listing the accessions - one for the study, one for each of the samples, and importantly, one for each of the analysis objects (files) in your dataset.

Endgame

This is now pretty straightforward. You basically want this:

<DATASETS>
    <DATASET alias="My cool datasets" center_name="<center name>" broker_name="EGA">
        <TITLE>My cool dataset</TITLE>
        <DATASET_TYPE>Whole genome sequencing</DATASET_TYPE>
        <ANALYSIS_REF accession="<accession of first analysis>" />
        <ANALYSIS_REF accession="<accession of second analysis>" />
        ...
        <POLICY_REF accession="<accession of data access policy>" refcenter="<center name>"/>
        <DATASET_LINKS>
            <DATASET_LINK>
                <URL_LINK>
                    <LABEL>My website</LABEL>
                    <URL><URL of my website></URL>
                </URL_LINK>
            </DATASET_LINK>
        </DATASET_LINKS>
    </DATASET>
    ...
</DATASETS>

Repeat the <DATASET> for as many datasets* as you have.

* Given the presence of <DATASET_TYPE> in there, I was worried I wouldn’t be able to include all my desired files in the same dataset. This worry seems to have been unfounded - I included a bunch of analysis specifying CRAM files, and and analysis specifying the VCF of microarray genotypes and the README file and two annotation files, and it seems to work.

Before writing the XML above you need to know:

the accession of all your analyses (from the submission step above)
and the accession of your data access policy. Here I ran into a couple of issues:

Policy issues

First, although there’s an EGA page listing data access commitees, there’s no such page listing policies. So if you’ve got an existing policy but don’t know its accession, you have to ask the EGA helpdesk.

Second, even if you do have a policy, the submission system might not know about it. Advice from the EGA helpdesk was that this happens because these objects are too old, and were entered manually, and the procedure has now changed completely. However this isn’t much help when uploading data.

I’ve chosen to workaround this by creating a new data access committee and a new data access policy. Then I’m going to email the helpdesk and ask them to link it to the correct policy and DAC when the data is processed. Luckily the two XMLs used for this are pretty simple:

<DAC_SET>
    <DAC alias="Placeholder for EGAC<<my policy number>" center_name="<center_name>" broker_name="EGA">
        <TITLE>My data access committee</TITLE>
        <CONTACTS>
            <CONTACT name="My IDAC" email="<my dac email address>" organisation="<my organisation>" telephone_number=""/>
       </CONTACTS>
    </DAC>
</DAC_SET>

and

<POLICY_SET>
    <POLICY alias="Placeholder for EGAP<<my policy number>" center_name="<center_name>" broker_name="EGA">
        <TITLE>My data access policy</TITLE>
        <DAC_REF accession="EGAC00001001201" refcenter="<center_name>"/>
        <POLICY_TEXT>See policy EGAP<<my policy number></POLICY_TEXT>
        <POLICY_LINKS>
            <POLICY_LINK>
                <URL_LINK>
                    <LABEL>Data Access Agreement</LABEL>
                    <URL><my policy URL></URL>
                </URL_LINK>
            </POLICY_LINK>  
        </POLICY_LINKS>
    </POLICY>
</POLICY_SET>

(Note that although the dataset only needs the policy, not the DAC, you need to do the DAC because otherwise you can’t creat the policy).

These two XMLs are pretty easy to submit, and now you’ve got a policy accession to include in your dataset.

Return of the work/life balance

Check it out:

Sit back. Take a deep breath. Now go and play football.

** END **

Me vs. the EGA part 2b: The EGA strikes back

2019-05-11T00:00:00+00:00

Saturday morning. The sun has broken free of the morning haze. Birds flit, twittering, between branches. My kids are out in the garden, laughing as they play.

One of them comes running over and asks me to play. No.

Off he goes.

A moment later he’s back. Come on, will I play? Nope.

He frowns and back he goes to his game, which as far as I can tell involves kicking a football as high as he can in the air, falling over, and then giggling. And then getting up and doing it again.

Two minutes later and something nudges my arm. Why won’t I play?

“Well,” I say,

File naming

Haven’t you read the comments on file naming I made in part 1 and part 2? The reason I can’t play football with you is that on Thursday I discovered one of my files was wrongly named on the EGA FTP site, and that wasted 24 hours, and then yesterday I decided one of the files should be named slightly more consistently with another file, and that wasted another 24 hours, and if I don’t get this done by Monday then I shan’t be able to do all the other things I have to do next week, and if I don’t do it today then I shan’t be able to play with you tomorrow either. So, what I’m doing now is,

(He goes back to his game.)

Me vs. the EGA part 2: uploading data

2019-05-02T00:00:00+00:00

In part one of this series, we took on the EGA by submitting a study and a set of samples to it. The EGA is now intrigued by the taste of data we’ve sent it - not to mention confused by our persistence. It was pretty sure we would choose options 1 or 2. In other words it’s on the back foot.

Nevertheless, there’s still a long way to go. This post is about actually getting data uploaded and submitted. First, remember to ignore all the red herrings. We’re using the XML schema to do this because it’s the One True Way*.

* that I got to work.

Where to put the data

The basic trick is to place files in absolute paths on the FTP site that are the paths you want the data users to download.

* This may seem obvious. But my initial instinct was to put data in well-organised subfolders of my choosing, but that wouldn’t be exposed to data users. Turns out that won’t work.

For my study I want people to get a folder of data in this form:

    EGAS0000XXXXXX/
        dataset1_name/
            sample1.cram
            sample1.cram.crai
            sample2.cram
            sample2.cram.crai
            ...
        dataset2_name/
            sample1.cram
            sample1.cram.crai
            ...
        ...

Where EGAS0000XXXXXX is the EGA identifier for the study. You did make a note of this right?* So I’ve uploaded data into a top-level folder EGAS0000XXXXXX on the FTP site.

* personally, I didn’t know the EGA identifier when I uploaded the data. Instead I used the placeholder folder name EGAS0000XXXXXX as shown above. Then I renamed it after getting the study ID during the real submission. Another way would be to submit the study XML from the previous post to the live service first, as that will give you the EGA ID. I’ll cover that in the next post. Yet another way is to simply not name your files with the EGA identifier in. I’m doing it because I’m stubborn that way.

How to prepare the data

You can’t just upload your data to the EGA - you have to encrypt it and compute checksums first. You upload the encrypted files (*.gpg), the checksums (*.md5), and the checksums of the encrypted files (*.gpg.md5). Is all this encryption and checksumming overkill? I don’t know, but that’s what you have to do.

EGA provides a tool for this, called EgaCryptor.jar. Get it here. Run it on each file as

$ java -jar ../EgaCryptor.jar -file <filename>

I’m not going to go into how to do this across many files. Use find -exec to do it, or use a compute cluster. This can be a mini-adventure, set in the broader narrative context of the larger one, for you to complete on your own. There may be challenges on the way, but I reckon you can handle them.

You now upload all these files to your ega box. For large directories of files, I found ncftpput to be the simplest tool for this. So if, locally, your files are all in a local EGAS0000XXXXXX folder:

$ ncftpput -u <ega-box-XXX> -p <password> -R ftp.ega.ebi.ac.uk /EGAS0000XXXXXX/ ./EGAS0000XXXXXX/

Where -R specifies recursive mode, and it’s ncftpput [options] host remotedir localdir. So what you’ve got now on the ftp site is:

EGAS0000XXXXXX/
    dataset1_name/
        sample1.cram.gpg
        sample1.cram.gpg.md5
        sample1.cram.md5
        sample1.cram.crai.gpg
        sample1.cram.crai.gpg.md5
        sample1.cram.crai.md5
        sample2.cram.gpg
        sample2.cram.gpg.md5
        sample2.cram.md5
        sample2.cram.crai.gpg
        sample2.cram.crai.gpg.md5
        sample2.cram.crai.md5
        ...
    dataset2_name/
        sample1.cram.gpg
        sample1.cram.gpg.md5
        sample1.cram.md5
        sample1.cram.crai.gpg
        sample1.cram.crai.gpg.md5
        sample1.cram.crai.md5
        ...
    ...

Note there aren’t any of the unencrypted files here. I acheived that by first making a parallel directory structure with all the .md5 and .gpg and .gpg.md5 files in it (but not the original files) before running ncftpput.

How to submit a CRAM file

It’s time to feed the monster again. In this step we tell EGA about our files by submitting another XML.

This is going to get a bit complicated because the XML has to contain

a reference to the study - as in the study XML we submitted before. (This is where the ‘alias’ for the study is used.)
the full name of the unencrypted file
the md5sum of the unencrypted file
and the md5sum of the encrypted file

For the CRAM files we’re working with, it also has to contain

a reference to the relevant sample - as in the sample XML we submitted before. (This is where the ‘alias’ for the sample is used.)
details on the reference used for alignment. This includes the name and accession of the reference, and those of all the reference sequence names.

The analysis XML

For a test with one sample I’m going to assume:

the study alias is my_study_v1
the sample alias is illumina_hiseq:test_sample_1
the files are at EGAS0000XXXXXX/dataset1/test_sample1.cram[.crai].

My CRAM files are aligned to GRCh37. In the analysise XML we’re supposed to give an accession for this reference - in my case it is the GenBank assembly accession GCA_000001405.1. Moreover, you’re supposed to also give an accession for each reference contig. This makes the full XML pretty long. Here’s a simplified version:

<ANALYSIS_SET>
<ANALYSIS alias="test_analysis_1" center_name="<center name>" broker_name="EGA" >
        <TITLE>
            Aligned reads file for illumina_hiseq:test_sample_1
        </TITLE>
        <DESCRIPTION>Aligned sequence reads for illumina_hiseq:test_sample_1, mapped to GRCh37</DESCRIPTION>
        <STUDY_REF refname="my_study_v1" refcenter="<center name>"/>
        <SAMPLE_REF refname="illumina_hiseq:test_sample_1" refcenter="<center name>" label="illumina_hiseq:test_sample_1"/>
        <ANALYSIS_TYPE>
            <REFERENCE_ALIGNMENT>
                <ASSEMBLY>
                    <STANDARD refname="GRCh37" accession="GCA_000001405.1"/>
                </ASSEMBLY>
                <SEQUENCE accession="CM00663.1" label="1"/>
                <SEQUENCE accession="CM00664.1" label="2"/>
                (etc.)
            </REFERENCE_ALIGNMENT>
        </ANALYSIS_TYPE>
        <FILES>
            <FILE filename="EGAS0000XXXXXX/dataset1/test_sample_1.cram" filetype="cram" checksum_method="MD5" checksum="fd847e0c4849ec50cdf310accd4293b0" unencrypted_checksum="d41d8cd98f00b204e9800998ecf8427e"/>
            <FILE filename="EGAS0000XXXXXX/dataset1/test_sample_1.cram.crai" filetype="crai" checksum_method="MD5" checksum="2b37te0c4994f850cdf310accd8b04c" unencrypted_checksum="e32d7cd98f00b204e5801958ecf772S"/>
        </FILES>
    </ANALYSIS>
</ANALYSIS_SET>

How to get the checksums in there? Well to upload the files to the FTP site you had to encrpyt them. So you can read the checksums from the .md5 files that were created.

Submit with:

$  curl \
-u <ega-box-XXX>:<password> \
-F "ANALYSIS=@cram.xml" \
-F "SUBMISSION=@submit.xml" \
https://www-test.ebi.ac.uk/ena/submit/drop-box/submit/

If successful it’ll return something like this:

<RECEIPT receiptDate="2019-05-01T13:49:59.366+01:00" submissionFile="submit.xml" success="true">
     <ANALYSIS accession="EGAZ00001399343" alias="test_analysis_1" status="PRIVATE"/>
     <SUBMISSION accession="EGA00001532515" alias="SUBMISSION-01-05-2019-13:49:59:289"/>
     <MESSAGES>
          <INFO>Submission has been committed.</INFO>
          <INFO>This submission is a TEST submission and will be discarded within 24 hours</INFO>
     </MESSAGES>
     <ACTIONS>ADD</ACTIONS>
     <ACTIONS>PROTECT</ACTIONS>
</RECEIPT>

If not there’ll be errors.

Some of it has issues

At this point I ran into a problem. For this XML to go in, the relevant encrypted (.gpg) files have to be on the FTP site in the specified locations (though the md5sums don’t seem to actually have to be right - don’t know why, but I imagine this is checked later). However, as I mentioned in my earlier post,

There’s a delay in the system recognising what you’ve put on the FTP site.

As far as I can tell, this is a delay of up to 24 hours, though I couldn’t find this documented.

Depending on how you work, this might not affect you. It is a real pain in the neck for me, as I’m a last-minute tinkerer trying to “get things right” for submission today. But having just renamed some files, I’m going to have to wait. The moral is: pick your filenames carefully from the start, and stick with them.

(Another complication with the above is that my sequences were actually aligned to the hs37d5 reference (available here) that includes a concatenated set of decoy sequences. I couldn’t find an accession for that and don’t know right now if this will come back to bite me later.)

Submitting many CRAM files

This is now easy, right? Just repeat the <ANALYSIS> tags as many times as you want within the XML. Of course, you’ll have to write a computer program to generate all that XML. You’ll have to make this code read all the .md5 files, so that you can put the md5 values in the XML.

Should be straightforward. I’ll leave it to you.

But what if I don’t have CRAM files.

Oh. Well, as it turns out there are lots of different stuff that can be submitted. As the https://ega-archive.org/files/Analysis_BAM.xml says:

<!-- Many other filetypes can be used to define your analysis file and
    multiple file types can be submitted for a single analysis.
    For example, you may wish to submit a readme_file and
    phenotype_file to accompany your bam file: 
    <"cram"/>
    <"tabix"/>
    <"wig"/>
    <"bed"/>
    <"gff"/>
    <"fasta"/>
    <"contig_fasta"/>
    <"contig_flatfile"/>
    <"scaffold_fasta"/>
    <"scaffold_flatfile"/>
    <"scaffold_agp"/>
    <"chromosome_fasta"/>
    <"chromosome_flatfile"/>
    <"chromosome_agp"/>
    <"chromosome_list"/>
    <"unlocalised_contig_list"/>
    <"unlocalised_scaffold_list"/>
    <"sample_list"/>
    <"readme_file"/>
    <"phenotype_file"/>
    <"OxfordNanopore_native"/>
    <"other"/>
    -->

Unfortunately here the EGA is pernickety (I mean “controlled”) and I found that I couldn’t submit exactly what I wanted. Instead I got messages of this form:

 <ERROR>
 In analysis, alias:"illumina_hiseq:test_sample_1", accession:"".
 Invalid group of files: 1 "other" file, 1 "cram" file, 1 "crai" file.
 Supported file grouping(s) are:
 [1 "bam" file, 0..1 "bai" files, 0..1 "readme_file" files],
 [1 "cram" file, 0..1 "crai" files, 0..1"readme_file" files].
 </ERROR>

So this suggests you can’t just submit what you want. A CRAM file can only go with an index file and a README, not e.g. an annotation file or any other associated information.

This is a bit of a problem for my data, because in addition to the CRAM files I want to release

a VCF file of array genotypes for these samples.
annotation files for the sequenced and chip-typed samples
and a README.

And I want to release them all inside the same dataset.

One way to do this seems to be to link them all to a genotyping analysis included within the ANALYSIS_SET. Like this:

<ANALYSIS alias="omni_typing_for_test_project" center_name="<center name>" broker_name="EGA" >
    <TITLE>Illumina Omni 2.5M genotyping</TITLE>
    <DESCRIPTION>Illumina Omni 2.5M genotyping</DESCRIPTION>
    <STUDY_REF refname="my_study_v1" refcenter="<center name>"/>
    <SAMPLE_REF refname="illumina_omni2.5M:test_sample_1" refcenter="MalariaGEN" label="3999807010_R01C01"/>
    <SAMPLE_REF refname="illumina_omni2.5M:test_sample_2" refcenter="MalariaGEN" label="3999807010_R01C01"/>
    ...
    <ANALYSIS_TYPE>
        <SEQUENCE_VARIATION>
        <ASSEMBLY>
            <STANDARD refname="GRCh37" accession="GCA_000001405.1"/>
        </ASSEMBLY>
        <SEQUENCE accession="CM00663.1" label="1"/>
        <SEQUENCE accession="CM00664.1" label="2"/>
        ...
        <EXPERIMENT_TYPE>Genotyping by array</EXPERIMENT_TYPE>
        </SEQUENCE_VARIATION>
    </ANALYSIS_TYPE>
    <FILES>
        <FILE filename="test_submission/test_samples.vcf.gz" filetype="vcf" checksum_method="MD5" checksum="47f0d94097b043fa4ec6a028da8c2de4" unencrypted_checksum="ac5dd03c3927f39806c4e593e78290d2"/>
        <FILE filename="test_submission/test_samples.vcf.gz.tbi" filetype="tabix" checksum_method="MD5" checksum="47f0d94097b043fa4ec6a028da8c2de4" unencrypted_checksum="ac5dd03c3927f39806c4e593e78290d2"/>
        <FILE filename="test_submission/test_samples.tsv" filetype="other" checksum_method="MD5" checksum="47f0d94097b043fa4ec6a028da8c2de4" unencrypted_checksum="ac5dd03c3927f39806c4e593e78290d2"/>
        <FILE filename="test_submission/README.md" filetype="readme" checksum_method="MD5" checksum="47f0d94097b043fa4ec6a028da8c2de4" unencrypted_checksum="ac5dd03c3927f39806c4e593e78290d2"/>
    </FILES>
    <ANALYSIS_ATTRIBUTES>
        <ANALYSIS_ATTRIBUTE>
        <TAG>platform</TAG>
        <VALUE>Illumina Omni 2.5M</VALUE>
        </ANALYSIS_ATTRIBUTE>
    </ANALYSIS_ATTRIBUTES>
</ANALYSIS>

For other types of data, you’re on your own. (or maybe I should point you at this or this or this. Yes, that’s a lot of stuff to read but you didn’t hear Hercules complaining, did you?)

How goeth the quest?

The EGA, let’s be honest, is fighting back. Although we’ve navigated its network of complicated XML schema, it has brought to bear stringent rules that confound our expectations, and is trying to annoy us by pretending our data isn’t even there. This is as good a time as any to give up - to choose life, say, or to choose to go and watch a colour TV, or to go back and choose option 1 or 2. But I’m not going to. I’m going to defeat the beast, and then I’m going to have a good moan/gloat about it on my blog.

(Although in practice…)

Me vs. EGA

2019-05-01T00:00:00+00:00

When Hercules faced up against Hydra, he didn’t have to interface with each head using its own set of API endpoints. He just lopped them off.

Sadly, things are more complex these days. Any hero(in)es planning to submit data to the European Genome-Phenome Archive (the EGA) will face a creature of terrifying proportions. Here are some options for dealing with this:

Don’t bother. Data release is not that important. (recommended*)
Employ someone else to do it for you. (recommended)
You have to do it yourself? No way! You should totally go back and consider options 1 and 2.
Ok, well, you’re in for a rough ride but this post might help.

* this isn’t really recommended.

The perils ahead

More specifically I’m going to show you how I submitted our genome sequencing data to the EGA. What I’ve got is 6 folders, each containing ~50 CRAM files, a VCF file of genotypes from array typing, and a couple of annotation files. On the EGA I want these to appear as one study containing 6 datasets.

To do this I had to write:

one submission XML (reused at each step)
one study XML describing the study
one samples XML listing all the samples (actually I split this into two, one for sequenced samples and one for microarray samples).
an ‘analysis’ XML, listing all the CRAM files and all the other files.
a ‘dataset’ XMLs, specifying how all the analysis files fit into the six datasets.

i.e. 6 XML files in total*.

Actually, it turned out I also had to write

a data access committee XML, and
a policy XML

Writing all these XML files is not much fun. It’s not any fun at all†. But hey, you didn’t see Hercules complaining, did you?

* Actually, I also had to register a new data access committee and a new data access policy, even though we already have a data access committee and a data access policy. They were among the first ones created. I think the explanation is that these objects are old, and hence should be ignored, but it’s not a view I would generally subscribe to.

†except perhaps for that moment when your XML finally goes in successfully. That’s fun.

What to ignore

This may be the most important section of the post. Ignore the two EGA websites, the submitter portal, the other submitter portal, and the Excel spreadsheet-based submission process. Ignore the JSON-based REST API, no matter how tempting. Read the docs if you want to but beware that they all seem to be subtly misleading.

I am here to show you the One True Way*.

*that I got to work.

What not to ignore

Even though we are ignoring the EGA Webin site, we are not going to ignore the other EGA Webin site, because this is the one that actually works, and it consumes the XML we’ll be writing. We will use that for our actual submission. But since that site doesn’t seem to have a version for testing, for most of this post we’ll use curl to submit programmatically to the test service instead.

What you will need

Your weapons are: an EGA submission account (presumably called something like ega-box-XXX with an associated password). And you need to know the ‘center name’ associated with that account. For me, the center name is “MalariaGEN”. If you don’t have these things, or don’t know them, your adventure is over. Register an account or contact the EGA helpdesk about your existing account* and then come back and see me. (Or reconsider Option 1?)

* If you’ve got an existing data access committee / policy you want to use, now’s a good time to also ask the EGA helpdesk what their accessions are. There’s a page listing DACs but not one listing policies. But if you’re creating new ones, this doesn’t matter.

A note on filenames

The EGA seems to careth not what you call your files, but you should care. In fact what you should do, right now and before you do anything else, is to write down the exact file names and file structure that your data release will occupy. And then never change it.

Why? There are two reasons.

It’ll stop you faffing about with renaming files later.
The EGA submission system does not recognise changes to its FTP inboxes straight away. It appears to take overnight to do it - no doubt it’s a CRON job or some other such thing running in the wee hours.

As I’ll describe later, this can lead to substantial delays.

Get on with it

Riiight…curiously enough, the only way to defeat this monster is to submit to it.

Submission XML

To submit anything you need this piece of XML.

<SUBMISSION_SET>
    <SUBMISSION alias="" center_name="<your center name>" broker_name="EGA">
        <ACTIONS>
                <ACTION>
                        <ADD/>
                </ACTION>
                <ACTION>
                        <PROTECT/>
                </ACTION>
        </ACTIONS>
    </SUBMISSION>
</SUBMISSION_SET>

You don’t need any of the other complexities in the documentation. The alias can be empty, and you don’t need to list any files under the ADD action. Fill in your ‘center name’ as described above. Save this to a file called submit.xml.

Test it out using the curl command, like this:

$ curl \
-u <ega-box-XXX>:<password> \
-F "SUBMISSION=@submit.xml" \
https://www-test.ebi.ac.uk/ena/submit/drop-box/submit/

(Fill in your EGA username and password). This sends a request to the EGA test service, which replies:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="receipt.xsl"?>
<RECEIPT receiptDate="2019-05-01T08:50:59.853+01:00" submissionFile="submit.xml" success="false">
     <MESSAGES>
          <ERROR>The submission must contain at least one object in addition to the submission for ADD, MODIFY and VALIDATE actions.</ERROR>
          <INFO>Submission has been rolled back.</INFO>
          <INFO>This submission is a TEST submission and will be discarded within 24 hours</INFO>
     </MESSAGES>
     <ACTIONS>ADD</ACTIONS>
     <ACTIONS>PROTECT</ACTIONS>
</RECEIPT>

We’ve got the EGA’s attention! Now we need to lob something substantial into its mandibles.

Submitting a study

For the EGA, a “study” is a collection of datasets. To submit a study you need a study name and an abstract.

Also, you must generate a unique identifier for this study. This is called the alias, and it is used to refer to the study later. It is not shared with anyone and it is supposed to be globally unique within the submission account. So it’s probably best to make it quite specific.

So something like this:

<STUDY_SET>
    <STUDY alias="my_study_v1" center_name="MalariaGEN">
        <DESCRIPTOR>
            <STUDY_TITLE>My whole genome sequencing study</STUDY_TITLE>
            <STUDY_TYPE existing_study_type="Whole Genome Sequencing"/>
            <STUDY_ABSTRACT>We sequenced some stuff, and it was good</STUDY_ABSTRACT>
        </DESCRIPTOR>
        <STUDY_ATTRIBUTES>
            <STUDY_ATTRIBUTE>
                <TAG>url</TAG>
                <VALUE>https://my.study.website.org/</VALUE>
            </STUDY_ATTRIBUTE>
        </STUDY_ATTRIBUTES>
    </STUDY>
</STUDY_SET>

To submit it:

$  curl \
-u <ega-box-XXX>:<password> \
-F "STUDY=@study.xml" \
-F "SUBMISSION=@submit.xml" \
https://www-test.ebi.ac.uk/ena/submit/drop-box/submit/

If this works, you’ll get a reply containing <INFO>Submission has been committed.</INFO>.

Note: In a real run you would make a note of the EGA ID that you received, as it’s important later. But we’re only testing right now.

You can of course add more attributes to the study - I think they are arbitrary tag/value pairs - but it’s not clear what they end up being used for. In my study, I added a couple of citations since that felt like the right thing to do.

(I’ll admit it - there is a perverse satisfaction in having my XML consumed like this. It’s tempting to submit this again, just to see what happens. Turns out this particular hydra won’t eat the same thing twice, so we’re going to have to feed it something else.)

Submitting a sample

This monster’s appetite is piqued but far from sated. We need to feed it some actual data, in the hope it will later be excreted onto the EGA website. To register a sample you need this:

<SAMPLE_SET>
<SAMPLE alias="illumina_hiseq:<a_unique_sample_identifier>" center_name="MalariaGEN">
<TITLE>
My whole-genome sequenced sample
</TITLE>
<SAMPLE_NAME>
    <TAXON_ID>9606</TAXON_ID>
    <SCIENTIFIC_NAME>homo sapiens</SCIENTIFIC_NAME>
    <COMMON_NAME>human</COMMON_NAME>
</SAMPLE_NAME>
<DESCRIPTION>A whole-genome sequenced human sample</DESCRIPTION>
<SAMPLE_ATTRIBUTES>
    <SAMPLE_ATTRIBUTE>
        <TAG>subject_id</TAG>
        <VALUE></VALUE>
    </SAMPLE_ATTRIBUTE>
    <SAMPLE_ATTRIBUTE>
        <TAG>sex</TAG>
        <VALUE>female</VALUE>
    </SAMPLE_ATTRIBUTE>
    <SAMPLE_ATTRIBUTE>
        <TAG>phenotype</TAG>
        <VALUE>genome</VALUE>
    </SAMPLE_ATTRIBUTE>
</SAMPLE_ATTRIBUTES>
</SAMPLE>
</SAMPLE_SET>

Once again the sample needs a unique alias. For me, samples were given a unique-ish identifier by the Sanger centre when they were processed. They were genotyped and sequenced but it’s not likely there’ll ever be more data on these samples, so the formula ‘illumina_hiseq:<Sanger sample identifier>’ should be enough. For you, you might need to generate something unique.

For each sample, it turns out you must also provide:

a taxon id (that’s the <TAXON_ID>9606</TAXON_ID> bit).
a subject_id or donor_id (I don’t know if there’s a difference between these - I used subject_id as above).
a sex or gender
a ‘phenotype’

That’s documented on this page. I think that the allowable values for gender and sex are male, female, or unknown.

The “phenotype` is supposed to come from the Experimental Factor Ontology. That’s a whole other beast, and as it lunges at me I remember that my samples don’t have any measured phenotypes. Panic! But quick as a flash, I parry by writing for the phenotype the only thing that was measured - the genome*.

Save the above file in sample.xml and submit it:

$  curl \
-u <ega-box-XXX>:<password> \
-F "SAMPLE=@sample.xml" \
-F "SUBMISSION=@submit.xml" \
https://www-test.ebi.ac.uk/ena/submit/drop-box/submit/

* Is the genome a phenotype? Yes, it’s a 3.2 billion-dimensional one.

Submitting lots of samples

The monster is still hungry! Hungry hungry hungry! We need to feed it more samples. Do this by including multiple <SAMPLE> blocks in the above XML. Then submit as before*.

$ curl \
-u <ega-box-XXX>:<password> \
-F "SAMPLE=@sample.xml" \
-F "SUBMISSION=@submit.xml" \
https://www-test.ebi.ac.uk/ena/submit/drop-box/submit/

* Also, I reckon you’ll want to start keeping a log of the result of these transactions - e.g. by piping the output into a file, curl ... > sample_result.xml. But for the real run I found it easier to use the Webin portal to make getting the output easier.

The journey continues

Congratulations! You have survived the first encounter. (Admittedly, we’re still using the test monster, but hey.)

In part two we will conduct a flanking manoevre via the EGA ftp site, confronting it with some actual data.

ENCALS Oxford 2018 day three

2018-06-22T00:00:00+00:00

For day one, see here. For day two, see here.

Summary of day two

Highlights of yesterday were the longer talks - James Shorter’s talk on ‘reversing phase transitions’ (i.e. about prion-like proteins and methods of turning them back to their native conformation), and the talks by Christine Holt and Jik Nijssen’s about translation in axons. There were some very cool real-time videos of transcriptional and translational activity in axons. Matthew Wood’s talk about improving design & delivery of therapies was of interest - there’s more detail on that the talks later today.

It seems clear that misfolded protein aggregation is a major feature of ALS. However, this seems to occur to a bunch of different proteins - TDP-43, C9orf72, SOD1, FUS, all of which contain ‘prion-like’ domains (domains with specific amino acid ), and all of which have been observed in aggregated form in neurons of individuals suffering from ALS. These aggregates seem to self-catalyse and spread. Extrapolating a little, maybe they spread along the brain connectome during disease progression (c.f. Jill Meier’s talk from Day one). And, excitingly, maybe aggregation can be reversed (c.f. James Shorter’s talk). Similar effects are known for other neurodenerative diseases, e.g. alpha-synuclein in Alzheimer’s. What was less clear to me is how all these different proteins relate to ALS. Can each of them cause ALS on their own? Do they catalyse misfolding of each other? There’s also still a bit of a question as to whether this is a cause or a symptom. However, there is evidence that reversing the aggregation reverses degradation. Is ‘neurodegenerative disease’ really a spectrum of damage due to aggregation of a variety of different misfolded proteins?

(There’s also a category of talk I have real trouble with. They go like this: we know disease or treatment X is involved in disease in humans. So we’ll make organism Y (= mice, or rats, or zebrafish, or less problematically stem-cell derived neurons) and we’ll do thing Z to them (where Z = knock out or knock down the gene, or overexpress it, or insert a humanised genetic variant, or inject some prion-like protein, etc.). Then we’ll observe what happens. Look, things happen! This often seems problematic to me as a) it’s often purely observational and b) it’s often pretty unclear how results can or could ever be interpreted in terms of human ALS. Not all this type of work suffers from these problems, but from my fairly naive vantage point I often find it difficult to tell and wish speakers would spend more time on motivation.)

Today’s session which is largely about novel approaches to therapy and clinical trials, see the bottom of this post for a summary.

Sesson 7

Russell McLaughlin - why GWAS is (still) useful in ALS Russell is the winner of the ENCALS young investigator award. Begins with the genetic architecture of ALS. Genes affect the phenotype. Back in 90’s we might have thought about individual genes (like SOD1). As time went on more genes were discovered. A picture of genetic heterogeneity emerged. But now talking about polygenic risk. Individuals will carry different amounts of this depending on alleles carried. Brief cartoon of a GWAS. Manhattan plot for ALS, 8 loci appear above the ‘genome-wide significant’ line. van Rheenen et al Nature Genetics 2016. But there’s extra signal in the rest of the genome. E.g. polygenic risk scores for prediction. Plot showing small amounts of prediction that increase as you include more SNPs (but remains low overall). Conclusion: a lot more ALS genes remain to be discovered. What will we discover in future? Shows LDScore and GTEx expression, signal is in central nervous system. Comparison with schizophrenia. Cross-trait prediction: does genetic schizophrenia risk predict ALS? Answer is yes - schizophrenia genes predict ALS status (quite a bit better than ALS genes do). This seems quite specific to schizophrenia. Does that mean ALS / SCZ should co-occur? Says no (cartoon of a joint distribution of liability, with a vertical threshhold for SCZ and horizontal one for ALS, sets don’t overlap much). SNP chips don’t just predict disease. Population genetics of Ireland and Britain, showing a PCA-like analysis of Britain and Ireland. Iceberg analogy (what’s left to discover is below the surface).

Bart Swinnen (Leuven) - Pur-alpha provides a potential link between RNA toxicity and loss-of-function in C9orf72 ALS. Is there evidence for RNA toxicity? And what is its mechanism? Uses a zebrafish model. I didn’t listen to this, c.f. comments above.

Ziqiang Lin (KCL) MRI imaging reveals frontal cortial and cerebellar deficits in TDP-43(Q331K) knock-in mice. Still not very happy about this kind of talk. Anyway, tensor-based morphometry demonstrated brain volume changes in mutatnts. Primarily in frontal and cerebellular cortex. E.g. orbital cortex, volume decreased by 13%, P=0.027 (I didn’t catch the counts but they were something like 15 of each mouse).

Angela Genge (Montreal) - on clinical trials in ALS. Where are we? Riluzole approved 23 years ago (though not everywhere). Edaravone approved in Japan, South Korea and USA, being evaluated at Health Canada and the EMA. It took many years to progress to the pivotal trial, done in Japan. Final study was carefully crafted based on the results on the previous study to pick patients for the study. Many other drugs failed. What has been learned? Placebo arms can be a problem - because to prove efficacy the placebo arm has to progress. Need a way of stratifying or randomising to esnure placebos progress. Tolerability is a problem - side effects have important consequences on the ability to power a study; are we prepared for drugs that have a benefit but a side effect? Including patients too far into disease can be a problem - does this jeopardise results? Maybe should look for early / pre-disease individuals. Rapid and very slow progressors can also be a problem. And how to handle genetics? Classic way is: don’t do a genetic test unless there is a family history. But this can give false negatives. In speaker’s clinic, decided to screen every probably ALS patient for genes for which a drug is in development: SOD1 and C9orf72, and for FUS in under-30s. (Have never seen ‘a walking talking TDP-43 mutant’). Surprised to find half of SOD1 patients that went into trial, had no family history. Without screening would have missed 50% of patients. Says you have to ask family history twice. Because after 1st ask, they will go home and talk to family & physician and discover there was history of neurodegernative diseases, and fatal diseases that went un-diagnosed. Now new design principles. Try to force homogeneity in population & speed of progression; make effective use of biomarkers to monitor progress; set long enough duration of trial (difficult because it costs ~1 million per month of study during recruitment period); and look carefully at time from onset for enrollment. Now will talk about North American side. In Jan this year was a draft guidance document issued by FDA (maybe this). One piece of guidance is, if looking at something like intrathecal delivery (i.e. not something easy like a pill), also develop and test the delivery device at the same time. Also efficacy needs to be clinically meaningful on symptoms, function, or mortality (with a stress on the latter). Safety consideration: need adequate number of patients and duration - even for a re-purposed drug. Study design must be randomised, placebo controlled, double-blind, add-on, dose response, time-to-event. Efficacy endpoint: survival or function, but also could be new endpoint if appropriate. Functional endpoints should be early and frequent with safety assessments. Now the ‘Orphan Drug Act’. She notes we have lots of healthy mice running around, but not so many patients. Quotes one prominent member of this conference “I don’t believe anything in mice”. ‘Orphan drug’ is one intended for use in a rare disease. This means < 200K people in U.S. population. This means get strong financial incentives if the drug is ‘designated’ as orphan, and 7 year marketing exclusivity. Moreover can apply before phase I or throughout development programme. Designated based on disease/condition, prevalence, and scientific rationale (which can be based on clinical, lab, or in vitro data). E.g. in 2012, 30% of orphan drugs were designated exclusively based on animal model or in-vitro based evidence. Notes EMA and Swiss processes is different. So where are we? We have outcome measures that are validated and reliable? Yes but can improve upon them. What about biomarkers that correlate with clinical changes? Maybe. Are we identifying patients early enough? (Says we should specifically focus on possible and lab-supported probably, i.e. very early, patients.) Do we have good diagnostic criteria? Finally, how much change in an outcome measure is necessary to say it’s a significant improvement (in terms of difference in quality of life)? Does this depend on the expense and difficulty of treatment? We need a view on this. Last slides are her opinion: let’s make tighter selections on patients going in. Says we should all be on a mission to capture patients in the first 6 months after diagnosis. Reduce delays. Put minimum scores (e.g. min ALSFRS). Collect blood on every patient in order that can assess genetic makeup. Determine rate of progression (for example, based on ALSFRS monthly for 3 months before screening). Promote the idea for algorithm (c.f. Prize4Life, ENCALs prediction model). Avoid run-in designs (which is effectively wasting 3 months at start before treatment starts).

Questioner asks about getting people early diagnosis, says it is very very difficult to get people prior to one year after diagnosis. Speaker agrees, advocacy groups need to get awareness to a place where we incentivise that. Also refers to clinician friend who, when patient is young, tends to take longer to make diagnosis. So get GPs to send patients earlier.

Ruben van Eijk (Utrecht) Evidence-based trial design. Endpoint of today’s talk is ‘time to fall asleep’. For ALS event is onset, disease stage, or survival. Could be measured in ms (for cultures), days or years in patients. Gold standard in ALS is to have little doubt about efficacy and unaffected by ‘deblinding’. Systematic review of ALS trials 2000-2018. Eexpramipexole (2013) 942 patients, Olesoxime 513 patients (2014), Pentoxifylline (2006) 400 patients - all showed similar (I think hypothesised) efficacy. Which sample size is right for this? Definition of ‘event’. How to incorporate unexpected event (like tracheostomy)? 4 of those trials had death as ‘event’. Observed survival rates lower than hypothecated survival rates (I think he’s saying this is true across trials in his table). Movie about TRICALS website, a tool for evidence-based trial design for ALS. (Looks pretty cool but I can’t find a link I think but this seems to be down.) Also contains documentation (and code) for statistical analysis being performed. That is very nice (but offline). Aim 1: standardization (and centralisation) of ALS trial design 2: provide guidance and reduce erroneous assumptions 3: implmentation platform for advanced trial methodology.

Helene Tran (U Massechusetts) - Optimisation of preclinical nucleic acid-based therapeutics. Intro about C9orf72 repeat expansion, found in ~40% familial and ~7% sporadic ALS patients. Expansion transcribed into repeat RNA that translate into repeat peptides that form aggregates inside cells - nucleus and cytoplasm. Disease models (mice, iPSC-derived cells, etc) agree this peptide is toxic. Hypothesise can silence the repeat RNA to fix this. Use antisense oligonucleotdies that entre the cytoplasm or nucleus, hybridise to target RNA, and result in RNase H-mediated degredation. Why use ASO (single-stranded nucleic acis)? Other therapies using ASOs are now being approved. Now about optimisation. 1: identify ASOs sequences that specifically target C9orf72 repeat-containing variants Shows C9orf72 transcript variants, V1, V2, V3, differing in their use of exons. V2 is the most widely transcribed but does not express the repeat. So can target V1 and 2 without completely removing C9orf72 translation totally. Designed ASOs, found indeed it knocked down V1 V3 but not V2. Also staining indicates reduced # of repeat peptides per cell. Now mice. ASOs do not cross the blood/brain barrier, so intracerebro-ventricular injection. Tested 3 sequences, 2 were not well tolerated (sequence=GCCCCTAGCGCGCGACTC). But one was and it spread around brain. Now talking about ASO stability: inherently unstable. Modified backbone by adding phosphorothiate (PS) linkage, increasing stability. Also added sugar modification into RNA-like conformation; this increases binding affinity. Tests (more mice) indicated the modified and unmodified ASOs (ASO5) have similar effects, but modified ASO is better tolerated (lower body weight loss…that’s a bit worrying). Next tried to improve tolerability by studying dose. Found 200microgram admissible without adverse effects, which is pretty low. Added modified versions (ASO5-1 and ASO5-2) that mix PS linkage and phosphodiester (PO) linkage. Found more tolerable version. Now worked (on more mice) to study effective dosage. Further work on stability and longevity of ASO. Seems to work up to 12 weeks after injection. To conclude: first generation ASO not that bad, if you optimise can make something that is both safe and active and lasts for a while. It does not exacerbate C9orf72 haploinsufficiency (see this). Concludes, this is Robert Brown’s lab.

Questioner asks if the oligos might target any other genes, answer is it they did work that suggests it may target some intergenic regions but not other genes.

Jean-Cosme Dodart (Massachusetts) Wave lifesciences, ‘WVE-3972-01’ (This is what Helene Tran was talking about). I think it is in clinical trial (more news on this and another C9orf72 therapy) starting in 2018. Lots of details in this talk but I’m already convinced by the previous talk. One thing is visualisation of WVE-3972-01 results in monkey CNS tissue. (Poor monkeys). Toxicology studies are ongoing, clinical trials in Q4 2018.

Rubika Balendra (UCL) Another approach to targetting C9orf72 repeat expansions. C9orf72 repeat RNA formas a G-quadrupplex secondary structure (diagram of this which lookslike a sort of Escher version of a cube). Look for small molecules that stabilise C9orf72 RNA by screening small molecules. Picked 3 particular molecules. Use iPS-derived motor neurons, takes 30 days and get 90% pure motor neurons, these do exhibit C9orf72 repeat expansion expression. G-quadruplex-binding small molecules have low toxicity in these cells. Study in derived cortical and motor neurons, two molecules (called DB1246 and DB1247) were most effective in reducing quadruplexes. Lots more evidence it is working, in Drosophila. Video of live drosophila larvae dissection. Lovely. Conclusion: potential for therapy.

Maria Grazia Biferi (IM, Paris) - about a new gene therapy for SOD1-linked ALS. Prize4Life award. Replicating increase in survival in SOD1 mice. Trying to apply similar method to C9orf72 mutants. Using antisense approaches. Exploring using a gene therapy vector (AAV10). It allows longer antisense RNA to be used. Unfortunately I have drifted off to think about other things in this talk.

Julian Gold (Sydney) An open-label trial of Triumeq in patients with ALS. Human endogenous retroviruses: a link to MND/ALS: The Lighthouse Project. Phase IIa study of antiretroviral therapy (which is used to treat almost everyone with HIV). Investigating safety and tolerability and efficacy parameters. 40 patients at 4 sites in Sydney and Melbourne. MND confirmed < 24 months ago. All patients HLA B*5701 negative (this is to avoid allergy to one of the drugs). Ten-wekk lead in phase, then treated with ‘open-label’ fashion with Triumec. It is a combination of three drugs (Abacivir, lamivudine,dolegurine or something), it is one tablet once a day. Screened 44, 3 dropped out for social reasons and were replaced, during treatment 5 dropped out and 35 finished study. Of 5 dropouts, 2 did so due to high liver function tests, possibly due to alcohol, 3 because of reasons unrelated to Triumec. AEs (= side effects), generally unrelated to drug. Concluded it was safe and well tolerated, not interacting with Riluzole, no vital sign indicators => primary outcome met. Secondary outcome: ALSFRS-R, forced vital capacity (FVC), neurophysiological index (NPI), biomarkers (including P75), survival. ALSFRS-R more or less flat per patient. ALSFRD-R dropped during both lead-in by about 1 point per month - relatively quick progression and during treatment by about half that. Used matched ‘historical controls’ from the GSK ‘NoGo’ study. Some evidence trajectory improved for lighthouse patients. Similarly for FVC and NPI (but there are pretty massive confidence intervals here). Kind of similar for P75 biomarker, some stratification here (that looks a bit tenuous to me because, again confidence intervals are wide). Then survival, compared with ENCALs survival curve. (Can these be compared? They are different studies). But says there appears to be something going on. What does it all mean? Parameters look like influencing patients in the right direction. Ok, why might it work? Infections limited to somatic cells only allow horizontal transmission. The most common one that we know is HIV. It lands on the cell and releases its core of RNA into the cell, and then it produces an amazing enzyme called reverse transcriptase. It converts RNA to DNA, and this is integrated into our genome. It becomes part of our genome! Species have been infected with these retroviruses over the last 40 years - leading to natural selection. Diagram of these across species. ‘HERV-K’ is the last one to have been integrated into our genome - 8% of our genome is HERVs (retrotransposons). But of these only HERV-K is still able to produce an active virus. A while back some groups looked to see if there was an association with ALS, and found increased reverse transcriptase in ALS compared to controls and other diseases. HERV-K is present at 35 different places across the genome, and can be upregulated when it produces viral particles which bud off and transfer to other cells. It’s been found to damage (only) motor neurons. Is this cause or effect? Well HERV-K introduced into transgenic mice induces MND (Wenxue Li et al 2015). So can antiretrovirals make any difference? Looked at IC90 and found components of Triuvec were more effective against HERV-K than against HIV. Is this a clue? It may certainly be worth looking at further.

Jonathan Katz (San Francisco) NP001 Phase 2 Results. Starts by saying this trial was negative. NP001 is a ph-adjusted IV formulation of sodium chlorite. Had been thought to do something. Earlier clinical trial: this kind of worked, in 136 patients. Particularly, effects in some strata analysed post-hoc. Phase 2b had 138 patients. Well balanced study. Drug was well tolerated. Primary endpoint: both groups fell by the same amount. (Although his plots show the treated patients doing better than the placebos - not much better, not significantly better, but better.) Serum IL-18 did not change so it may be that the drug didn’t work. Take-hom messages 1: beware post-hoc cohort analyses in the name of heterogeneity - statistical risk! (I guess this is garden of forking paths / researcher degrees of freedom etc.) Does ranging (including different dosages) itself doubles chance of a ‘winner’ (due to statistical noise). Was there a sufficient outside sounding board? (They mainly had investors and infrastructure members that benefitted from the trial.)

Summary

Here are a few points from today’s talks that stood out to me

1: Angela Genge’s comment that half of all ALS patients with SOD1 mutations entering clinical trials, had no family history. I.e. family history may not be a reliable marker of genetic causes.

2: Greatest progress seems to be being made on therapies for individuals carrying the C9orf72 repeat expansion. There were a couple of talks about this, but the ones on WVE-3972-01 seemed particularly compelling, with detailed work on optimising the therapy, and this is entering clinical trials.

3: I’d expected more to be said about Masitinib, which as far as I can tell is hoped to be generally protective in a mild way. It sounded like there is some difficulty getting approval for its clinical trial.

4: Very valid points from Jonathan Katz, which was really a cautionary tale about drawing unwarranted hope from post-hoc analyses of clinical trial data. (i.e. look hard enough and you can always find an effect, c.f. ‘researcher degrees of freedom’, but this does not mean it is real).

5: Julian Gold’s talk on Triumvec was very thought-provoking. I did not really believe in the effects found in clinical trials. But the hypothesis, described in the second half of the talk - about activation of latent retroviruses encoded in our DNA - was really interesting.

6: Finally, I also noted there were no talks about looking for de novo mutations (i.e mutations that occur in the patient, not inherited from parents) in sporadic ALS patients. I was a little surprised about this because I’ve seen work on finding putatively causal de novo mutations in a previous talk by Justin Ichida, and thought that might be an area of active research here.

ENCALS Oxford 2018 day two

2018-06-21T00:00:00+00:00

For day one, see here. For day three, see here.

Summary of first day

I confess I was a bit nonplussed by the yesterday’s first session, in which it was suggested that hyperexcitable neurons are important (but doubt was cast on how specific to ALS this is), that exercise & athleticism may be associated (but evidence wasn’t presented except the Italian footballers; surely there are other footballers around to test this in?), that electric shocks or exposure to high frequency electric/magnetic fields may be causal (though there’s some doubt about confounding). In the second session we heard that telomere length is associated with increased risk, an opposite effect to that found in mice.

Of course, any of these could be true, but surely they can’t all be true. (But given the rarity of disease, they could all be noise.) I’m left not quite knowing what to believe.

It’s also clearly not known how currently-used drugs work, and anyway they don’t seem to work very well.

So what I think this says is the field is still searching for the real causes.

Not surprisingly I liked the genetic talks yesterday better (as well as the talk from Jill Meier on spread of neuron impairment along the brain connectome). That’s maybe not surprising as I’m a geneticist. The first talk of this session said we need large GWAS. There are some clearly real genetic signals around (particularly in the versions of ALS that run in families). Genotypes are easier than most things to reason about causally, because the (adult) outcome can’t generally affect the genotype. (Although there was an observation of somatic C9orf72 variation yesterday, which was also interesting). The high estimated heritability means that a lot of the causal effects are likely genetic (although worth noting that the first speaker yesterday had a somewhat opposing view). Many of these effects will be small (as individual genetic effects) - finding them will require large sample sizes. And work is progress to collected these. (Project MinE’s website) says they’ve collected 47% of their target 22,500 DNA samples; these’ll all be sequenced so it’ll be a fairly colossal set of data. I don’t know where the other 53% are coming from).

On to day two’s talks:

Session 3

James Shorter (U Penn) “Reversing aberrant phase transitions connected to ALS” About TDP-43 and FUS, which form cytoplasmic ‘inclusions’ in the degenerating neurons of ALS and FTD patients. TDP-43 pathology is very common in ALS (93% of patients), while FUS pathology is rarer, maybe 1%, a bit more seen in FTD, which also sees TDP-43 and TAU pathologies. Both TDP-43 and FUS are RNA-binding proteins (RBPs) with ‘prion-like’ domains. They shuttle in and out of the nucleus, performing RNA transport, and function in splicing, tanscription & RNA processing. They help regulate thousands of human genes. “Prion-like domains” (PrLDs) are the distinctive low-complexity domains enriched in Q,N,S,Y and G, which enable RBPs to form self-templating fibrils. They are a type of low-complexity domain. A prion is an infectious protein that can exist in at least two forms, one of which is self-templating. Template proteins alter native proteins into self-templating state. c.f. Mad cow disease. Prion proteins are found in yeast (SUP35, URE2, RNQ1). It is the domain that’s important, e.g. if domain is deleted, SUP35 loses ability to template. Moreover inserting the domain into another protein makes it prion-like. Also, can scramble the amino acid sequence of the domain: it still acts like a prion. So it’s the composition, not the sequence that’s important. Developed algorithms to find PrLDs in genomes. In humans, find 240 proteins with prion-like domains. ~30% are RNA binding proteins, another ~30% ar DNA-binding, and ~75% have an nuclear localisation sequence - oftern nuclear proteins. Ranked list of proteins: FUS, TDP-43 in there, wondered about other genes on this list. Lots of refs here e.g. Harrison & Shorter 2017. Indeed several of these genes are now linked to neurodegenerative disease. E.g. TIA1 connected to rare forms of ALS. It seems to be expansions within these prion-like domains. Example is TDP-43, where almost all the disease mutations cluster in the PrLD. So why do humans keep these domains? It turns out the domain is important for the function of these genes. Now showing experiment observing phase transition of protein into pathological fibrillar agregate. (e.g. Murakami et al Neuron, March et al Brain Res 2016). In liquid phase, these proteins clump and eventually convert into a gel-like phase, an ‘aberrant phase transition’. Are these aberrant phase transitions actually neurotoic? In TDP-43, tag TDP-43 with Cry2 light-responsive protein, now can shine blue light on cell and induce aberrant phase transition and aks whether it’s toxic to cultured neurons. Timelapse photos of TDP-43 in neurons under light and no light. It behaves different. Aberrant TDP-43 spreads under blue light and is damaging to cells. Can we find agents that reverse the aberrant phase transitions? Attractive strategy because could reverse toxicity. Using Karyopherin-beta2 (Kapb2, or transportin), a nuclear-import receptor for FUS, TAF15, EWSR1, and many of the proteins on their list of prion-like proteins. Trafficking proteins through nucleur pore (which is itself a complex viscous gel). Ran inside nucleus seperates KapB2 from cargo. Now Guo et al Cell 2018 <>. Found Kapb2 rapidly disassemble FUS fibrils. Then tried KapB2-WWAA (modified KapB2) that can’t disassemble FUS. Adding ran suggested this dissassembly activity is likely localised to the cytoplasm, not the nucleus. What about other FUS binding proteins? E.g. HDA1, FUS antibodies, did not have these properties, so simply binding is not enough. So what’s going on? Seems that KapB2 engages PY-NLS on FUs fibrils (liek a FUS antibody), but also makes secondary contact with the PrLD of FUS, and this leads to disaggregation leading to soluble KapB2-FUS complex. Ran-GTP completes the disaggregation. Ran GTP then lets KapB2 separate KapB2 from FUS. What about other RNA binding proteins? KapB2 works for TAF15 and another PrLD protein fibrils. Extends to others. But annoyingly, not some mutant forms - FUS-R495X and FUS-P525L. (What about TDP-43? He doesn’t mention that here). Oh, here w go: Importin alpha and KapB1 dissamble TDP-43 (as does something called A503.). Now looking at macroscopic images. FUS apontaneously self-assembles into macroscopic hydrogel stages. What does Kap-B2 do to this? Sure enough, KapB2 completely disrupts structure. Pretty striking. Now real-time video of FUS liquid droplets being dissolved. Now in yeast cells. FUS is toxic to yeast cells. Switch KapB2 on after FUS has accumulated. Eliminated FUS in cytoplasm. We would like to be able to do this in ALS patients. Now showing reverse of neuron degeneration in drosophila (fruit flies). Conclusions. Some sort of nucleur import defect leads to FUS localising into cytoplasm, leading to solid stage aggregates. Think KapB2 can reverse all of these stages bringing it back to normal state. A parallel system (Importin alpha + KapB2) also does it for TDP-43. (Would also be good to look for small molecules / drugs that increase KapB2 expression.) Future: re-engineering KapB2 to recognise mutant FUS forms, preliminary data says this may be working. (This talk was nice.) Questioner asks about inside the nucleus, where this system is not going to work because of RAN GTP levels. Speaker says yes seomtimes you see aggregates in the nucleus and he’d like to know if there’s a parallel mechanism there.

Raphael Munoz-Nuez (Paris ICM) - studiying interaction between TDP-43 and SQSTM1/P62 Most common ALS genes discovered are involved in protein clearance; another group due to RNA metabolism as in previous talk. TDP-43 has a nucleic acid binding domain, PrLD leads to phosphorylated and ubiquitinated TDP-43 aggreagates. (Line et al 2013, Neumann et al 2006). Now sequestosome 1/p62. Looking for protein-protein or protein/RNA level interactions between these two proteins. Work on zebrafish, transient knockdown models. Not really listening to this.

Piettro Fratta (UCL) TDP-43 central to ALS pathogenesis. TDP-43 mutations clearly established as causing disease. We know a lot about it. (c.f. Lagier-Tourenne et al.) TDP-43 is dosage-sensitive. Even subtle alterations in protein levels induce splicing changes. About experiments trying to get normal expression (I think). I’m not following this. A conlusion is TDP-43 gain of function mutations produce novel splicing events “skiptic exons”.

Jolien Seyaert (Leuven) FUS inclusions are found in ALS and FTD pateients. Aim: characterise FUS toxiticy in Drosophila model. Developed 3 fly lines in which human FUS gene + 2 mutant forms were inserted. Expressed in central nervous system using a GAL4 expression system. Pictures of mice that are maldeveloped (poor wings and skin). Try to udnerstand this. Study particular neurons called CCAP neurons. A subset of these secrete Bursicon, is this goind wrong? Seems to be less Bursicon being expressed by these neurons. CCAP neuron loss. Can be rescued by preventing apoptosis. Table of human families from Kwiatkowski et al 2009, Science, point is that there’s a wide range of age of onset and also disease progression within same mutant carriers (and within families). Now talking abotu ‘exportin 1’. Exportin 1 knockdown counteracts the formation of FUS inclusions. (I thikn FUS inclusions = the FUs accumulations described in earlier talks, but not sure.) Hypothesis: exportin binds to FUS as it leaves nuclear complex, and sequesters it into ‘stress granules’, where it aggregates. No exportin => less aggregation => protective. Says NUP154 and Exportin 1 are potential suppressors of FUS neurotoxicity.

Somebody (somewhere) (I thought this was Ziquang Lin, but she spoke on day 3 on a different topic. So this is somebody else). ‘SOD1 prions transmit ALS to hemizygous hSOD1^D90A transgenic mice’ SOD1 exists in all cells i human body, mutations are known to cause ALS. ‘Seeding principle’, modified from Jucker et al 2013. Natural initial phase of fibril formation is slow. Injecting seed SOD1 fibrils leads to much quicker growth of aggregates and death (of mice). Are SOD1 proteins from mutant lines seeds for this? Try by injecting them, using two strains (‘A’ and ‘B’) and control strain. 1ul of seed (~5ng hSOD1). Strain A and B mice, seeded much quicker death. But control strain and not-inculated strain didn’t show this. (So I think she is saying the mutant SOD1 is not seeding.) (I forgot to say: poor mice.) Now another SDO1 mutation discovered in a patient in 1995. (Andersen PM et al, Brain 1997)[https://www.ncbi.nlm.nih.gov/pubmed/9365366]. This talk had lots of similar-looking plots, I got a bit lost. Conclusion is that ‘strain A’-like truncated humaised SOD1 did induce a transmissible aggregation causing disease and death.

Coffee break

Over the break I was looking at posters, some of them are:

a poster about assessment of microsatellite repeats using whole-genome sequencing data from Project MiNE. (They use HipSTR and other existing software).
a poster about Masitinib. Masitinib has undergone an initial phase III trial in which it appeared successfully therapeutic, with modest effects on ALS. It is now undergoing a second phase III clinical trial that is recruiting in 2018. (See also this, which I think is about an earlier trial.

Session 4

Olaf Ansorge (Oxford) - Neuropathological heterogneity across ALS. Starts with outline, with 6 themes: What do we talk about when we talk about ALS? (Nosology)[https://en.wikipedia.org/wiki/Nosology]. Phenotypic extremes. Does ‘incidental ALS exist? Selective vulernability. Genotype and neuropathological phenotype: powerful allies. Finally a clinical vignette. 1: ALS is primarily a clinical diagnosis - upper and lower motor neuron signs and symptoms. Clinician infers amyotrophy (affecting lower motor neurons (LMN)) and lateral sclerosis (UMN). Anatomy of UMNs connected to LMNS in brain and spine. Pictures of two real spines from a healthy and an ALS patient. Spine is visibly altered with neurogenic atrophy (the amyotrophic component). Now cross-sectino of spinal chord, severe degeneration where spinal column material becomes soft. So initial signs are clinical. Can then further refine by molecular pathology. Extremes of phenotype: what are the boundaries of ALS? Now presymptomatic neuropathology. Plot (I think it’s a cartoon) showing protein aggregation of TDP-43 prior to clinical diagnosis. Does one ever see TDP pathology in individuals not suffering from ALS? He thinks no, because it’s never seen in individuals in their brain bank. This differs from Alzhemiers and Parkinsons disease which shows a linear trend. In ALS postulates that onset of protein aggregation is a fairly rapid “catastrophic” event. That’s pretty interesting. (Notes there may be some reports of this from Asian studies.) “Selective vulnerability”. 95% of ALS is defined by mislocalisation of TDP-43, aggregating into fibrils. A classification scheme was established for FTD field, types A, B and C, with different types of aggregation linked to different clinical presentations. Sporadic cases seem to be in ‘type B’, C9 cases seem to be mixture of type A and B. What that means is not entirely clear. There is a striking oligodendrogliopathy in ALS. Takeuchi et al 2016. Seen in grey matter, rarely in white matter. What does this mean for propagation? (Not all oligodendrocytes are created equal, some are ‘satellites’, some are ‘interfascicular’, they do different things, understanding this -opathy in these would be valuable). Now intra-individual versus inter-individual TDP-43 ‘strain’ diversity. Is there diversity within individuals? E.g. can’t find specific C-terminal fragments in spinal chord, suggesting it is specific to parts of the nervous system, although cautions this could be an artifact. Now talking about major effector proteins. SOD1, TDP-43, FUS. Genes and pathology define beginning and end of pathogenic pathway. TDP-43 in neurons and oligodendroglia. SOD1 in neurons and maybe astroglia. FUS in neurons and oligodendroglia. TDP-43 seems to be the pathological effector of mutations in a range of genes. Lastly clinical vignette: a man with clinical diagnosis of FTD. Sequenced, found het for TDP-43 mutation (c.859G>A, p.(Gly287Ser).) Children at 50% risk of inheriting this mutation. This man came for autopsy and did full research autopsy. Conclusion is that this is full-blown alzheimers pathology, no evidence of TDP-43 dysfunction, this is a alzheimers FTD mimic. Brain sequencing confirmed this. Notes: genetic services have an impossible task if they are required to study in detail the evidence in the literature before issuing a report. Ends with picture of the Oxford Radcliffe Observatory.

Chris Henstridge (Edinburgh) Synapse loss in the prefontal cortex. Evidence that ALS and FTD lie on the same spectrum: genetic evidence; pathological evidence (half of FTD patients also present with ALS) - but also clinical evidence re: symptoms. a 3rd of ALS patients also have cognitive decline that is similar to but not diagnosed as FTD. Could similar brain changes underlie this? Focus on synapse loss, which is a shared mechanism in neurodenegeraticve disease. Early synapse loss is seen early: before neuron loss, in PD, AD, ALS, FTD. Data is from postmortem tissue analysis. 20 ALS and 5 control patients. Tissue preserved in many formats for different research approaches, enabling tissue electron microscopy. observe ~100 synapses per case. Decrease in ALS patients is statistically significant. Shows video of array tomography. Cut cortex columns, fresh pieces of brain are embedded in resin, can cut ribbons 70nm in thickness. Can do conventional approaches on these ribbons and then reassemble into 3d map. prefontal cortex (29 ALS and 14 controls. ~36,000 synapses per case). Difference is still statistically significant (but no more so than before, there’s more data here so I guess this indicates the observed effect is smaller). Of 29 ALS cases, 2 had been cognitively screened pre-mortem. 16 unimpaired and 7 impaired meaning they have a subtle cognitive change not as strong as FTD. Impaired patients has the lowest synaptic counts. Conclusion: synapse loss is associated with cognitive change. So what’s driving that? Found no association between alzheimers-like pathology and synapse density. pTDP-43 does associate weakly with lower synapse density. pTDP-43 aggregations accumulate at the synapse in ALS. Picture show large clumps but also a lot of small clumps which localise with synapses. But, just because they’re there, are they causing anything? Trying to get at this at the moment, need to use mouse models. Single Q331K mutation knock-in. Found ‘frontal dependent behavioural change’. TDP-32 gains function due to perturped autoregulation… . Can’t say what mechanism of synaptic loss is, but can say that that single mutation is leading to synaptic loss. Also this. Microglia-dependent synapse loss? In TDP-43 KO in microglia results in fewer cortical synapses, more synaptic material engulfed by microglia. Summarises and describes ongoing studies on this.

Noemi Gatto (Sheffield) About misfolded SOD1 in astrocytes. Diagram of pathogenic mechanisms in SOD1 ALS. Aim to determine role of wild type SOD1 in sporadic ALS cases, using iNPCs (induced neural progentor cells). Slide about making these. Is SOD1 in the nucleus? Staining indicates it is. Found higher levels of nuclear misfolded SOD1 in sALS (but sample size is tiny - 3 in total I think). shRNA (short hairpin RNA)for SOD1 successfully reduces the level of SOD1. Also observe higher level in C9 patients. (Not clear to me if these are all different from the previous ones). SOD1 mutation patients don’t have a higher level of nucleur SOD1. I don’t think any of these sample sizes are large. Nw, does CRM1 (which I think is Exportin 1, see here) decrease? Preliminary data suggest yes there’s a decrease. So these data indicate SOD1 detectable in nucleus of astrocytes and reveals link between possibile unknown function of SOD1 and sALS pathophysiology.

Matthew Nolan (Oxford) About selective vulnerability in ALS. Says several cells (not just neurons) are involved, and proteinopathies seem to be involved in all of these. Picture of (real I think) primary motor cortex and brain. Showing parts controlling face, hand, leg. Huh. Aims to characterise pathology of primary motor cortex across spectrum of ALS, and compare this across molecularly-defined subtypes. Use a variety of markers to assess selective vulnerability across specific cell types. A problem is that these markers don’t work well in long-fixed material found in most biobanks. Using short-fixed material this works. Study has 19 controls, 43, 18, 11, 9, 5 sporadic, C9orf72, SOD1, FUS, and other gene-variant pateints of ALS. Severity of pTDP-43 deposition correlated with extent of microglial activation. Now looking at UMN versus LMN burden of CD68 (which I reckon is one of their chosen markers). I got distracted here.

Roisin McKackin (Trinity College, Dublin) ALS as a network disorder. Measuring network change. Pros and cons of electroencephalography (EG). Pros: 1: gives a direct measure of neuronal function; 2: dysfunction leads to cell death and can measure change before cell loss; 3: the cost is tens of thousands rather than millions (for MEG, MRI, fMRI, PET). Cons: spatial resolution is not as high. However ‘source localisation’ techniques allow to improve resolution, c.f. Muthuramen et al 2014. A network of electrodes allows to transfer EG to a high-resolution map. Example: ‘mismatch negativity’, EEG giving poor resolution. Investigated using source localisation methods. Dipole fitting, LCMV, eLORETA. 58 ALS patients, 7 had C9, 12 with family history (+1 with family history of FTD). Results: diploe fitting, gives significatn effects in inferior and superior fronal gyri. Now using eLORETA which has relatively low spatial resolution. Nevertheless gave reliable info and proceeded with this. Couldn’t detect a significant group-level difference. Now result using LCMV which uses a beam-forming approach to localisation. This seemed to detect better (I think). Using ‘empirical bayesian significance testing with a FDR of 10%’! Yay! Colour-word interference test in 27 patients suggests localised activity does reflect impaired cognitive flexibility. Hmm.

Lunch

Lunch. Sandwiches, crisps, water, banana. Look at posters.

Christine Holt (Cambridge) Axonal mRNA biology: implications for axonal maintenance. Working on embryonic visual system - retinal ganglions, that form the main axons that form the optic nerve. In tadpoles. Over the years have discovered queues during growth that guide growth and branching of the axon. Also survival queues that are important to maintain the axon. Central Dogma: DNA -> RNA -> protein. In axons, proteins get shipped out along the axons but this could be a problem if they are long. It is quite slow: might take a couple of days along a long axon: this isn’t much good if you need fast response. Cells have come up with another mechanism: transport RNA and translate it in far reaches of cell. It is mostly repressed during transport. Then a signal comes in that activates protein synthesis. Cites Campbell and Holt 2001; Ming et al 2002; Brittis et al 2002; Wu & Jaffrey 2005, Leung et al 2006, Yao et al 2006. Evidence: can artificially guide axons growth. Adding a protein inhibitor, axons continued to grow. Also true if axons was cut. “Netrin gradient elicits asymmetric ‘near-side’ beta-actin synthesis.” RNA trafficked to axons via RNA-binding proteins (RNPs, e.g. ZBP1, FRX, FUS). Cool video of beta-actin RNA moving along axon during axon growth, also see mitochondrial movement, a dynamic process. EB1 (indicator of microtubules) also moving. Can look live at the translation of an mRNA in vivo. Video of simultaneous image of rna and translation. That is very very cool. Laser capture of growth cones: 1000s of mRNAs in frog and mouse axons and growth cones. Spanning many categories of function, but really wanted to know what is actually being translated in axons in vivo. Shigeoka et al Cell 2016. Use ‘Axon-TRAP-RiboTag’. Immunoprecipitation of ribosome-mRNAs in retinal axons. Get different stages during growth. “Translatome is developmentally regulated”. Shows gene ontology analysis from the [same paper]((https://www.cell.com/cell/abstract/S0092-8674(16)30580-3). Now talking about Lamin B2. Synthesised in axons in response to growth stimulation. Axon survival requiers axonal translation (of LB2). LB2 lacking nuclear localisation sequence rescues axon degeneration. LB2 colocalises with mitochondria and knockdown leads to mitochondria dysfunction. Now is there a link from FUS and axonal translation? Shows again gelling of FUS mutant proteins (as in a previous talk, see above). FUS mutants decrease protein synthesis in axons.

Wenting Guo (Leuven) - I didn’t listen to this.

Axel Freischmidt (Ulm) - I didn’t listen to this.

Laura Fumagalli (Leuven) C9orf72 expansions cause axonal transport defects in iPSC-derived neurons. Didn’t listen to this either but there’s a cool video of movement of mitochondria along axons, impaired in C9orf72 mutant iPSC-derived axons. This is in real time but sped up about 100X or so.

Jik Nijssen (Stockholm) - About ‘Axon-seq’, I guess about this. Axon is the longest cell in body. Muscle denervation and axon retraction occur first in ALS. Motor neuron somas are lost later in disease. Used microfluidic system to isolate motor axons. Mouse derived motor neurones. Use gradient of growth factors to induce axon growth through microfluidic channels. Pretty cool! Stained with MAP2 and TAU to ensure not getting dendrites, but getting motor axons. Now axonal material can be harvested separately. Can isolate axons from rest of cell. Did RNA seq applied specifically to motor axon compartment. Shows some QC. Axonal mRNA from all thousands of axons amounts is about same as mRNA from single isolated motor neuron. So not very much. But get signal across axons. Multiple differentially expresse dgenes (n=771 axon-enriched). n>200 entirely axon-excluded transcripts. Shows 25 highest-enriched-in-axons genes. Enrichment analysis too. To validate this considered published studies of primary motor axons, primary drg axons (Saal et al 2014, Gumy et al 2011, Minis et al 2014), core signature of 1750 genes expressed in all of these datasets. In data, surprised to see a number of transcription factors (known for function in nucleus) turning up. But seems like they may have function otuside nucleus. Example: YBX1 showed very high expression in axons and enriched in axons. It is involved in RNA transport and in splicing - in fact in every step. Now consider axons from SOD1 mutant mouse line. Found differential expressed genes. ALS-linked genes: NEK1, MGRN1, NRP1. Perspectives: want to transition system to humans (using human iPS lines). Also want to make it more ‘in-vivo-like’ by introducing muscle cells into the system, hope will help stabilise their transcriptome.

Laura Ferraiuolo (SITraN, Sheffield) About mRNAs secreted by C9or72 patient-derived astrocytes. iNPC direct conversion as described this in a talk morning. Astrocytes from C9 patients induce motor neuron death. (Meyer et al PNAS 2014). Now about extracellular vesicles (EVs). Misfolded proteins (a-synuclein, SOD1, something else) found in EVs. ref Aoki Y et al, Brain 2017. Decreased EV biogenesis protein transcripts in C9 astrocytes. Shows funky microscope detecting by shining lasers. Or something. EVs express CD63 but not CD9. Most RNA content is mRNA. Quantified this using GeneChip miRNA array 4.0. A hundred or so upregulated, 70 or so downregulated. (This seems like more data is needed). ‘Axon guidance’ is the most enriched pathway though. Huh. Focus on miR-146a for a bt

Caitia Gomes (Lisbon) About astrocytes from ALS patient fibroblasts. Only two drugs for ALS - riluzole and edaravone only slightly reduce disease (if at all). There is no cure. Astrocytes play a role in ALS through interaction with motor neurons, expressing neurotoxins. Some evidence of miRNA dysregulation in astrocytes in ALS patients. Goal to understand the mechanisms of this. Mouse SOD1 model. Focus on miR-146a. Looking at neurotoxic effect. I’m not listening to this.

Session 6

Ana Candalija (Oxford) Transciptomic analysis of iPSC-derived motor neurons from C9orf72 ALS/FTD patients. Jings. This talk is more cell visualisation. It’s all good stuff (presumably) but I’m running out of concentration :(. They are gonna use C9-corrected lines using CRIPR/CAS9 editing to work around natural experimental variability. Anyhoo. they find differentially expressed genes in C9 versus edited lines. SYT11 is one of these. (GB: It’s of course good to look at differentially expressed genes. But just because genes are differentially expressed does not make them causal. They could be differentially expressed because they are influenced by other molecules that are in the causal pathway, or because the causal pathway induces large global hanges to the cell. In fact, given the complexity of how transcription works, it’s not impossible that minor changes in expression of protein 1 (a causal protein) effect major changes in expression on many other proteins. Unlikely studying these proteins will lead to therapies. Thus maybe it’s better to look (for example) for transcription factors that influence all of the differentially expressed genes? Anyway that’s not what these talks have done).

Hortense de Calbiac (Paris ICM) Synergistic mechanisms of C9orf72 gain and loss of function. ALS has strong genetic composition, more than 20 genes involved. Diagram of autophagy pathway. E.g. Lee JK et al biophysica (I think). Pathogenicity of C9orf72 mutation. Summary of C9orf92 repeat expansion, found by GWAS and finemapped. So zebrafish model. PLot swimming path of zebrafish in petri dish after touch (touch-evoked escape response). Video of this. Cool! They move! Quantify this in C9orf72 knockdown and ‘mismatch’ (dunno what this is). C9 loss-of-function disrupts poly(GP) clearance. (I think poly(GP) is a marker of the C9 repeat expansion, c.f. this). (At this stage in the afternoon, I find I can’t hang on to this kind of talk even when I’m trying. It seems to consist of slide after slide of bar plots that show some kind of effect. The axes are labelled differently. They mean similar and/or different things. They have things like ‘P62’ written on them. Does this make sense if you know what P62 is? (I didn’t). I used to think this was all fine & natural, after all I’m not an experimentalist, so no wonder I don’t understand experimental talks. Now I think speakers (this applies to all fields) should go out of their way to make me understand what they’re talking about. This particular talk is no worse than many others in this respect (for many of the other talks that appeared to me like this, I simply didn’t listen to them), it just happens to land in that spot in late afternoon where I really, really can’t follow despite trying.)

Matthew Wood (Oxford, also spin-out companies that he will mention). Starts with slide of human conditions with known molecular basis from OMIM. Many thousands, while only ~500 have therapies. Prospects for genetic medicines. Genome-based theraputic technology have large potential. Significant recent profress. Realising this wil depend on overcoming intracellular drug delivery. Ok, nucleic acid-based drugs exploit a variety of molecular mechanisms. gene expression, gene splicing, gene silencing, gene editing. Most companies using first-generation technologies - they are working, but not well enough. Example. Eteplirsen for DMD (muscular dystrophy), approved 2016 by FDA. Controversial decision as it has very little effect (but it’s safe). Nusinersen for spinal muscular atrophy (SMA), approved 2016 FDA. The efficacy in this case is actually pretty good, probably this is an exception. And Huntingdon’s disease a number of companies developing oligonucleotide drugs. But, speaker’s perspective is that clinical benefit for most of the will be modest - at best. Reason: intracellular delivery is the major barrier. The drugs don’t get into cells readily but stay outside. Need to get them into the cell and into the compartment it needs be active (e.g. nucleus for oligonucleotides). This is the major scientific barrier to effective genetic medicines. Now about developing next-gen oligonucleotides. Two areas for development are the backbone chemistry of the drug (will give an example of ‘sterochemisty’), and delivery, he will talk about peptide delivery and nanotechnology. Now talking about ‘WAVE platform” exploiting stereochemistry. By WAVE Life sciences. Oligonucleotides can occur in left- and right-handed form. This wasn’t previously appreciated. In a 20 nt oligont, each pair of consecutive nts be linked left or right handed, giving 2^19 variants. Does that matter? Well it might. Now have technology to try to generate a sterochemically optimised drug. Have worked through this for muscular dystrophy. E.g. Exon 51: increased Dystrophin restoration, stereochemically optimised drug much better. Clinical trials in last quarter of 2017. WAVE also developing drugs for C9orf72 which are sterooptimised. Now talking about peptide delivery technology. Developed with Gait and MRC LMB. New spin-out company PepGEN. In Muscular Dystrophy, benchmark against FDA-approved drug. Peptide-delivered drug is significantly more active. Think it is 100 to 1000 times more effective. In mice can reach ~80% of normal levels of dystrophin, generating almost complete physiological restoration, whereas standard drug gives ~1%. Have tried this in SMA. Nusinersen drug modulates splicing to generate full-length SMN2 mRNA transcript. But is delivered intrathecally (i.e. into spinal canal). With peptides, in mice, again can generate increased levels of delivery through intravenous delivery. Restores life span and physiology. This is going into clinical trial. Think can do something similar in Hungtingdon’s. (Hammond et al 2016.) PepGen is developing peptide platform tecnology for nucleic acid drug delivery. Now talking about exosome based nanotechnology. Some kind of natural nanoparticle (extracellular vesicle). I dunno what. COuld we use these for delivery? We engineer these nanoparticles for this. Example is exogenous RNA encapsulation, with a rabies virus peptide on surface, to get payload into nervous system. (Mol Ther. 2017). Deliver double-stranded RNA to mouse cortical neurons, given targetted silencing of BACE-1 gene. But why was it so effective? (Heusermann et al JCB 2016). These natural nanoparticles behave very much like viruses. They enter as single particles via cell uptake, rapidly transported inside cells. Shows a model for how they deliver their cargo. Eventually get degraded by lysosome. Wow. Concludes.

A questioner asks if his nanoparticle approach can deliver larger molecules - e.g. DNA. Response is yes, but it’s challenging, they are trying to deliver a version of dystrophin, but a cut-down version because it’s smaller.

Debate - “This house believes ALS is a prion-like disease”, chaired by Martin Turner. FOR: James Shorter (U Penn) AGAINST: Simon Mead (UCL)

(I’m a bit surprised about this debate. My reading of the conference is that there’s good evidence for agglutination (of several proteins) in ALS. Witness: the connectome talk, the talk about reversing aberrant phase transitions, the many slides showing aggregating FUS or TBP-42, the one about bioinformatically identified prion-like domains. And pictures of test tubes with agglutinated FUS. Isn’t this it? Anyway let’s see what the debate says. We have to vote via a website.)

J.S. defines ‘prion-like’ (or prionoid) as an inectious protein-like thing that can transmit phenotype wihin an individual and experimentally between individuals. E.g. alpha-synuclein in Parkinsons. ALS is not a prion disorder (no transmission between individuals) but there’s strong evidence it’s prion-like. Motor neuron loss starts focally in central nervous system (CNS), and spreads during disease progression. This is compatible with prion-like spread. Moreover, pathology of ALS also spreads. Braak et al Nat. Rev. Neurol 2013. Evidence from in vitro studies: should be able to spotanrously assemble. Indeed this has been done for SOD1, TDP-43, not yet done at time of writing paper in 2015 for FUS but probably has now.) TDP-43 and FUS are both RNA-binding proteins with prion-like domains. These enable them to assemble into these fibrils. They get this name because of amino acid similarity to real prions; can insert into genes and make it prion-like; can remove from prion-like genes and get rid of prion behaviour; etc. To close: to really have definitive evidence, should be able to make recombinant protein in prion conformation and inject (say into mouse). Mouse should become diseased. These has been done for PrP (by Ma and Prusiner) and alpha-synuclein(Lee). Being done for SOD1, TDP-43, and FUS. E.g. Ayers et al Acta Neuropathologica 2016, induced MND. Not quite there because this is not wild type mouse. But the weight of evidence is very much toward this. So yes, ALS is a prion-like disorder.

S.M. Draws attention to wording ‘prion-like disease’. First says he is from the institute of prion disease at UCL. ALS implicated proteins do show a seeded-like polymerisation mechanism but that’s only a small part of the picture. Mammalian prions are pathogens that invade, evolve, kill host. “Prion-like” is a poorly defined term. E.g. prion labs are extremely carefully contained - not so for ALS labs etc! Shows cartoon of ‘seeded polymerisation’ mechanism. How can this explain the diversity of diseases found? Prion diseases occur in massive epidemics - Kuru in Papua New Guinea, lost 10% of population, that was transmitted through mortuary feasts. Mammalian prion diseases - BSE, vCJD, highly contagious. trace amounts of prions can kill. CJD has strains - ‘sporadic’ and ‘variant’ CJD with different clinical phenotypes. Table of features of ALS versus prion diseases. In particular evidence of spreading is different than prion diseases. So a critique of the term ‘prion-like’. If doctor said you had a ‘virus-like’ disorder you’d want a 2nd opinion. It’s either a virus or it isn’t. Now tackling specific aspects. Seeded polymerisation - well, this could apply to a large proportion of human diseases - c.f. amyloids, Astbury 1935, Jarrett and Lansbury 1993 - pretty much all neurodegenerative disorders - and more. Almost any protein can adopt amyloid state under right conditions. ‘Prion-like’ is unhelpful. What about spreading? Propagation of pathogen is not propagation of pathology. Agree evidence that ALS progression is determined by neuroatnatomical pathways. But most diseases spread, could be gradient of sensitivity, could be disease agent itself. Spreading is not “prion-like”. Strains of prion diseases are defined by pattern of pathology. There is heterogeneity in ALS, but these don’t correlate with biochemical features of ALS-implicated proteins. Finally notes there are no implications of the term ‘prion-like’. There are already decisions for prion diseases that give guidance on care, treatment etc. The data are just not strong enough for ALS.

This was followed by a discussion. My sense is that the ‘against’ argument was not really made sincerely - as he said, the dude is from the institute of prion disease - and he says he will be convinced if the malfolded proteins can be induced in wild type mice. Most of his argument is about prion strains, and he accepts that the evidence on this may well come over time.

Despite this it was a victory for the nays. (However, unfortunately the voting system went wrong, so only 5 people’s votes were counted - (including mine :))

ENCALS Oxford 2018: day one

2018-06-20T00:00:00+00:00

I know very little about the disease known as Amyotrophic Lateral Sclerosis or Motor Neurone Disease. I’ve decided to come to the ENCALS conference in Oxford to find out. To this end I’m going to do what I’ve sometimes done before: write down everything everyone says and then try to summarise…let’s see how it goes.

(See also my notes on day two or day three).

Wednesday 20th

Session 1

It’s pretty packed (not to mention hot) in here - apparently the conference is over-subscribed and I couldn’t get into the main lecture hall at first - but later I sneaked in.

Leonard van der Berg (Utrecht) opens conference with a couple of slides from previous meetings and a quiz on Britishness. (Sample questions: “True of false: Britain owns Australia?”, “Have you ever accidentally said ‘thank you’ to a cash machine?”. etc.) Much hilarity ensues.

Evy Reviers briefly introduces EUpALS - “European organisation for professionals and patients with ALS”. Established 2017. How can you become a member? Members are national ALS organisations established in Europe. Three activity domains: 1: informing and supporting ALS patients, 2: defending ALS patient rights, 3: stimulating research. New members are charged 1 euro as a symbolic fee. Members in Iceland France Spain Portugal Italy & elsewhere (but not the UK it looks like). EUpALS participates in project proposals at European scale.

Now on to scientific talks:

Matthew Keirnan (Sydney) “Roadmaps to therapy in ALS” Takes tie off because he was told to. Starts with interesting looking picture which looks like a series of mountain ranges, says this will be explained at end. Talking about ALS in terms of ‘transmission’. He is current editor of http://jnnp.bmj.com. Human motor transmission: machine which provides controlled application of power. Picture of a crossing in the spinal cord, a spinal nerve, and the lower motor neurons. Hodgkin and Huxley - I guess this is this. They worked on squid axons before the war, but got distracted by the war. After the war - no squid. Anyhow they worked out that neurons transmit through sodium channels. Transmission is like “kangaroo jumping”. Hugh Bostock found ways to study strength/duration of pulses. How much current do you need to apply to neuron to activate it? These techniques are now commercial. Refers to Wainger et al, ‘Intrinsic membrane hyperexcitability…”. This hyperexcitability seems to increase in ALS patients. ALS Diagnosis. We know in MND that axons are dying. Mentions “Motor Neurone Disease is a clinical diagnosis’, Turner & Talbot. Speaker agrees but says technology gives valuable insights. Interesting slide linking repeat activation of neuron to the clinical aspect of fatigue when using muscles. Picture of two hands showing classic “split hand” presentations of hands (slide says FDI, ADM, APB, I dunno what these are). We are the only species that need fine pincer control; our brains have adapted for this. Picture of Lou Gehrig which shows ‘split hand’. Origin of ALS: we know it is a primary neurodegenerative disease, but we haven’t worked out how it gets into the upper motor neurons (UMNs). Difficulty evaluating UMN dysfunction as no technology to do it. MRI techniques helpful. But has been focussing on taking some existing techniques to look at hyperexcitability of UMNs. Try threshhold tracking of transcranial magnetic stimulation (TMS) in Riluzole treated patients. Similarly in cohorts of ALS patients, some are actually not ALS but TMS can help distinguish them. ALS - is it all in the genes? Speaker says, no it can’t be. Onset is a roughly linear relationship with age - “six steps to ALS” (I think he’s referring to this, which I’ve read before. I don’t understand from this why it can’t be genetic). Cognitive and behavioral symptoms in ALS. Video of ALS patient showing fasciculation (i.e. muscle twitches). Most muscles have twitches, but more fasciculations => greater progress of disease. Now ALS, physiology & metabolism. Another thing that’s strange: pre-morbid physiology. Patients tend to develop ALS out of the blue - tend be quite athletic, tend to have normal or low BMI, have reduced coronary artery disease (and so do relatives). Streesses this is somewhat anecdotal. Refers to this about ALS in Italian footballers. I’ve heard this story (about athleticism) but not sure I believed it - interesting to hear it brought up. If talk to pateients & partners, they report behavioural changes quite a while before clinical symptoms are identified. Need national population registries (like ENCALS, which they’re trying to replicate in Australia as ‘PACTALs’). Challenges for ALS/MND in pan-Asian countries - it may have a different genetic makeup. Now talking about brain structure and ALS transmission. In summary, there seem to be ‘six steps’, genes are involved, other factors, and apparently protein spread (c.f. 1960s experiments injecting monkeys with ALS motor neurons). How do we link transmission and hyperexcitability? Finally here’s the explanation of the first picture: it is energy released by stars when they are dying. A pulsar is like a hyperexcitable motor neuron! Er…is it? Wow. (It’s larger, I should imagine). Main conclusion: our understanding of ALS is evolving.

In answer to a question, brings up TDP-43, a form of which is implicated in frontotemporal dementia and ALS; it also apparently represses HIV transcription. Another question is about hyperexcitability - it occurs also in lots of other diseases (like stroke), how does he hang that together? Says he hasn’t really quantified that. Is this higher in magnitude than seen in epilepsy? Doesn’t know. (See the region here. Maybe it’s rs80356717.)

Boudewijn Sleutjes (Utrecht) “Biophysical basis of the acute effects of riluzole and retigabune…” Starts with hyperexcitability again: symptoms are fasciculations, cramps, hyperreflexia, spasticity. Riluzole -> NA+ channel function, inhibits persistant Na+ currents. It partly normalises hyperexcitability. Reigabine -> K+ channel activator, also reduces MN hyperexcitability. Will focus on 2018 randomised control trial. It included 18 patients, 4 of which are familial with C9Orf72 mutations). Retigabine (300mg), Riluzole (100mg), and placebo randomly assigned. Mixed model analysis applied. Nerve excitability testing, non-invasive estimation of axonal membrane potential and ion channel activity. Temperatures were controlled by warming nerves beforehand and throughout tests. Standard TROND protocol. Compared with mathematical neuron model. Results show no acute effects of riluzole on excitability within 6 hours. But for retigabine, significant (but perhaps not large) effects were found. What is the biophysical basis? Uses simulations, these match pre-dose results very well. Large input parameter set, varied only one parameter at a time to avoid overfitting. Post-dose (1.5h) results, best fit had a downward shift of half-activation potential. Similarly at 6h. This model suggest hyperpolarizing shift of V~0.5 of slow K^+ channels. This has some experimental support. Conclusion: excitability is a reliable biomarker, modelling it helps identify mechanism. Riluzole has no acute effects (maybe time period too short, also they’ve already been on Riluzole for > 1 year). Long-term administration may reduce Na^+ influx and Ca^2+-mediated degeneration.

Jill Meier (Utrecht) - Connectome-based disease progression in ALS. Can we predict disease progression using MRI? ALS on a microscale. pTDP-43 spreading in a “prion-like” manner. (references Brettschneider et al 2013, Schmidt et al 2016, Braak et al.). Connectome: huge longditudinal dataset, in which patients are scanned. This works like this: divide the brain into 82 individual regions. Then use an algorithm to track the connective fibers between regions. This gives a ‘structural brain network’. (This looks pretty cool). A link is when there’s a strong enough connection observed between brain regions. We can then measure the link in two different ways - the number of streamlines (like the bandwidth) and the speed of motion, referred to as ‘FA’. (I.e. current and voltage). In ALS macroscale connectomes, impairment in FA (quality of connection) decreasing over time. Also it spreads, apparently along the connections (rather than just outwards across whole brain). Question: can we predict progression with MRI? First have to answer, 1: what is the progression, can we predict it with a random walk (along the brain network) approach? 39 ALS patients, 4 scans, 5.5 months between scans on average, matched controls. Refs: Zalesky A et al 2010 NeuroImage, find a growing impaired connected component (based on MRI pictures) fron cortex to other parts of brain, identified by comparison with control samples. (I assume this is a different connected component in each individual). She’s a mathematician but will leave out the formulae. Aw! “Self-multiplying NOS-biased random walker” (I think NOS = brain network). I.e. it walks along the observed network. Main result: compare simulations to real data, straight line fit. correlation=0.7. I.e. can predict with r~0.7 the state of progression based on this model. Also shows differences between ‘Brettschneider’ stages (maybe this. 141 patients with 2 scans, 141 age-gender matched controls. Something else about prediction between scans. Conclusion: it works, there’s a growing impaired connected component. Future: clinical features? Functional data? (This talk was cool.)

Alexander Thompson (Oxford) - CSF chitinae protein performance as ALS biomarkers. This is a clinically-oriented talk about Chitotriosidase-1 (CHIT1), aldo sCHI3L1, CHI3L2). Measured using ELISA (which is antibody binding). Can they be used as predictors? The significance of these particular proteins is lost on me and I’m not really following this. Also, while I’m moaning, the measured levels are compared between individuals in a plot with ‘***’ (meaning ‘highly statistically significant’) and ‘NS’ (meaning ‘not significant’). These are typically not the right way to think about comparisons like this. For example, in one comparison, all of group two is below the median of group 1 (visually at least), yet this is ‘N.S.’ and the speaker says there’s no difference. I think that’s a flawed way of looking at the data. Anyway, I’m saying this not because his results are wrong (I’m not really listening) but because I started thinking about it. Comparison is between ALS, HC, Mim, PS, and AGC patients, but I don’t know what the latter 4 are. Conclusion: chitinases are an emerging CSF biomakrer in ALS.

Henk-Jan Westeneng (Utrecth) About imaging in asymptomatic C9orf72 repeat expansion carriers and non-carriers. Starts with Westeneng et al 2017, but different story in Geevasinga et al Nat Rev Genetics 2014. I missed the slide on # samples but I think it is one extended family. Use 7 Tesla scanner to generate 3d scan of whole brain. Investigated all parts of brain - grey matter and white batter, and deeper in brain. Studied 6 metabolites. Quality control. This includes ‘Cramer-Rao lower bounds’ (huh?). Bayesian linear mixed model. Kinship as random effect for family. Age as covariate. Weakly informative priors N(0,0.25) on coefficients, T distribution on residual sd. Computed P-values, adjusted for FDR, and report at 5% FDR. Results: GPE (a measure of cell membrane breakdown). Clear differences, more GPE in C9orf72+ carriers, in all parts of brain. Also for PE though maybe less widely. Also one more, UDPG. Can we use this as a biomarker? Tried to predict C9orf72 mutation. Shows prediction intervals for carrier status. It has nontrivial (but I wouldn’t say strong) predictive ability. Suggests this can therefore be used as a monitor of treatment (if it is in the causal pathway). Incrase of GPE previously reported in Alzhemiers and Parkinson’s disease patients. Neurodegeneration in pre-symptoamtic patients starts maybe 10 years before clinical diagnosis.

Greig Joilin (Sussex) Non-coding RNA serum biomarkers in ALS. Schematic of RNA biomarkers (mRNA, ncRNA, tRNA, rRNA, lncRNA, microRNA, piwi-RNA, snoRNA, snRNA. (Man, there is no way all of these represent distinct biological things. I reckon they are just a spectrum of the same thing. Hmm, I was talking to a dude the other day about short noncoding RNAs, he said part of this is that traditional methods couldn’t get at short RNAs, which is one reason they’re under-studied). Methods: 24 controls, 17 disease mimics, 24 slow-progression ALS, 24 fast-progressing ALS patients. RNA-seq on MiSEQ & 75bp paired end, Oxford Genomics Centre (that’s in the WCHG, which is where I work). Shows breakdown (i.e. pie chart) of types of differentially expressed ncRNAs. e.g. 12% are miRNAs, 28% are tRNA. Plot of differentiation goes up to about P=1E-5. Is this enough to be convinced? I dunno. Another plot has the ***’s and ‘N.S.’s again. Come on. MIR-A and MIR-B appear correlated with fast/slow progression status. MIR206 has been reported as up-regulated, but could only detect that in one of their samples, where it has a binary-like pattern. Possibility that we maybe below detection threshhold. ‘Combined signature is 78.5% accurate’ in separating healthy and disease. But this is the same set of samples, right? So it’s overfit. But more samples are coming.

Susan Peters (Utrecht) Electric shock and extremely low-frequency magnetic field exposure and risk of ALS. (Euro-MOTOR). Do ‘electrical occupations’ play a role in ALS? Extremely low frequency (ELF) magnetic fields are ubiquitous. Recent meta-analysis: increased risk, but heterogeneous across ~10 studies (I = 75%). Stratified studies by studies with full job history and those without. Get I2=0% and significant increase risk in former set. Oh! Interesting. (But it seems to say P=0.493. Eh?). Euro-MOTOR. 1,600 ALS and 3,000 controls. POpulations-based controls, clinical data + questionaires. Exposure assessment. Logistic regression with adjustments for sex, age, alcohol use, and something else. Results: OR = 1.16 (1.01-1.33) for ELF-MF. ANd OR = 1.23 (1.05-1.43) for electric shocks. Huh. After adjusting for each other, similar results (but maybe weaker, don’t understand this). Compare to earlier findings in Dutch cohort (Koeman et al 2017). They found clear exposure dependent association for ELF-MF, but speaker did not see this increase with amount of exposure. Euro_MOTOR suggest both exposures might play a role. Signals could be driven by something else associated with these types of jobs. Euro-MOTOR is good, large number of cases clinical confirmed. But some recall bias (cases think harder about their history. That is totally relevant here. Could that explain the effect? Hard to imagine people can’t remember what jobs they’ve had, but maybe there’s a difference in what’s reported as involving electricity, especially for low-frequency MF) and controls have higher educational levels.

Session 2

Michael van Es (Utrecht) - how many DNA samples is enough? This talk is intended to be provocative. And he is Dutch. So he can be provocative without trying. How many samples is enough? Enought for what? How many is enough to find all ALS genes? Ok, step back. Why genetics in the first place? ALS sometimes runs in families, safe assumption that genes are involved. Sporadic ALS is also genetic (60% heritable). COuld possibly be useful for diagnosis, and can aid genetic counselling, but most useful in studying disease pathology. Gene discoveries get translated into disease models (e.g. mouse models, poor mice). Do we really need to find all risk factors? Genetics (slide has SOD1, ALSin, TARDBP, FUS, ANG, VCP, …) and environment lead to TDP-43 that leads to ALS (prion-like spreading, mitochondrial dysfunction, inflammation, excitotoxicity, axonal transport). Going to argue it’s important to chase the genes. ONe line is that genetics tells us have different types of ALS. SOD1, C9orf72, FUS etc. Will there be a drug applicable to all of these? Don’t know but seems unlikely, so this may mean stratified treatment. Also caused by different types of genetic variation. Familial ALS rare => genetic variant also rare, sporadic ALS => multiple risk factors which could be common. 126 “ALS-related” genes reported, but most are not validated. <alsod.iop.kcl.ac.uk/index.aspx> but many of these are false, and this needs to be cleaned up. Genetics of ALS: in familial ALS have come a long way (SOD1 C9orf72, FUS TDP43). But most patients are ‘sporadic’ and we’re a long way away from this. GWAS are probably an effective tool. Common genetic variants (5% and above), compare between disease cases and controls. Tells us what a manhattan plot is (https://en.wikipedia.org/wiki/Manhattan_plot) and about statistical power (cue slide of bored people sleeping). Ballpark estimate: 1st study in ALS had 276 ALS ad 271 controls, no convincing signals. LAtest Van Rheene et al Natugre Genetics 2016 https://www.nature.com/articles/ng.3622 (12,577 cases, 23,475 controls.) GWAS have critiques. One is that they are way too expensive for what they deliver. But prices dropping, now ~$25 per sample. Data sharing means can share controls with other studies. What have we learned from GWAS studies? Shows heritability by MAF in schizophrenia (SCZ) and ALS. SCZ mainly driven by more common variants. For ALS appears to be more rare. Also shows the 9 ‘hits’. MOBP, C9orf72, TBK1, SCFD1, SARM, UNC13A, C21orf2 (that’s 7, dunno where the other two are). The other criticism: risk for each variant is small. However, even with a small OR on disease can have substantial effect: shows plot of UNC13A variant which has small effect on disease risk but large effect on phenotype / survival time. Van Eijk et al Neurology (2017). Now talking about whole genome sequencing. Gives you SNPs, coding and non-coding genome, structural variants, repeat expansions. Bar chart of % of gene in different functional classes (1.5% is protein coding). Project MinE. Aims to sequence 15,000 ALS cases & 7,500 controls. Map of participating countries. Now have 14,000 samples sequenced. Next data freeze will give more samples and better power, using additional controls obtained from another study. Describes rare variant burden (RVB) analysis. QQ-plot of RVB analysis in ALS genes works very well (SOD1 at top. Also lambda = 0.95, why is this?). What about repeat expansions? Example: expansion with final P-value=10^-7. So this is ‘exome-wide significant’. What about the non-coding genome? Burden test other functional elemetns? What about follow-up studies (as unlikely to be able to make mouse models)? Mentions ‘organoids’ which is one way to go. Whole-genome sequencing may become standard work on a patient. Diagnostic analysis for 1 gene = 300 euro. So cost for a few of these is more than for WGS. Need to start thinking about how to incorporate this information. Slide on gene therapy (e.g. antisense study for C9orf72). Gene delivery, gene silencing, but also need to think about how to show these are effective. Cross-phenotype studies? So: what we need to do in genetics is clear, it will be complicated but will get there in the end.

Marie Ryan (Dublin) Oligogenic and discordant inheritance: population based genomic study of Irish kindreds carrying C9orf72 repeat expansion (hereafter referred to as C9). C9 is one of the top 4 causes of ALS. Known pleiotropic between ALS and other psychiatric disorders. 1022 DNA samples (blood) screened for C9. 269 individuals with familial ALS. 131 individuals in 122 families have next-gen sequencing (Ryan et al NAture Genetics). Screened for 38 genes considered linked to ALS or FTD. Results: identified 89 C9 +ve individuals. reduced survival of C9 carriers. This talk is going by very fast indeed. Also ‘Oligogenic inheritance’ (n=11) but didn’t catch this. Shows one family with C9 and ALS, apparently not co-segregating (different individuals with C9 and with ALS, excpet one). Huh. Considered chance, lab error, or whether there might be a 2nd mendelian gene. Some evidence of that. Or whether there’s somatic instability. Did indeed find somatic instability of the C9 repeat expansion. Huh.

Ahmad Al Khleifat (King’s college London) Next-gen sequencing study of telomere length. Describes structural variation (deletions, inversions, duplications, tandem repeats.) Example is telomere. Unfortunately I missed the info on the size of this study. Telomere length comparison between ALS and healthy controls. Age has an effect, gender does too, and so does case status (P=0.008). Then survival analysis, longer telomeres survive longer (P=0.003). Assessment of 9 loci previously associated with telomere length at ‘genome-wide significance’ in European populations. Found rs6772228, rs8105767 are both associated with ALS. Conclusion: longer telomeres associated with increased risk of ALS, this make a trade-off in risk. (But it seems like opposite effects have been observed in mice.)

Kevin Kenna (Dublin) - about KIF5A as a novel ALS gene. KIF5A identifed by LOF burden analysis. 2014 (n=363). 2018 (n=1,463), across many European (or european ancestry) cohorts. Case-control gene burden analysis. 41,410 Control whole exomes. This finds KIF5A with OR = 32 (9-135)! And P=5.5x10^-7. And other genes at ‘exome-wide significance’. (GB: jings, haven’t we moved past that?) Quick description of the bioinformatic work that goes into making this actually work well, including looking at qq plots, looking at possible confounders, validating KIF5A variants by sanger sequencing. Now table of patient variants. Including exon skipping (splice) event. Now: KIF5A also identified by GWAS. 8,000 cases and 36,000 controls. 20,416 cases vs 58,914 controls (McLaughlin, Nat Comm 2017). Shows diagram of KIF5A, has ‘motor domain’, ‘stalk’, and ‘tail’. Most of these mutations occur in motor domain. Involved in kinesin complex, which transports things along microtubules. Now replication. 9,000 cases and 2,000 controls, replicated GWAS SNP though not rare mutation. Summary: significant excess of KIF5A LOF mutations in ALS (high OR, rare atypical phenotype). (GB: though I don’t believe the OR). + interesting stuff about function. Published in Neuron 21 March 2018, author list is 5 pages, it’s this: https://www.sciencedirect.com/science/article/pii/S089662731830148X. (I’m a bit worried about this (again). They find a massive odds ratio in their large discovery cohort. It will be enlarged through Winner’s curse. For that reason, usually people tkae the replication effect size. But they don’t replicate this in 11,000 samples (2). Surely if the effect was that large it would replicate?)

(GB: there’s a bunch of stuff going on here that coule be problematic. First, the discovery phase rests on only 9 LOF variants in KIF5A: 6 in 1,138 cases and 3 in 19,494 controls. That difference is highly statistically significant because of the large number of samples, but it’s problematic because it’s so easy for experimental or bioinformatic factors to introduce small numbers of counts like this, artifactually. This is particularly so because the definition of LOF here includes ‘splice sites’ (splicing out of exons) that are predicted computationally based on DNA, RNA, and genetic variation databases. So they might not be real! The effect size seen in discovery is also absolutely massive (OR=32): it suffers massively from Winner’s curse. None of this is helped by references to ‘exome-wide significance’ which is not really helpful: it is a threshhold based on naive statistical arguments and not on actually pertinent information.)

Lara Marrone (Dresden) - Modelling FUS ALS using iPSC lines. About iPSC-derived motor neurons. I’m not following this (though I’ve previously seen a cool talk about this type of work from Justin Ichida).

Mattia Perez (Strasbourg) About accumulation of ‘exogenous recombinant FUS’ in cortical neurons in mouse ALS models. Protein aggregation is not randomly distributed across brain. FUS is normally located in nucleus of cells, where it helps shuttle molecule between nucleus and cytoplasm. Is FUS pathology spreading in the brain like a prion? To answer: injected mice in the brain with recombinant FUS (poor mice again. Haven’t these people read the hitch-hikers guide?) But this is work in progress; aggregates were too sticky so results weren’t reproducible. So instead injected soluble FUS-GBP. Sterotaxis injections in motor cortex and hippocampus (4-6month old mice). Injected on day 0 and sacrificed after 3 and 30 days. Used cell staining. Staining implies cortical neurons do intake these proteins. The FUS mutation does not modify entry into cortical neurons. After 30 days, FUS-GFP immunoreactivity is weaker and confined to large aggreaget structures. Same in mutant mice. No obvious effect of the FUS mutants. So what happens to endogenous FUS protein? partial colocalisation. Most obvious in hippocampus. Might be technical issue, investigating this. Conclusion: recombinant proteins can enter neurons, but not genotype dependent. Do seem to be aggregate like structures that need to be identified an dmight recruit endogenous FUS. Future: inject more mice. Hmm. Have also developed a mouse line in which can induce mutation in neurons in a labelled way.

Albert Rudolf (German Network for MND) About safety and efficacy of Rasagline as an add-on therapy to riluzole. A Randomised, double-blind, parallel-group, placebo-controlled trial. Study partially supported by drug company. Study rationale: previously, Rasagliline showed a significant, dose-dependent threapeutic effet on motor function and survuval in mice. Largest extension of life (~20%) was seen in riluzole+rasagiline treatment. Plots of suruvual. Aparently improved after 6 and 12 months, but not by end of study (18 months). In ALS-FRS-R, rasagiline has a significant effect. COnclude there’s a different of treatment effect in fast and slow progressors. IN this study, the ALS/FRS shows a plateau in the beginning, predictive of further delince. Rasagiline (1mg + riluzole) did not show effect on primary endpoint. But post-hoc analysis showed effect at intermediate times.

More tomorrow.

Getting biobank data into R, part 3

2017-05-18T00:00:00+00:00

Well, I’ve succeeded in making an Rcpp package and loading variants into R. For the final piece of the puzzle it’s time to actually get at the data.

The BGEN repo API

The API for reading data was designed with the ideas that:

It should not impose specific data structures on client code;
It should provide enough information to setup whatever storage the client wanted to use;
and it should be easy to use.

To achieve this the BGEN code reports data back to the user via an indirect API. The user provides a ‘setter’ object that is supposed to know what to do with the data; the bgen code calls methods of that setter object to set the data. This indirection makes using the code a bit more complex, but on the other hand, it’s an advantage in situations like this where we want to get data into a particular data structure that bgen doesn’t know about (in this case an Rcpp array).

The API is described fully here but a short version is that, at each variant, the View class will follow this pseudocode:

initialise and set storage sizes
for each sample:
  if client want data for this sample:
    set ploidy and storage size for this sample
    set each probability value for this sample
finalise

translating into these method calls on the setter object:

initialise()
   set_min_max_ploidy()
for each sample i:
  if set_sample(i):
    set_number_of_entries()
    for each probability value p:
      set_value(p)
finalise()

All we have to do is implement these methods.

Implementing the setter

Here we go:

struct DataSetter {

We want to fill a 2d array of ploidy values (represented by an Rcpp::IntegerVector) and a 3d array of probability values (represented by an Rcpp::NumericVector), that have already been allocated. Unfortunately, Rcpp doesn’t seem to have a multidimensional array class, so we have to do the indexing ourselves. So let’s begin by passing in pointers to the result fields, their dimensions, and also the index of the variant we are working on:

  DataSetter(
    IntegerVector* ploidy,
    Dimension const& ploidy_dimension,
    NumericVector* data,
    Dimension const& data_dimension,
    std::size_t variant_i
  ):
    m_ploidy( ploidy ),
    m_ploidy_dimension( ploidy_dimension ),
    m_data( data ),
    m_data_dimension( data_dimension ),
    m_variant_i( variant_i )
  {}

All these m_ variables are declared as private members of the class. I tend to put these at the end of the class, but I’ll skip that bit for this post. Instead let’s go ahead and implement the API. Because storage is already allocated, there’s not much to do in the first two methods, but let’s add some sanity checks anyway. For the sake of this example we are assuming variants are biallelic, that all samples are at most diploid, and we’ll check the data sizes are large enough:

  void initialise( std::size_t number_of_samples, std::size_t number_of_alleles ) {
    // Nothing to do but run sanity checks.
			// Put them here
    if(!(
      (number_of_alleles == 2)
      && ( m_data_dimension[0] > m_variant_i )
      && ( m_data_dimension[1] == number_of_samples )
      && ( m_data_dimension[2] == 3 )
      && ( m_ploidy_dimension[0] > m_variant_i )
      && ( m_ploidy_dimension[1] == number_of_samples )
    )) {
      throw genfile::bgen::BGenError() ;
    }
  }

  void set_min_max_ploidy( genfile::bgen::uint32_t min_ploidy, genfile::bgen::uint32_t max_ploidy, genfile::bgen::uint32_t min_entries, genfile::bgen::uint32_t max_entries ) {
    if( max_ploidy > 2 ) {
      throw genfile::bgen::BGenError() ;
    }
  }

For the moment we want data on all samples, so let’s tell bgen that. Also to we cache the sample index here for later use:

  bool set_sample( std::size_t i ) {
    m_sample_i = i ;
    return true ;
  }

The set_number_of_entries() call tells us the ploidy (as well as the number of probability values) for the current sample. Let’s store that at the appropriate index:

  void set_number_of_entries(
    std::size_t ploidy,
    std::size_t number_of_entries,
    genfile::OrderType order_type,
    genfile::ValueType value_type
  ) {
    // Sanity checks go here
    // Compute the index for this variant and sample in the 2d ploidy matrix:
    int flatIndex = m_variant_i + (m_sample_i * m_ploidy_dimension[0]) ;
    (*m_ploidy)[ flatIndex ] = ploidy ;
  }

To store the values themselves is similar. There are two versions of the set_value() method, one for non-missing and one for missing data. In the latter we’ll set the appropriate R missing value:

  void set_value( uint32_t entry_i, double value ) {
    int flatIndex = m_variant_i + (m_sample_i * m_data_dimension[0]) + (entry_i * m_data_dimension[0] * m_data_dimension[1]) ;
    (*m_data)[ flatIndex ] = value ;
  }

  void set_value( uint32_t entry_i, genfile::MissingValue value ) {
    int flatIndex = m_variant_i + (m_sample_i * m_data_dimension[0]) + (entry_i * m_data_dimension[0] * m_data_dimension[1]) ;
    (*m_data)[ flatIndex ] = NA_REAL ;
  }

There’s nothing to do in finalise() either:

  void finalise() {
  }

And we’re done!

Using the setter object

Remember this code?

  for( std::size_t variant = 0; variant < number_of_variants; ++variant ) {
    view->read_variant( &SNPID, &rsid, &chromosome, &position, &alleles ) ;

    // (stuff to do with variant id data here)

    view->ignore_genotype_data_block() ; // will be fixed later
  }

Let’s change it to read the data instead of ignoring it:

  for( std::size_t variant = 0; variant < number_of_variants; ++variant ) {
    view->read_variant( &SNPID, &rsid, &chromosome, &position, &alleles ) ;

    // (stuff to do with variant id data here)

    // Construct the setter object for this variant
    DataSetter setter(
      &ploidy, ploidy_dimension, &data, data_dimension, variant
    ) ;
    view->read_genotype_data_block( setter ) ;
  }

Job done.

A final tweak is to give the resulting data suitable names, I’ll skip that here.

Does it work?

R CMD INSTALL the package and let’s test it:

> library( rbgen )
> D = load( "example/example.16bits.bgen", data.frame( chromosome = '01', start = 0, end = 100000 ))
> str( D )
List of 4
 $ variants:'data.frame':	198 obs. of  5 variables:
  ..$ chromosome: Factor w/ 1 level "01": 1 1 1 1 1 1 1 1 1 1 ...
  ..$ position  : int [1:198] 1001 2000 2001 3000 3001 4000 4001 5000 5001 6000 ...
  ..$ rsid      : Factor w/ 198 levels "RSID_10","RSID_100",..: 3 111 4 122 5 133 6 144 7 155 ...
  ..$ allele0   : Factor w/ 1 level "A": 1 1 1 1 1 1 1 1 1 1 ...
  ..$ allele1   : Factor w/ 1 level "G": 1 1 1 1 1 1 1 1 1 1 ...
 $ samples : chr [1:500] "sample_001" "sample_002" "sample_003" "sample_004" ...
 $ ploidy  : int [1:198, 1:500] 2 2 2 2 2 2 2 2 2 2 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:198] "RSID_101" "RSID_2" "RSID_102" "RSID_3" ...
  .. ..$ : chr [1:500] "sample_001" "sample_002" "sample_003" "sample_004" ...
 $ data    : num [1:198, 1:500, 1:3] 0.00168 NA 0.91611 0.00507 0.99286 ...
  ..- attr(*, "dimnames")=List of 3
  .. ..$ : chr [1:198] "RSID_101" "RSID_2" "RSID_102" "RSID_3" ...
  .. ..$ : chr [1:500] "sample_001" "sample_002" "sample_003" "sample_004" ...
  .. ..$ : chr [1:3] "0" "1" "2"
> head( D$variants )
         chromosome position     rsid allele0 allele1
RSID_101         01     1001 RSID_101       A       G
RSID_2           01     2000   RSID_2       A       G
RSID_102         01     2001 RSID_102       A       G
RSID_3           01     3000   RSID_3       A       G
RSID_103         01     3001 RSID_103       A       G
RSID_4           01     4000   RSID_4       A       G
> D$data[1,1:10,1:3]
                      0            1            2
sample_001 0.0016784924 0.0023498894 0.9959716182
sample_002 0.0070191501 0.0003662165 0.9926146334
sample_003 0.9959411002 0.0036316472 0.0004272526
sample_004 0.0002441444 0.9976195926 0.0021362631
sample_005 0.0020141909 0.9965819791 0.0014038300
sample_006 0.0023498894 0.0019226368 0.9957274739
sample_007 0.0112001221 0.9808346685 0.0079652094
sample_008 0.0072632944 0.9922178988 0.0005188067
sample_009 0.0029907683 0.9944457160 0.0025635157
sample_010 0.0024719615 0.9781490806 0.0193789578

So it works!

The finished product

The code from these blogs is currently available as part of the ‘default’ branch of the bgen repo. (Commit 1332153121bb as of this writing). I’ve made a few tweaks there:

Calling a function ‘load’ is not very friendly as it masks the base function. So I’ve called the main function bgen.load instead.
I’ve arranged that the package is assembled in the build dir during the normal build phase of the bgen repo. So after compiling the bgen repo, you can install it with R CMD INSTALL build/R/rbgen.
It is highly experimental (not to mention currently largely untested) so if you do use it, use it with a liberal does of sanity checking.

I’ve also noticed that there are interactions between the compiler used to compile the BGEN repo, and the compiler used to compile the version of R you’re using. (They’d better be the same). On our compute cluster, for example, I do

CXX=/path/to/gcc/5.4.0/bin/g++ CC=/path/to/gcc/5.4.0/bin/gcc ./waf-1.5.18 configure

and then I can install the package into a version of R also built with gcc 5.4.0. I think this type of issue is going to be inevitable.

The future

I expect to develop this further and hopefully move it to the ‘master’ branch when it becomes stable enough for regular use. In the meantime, if you do try it, let me know how you get on.

Enjoy!.

Getting biobank data into R, part 2

2017-05-17T00:00:00+00:00

Thanks to RStudio and Rcpp I’ve succeeded in getting an Rcpp package up and running and making it link to the bgen code. Now it’s down to writing the C++ code that makes it actually work.

Implementing the interface

We want to implement this function:

Rcpp::List load(
	std::string const& filename,
	Rcpp::DataFrame ranges
) {

to load data from a bgen file. Here’s how we’ll do it.

Opening the files

First, we’ll open the bgen file and its index. For this I will use the genfile::bgen::View and genfile::bgen::IndexQuery classes that are part of the implementation of bgenix in the bgen repo. (The relevant #includes need to go at the top of the file.)

Warning: The API for the View and IndexQuery classes is somewhat experimental, meaning that it might change in future. (In particular in the current design, the View class really represents both the bgen file, and a view of that bgen file; likewise IndexQuery represents both the index and a query against that index. That’s a bit weird.) But they work, so let’s use them without further ado:

  using namespace genfile::bgen ;
  using namespace Rcpp ;

  View::UniquePtr view = View::create( filename ) ;
  IndexQuery::UniquePtr query = IndexQuery::create( filename + ".bgi" ) ;

Good! Files successfully opened (or not, in which case an exception will be thrown, we’ll deal with that later if we need to).

(The UniquePtrs here are typedefs for std::auto_ptr, which represents unique ownership of the pointed-to objects. These days we are supposed to use std::unique_ptr instead, but I’m keeping to std::auto_ptr for the moment to keep the code working on older compilers.)

Setting up the query

Next we set up the query. I’m going to assume that ranges is a dataframe specifying a set of genomic ranges, each with a specified chromosome, start coordinate, and end coordinate. (By convention intervals in bgen are taken to be 1-based, closed intervals). Let’s add these to the query:

  for( std::size_t i = 0; i < ranges.nrows(); ++i ) {
    query->include_range(
      genfile::bgen::GenomicRange(
        ranges['chromosome'][i],
        ranges['start'][i],
        ranges['end'][i]
      )
    ) ;
  }

Now: I told you the API isn’t perfect. The query currently needs to be initialised before it can be added to the view:

  query->initialise() ;
  view->set_query( query ) ;

At this point, view represents a view of the specified ranges in the bgen file.

Reading the variant data

The load() function has to return quite a lot of information:

The list of variants in the query
The list of samples in the data
The ploidy of each sample at each variant
The genotype probabilities for each sample at each variant

In addition, these need nice things like useful row and column names. So there’s a bit of fiddling around to do. First let’s set up storage for the data we’ll need:

  std::size_t const number_of_variants = view->number_of_variants() ;

  // For this example we assume diploid samples and two alleles
  DataFrame variants = DataFrame::create(
    Named("chromosome") = StringVector( number_of_variants ),
    Named("position") = IntegerVector( number_of_variants ),
    Named("rsid") = StringVector( number_of_variants ),
    Named("allele0") = StringVector( number_of_variants ),
    Named("allele1") = StringVector( number_of_variants )
  ) ;
  StringVector sampleNames ;

(Rcpp::StringVector turns out to be the same thing as Rcpp:CharacterVector. I dunno why there are two names for the same thing.)

We’ll return the genotype data as a 3 dimensional array indexed by the variant, the sample, and the genotype. (For now we’ll assume at most three genotypes exist). Likewise the ploidy will be stored as a 2d array. In this version of Rcpp there seems to be no multidimensional array class, and the right way to set this up is to pass a Dimension object to the vector constructor instead:

  Dimension data_dimension = Dimension( number_of_variants, number_of_samples, 3ul ) ;
  Dimension ploidy_dimension = Dimension( number_of_variants, number_of_samples ) ;

  NumericVector data = NumericVector( data_dimension ) ;
  IntegerVector ploidy = IntegerVector( ploidy_dimension ) ;

Now we’re ready to get data. First let’s get the list of sample names from the bgen file (some bgen files have no sample names, in this case the View class provides a dummy set of names):

  view->get_sample_ids(
  	[&sampleNames]( std::string const& name ) { sampleNames.push_back( name ) }
  ) ;

The middle line here is a C++11 lambda function that adds each name to the list of sample names.

Now let’s iterate the variants. In the bgen::View class, that’s done by alternately calling the read_variant() and the ignore_genotype_data_block() or read_genotype_data_block() methods. (For the moment we’ll ignore the genotypes themselves, I’ll get back to this below). Like this:

  for( std::size_t variant = 0; variant < number_of_variants; ++variant ) {
    view->read_variant( &SNPID, &rsid, &chromosome, &position, &alleles ) ;
    variants['chromosome'][i] = chromosome ;
    variants['position'][i] = position ;
    variants['rsid'][i] = rsid ;
    variants['allele0'][i] = alleles[0] ;
    variants['allele1'][i] = alleles[1] ;

    view->ignore_genotype_data_block() ; // will be fixed later
  }

For ease of use we’ll give variants row names of the variant IDs:

  variants.attr( "row.names" ) = rsids ;

Finally, we need to put everything together in an Rcpp::List as the return value:

  List result ;
  result[ "variants" ] = variants ;
  result[ "samples" ] = sampleNames ;
  result[ "ploidy" ] = ploidy ;
  result[ "data" ] = data ;

And we’re done!

  return( result ) ;
}

Some things I learned about Rcpp

On trying to compile this I learned some things about Rcpp:

Thing 1: Rcpp.h includes R header files that define some macros, including one called ERROR (defined in RS.h). Bad, bad R! This means you’d better #include <Rcpp.h> last, not before your other files, otherwise you’ll get weirdo errors if you’ve used ERROR anywhere. In my case, this broke the sqlite headers that try to define an enum value SQLITE_ERROR.

Thing 2: You can’t access dataframe elements as I did above, like mydataframe["column"][0] = 5. This is because Rcpp doesn’t know what the type of each column in a DataFrame is. Instead you make a reference to the column first, as in

IntegerVector const& column = mydataframe["column"];
column[0] = 5 ;

or you use the as<> function to specify the type, as in

as< IntegerVector >( mydataframe["column"] ) = 5 ;

Thing 3: C++11 support was not turned on for me, so it worked better to replace that C++11 lambda function with a class built for the purpose, like this one:

struct set_sample_names {
  set_sample_names( Rcpp::StringVector* result ):
    m_result( result )
  {
    assert( result != 0 ) ;
  }
  
  void operator()( std::string const& value ) {
    m_result->push_back( value ) ;
  }
private:
  Rcpp::StringVector* m_result ;
} ;

which can be used like

  view->get_sample_ids( set_sample_names( &sampleNames ) ) ;

instead of the lambda function.

Does it work?

Let’s try. Apple-shift-b again (or R CMD INSTALL or whatever) and let’s try:

> library( rbgen )
> D = load( "example/example.16bits.bgen", data.frame( chromosome = '01', start = 0, end = 100000 ))
> str( D )
List of 4
 $ variants:'data.frame':	198 obs. of  5 variables:
  ..$ chromosome: Factor w/ 1 level "01": 1 1 1 1 1 1 1 1 1 1 ...
  ..$ position  : int [1:198] 1001 2000 2001 3000 3001 4000 4001 5000 5001 6000 ...
  ..$ rsid      : Factor w/ 198 levels "RSID_10","RSID_100",..: 3 111 4 122 5 133 6 144 7 155 ...
  ..$ allele0   : Factor w/ 1 level "A": 1 1 1 1 1 1 1 1 1 1 ...
  ..$ allele1   : Factor w/ 1 level "G": 1 1 1 1 1 1 1 1 1 1 ...
 $ samples : chr [1:500] "sample_001" "sample_002" "sample_003" "sample_004" ...
 $ ploidy  : int [1:198, 1:500] 0 0 0 0 0 0 0 0 0 0 ...
 $ data    : num [1:198, 1:500, 1:3] 0 0 0 0 0 0 0 0 0 0 ...
> head( D$variants )
         chromosome position     rsid allele0 allele1
RSID_101         01     1001 RSID_101       A       G
RSID_2           01     2000   RSID_2       A       G
RSID_102         01     2001 RSID_102       A       G
RSID_3           01     3000   RSID_3       A       G
RSID_103         01     3001 RSID_103       A       G
RSID_4           01     4000   RSID_4       A       G

It works!

Of course - we haven’t actually got any genotype data yet. For that, see the next post.