Me vs. the EGA part 2: uploading data

In part one of this series, we took on the EGA by submitting a study and a set of samples to it. The EGA is now intrigued by the taste of data we’ve sent it - not to mention confused by our persistence. It was pretty sure we would choose options 1 or 2. In other words it’s on the back foot.

Nevertheless, there’s still a long way to go. This post is about actually getting data uploaded and submitted. First, remember to ignore all the red herrings. We’re using the XML schema to do this because it’s the One True Way*.

* that I got to work.

Where to put the data

The basic trick is to place files in absolute paths on the FTP site that are the paths you want the data users to download.

* This may seem obvious. But my initial instinct was to put data in well-organised subfolders of my choosing, but that wouldn’t be exposed to data users. Turns out that won’t work.

For my study I want people to get a folder of data in this form:

    EGAS0000XXXXXX/
        dataset1_name/
            sample1.cram
            sample1.cram.crai
            sample2.cram
            sample2.cram.crai
            ...
        dataset2_name/
            sample1.cram
            sample1.cram.crai
            ...
        ...

Where EGAS0000XXXXXX is the EGA identifier for the study. You did make a note of this right?* So I’ve uploaded data into a top-level folder EGAS0000XXXXXX on the FTP site.

* personally, I didn’t know the EGA identifier when I uploaded the data. Instead I used the placeholder folder name EGAS0000XXXXXX as shown above. Then I renamed it after getting the study ID during the real submission. Another way would be to submit the study XML from the previous post to the live service first, as that will give you the EGA ID. I’ll cover that in the next post. Yet another way is to simply not name your files with the EGA identifier in. I’m doing it because I’m stubborn that way.

How to prepare the data

You can’t just upload your data to the EGA - you have to encrypt it and compute checksums first. You upload the encrypted files (*.gpg), the checksums (*.md5), and the checksums of the encrypted files (*.gpg.md5). Is all this encryption and checksumming overkill? I don’t know, but that’s what you have to do.

EGA provides a tool for this, called EgaCryptor.jar. Get it here. Run it on each file as

$ java -jar ../EgaCryptor.jar -file <filename>

I’m not going to go into how to do this across many files. Use find -exec to do it, or use a compute cluster. This can be a mini-adventure, set in the broader narrative context of the larger one, for you to complete on your own. There may be challenges on the way, but I reckon you can handle them.

You now upload all these files to your ega box. For large directories of files, I found ncftpput to be the simplest tool for this. So if, locally, your files are all in a local EGAS0000XXXXXX folder:

$ ncftpput -u <ega-box-XXX> -p <password> -R ftp.ega.ebi.ac.uk /EGAS0000XXXXXX/ ./EGAS0000XXXXXX/

Where -R specifies recursive mode, and it’s ncftpput [options] host remotedir localdir. So what you’ve got now on the ftp site is:

EGAS0000XXXXXX/
    dataset1_name/
        sample1.cram.gpg
        sample1.cram.gpg.md5
        sample1.cram.md5
        sample1.cram.crai.gpg
        sample1.cram.crai.gpg.md5
        sample1.cram.crai.md5
        sample2.cram.gpg
        sample2.cram.gpg.md5
        sample2.cram.md5
        sample2.cram.crai.gpg
        sample2.cram.crai.gpg.md5
        sample2.cram.crai.md5
        ...
    dataset2_name/
        sample1.cram.gpg
        sample1.cram.gpg.md5
        sample1.cram.md5
        sample1.cram.crai.gpg
        sample1.cram.crai.gpg.md5
        sample1.cram.crai.md5
        ...
    ...

Note there aren’t any of the unencrypted files here. I acheived that by first making a parallel directory structure with all the .md5 and .gpg and .gpg.md5 files in it (but not the original files) before running ncftpput.

How to submit a CRAM file

It’s time to feed the monster again. In this step we tell EGA about our files by submitting another XML.

This is going to get a bit complicated because the XML has to contain

a reference to the study - as in the study XML we submitted before. (This is where the ‘alias’ for the study is used.)
the full name of the unencrypted file
the md5sum of the unencrypted file
and the md5sum of the encrypted file

For the CRAM files we’re working with, it also has to contain

a reference to the relevant sample - as in the sample XML we submitted before. (This is where the ‘alias’ for the sample is used.)
details on the reference used for alignment. This includes the name and accession of the reference, and those of all the reference sequence names.

The analysis XML

For a test with one sample I’m going to assume:

the study alias is my_study_v1
the sample alias is illumina_hiseq:test_sample_1
the files are at EGAS0000XXXXXX/dataset1/test_sample1.cram[.crai].

My CRAM files are aligned to GRCh37. In the analysise XML we’re supposed to give an accession for this reference - in my case it is the GenBank assembly accession GCA_000001405.1. Moreover, you’re supposed to also give an accession for each reference contig. This makes the full XML pretty long. Here’s a simplified version:

<ANALYSIS_SET>
<ANALYSIS alias="test_analysis_1" center_name="<center name>" broker_name="EGA" >
        <TITLE>
            Aligned reads file for illumina_hiseq:test_sample_1
        </TITLE>
        <DESCRIPTION>Aligned sequence reads for illumina_hiseq:test_sample_1, mapped to GRCh37</DESCRIPTION>
        <STUDY_REF refname="my_study_v1" refcenter="<center name>"/>
        <SAMPLE_REF refname="illumina_hiseq:test_sample_1" refcenter="<center name>" label="illumina_hiseq:test_sample_1"/>
        <ANALYSIS_TYPE>
            <REFERENCE_ALIGNMENT>
                <ASSEMBLY>
                    <STANDARD refname="GRCh37" accession="GCA_000001405.1"/>
                </ASSEMBLY>
                <SEQUENCE accession="CM00663.1" label="1"/>
                <SEQUENCE accession="CM00664.1" label="2"/>
                (etc.)
            </REFERENCE_ALIGNMENT>
        </ANALYSIS_TYPE>
        <FILES>
            <FILE filename="EGAS0000XXXXXX/dataset1/test_sample_1.cram" filetype="cram" checksum_method="MD5" checksum="fd847e0c4849ec50cdf310accd4293b0" unencrypted_checksum="d41d8cd98f00b204e9800998ecf8427e"/>
            <FILE filename="EGAS0000XXXXXX/dataset1/test_sample_1.cram.crai" filetype="crai" checksum_method="MD5" checksum="2b37te0c4994f850cdf310accd8b04c" unencrypted_checksum="e32d7cd98f00b204e5801958ecf772S"/>
        </FILES>
    </ANALYSIS>
</ANALYSIS_SET>

How to get the checksums in there? Well to upload the files to the FTP site you had to encrpyt them. So you can read the checksums from the .md5 files that were created.

Submit with:

$  curl \
-u <ega-box-XXX>:<password> \
-F "ANALYSIS=@cram.xml" \
-F "SUBMISSION=@submit.xml" \
https://www-test.ebi.ac.uk/ena/submit/drop-box/submit/

If successful it’ll return something like this:

<RECEIPT receiptDate="2019-05-01T13:49:59.366+01:00" submissionFile="submit.xml" success="true">
     <ANALYSIS accession="EGAZ00001399343" alias="test_analysis_1" status="PRIVATE"/>
     <SUBMISSION accession="EGA00001532515" alias="SUBMISSION-01-05-2019-13:49:59:289"/>
     <MESSAGES>
          <INFO>Submission has been committed.</INFO>
          <INFO>This submission is a TEST submission and will be discarded within 24 hours</INFO>
     </MESSAGES>
     <ACTIONS>ADD</ACTIONS>
     <ACTIONS>PROTECT</ACTIONS>
</RECEIPT>

If not there’ll be errors.

Some of it has issues

At this point I ran into a problem. For this XML to go in, the relevant encrypted (.gpg) files have to be on the FTP site in the specified locations (though the md5sums don’t seem to actually have to be right - don’t know why, but I imagine this is checked later). However, as I mentioned in my earlier post,

There’s a delay in the system recognising what you’ve put on the FTP site.

As far as I can tell, this is a delay of up to 24 hours, though I couldn’t find this documented.

Depending on how you work, this might not affect you. It is a real pain in the neck for me, as I’m a last-minute tinkerer trying to “get things right” for submission today. But having just renamed some files, I’m going to have to wait. The moral is: pick your filenames carefully from the start, and stick with them.

(Another complication with the above is that my sequences were actually aligned to the hs37d5 reference (available here) that includes a concatenated set of decoy sequences. I couldn’t find an accession for that and don’t know right now if this will come back to bite me later.)

Submitting many CRAM files

This is now easy, right? Just repeat the <ANALYSIS> tags as many times as you want within the XML. Of course, you’ll have to write a computer program to generate all that XML. You’ll have to make this code read all the .md5 files, so that you can put the md5 values in the XML.

Should be straightforward. I’ll leave it to you.

But what if I don’t have CRAM files.

Oh. Well, as it turns out there are lots of different stuff that can be submitted. As the https://ega-archive.org/files/Analysis_BAM.xml says:

<!-- Many other filetypes can be used to define your analysis file and
    multiple file types can be submitted for a single analysis.
    For example, you may wish to submit a readme_file and
    phenotype_file to accompany your bam file: 
    <"cram"/>
    <"tabix"/>
    <"wig"/>
    <"bed"/>
    <"gff"/>
    <"fasta"/>
    <"contig_fasta"/>
    <"contig_flatfile"/>
    <"scaffold_fasta"/>
    <"scaffold_flatfile"/>
    <"scaffold_agp"/>
    <"chromosome_fasta"/>
    <"chromosome_flatfile"/>
    <"chromosome_agp"/>
    <"chromosome_list"/>
    <"unlocalised_contig_list"/>
    <"unlocalised_scaffold_list"/>
    <"sample_list"/>
    <"readme_file"/>
    <"phenotype_file"/>
    <"OxfordNanopore_native"/>
    <"other"/>
    -->

Unfortunately here the EGA is pernickety (I mean “controlled”) and I found that I couldn’t submit exactly what I wanted. Instead I got messages of this form:

 <ERROR>
 In analysis, alias:"illumina_hiseq:test_sample_1", accession:"".
 Invalid group of files: 1 "other" file, 1 "cram" file, 1 "crai" file.
 Supported file grouping(s) are:
 [1 "bam" file, 0..1 "bai" files, 0..1 "readme_file" files],
 [1 "cram" file, 0..1 "crai" files, 0..1"readme_file" files].
 </ERROR>

So this suggests you can’t just submit what you want. A CRAM file can only go with an index file and a README, not e.g. an annotation file or any other associated information.

This is a bit of a problem for my data, because in addition to the CRAM files I want to release

a VCF file of array genotypes for these samples.
annotation files for the sequenced and chip-typed samples
and a README.

And I want to release them all inside the same dataset.

One way to do this seems to be to link them all to a genotyping analysis included within the ANALYSIS_SET. Like this:

<ANALYSIS alias="omni_typing_for_test_project" center_name="<center name>" broker_name="EGA" >
    <TITLE>Illumina Omni 2.5M genotyping</TITLE>
    <DESCRIPTION>Illumina Omni 2.5M genotyping</DESCRIPTION>
    <STUDY_REF refname="my_study_v1" refcenter="<center name>"/>
    <SAMPLE_REF refname="illumina_omni2.5M:test_sample_1" refcenter="MalariaGEN" label="3999807010_R01C01"/>
    <SAMPLE_REF refname="illumina_omni2.5M:test_sample_2" refcenter="MalariaGEN" label="3999807010_R01C01"/>
    ...
    <ANALYSIS_TYPE>
        <SEQUENCE_VARIATION>
        <ASSEMBLY>
            <STANDARD refname="GRCh37" accession="GCA_000001405.1"/>
        </ASSEMBLY>
        <SEQUENCE accession="CM00663.1" label="1"/>
        <SEQUENCE accession="CM00664.1" label="2"/>
        ...
        <EXPERIMENT_TYPE>Genotyping by array</EXPERIMENT_TYPE>
        </SEQUENCE_VARIATION>
    </ANALYSIS_TYPE>
    <FILES>
        <FILE filename="test_submission/test_samples.vcf.gz" filetype="vcf" checksum_method="MD5" checksum="47f0d94097b043fa4ec6a028da8c2de4" unencrypted_checksum="ac5dd03c3927f39806c4e593e78290d2"/>
        <FILE filename="test_submission/test_samples.vcf.gz.tbi" filetype="tabix" checksum_method="MD5" checksum="47f0d94097b043fa4ec6a028da8c2de4" unencrypted_checksum="ac5dd03c3927f39806c4e593e78290d2"/>
        <FILE filename="test_submission/test_samples.tsv" filetype="other" checksum_method="MD5" checksum="47f0d94097b043fa4ec6a028da8c2de4" unencrypted_checksum="ac5dd03c3927f39806c4e593e78290d2"/>
        <FILE filename="test_submission/README.md" filetype="readme" checksum_method="MD5" checksum="47f0d94097b043fa4ec6a028da8c2de4" unencrypted_checksum="ac5dd03c3927f39806c4e593e78290d2"/>
    </FILES>
    <ANALYSIS_ATTRIBUTES>
        <ANALYSIS_ATTRIBUTE>
        <TAG>platform</TAG>
        <VALUE>Illumina Omni 2.5M</VALUE>
        </ANALYSIS_ATTRIBUTE>
    </ANALYSIS_ATTRIBUTES>
</ANALYSIS>

For other types of data, you’re on your own. (or maybe I should point you at this or this or this. Yes, that’s a lot of stuff to read but you didn’t hear Hercules complaining, did you?)

How goeth the quest?

The EGA, let’s be honest, is fighting back. Although we’ve navigated its network of complicated XML schema, it has brought to bear stringent rules that confound our expectations, and is trying to annoy us by pretending our data isn’t even there. This is as good a time as any to give up - to choose life, say, or to choose to go and watch a colour TV, or to go back and choose option 1 or 2. But I’m not going to. I’m going to defeat the beast, and then I’m going to have a good moan/gloat about it on my blog.

(Although in practice…)