Help Login Create account

Back to GigaDB

Introduction

GigaDB search

Using Aspera

Submission guidelines

Controlled vocabulary


Introduction

The GigaDB website allows any user to browse, search, view datasets and access data files. If you want to submit a dataset, save searches or be alerted of new content of interest we request that you create an account.

A 'Latest news' section will be visible to announce any updates or new features to the database and the RSS feed automatically announces each new dataset release.

The GigaDB homepage allows you to browse datasets by type eg Genomic, Metagenomic, Transcriptomic. Clicking on the DOI (digital object identifier) or image will take you directly to the webpage for the dataset of interest.

Alternatively you can use the search functions to find datasets, samples or files of interest.


GigaDB search

Search Operation

To search across all Dataset, Sample and File records in GigaDB, simply enter a search term in the search bar found at the top of all GigaDB pages.

The search is case insensitive which means both uppercase and lowercase keywords will have the same result.

Search results

The search results are grouped by GigaDB Datasets (G), Samples (S) and Files (F).

For each dataset result, author names and DOI are displayed. Hovering over dataset name provides the description of dataset. Dataset and sample names are linked to the specific DOI page for those data, as well as file links are provided to download.

For each sample result, the sample name, species name and species ID are displayed with links to the NCBI taxonomy page for the species and to the GigaDB dataset page.

For each file result, the file name, file type and file size are displayed with a direct link to the FTP server location of that file.

Only those objects that have direct matches are displayed in the search results, i.e. the only Files to be displayed in the search results will be those with matches to the search term, all other files within the same dataset will NOT be displayed.

For example, searching for the term “Potato” will return the dataset with the title “Genomic data from the potato” which contains 17 files, however, the search results table will only display 3 of those 17 files because only 3 contain the search term “potato”. To find all data associated with a dataset you must follow the link to the dataset page.


Filtering results

On the left of the search results you have the option to further refine the results by using the filters. By default all filters are disabled, allowing you to see all search results for your keyword. If you want to hide some results based on some criteria, choose the filter for your criteria, and select the options that match what you want to see.

TFilter options for Datasets:

Filter options for Samples:

Filter options for Files:

Click the 'Apply Filters' button to see your refined results table.


Using Aspera

As many of the GigaDB datasets are large, several in the terabyte range, we have installed Aspera to provide a faster and more reliable method for users to download files from the GigaDB FTP server:

http://aspera.gigadb.org/?B=pub/10.5524/100001_101000/

Login: gigadb
Password: gigadb

In order to use Aspera to download files you first need to install the free AsperaConnect web browser plug-in. For information on setup and use see the documentation on the plug-in site and the Aspera Connect User Guide.

For bulk downloads it is recommended that you do this programmatically via the 'ascp' command line (this utility is delivered along with the AsperaConnect product).


Submission guidelines

GigaDB is an open-access database. As such, all data submitted to GigaDB must be fully consented for public release (for more information about our data policies, please see our Terms of use page).

All sequence, assembly, variation, and microarray data must be deposited in a public database at NCBI, EBI, or DDBJ before you submit them to GigaDB. In the cases where you would like GigaDB to host files associated with genomic data not fully consented for public release, you must first submit the non-public data to dbGaP or EGA.

Step 1 - Create an account or log in to GigaDB

Step 2 - Download and complete the Excel template file. Completed example files for the E. coli (10.5524/100001) and Sorghum (10.5524/100012) datasets are available.

The template file contains:

Mandatory fields are highlighted in yellow.

Study

Required information includes submitter name, email and affiliation, upload status [can we publish this dataset immediately after review (Publish) or should it be held until publication (HUP)], author list, dataset type(s) (selected from a controlled vocabulary list), dataset title and description, estimated total size of the files that will be submitted and dataset image information.

Optional information includes links to additional resources and related manuscripts, accessions for data in other databases (prefixes are found in the Links tab), and relationship (if any) to a previously published GigaDB dataset (selected from a controlled vocabulary list).

Samples

Required information includes a sample ID or name (please use an NCBI BioSample ID when possible), species NCBI taxonomy ID, and species common name.

Optional information includes sample attributes (these are automatically populated in GigaDB if an NCBI BioSample ID is provided).

Files

Required information includes a file name or path relative to your home directory and file type (selected from a controlled vocabulary list). A readme file must be provided.

Optional information includes a file description and a sample ID or name.

Step 3 - confirm you have read our Terms of use page and upload the completed Excel template file.

You can expect a response from the GigaDB team within 5 days to verify the information in your submission and to arrange upload of your files to our FTP site.

If you have any questions, please contact us at database@gigasciencejournal.com.


Controlled vocabulary

Dataset types

Genomic - includes all genetic and genomic data eg sequence, assemblies, alignments, genotypes, variation and annotation.
Minimal requirements: DNA sequence data eg next-gen raw reads (fastq files) OR assembled DNA sequences (fasta files)

Epigenomic - includes methylation and histone modification data.
Minimal requirements: Details on methylation sites/status eg qmap files OR details on histone modification sites/status.

Metagenomic - includes all genetic and genomic data eg sequence, assemblies, alignments, genotypes, variation and annotation from environmental samples.
Minimal requirements: Environmental DNA sequence data eg next-gen raw reads (fastq files) OR assembled DNA sequences (fasta files).

Proteomic - includes all mass spec data.
Minimal requirements: Peptide/protein data eg mass spec.

Transcriptomic - includes all data relating to mRNA.
Minimal requirements: RNA sequence data eg next-gen raw reads (fastq files) OR transcript statistics eg RNA coverage/depth.

Additional dataset types can be added, upon review, as new submissions are received.


File types

File types and examples of associated file extensions:

Alignments: .bam, .chain, .maf, .net, .sam

Allele frequencies: .frq

Annotation: .gff, .ipr, .kegg, .wego

Coding sequence: .cds, .fa

InDels: .gff, .txt, .vcf

ISA-Tab: see ISA tools

Genome assembly: .agp, .contig, .depth, .fa, .length, .scafseq

Genome sequence: .fastq, .fq

Haplotypes: .haplotype

Methylome data: .fa, .qmap, .rpm, .txt

Protein sequence: .fa, .pep

Readme: .pdf, .txt

SNPs: .annotation, .gff, .txt, .vcf

SVs: .gff, .txt, .vcf

Transcriptome data: .depth, .rpkm, .wig

Other: .xls, .pdf, .txt

Additional file types can be added, upon review, as new submissions are received.


File formats

AGP (.agp) - the Accessioned Golden Path (AGP) file describes the assembly of a larger sequence object from smaller objects:

chr1 1 1972671 0 W scaffold43 1 1972671 m
chr1 1972672 3061819 1 W scaffold8 1 1089148 p
chr1 3061820 3181505 2 W scaffold548 1 119686 m
chr1 3181506 4176151 3 W scaffold313 1 994646 m

The large object can be a contig, a scaffold (supercontig), or a chromosome.
See AGP Specification v2.0

BAM (.bam) - the Binary Alignment/Map (BAM) format is the compressed binary version of the Sequence Alignment/Map (SAM) format, a compact and index-able representation of nucleotide sequence alignments.

BIGWIG (.bw) - the BIGWIG format is for storing dense, continuous data (such as GC percent, probability scores, and transcriptome data) that will be displayed in the UCSC Genome Browser as a graph. BIGWIG files are created initially from wiggle (WIG) type files, using the program wigToBigWig.

CHAIN (.chain) - the CHAIN format describes a pairwise alignment that allow gaps in both sequences simultaneously and is used by the UCSC Genome Browser.

CONTIG (.contig) - the CONTIG format is a direct output from the SOAPdenovo alignment program:

>1 length 32 cvg_0.0_tip_0
GAGAACGGCGAAGCCTGCTCGGGCCCGTTATA
>3 length 32 cvg_23.0_tip_0
TAGCAGCGATTTGATCAAACTCAATCTTACCG
>5 length 32 cvg_40.0_tip_0
GGTAAGATTGAGTTTGATCAAATCGCTGCTAT

EXCEL (.xls, .xlsx) - Microsoft office spreadsheet files

FASTA (.fasta, .fa, .seq, .cds, .pep, .scafseq [SOAPdenovo output file - sequence of each scaffold]) - FASTA is a text-based format for representing either nucleotide sequences or peptide sequences.

FASTQ (.fq, .fastq) - the FASTQ format stores sequences (usually nucleotide sequence) and Phred qualities in a single file.

GFF (.gff) - The General Feature Format (GFF) is used for describing genes and other features of DNA, RNA and protein sequences.

IPR (.ipr) - the Web Gene Ontology (WEGO) Annotation format consists of the protein ID, followed by column(s) that are the IPR (InterPro) ID(s):

CR_ENSP00000334840
CR_ENSMMUP00000018123 IPR000504 IPR003954
CR_ENSP00000333725 IPR001781 IPR015880 IPR007087 IPR001909

See WEGO: a web tool for plotting GO annotations

KEGG (.kegg) - the Web Gene Ontology (WEGO) Annotation format consists of the protein ID, followed by column(s) that are the KEGG (Kyoto Encyclopedia of Genes and Genomes) ID(s):

CR_ENSMMUP00000031408 ko03010
CR_ENSP00000364815 ko00970 ko00290
CR_ENSP00000414605 ko05146 ko04510 ko04512

See WEGO: a web tool for plotting GO annotations

MAF (.maf) - the Multiple Alignment Format (MAF) stores a series of multiple alignments at the DNA level between entire genomes.

NET (.net) - the NET file format is used to describe the axtNet data that underlie the net alignment annotations in the UCSC Genome Browser.

PDF (.pdf) - portable document format

PNG (.png) - portable network graphics

QMAP (.qmap) - QMAP files are generated for methylation data from an internal BGI pipeline.

QUAL (.qual) - the QUAL file format represents base quality score file for NextGen data (similar in format to fasta).

RPKM (.rpkm) - Gene expression levels are calculated by Reads Per Kilobase per Million (RPKM) mapped reads eg 1kb transcript with 1000 alignments in a sample of 10 million reads (out of which 8 million reads can be mapped) will have RPKM = 1000/(1 * 8) = 125:

ENSP00000379387 15.5651433366423 6002951 289 3093
ENSP00000349977 24.7483107230444 6002951 398 2679
ENSP00000368887 24.6477413647837 6002951 174 1176

SAM (.sam) - the Sequence Alignment/Map (SAM) format is a TAB-delimited text format consisting of a header section, which is optional, and an alignment section. Most often it is generated as a human readable version of its sister BAM format, which stores the same data in a compressed, indexed, binary form. Currently, most SAM format data is output from aligners that read FASTQ files and assign the sequences to a position with respect to a known reference genome. In the future, SAM will also be used to archive unaligned sequence data generated directly from sequencing machines.
See The Sequence Alignment/Map format and SAMtools

TAR (.tar) - an archive containing other files

TEXT (.doc, .readme, .text, .txt) - a text file

VCF (.vcf) - the Variant Call Format (VCF) is a text file format for representing eg SNPs, InDels, CNVs, SVs, microsatellites, genotypes.

WEGO (.wego) - the Web Gene Ontology (WEGO) Annotation format consists of the protein ID, followed by column(s) that are the GO ID(s):

Bmb015379_2_IPR001092
Bmb003749_1_IPR006329 GO:0009168 GO:0003876
Bmb006173_1_IPR000909 GO:0007165 GO:0004629 GO:0007242

See WEGO: a web tool for plotting GO annotations

WIG (.wig) - the output file from TopHat is a UCSC wigglegram of alignment coverage.

UNKNOWN - any file format not in this list

XML (.xml) - eXtensible Markup Language


Upload status

Publish: this dataset is fully consented for immediate release upon GigaDB approval

HUP: this dataset should be Held Until Publication (HUP)


DOI relationship

The DOI relationship vocabulary is taken from the DataCite 'relationType' schema property (ID=12.2).

Definition: Description of the relationship of the resource being registered (A) and the related resource (B).

IsSupplementTo: indicates that A is a supplement to B

IsSupplementedBy: indicates that B is a supplement to A

IsNewVersionOf: indicates A is a new edition of B, where the new edition has been modified or updated

IsPreviousVersionOf: indicates A is a previous edition of B

IsPartOf: indicates A is a portion of B; may be used for elements of a series

HasPart: indicates A includes the part B

References: indicates B is used as a source of information for A

IsReferencedBy: indicates A is used as a source of information by B