About
On this site, you find the test data that were used for efficient construction of a compressed de Bruijn graph for pan-genome analysis.
Human Genome
The file 7 genomes has file size 21,625,319,541 (md5sum: affc827aa48b7cfd07eb9d7e071a3bf3) and contains 21,201,290,946 base pairs. It was created by concatenation of the following files in the following order:
hg16 (NCBI34) from July 2003
Download (md5sum: 9c4567258b47b6dd466225c58da65eb4)
Src: ftp://hgdownload.cse.ucsc.edu/goldenPath/hg16/chromosomes/
Comment: Modified file - converted lowercase to uppercase and removed 3 characters (RR and M) from chromosome 3.
hg17 (NCBI35) from May 2004
Download (md5sum: 57f5af6e6004497f82b284b75a712486)
Src: ftp://hgdownload.cse.ucsc.edu/goldenPath/hg17/chromosomes/
Comment: Modified file - converted lowercase to uppercase and removed 3 characters (RR and M) from chromosome 3.
hg18 (NCBI36) from Mar. 2006
Download (md5sum: f37590f3007ac483488891113f222dc8)
Src: ftp://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes/
Comment: Modified file - converted lowercase to uppercase and removed 3 characters (RR and M) from chromosome 3.
hg19 (GRch37) from Feb. 2009
Download (md5sum: 55c0eb9b019d9f727b0d0ae42b5ca237)
Src: ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/
Comment: Modified file - converted lowercase to uppercase.
hg38 (GRch38) from Dec. 2013
Download (md5sum: ea47ff706942f5e58b327aac61e528d6)
Src: ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/
Comment: Modified file - converted lowercase to uppercase.
maternal haplotype of NA12878
The Gerstein Lab at Yale University has created a version of the NA12878 genome based on NCBI build 36 and incororating SNPs, indels and SVs identified by the 1000 Genomes project. This genome sequence is available at http://sv.gersteinlab.org/NA12878_diploid.
Download (md5sum: 4a5e7ffec07364de66e56022d5864107)
Src: http://sv.gersteinlab.org/NA12878_diploid/NA12878_diploid_genome_2012_dec16.zip
Comment: Users of this assembly are requested to cite: Rozowsky J et al. (2011). AlleleSeq: Analysis of allele-specific expression and binding in a network framework. Molecular Systems Biology, 7, 522.
paternal haplotype of NA12878
The Gerstein Lab at Yale University has created a version of the NA12878 genome based on NCBI build 36 and incororating SNPs, indels and SVs identified by the 1000 Genomes project. This genome sequence is available at http://sv.gersteinlab.org/NA12878_diploid.
Download (md5sum: 75e170b383de42aeb14732cabeab9a00)
Src: http://sv.gersteinlab.org/NA12878_diploid/NA12878_diploid_genome_2012_dec16.zip
Comment: Users of this assembly are requested to cite: Rozowsky J et al. (2011). AlleleSeq: Analysis of allele-specific expression and binding in a network framework. Molecular Systems Biology, 7, 522.
Human Chromosome 1
Chr1 of hg16 (NCBI34) from July 2003
Download (md5sum: f339b1e234e9b708d04ef7928ccbcd7e)
Src: ftp://hgdownload.cse.ucsc.edu/goldenPath/hg16/chromosomes/chr1.fa.zip
Comment: Modified file - converted lowercase to uppercase.
Chr1 of hg17 (NCBI35) from May 2004
Download (md5sum: 057693e7e5be4a813610dc49d2647f05)
Src: ftp://hgdownload.cse.ucsc.edu/goldenPath/hg17/chromosomes/chr1.fa.gz
Comment: Modified file - converted lowercase to uppercase.
Chr1 of hg18 (NCBI36) from Mar. 2006
Download (md5sum: b9fb6b270b7e6cf777a5925c904a7f9e)
Src: ftp://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes/chr1.fa.gz
Comment: Modified file - converted lowercase to uppercase.
Chr1 of hg19 (GRch37) from Feb. 2009
Download (md5sum: a46474e572b3be254b8f4e59034d6238)
Src: ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr1.fa.gz
Comment: Modified file - converted lowercase to uppercase.
Chr1 of hg38 (GRch38) from Dec. 2013
Download (md5sum: 358f980f9e54a41f2df778b7a89a620e)
Src: ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chr1.fa.gz
Comment: Modified file - converted lowercase to uppercase.
1_NA12878_maternal.fa of NA12878
The Gerstein Lab at Yale University has created a version of the NA12878 genome based on NCBI build 36 and incororating SNPs, indels and SVs identified by the 1000 Genomes project. This genome sequence is available at http://sv.gersteinlab.org/NA12878_diploid.
Download (md5sum: c4af94539eab79b56de98f7767a72f3f)
Src: http://sv.gersteinlab.org/NA12878_diploid/NA12878_diploid_genome_2012_dec16.zip
Comment: Users of this assembly are requested to cite: Rozowsky J et al. (2011). AlleleSeq: Analysis of allele-specific expression and binding in a network framework. Molecular Systems Biology, 7, 522.
1_NA12878_paternal.fa of NA12878
The Gerstein Lab at Yale University has created a version of the NA12878 genome based on NCBI build 36 and incororating SNPs, indels and SVs identified by the 1000 Genomes project. This genome sequence is available at http://sv.gersteinlab.org/NA12878_diploid.
Download (md5sum: 335f5e6754218a825939a4b485c5c85d)
Src: http://sv.gersteinlab.org/NA12878_diploid/NA12878_diploid_genome_2012_dec16.zip
Comment: Users of this assembly are requested to cite: Rozowsky J et al. (2011). AlleleSeq: Analysis of allele-specific expression and binding in a network framework. Molecular Systems Biology, 7, 522.
E.Coli
We used (almost) the same Ecoli files as described in the Supplementary Data of "SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips". Therefore, we downloaded the sequences (or when suitable the reverse complement) from the page www.ncbi.nlm.nih.gov/nuccore/ by the following accession numbers.
FM180568 | 01.FM180568.sequence.fasta | 3939f45d6d97afa76a2fbb603307a237 | |
FN554766 | 02.FN554766.sequence.fasta | b5726c3bae898831d5240f8897736c12 | |
CP000247 | 03.CP000247.sequence.fasta | 08643a4078ec97b36ea3da402bef95f6 | |
CU928145 | 04.CU928145.sequence.fasta | d6e80db065ddf221b5925888ff8edd67 | |
CP001671 | 05.CP001671.sequence.fasta | 97f209b1693b222e97a828d3c5a9c449 | |
CP000468 | 06.CP000468.sequence.fasta | 80d233b0ffab129579f55789b136ad2e | |
CP004009 | 07.CP004009.sequence.fasta | 59b2b0fe09f5c2d4bd93475a0ecee948 | |
CP004009 | 08.CP004009.sequence.fasta | c6ed68fd4899e8deac84f57afbb8e700 | |
AM946981 | 09.AM946981.sequence.fasta | 6bd14dc519e588723715ff04fb289742 | |
CP001396 | 10.CP001396.sequence.fasta | 6aa6aee90306e7800d681ad2a20e0c03 | |
CP000819 | 11.CP000819.sequence.fasta | e79fd22c2716252d440e9f6b7e60546f | |
AE014075 | 12.AE014075.sequence.fasta | 5b6fb26a0e33185fbedc6ef017858c85 | |
CP000946 | 13.CP000946.rev.sequence.fasta | a08d364f46c80c5cffe06c1fe8384e35 | In constrast to splitMEM we take this reverse complement sequence |
CP001637 | 14.CP001637.rev.sequence.fasta | a6c4702cc30c2bc5b797cdb100652eea | This is the reverse complemented sequence |
AP012030 | 15.AP012030.sequence.fasta | 376ed43655a0b17ab910e953f0ee0f58 | |
CP000800 | 16.CP000800.sequence.fasta | c308c831695ae52b69b3e3a06d91f91b | |
CU928162 | 17.CU928162.sequence.fasta | c0a080d2e1ad0e1f5db1f29de0783802 | |
FN649414 | 18.FN649414.sequence.fasta | 05d5f327b5786f16e569edcd5a4ae4d5 | |
CP000802 | 19.CP000802.sequence.fasta | 5796adc77aaf0ac2803e5a84a3bf9e1a | |
CU928160 | 20.CU928160.sequence.fasta | 96e0455c30720ab5cbf950ba5deef623 | |
CU928164 | 21.CU928164.sequence.fasta | 17af046949d2bc75da6be07441b09b24 | |
CP001969 | 22.CP001969.sequence.fasta | 3ae46456e0f1bad4214e4cad7101cb6e | |
CP006784 | 23.CP006784.sequence.fasta | 6543a8baf0a9aedbfe9661674b38ae4e | |
CP002516 | 24.CP002516.rev.sequence.fasta | 6fd8c3dafd4e5c3c691185429edd9e7e | This is the reverse complemented sequence |
CP002970 | 25.CP002970.rev.sequence.fasta | 5ad67263edba15f2ece68c6afab893cf | This is the reverse complemented sequence. |
CP000948 | 26.CP000948.sequence.fasta | c5e591bf4793f8649c4d880151aff13c | |
AP012306 | 27.AP012306.sequence.fasta | ab94cfcbf5015cbca0bba3b35b74b0da | |
U00096 | 28.U00096.sequence.fasta | 7c5486a762455b4b811ea2411aa111d7 | SplitMEM Paper reports other filesize |
AP009048 | 29.AP009048.sequence.fasta | 059c8fae616045cb0a3447ed04316295 | |
CU651637 | 30.CU651637.sequence.fasta | dcbb0c93454bca2b1ab274bf6e14c876 | |
CP006584 | 31.CP006584.sequence.fasta | 5ff79d9d936dec004aa190fec1c91b98 | |
CP002797 | 32.CP002797.sequence.fasta | 59780302ab6da977cf19f0858dd84ba3 | |
NC_013353 | 33.NC_013353.sequence.fasta | e35447195a6eb85b956c93c66b32853d | SplitMEM Paper uses the sequence with the accession number P010958 |
CP003297 | 34.CP003297.rev.sequence.fasta | dfc12bbdc2e4fc66dfe075671f1310db | This is the reverse complemented sequence |
CP003301 | 35.CP003301.rev.sequence.fasta | 4f0036e0cbf337b4424b1d647c9f2717 | This is the reverse complemented sequence |
CP003289 | 36.CP003289.rev.sequence.fasta | 86ce567ff234f60cf515819e83e75d49 | This is the reverse complemented sequence |
AP010960 | 37.AP010960.sequence.fasta | 1f3cadfd676587468621ec6118d0a572 | |
AE005174 | 38.AE005174.sequence.fasta | 0605d65eccb8411792eb03517d8d6e1c | |
BA000007 | 39.BA000007.sequence.fasta | 2819c48fc28e4399eefdcd6b1695c6c0 | |
CP001164 | 40.CP001164.sequence.fasta | 99308a74849430a134967e2645b62ca5 | |
CP001368 | 41.CP001368.sequence.fasta | 36c1491fb544769e24c4dfeb997aeb0b | |
AP010953 | 42.AP010953.sequence.fasta | db3a056d9472995ebfd524c1a0720112 | |
CP001846 | 43.CP001846.sequence.fasta | 284b034e88e11e169de16b46df425f83 | |
CP003109 | 44.CP003109.sequence.fasta | ee68b91e95eb5e9821101df85e98b184 | |
CP003034 | 45.CP003034.sequence.fasta | 466006b02d21f7ff8fb68e43f6ee305e | |
CP001855 | 46.CP001855.sequence.fasta | 60b61bb47971940d1ff7f5ae98e7b678 | |
CP002291 | 47.CP002291.sequence.fasta | 95b8882b832471c5ff4bda6fc53754d7 | |
CU928161 | 48.CU928161.sequence.fasta | f08f59f924fb3d0b2786cacfd95db243 | |
AP009240 | 49.AP009240.sequence.fasta | 6647d2f1657637e133b37401b30c3e68 | |
AP009378 | 50.AP009378.sequence.fasta | e8cc199da10e1fa45157683720e4b68f | |
CP000970 | 51.CP000970.sequence.fasta | 979fe28928cddf5f8bed140173c71077 | |
CP002167 | 52.CP002167.rev.sequence.fasta | 8d84661b1cf63e9b9732cd830812c072 | This is the reverse complemented sequence |
CU928163 | 53.CU928163.sequence.fasta | d8bfddb71f6dd61661fb1a4f9eedc372 | SplitMEM Paper reports other filesize |
CP002729 | 54.CP002729.sequence.fasta | d5344aaf32bd04140be3140624c61817 | |
CP000243 | 55.CP000243.sequence.fasta | d3e1c749af0a39d29190c61f089aa6df | |
CP002185 | 56.CP002185.sequence.fasta | 8e54c38451669a0e1dba0c49d76caf90 | |
CP002967 | 57.CP002967.sequence.fasta | 72dd80f519cc369c19507ad5e522ece2 | |
CP001925 | 58.CP001925.sequence.fasta | 911666bb04b01804400ba351fc1effe1 | |
CP001665 | 59.CP001665.rev.sequence.fasta | fd656e6dcbd4f0806c62d0fce255769c | This is the reverse complemented sequence |
CP002212 | 60.CP002212.sequence.fasta | 93aa4ed287b350492ad41b44e4260e3e | SplitMEM Paper has accession number P002212 |
CP002211 | 61.CP002211.sequence.fasta | 57ca40270401c08f240fd9a5b49fe601 | |
CP006698 | 62.CP006698.sequence.fasta | 0a0da9962b86a31ae2a92d59c579399c |
All files are also in this archiv.