1 Background
GDx at OUSAMG is planning to upscale the WGS production to 192
samples (4 x 48
or 2 x 48 + 1 x 96
) samples per week. Do we have enough capacity in IT and bioinformatics pipelines for this upscaling?
The capacity of IT & bioinformatics pipelines can be evaluated from following three aspects:
- Data transfer speed
- Data storage
- Pipeline capacity
This document will focus on the evaluation of NSC storage capacity.
2 NSC Storage Capacity
2.1 Capacity breakdown
Total capacity of NSC storage is 552.6 TiB
. The capability breakdown as of Dec 29, 2024 12:00 PM is as follows:
2.2 Usable capacity
The usable capacity as of Dec 29, 2024, 12:00 PM is 231.2 TiB
.
2.3 Storage volumes
Name | Purpose |
---|---|
/boston | General data area |
/boston/runScrach | Sequencing runs (Illumina, ONT) |
/boston/projects | Research projects |
/boston/common | Software, and repositories |
/boston/diag | Diagnostics production |
/boston/diag/transfer | Transfer area (for TSD) |
/boston/runScratch/demultiplexed/delivery/tsd_sleipnir | Transfer area NSC |
vm-datastore | VMware datastore (virtual hard disks) |
2.4 Used storage breakdown
The used storage as of Dec 29, 2024, 12:00 PM is 228 TiB
. Details are as follows:
The size of boston/diag/production/data/samples
folder (with 255 sample folders in it) is very small (924 GiB) due to the big files (mainly .fastq.ora
files) are hardlinks.
2.4.1 /boston (228 TiB)
Directory | Logical | %use of Parent Directory | Physical |
---|---|---|---|
/boston/diag | 96.9 TiB | 51.8% | 113 TiB |
/boston/runScrach | 83.1 TiB | 44.4% | 107 TiB |
/boston/projects | 2.87 TiB | 1.5% | 3.74 TiB |
/boston/home | 2.77 TiB | 1.5% | 2.82 TiB |
/boston/common | 1.42 TiB | 0.8% | 1.54 TiB |
2.4.1.1 /boston/diag
Directory | Logical | %use of Parent Directory | Physical |
---|---|---|---|
/boston/diag/runs | 35.8 TiB | 37.0% | 47.4 TiB |
/boston/diag/production | 33.8 TiB | 34.9% | 34.3 TiB |
/boston/diag/transfer | 15.6 TiB | 16.1% | 16.3 TiB |
/boston/diag/nscDelivery | 7.82 TiB | 8.1% | 10.4 TiB |
/boston/diag/staging | 3.84 TiB | 4.0% | 4.88 TiB |
/boston/diag/diagInternal | 1.48 GiB | 0.0% | 1.88 GiB |
2.4.1.1.1 /boston/diag/production
Directory | Logical | %use of Parent Directory | Physical |
---|---|---|---|
/boston/diag/production/data | 33.5 TiB | 99.1% | 33.9 TiB |
/boston/diag/production/sw | 238 GiB | 0.7% | 297 GiB |
/boston/diag/production/reference | 74.3 GiB | 0.2% | 63.9 GiB |
/boston/diag/production/logs | 3.32 GiB | 0.0% | 557 MiB |
/boston/diag/production/.thirdparty | 110 MiB | 0.0% | 148 GiB |
2.4.1.1.2 /boston/diag/transfer
Directory | Logical | %use of Parent Directory | Physical |
---|---|---|---|
/boston/diag/transfer/production | 14.8 TiB | 97.7% | 15.4 TiB |
2.4.1.1.3 /boston/diag/staging
Directory | Logical | %use of Parent Directory | Physical |
---|---|---|---|
/boston/diag/staging/data | 3.53 TiB | 92.0% | 4.54 TiB |
/boston/diag/staging/sw | 237 GiB | 6.0% | 285 GiB |
/boston/diag/staging/reference | 77.5 GiB | 2.0% | 67.5 GiB |
2.4.1.2 /boston/runScratch
Directory | Logical | %use of Parent Directory | Physical |
---|---|---|---|
/boston/runScratch/NovaSeqX | 34.5 TiB | 41.5% | 45.7 TiB |
/boston/runScratch/analysis | 38.8 TiB | 37.0% | 37.1 TiB |
/boston/runScratch/demultiplexed | 15.6 TiB | 18.8% | 20.7 TiB |
/boston/runScratch/processed | 1.03 TiB | 1.2% | 1.58 TiB |
/boston/runScratch/ONT | 738 GiB | 0.9% | 1.02 TiB |
/boston/runScratch/UserData | 244 GiB | 0.3% | 254 GiB |
/boston/runScratch/test | 64.7 GiB | 0.1% | 86.1 GiB |
/boston/runScratch/PGT | 16.7 GiB | 0.0% | 20.5 GiB |
/boston/runScratch/Upgrade_software | 16.7 GiB | 0.0% | 22.2 GiB |
/boston/runScratch/mik_data | 12.5 GiB | 0.0% | 12.5 GiB |
/boston/runScratch/imm_data | 4.5 GiB | 0.0% | 4.53 GiB |
2.5 Expected data
The data generated by NovaSeqX depend on the settings of secondary analysis and the sequencing depth (current setting is 64 samples per flowcell).
When use Onboard DRAGEN only for demultiplexing, inhouse pipelines must be run on external DRAGEN.
2.5.1 Per sample
2.5.1.1 BCL Convert
2.5.1.1.1 NovaSeqX generated data
The {R1,R2}.fastq.ora
files per sample is about 14 GB
data (18 GB
on disk).
Other files such as BCL files, images, logs, reports, etc. is about 47 GB
data per sample (63 GB
on disk).
In total, 61 GB
data per sample (81 GB
on disk).
2.5.1.1.2 Inhouse pipelines on external DRAGEN
With BCL Convert only, we need to run inhouse pipelines on external DRAGEN which requires input data and also generates output data.
Given the design of the inhouse pipelines, some files are duplicated in different locations.
Only 1 copy of any fastq.ora file is physically stored on boston, i.e., not duplicated. A fastq.ora file appears in 4 different locations:
/boston/diag/nscDelivery
/boston/diag/transfer/production/{normal,high,urgent}/samples
or/boston/daig/transfer/production/transferred/{normal,high,urgent}/samples
/boston/diag/production/data/samples
/boston/diag/production/data/analyses-work/\*/result/\*/work
1, 2 and 3 are hardlinks; 4 is symlink of 3 (within
/boston/diag/produciton/data/analyses-work
, files are symlinked from the Nextflow work folder.)Files in
/boston/diag/produciton/data/analyses-results/{singles,trios}
are copies of that in/boston/diag/produciton/data/analyses-work
Files in
/boston/diag/transfer/production/{normal,high,urgent}/analyses-results/{singles,trios}
are copies of that in/boston/diag/produciton/data/analyses-work
2.5.1.1.2.1 analyses-work folder size
Average analyses-work basepipe folder size is 71 GB
(70 GB
on disk).
Average analyses-work triopipe folder size is 11 GB
(7 GB
on disk).
Average analyses-work annopipe folder size is 9 GB
(8 GB
on disk).
Adding NovaSeqX generated data, the total data per sample is 139 GB
(157 GB
on disk).
2.5.1.1.2.2 analyses-results folder size
Average analyses-results singles folder size is 55 GB
(55 GB
on disk).
Average analyses-results trios folder size is 2 GB
(2 GB
on disk).
Adding NovaSeqX generated data and analyses-work data, and counting in the 2 duplicates of analyses-results, the total data per sample becomes 251 GB
(269 GB
on disk).
2.5.1.1.2.3 ella-incoming folder size
Average ella-incoming folder size is 126 MB
(126 MB
on disk).
The ella-incoming folder is small, so the total data per sample remains 251 GB
(269 GB
on disk).
2.5.1.2 DRAGEN Germline
When secondary analysis is DRAGEN Germline with all variant callers, i.e. doing both demultiplexing and mapping, variant calling with Onboard DRAGEN.
Each sample has about 20 GB
pipeline output data (27 GB
on disk) in addition. See Section 2.5.1.2.1 for details.
In total, 81 GB
data per sample (108 GB
on disk).
The inhouse pipeline and nsc-exporter changes to accommodate the DRAGEN Germline pipeline is not yet implemented. Some duplication of NovaSeqX generated data is expected.
2.5.1.2.1 Pipeline output files per sample
analysis/wgs435_HB12345678_b08a0667-b221-48c6-8d44-abb516a61a2b/germline_seq
├── [ 20G] germline_seq
│ ├── [ 2.6M] report.html
│ ├── [ 13M] sv
│ │ ├── [ 6.1M] results
│ │ │ ├── [ 44K] stats
│ │ │ │ ├── [ 736] alignmentStatsSummary.txt
│ │ │ │ ├── [ 20K] candidate_metrics.csv
│ │ │ │ ├── [ 535] diploidSV.sv_metrics.csv
│ │ │ │ ├── [ 4.4K] graph_metrics.csv
│ │ │ │ ├── [ 9.0K] svCandidateGenerationStats.tsv
│ │ │ │ ├── [ 6.7K] svCandidateGenerationStats.xml
│ │ │ │ └── [ 1.7K] svLocusGraphStats.tsv
│ │ │ └── [ 6.1M] variants
│ │ │ ├── [ 4.2M] candidateSV.vcf.gz
│ │ │ ├── [ 671K] candidateSV.vcf.gz.tbi
│ │ │ ├── [ 1.1M] diploidSV.vcf.gz
│ │ │ └── [ 121K] diploidSV.vcf.gz.tbi
│ │ └── [ 6.5M] workspace
│ │ ├── [ 56K] alignmentStats.xml
│ │ ├── [ 505] chromDepth.txt
│ │ ├── [ 59K] edgeRuntimeLog.txt
│ │ ├── [ 17K] genomeSegmentScanDebugInfo.txt
│ │ ├── [ 2.3K] logs
│ │ │ └── [ 2.2K] config_log.txt
│ │ └── [ 6.4M] svLocusGraph.bin
│ ├── [ 182K] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.cnv.excluded_intervals.bed.gz
│ ├── [ 457K] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.cnv.gff3
│ ├── [ 2.8K] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.cnv.igv_session.xml
│ ├── [ 793] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.cnv_metrics.csv
│ ├── [ 70K] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.cnv.vcf.gz
│ ├── [ 32] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.cnv.vcf.gz.md5sum
│ ├── [ 18K] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.cnv.vcf.gz.tbi
│ ├── [ 16G] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.cram
│ ├── [ 1.3M] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.cram.crai
│ ├── [ 32] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.cram.md5sum
│ ├── [ 302] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.cyp2b6.tsv
│ ├── [ 283] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.cyp2d6.tsv
│ ├── [ 420K] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.fastqc_metrics.csv
│ ├── [ 272K] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.fragment_length_hist.csv
│ ├── [ 185] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.gba.tsv
│ ├── [ 2.4K] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.gvcf_hethom_ratio_metrics.csv
│ ├── [ 2.4K] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.gvcf_metrics.csv
│ ├── [ 46M] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.hard-filtered.baf.bw
│ ├── [ 3.8G] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.hard-filtered.gvcf.gz
│ ├── [ 32] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.hard-filtered.gvcf.gz.md5sum
│ ├── [ 1.2M] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.hard-filtered.gvcf.gz.tbi
│ ├── [ 365M] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.hard-filtered.vcf.gz
│ ├── [ 32] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.hard-filtered.vcf.gz.md5sum
│ ├── [ 1.6M] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.hard-filtered.vcf.gz.tbi
│ ├── [ 16M] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.improper.pairs.bw
│ ├── [ 429] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.insert-stats.tab
│ ├── [ 8.9K] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.mapping_metrics.csv
│ ├── [ 9.2K] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.metrics.json
│ ├── [ 39K] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.pcr-model-0.log
│ ├── [ 115] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.pcr-model.log
│ ├── [ 1.4K] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.ploidy_estimation_metrics.csv
│ ├── [ 1.8K] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.ploidy.vcf.gz
│ ├── [ 32] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.ploidy.vcf.gz.md5sum
│ ├── [ 4.0K] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.ploidy.vcf.gz.tbi
│ ├── [ 1.9M] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.repeats.bam
│ ├── [ 4.3K] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.repeats.vcf.gz
│ ├── [ 3.9K] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.repeats.vcf.gz.tbi
│ ├── [ 48K] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.roh.bed
│ ├── [ 114] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.roh_metrics.csv
│ ├── [ 242K] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.seg
│ ├── [ 69K] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.seg.bw
│ ├── [ 247K] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.seg.called
│ ├── [ 259K] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.seg.called.merged
│ ├── [ 223] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.smn.tsv
│ ├── [ 1.7K] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.snperror-sampler.log
│ ├── [ 535] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.sv_metrics.csv
│ ├── [ 1.1M] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.sv.vcf.gz
│ ├── [ 121K] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.sv.vcf.gz.tbi
│ ├── [ 19M] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.target.counts.bw
│ ├── [ 22M] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.target.counts.diploid.bw
│ ├── [ 31M] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.target.counts.gc-corrected.gz
│ ├── [ 25M] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.target.counts.gz
│ ├── [ 1.4K] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.targeted.json
│ ├── [ 22M] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.tn.bw
│ ├── [ 37M] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.tn.tsv.gz
│ ├── [ 1.4K] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.trimmer_metrics.csv
│ ├── [ 7.2K] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.vc_hethom_ratio_metrics.csv
│ ├── [ 2.4K] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.vc_metrics.csv
│ ├── [ 2.8K] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.wgs_contig_mean_cov.csv
│ ├── [ 2.1K] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.wgs_coverage_metrics.csv
│ ├── [ 16K] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.wgs_fine_hist.csv
│ ├── [ 558] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.wgs_hist.csv
│ └── [ 43] wgs435_HB12345678_3c52f4e7-826d-40dd-bd98-f4356deeb098.wgs_overall_mean_cov.csv
└── [ 11K] logs
├── [ 258] bcl2fastq.dragen_events.csv
├── [ 1.1K] cmdline_1198902.txt
├── [ 608] cmdline_1608030.txt
├── [ 1.1K] cmdline_3358454.txt
├── [ 219] DCKR_RG-stderr_1608030.txt
├── [ 461] DCKR_RG-stdout_1608030.txt
├── [ 250] ora.dragen_events.csv
├── [ 1] ORA-stderr_1198902.txt
├── [ 3.9K] ORA-stdout_1198902.txt
├── [ 1] P2FSW-stderr_3358454.txt
└── [ 2.4K] P2FSW-stdout_3358454.txt
👆 Generated by tree --du
which shows the actual file sizes instead of disk space used.
2.5.2 Per sequencing run
Each sequencing run can have different set up.
- Flowcell side:
- single side (single flowcell)
- both sides (dual flowcell)
- Flowcell type:
- 1.5B
- 10B
- 25B
- Secondary analysis:
- BCL Convert
- DRAGEN Germline
- variant calling mode = None
- variant calling mode = SmallVariantCaller
- variant calling mode = AllVariantCallers (Small, Structural, CNV, Repeat Expansions, ROH, CYP2D6 etc.)
2.5.2.1 Single 25B flowcell; BCL Convert
Single 25B flowcell (64 samples), secondary analysis is BCL Convert, i.e. only demultiplexing.
3.9 TB
data (5.1 TB
on disk).
2.5.2.2 Dual 25B flowcell; BCL Convert
Dual 25B flowcell (128 samples), secondary analysis is BCL Convert, i.e. only demultiplexing.
7.8 TB
data (10.2 TB
on disk).
2.5.2.3 Dual 25B flowcell; DRAGEN Germline, AllVariantCallers
Dual 25B flowcell (128 samples), secondary analysis is DRAGEN Germline with all variant callers enabled, i.e. doing both demultiplexing and mapping, variant calling with Onboard DRAGEN.
10.5 TB
data (14.0 TB
on disk).