flowchart LR
subgraph dt[**nsc-exporter**]
check-new-data(((data to transfer))) --> |Yes| transfer(Transfer data to TSD)
transfer --> sleep(Sleep 10 minutes)
sleep --> |check| check-new-data
check-new-data --> |No| sleep
end
subgraph dp[**2 data producers**]
lims-exporter[lims-exporter] ---> |produce| check-new-data
pipeline[NSC pipelines] ---> |produce| check-new-data
end
style dt fill:#e4eda6,stroke-width:3px
style dp fill:#F2D1FF
1 Background
GDx at OUSAMG is planning to upscale the WGS production to:
192samples (4 proj x 48or2 proj x 48 + 1 proj x 96) per week (9,216per year), with 2 NovaSeq 6000.352samples (2 fc x 64+3.5 fc x 64) per week, (16,896per year), with 2 NovaSeq X Plus416samples (2 fc x 64+4.5 fc x 64) per week, (19,968per year), with 2 NovaSeq X Plus and weekend work.
The capacity of IT & bioinformatics pipelines can be accessed in three key areas:
- Data transfer speed (this report)
- Data storage
- Pipeline capacity
2 Data transfer speed
Sequencing data and NSC pipeline results are stored at NSC and must be transferred to TSD. Due to the large data volume, the transfer is managed by nsc-exporter, which checks for new data every 10 minutes and transfers it using s3cmd put via TSD’s s3api wrapper.
2.1 Data Collection
To assess the data transfer speed from NSC to TSD, we analyzed historical transfer records from the nsc-exporter log between 2023-09-01 08:41:40 and 2023-11-30 20:14:10
[,1]
datetime "2023-09-30 05:49:01"
project "wgs315"
filename "Diag-wgs315-HG72932663C12413.bam"
bytes "64816823173"
seconds "767.4"
speed "80550000"
[,1]
datetime "2023-10-07 13:54:26"
project "EKG231004"
filename "HG53787654-MAMMAE-KIT-CuCaV3_S29_R2_001.fastq.gz"
bytes "329994916"
seconds "4.2"
speed "75220000"
[,1]
datetime "2023-11-06 19:29:07"
project "wgs328"
filename "Diag-wgs328-HG60131363.cnv.vcf"
bytes "188199"
seconds "0"
speed "5890000"
[,1]
datetime "2023-10-09 19:55:04"
project "wgs318"
filename "Diag-wgs318-HG73451077C12478.sample"
bytes "1567"
seconds "0"
speed "49590"
[,1]
datetime "2023-11-30 14:46:04"
project "wgs337"
filename "231127_A01447_0418_AHKG5VDSX7.HG25933427-NevrMusk-KIT-wgs_S15_R2_001.qc.pdf"
bytes "120206"
seconds "0.1"
speed "1844160"
The nsc-exporter log files and the sequencer overview html files were excluded for simplicity. 1
2.2 Data Overview
The transferred file sizes ranges from 0.0 B to 100.9 GiB, with an average of 1.5 GiB, a median of 9.3 KiB, and a standard deviation of 8.1 GiB.
filesize
Min. 0.0 B
1st Qu. 421.0 B
Median 9.3 KiB
Mean 1.5 GiB
3rd Qu. 968.0 KiB
Max. 100.9 GiB

The transfer speed ranges from 1.0 B/s to 93.1 MiB/s. The average transfer speed is 12.2 MiB/s. The median transfer speed is 286.7 KiB/s. The standard deviation is 23.4 MiB/s. Transfer speed for small files is usually very low, so the average transfer speed is not a good indicator for actual transfer speed, see Section 2.3.1.
speed(/s)
Min. 1.0 B
1st Qu. 11.9 KiB
Median 286.7 KiB
Mean 12.2 MiB
3rd Qu. 8.4 MiB
Max. 93.1 MiB

The transfer time ranges from 0 seconds to 2084.4 seconds. The average transfer time is 19.5 seconds. The median transfer time is 0 seconds. The standard deviation is 104 seconds.
seconds
Min. : 0.00
1st Qu.: 0.00
Median : 0.00
Mean : 19.47
3rd Qu.: 0.10
Max. :2084.40

2.3 Correlation
Next, we would like to know the how does file size affect transfer speed and transfer time.
2.3.1 Transfer speed and time against file size for all files
Small files have lower transfer speed. A good transfer speed around 80 MB/s can be achieved for large files (>2 GB). However, the best speed is observed for files with size around 200 MB (zoom in or see Figure 5).
2.3.2 Transfer speed and time against file size for small files only
Although the transfer speed of small files are very low; the transfer time is usually very short. So small files are not the bottleneck of the data transfer. See also Section 2.5.1.
2.3.3 Maximum transfer reached around 200MB file size?
Small files have lower transfer speed. Large files have higher transfer speed. But it looks like best transfer speed is observed for files with sizearound 200 MB file size.
2.4 Idle Time
To evaluate whether there is capacity for upscaling, we need to know the idle time of the nsc-exporter. The nsc-exporter is idle when it is not transferring data.
All transfer records are plotted with starting time of each transfer on x-axis and the time used to finished the transfer on y-axis. The gaps represents idle periods of nsc-exporter. The color represents projects, e.g. wgs123, EKG20230901 etc.. The shape represents project type, e.g. wgs, EKG etc. You can turn off a project by clicking it in the legend to the right of the figure.
For easier visualization, the data is grouped in months.
