WGS Upscaling - Data Transfer Capacity Evaluation

Data transfer, Data storage, Bioinformatics pipeline capacity
Author
Affiliation

GDx OUSAMG

Published

March 13, 2025

1 Background


GDx at OUSAMG is planning to upscale the WGS production to:

  • 192 samples (4 proj x 48 or 2 proj x 48 + 1 proj x 96) per week (9,216 per year), with 2 NovaSeq 6000.
  • 352 samples (2 fc x 64 + 3.5 fc x 64) per week, (16,896 per year), with 2 NovaSeq X Plus
  • 416 samples (2 fc x 64 + 4.5 fc x 64) per week, (19,968 per year), with 2 NovaSeq X Plus and weekend work.

The capacity of IT & bioinformatics pipelines can be accessed in three key areas:

  1. Data transfer speed (this report)
  2. Data storage
  3. Pipeline capacity

2 Data transfer speed


Sequencing data and NSC pipeline results are stored at NSC and must be transferred to TSD. Due to the large data volume, the transfer is managed by nsc-exporter, which checks for new data every 10 minutes and transfers it using s3cmd put via TSD’s s3api wrapper.


flowchart LR
    subgraph dt[**nsc-exporter**]
    check-new-data(((data to transfer))) --> |Yes| transfer(Transfer data to TSD) 
    transfer --> sleep(Sleep 10 minutes)
    sleep --> |check| check-new-data
    check-new-data --> |No| sleep
    end
    subgraph dp[**2 data producers**]
    lims-exporter[lims-exporter]  ---> |produce| check-new-data
    pipeline[NSC pipelines] ---> |produce| check-new-data
    end
    style dt fill:#e4eda6,stroke-width:3px
    style dp fill:#F2D1FF


2.1 Data Collection

To assess the data transfer speed from NSC to TSD, we analyzed historical transfer records from the nsc-exporter log between 2023-09-01 08:41:40 and 2023-11-30 20:14:10

         [,1]                              
datetime "2023-09-30 05:49:01"             
project  "wgs315"                          
filename "Diag-wgs315-HG72932663C12413.bam"
bytes    "64816823173"                     
seconds  "767.4"                           
speed    "80550000"                        
         [,1]                                              
datetime "2023-10-07 13:54:26"                             
project  "EKG231004"                                       
filename "HG53787654-MAMMAE-KIT-CuCaV3_S29_R2_001.fastq.gz"
bytes    "329994916"                                       
seconds  "4.2"                                             
speed    "75220000"                                        
         [,1]                            
datetime "2023-11-06 19:29:07"           
project  "wgs328"                        
filename "Diag-wgs328-HG60131363.cnv.vcf"
bytes    "188199"                        
seconds  "0"                             
speed    "5890000"                       
         [,1]                                 
datetime "2023-10-09 19:55:04"                
project  "wgs318"                             
filename "Diag-wgs318-HG73451077C12478.sample"
bytes    "1567"                               
seconds  "0"                                  
speed    "49590"                              
         [,1]                                                                         
datetime "2023-11-30 14:46:04"                                                        
project  "wgs337"                                                                     
filename "231127_A01447_0418_AHKG5VDSX7.HG25933427-NevrMusk-KIT-wgs_S15_R2_001.qc.pdf"
bytes    "120206"                                                                     
seconds  "0.1"                                                                        
speed    "1844160"                                                                    
Excluded files

The nsc-exporter log files and the sequencer overview html files were excluded for simplicity. 1

2.2 Data Overview

The transferred file sizes ranges from 0.0 B to 100.9 GiB, with an average of 1.5 GiB, a median of 9.3 KiB, and a standard deviation of 8.1 GiB.

         filesize
Min.        0.0 B
1st Qu.   421.0 B
Median    9.3 KiB
Mean      1.5 GiB
3rd Qu. 968.0 KiB
Max.    100.9 GiB

The transfer speed ranges from 1.0 B/s to 93.1 MiB/s. The average transfer speed is 12.2 MiB/s. The median transfer speed is 286.7 KiB/s. The standard deviation is 23.4 MiB/s. Transfer speed for small files is usually very low, so the average transfer speed is not a good indicator for actual transfer speed, see Section 2.3.1.

        speed(/s)
Min.        1.0 B
1st Qu.  11.9 KiB
Median  286.7 KiB
Mean     12.2 MiB
3rd Qu.   8.4 MiB
Max.     93.1 MiB

The transfer time ranges from 0 seconds to 2084.4 seconds. The average transfer time is 19.5 seconds. The median transfer time is 0 seconds. The standard deviation is 104 seconds.

    seconds       
 Min.   :   0.00  
 1st Qu.:   0.00  
 Median :   0.00  
 Mean   :  19.47  
 3rd Qu.:   0.10  
 Max.   :2084.40  

2.3 Correlation

Next, we would like to know the how does file size affect transfer speed and transfer time.

2.3.1 Transfer speed and time against file size for all files


Small files have lower transfer speed. A good transfer speed around 80 MB/s can be achieved for large files (>2 GB). However, the best speed is observed for files with size around 200 MB (zoom in or see Figure 5).

Figure 1: Transfer speed VS file size (all files)
Figure 2: Transfer time VS file size (all files)

2.3.2 Transfer speed and time against file size for small files only


Although the transfer speed of small files are very low; the transfer time is usually very short. So small files are not the bottleneck of the data transfer. See also Section 2.5.1.

Figure 3: Transfer speed VS file size (small files)
Figure 4: Transfer time VS file size (small files)

2.3.3 Maximum transfer reached around 200MB file size?


Small files have lower transfer speed. Large files have higher transfer speed. But it looks like best transfer speed is observed for files with sizearound 200 MB file size.

Figure 5: maximum transfer speed reached around 200MB file size

2.4 Idle Time

To evaluate whether there is capacity for upscaling, we need to know the idle time of the nsc-exporter. The nsc-exporter is idle when it is not transferring data.

All transfer records are plotted with starting time of each transfer on x-axis and the time used to finished the transfer on y-axis. The gaps represents idle periods of nsc-exporter. The color represents projects, e.g. wgs123, EKG20230901 etc.. The shape represents project type, e.g. wgs, EKG etc. You can turn off a project by clicking it in the legend to the right of the figure.

For easier visualization, the data is grouped in months.

2.4.1 September


2.4.1.1 Per file transfer time

Figure 6: Transfer time of all files in September